next up previous contents
Next: 5. Systems For Natural Up: Towards Linguistic Steganography: A Previous: 3. Lexical Language Processing   Contents

Subsections

4. Approaches to Linguistic Steganography

Approaches

We have seen in the previous chapters why the study of steganography needs to be closely linked to that of the channels supposed to cover steganograms and the interpretation of the usual cover-datagrams.

The structure of this section is aligned along traditional linguistic lines of layers accounting for atomic symbols, syntax relating the symbols and semantics expressing their meanings, approached via lexical, grammatical and ontological models.

Since language is essentially redundant, it will carry information that is irrelevant for understanding its meaning. In the context of steganographic embedding, a good model for redundant information in language suitable for steganography is meaning-preserving substitution. Depending on the approach we employ, the term ``meaning-preserving'' has different interpretations.

4.1 Words and Symbolic Equivalence: Lexical Steganography

Lexical Steganography

The most straightforward subliminal channel in natural language is probably the choice of words. On the word-level, meaning is traditionally linked to the lexical relation of synonymy. For example, consider the following set of covers:

\begin{eqnarray*}
C & = \lbrace& \w{Midshire is a nice little city}, \\
&& \w{...
...le town}, \\
&& \w{Midshire is a wonderful little town}
\rbrace
\end{eqnarray*}

We can use synonymy to encode ten states in the above sentences, since two information-carrying symbols are available, one of which can take on five values, whereas the other one can take on two values. The first is the choice for either wonderful, decent, nice, fine or great and the other is for either city or town:

\begin{displaymath}
\w{Midshire is a}
\left\lbrace
\begin{array}{c}
\w{wonderful...
...gin{array}{c}
\w{city} \\
\w{town}
\end{array}\right\rbrace
.
\end{displaymath}

We call sets of words that can be used interchangeably synonymy sets or synsets for short and denote them in standard set-notation as

\begin{eqnarray*}
\lbrace \w{wonderful}, \w{decent}, \w{fine}, \w{great}, \w{nice} \rbrace,
\cr
\lbrace \w{city}, \w{town} \rbrace
.
\end{eqnarray*}

We have seen in the previous section, how lexica can be used as a source of such synsets.

A mixed-radix number is one way to encode a secret state in a sentence, if we think of state in numeric terms. Mixed-radix numbers can be written using positional notation in the following way (, , p. 208):

\begin{displaymath}
\left[\begin{array}{rrrrr}
\ldots, & a_3, & a_2, & a_1, & a_0 \\
\ldots, & b_3, & b_2, & b_1, & b_0 \\
\end{array}\right]
\end{displaymath}

for $0 \leq a_i < b_i$. The numeric interpretation of a mixed radix number is

\begin{displaymath}
\ldots
+ a_3 b_2 b_1 b_0
+ a_2 b_1 b_0
+ a_1 b_0
+ a_0.
\end{displaymath}

Given the bases of an $n$-digit mixed-radix number $b_{n-1},\ldots,b_2,b_1,b_0$, we can always use this numerical identity to uniquely write each number in the range from $0$ to $\prod b_i - 1$ as a sequence $a_{n-1},\ldots,a_2,a_1,a_0$ where $0 \leq a_i < b_i$ for each $i$.

Since, analogously, each sequence of bits of a length $\leq \lfloor\log_2(\prod b_i)\rfloor$ can always be thought of as a binary number in that range, we can always find a unique mixed-radix representation for it, and vice-versa. In practice such a conversion can be done efficiently using Horner's rule. (, p. 635) gives some examples.

To establish a direct correspondence between a text and a mixed-radix number encoding document state, let's establish the convention that the leftmost word in a text always corresponds to the most significant digit in the mixed-radix representation of the text and words from the text that do not fall into any synset can be ignored for this purpose. If a text contains $n$ words $w_1,w_2,\ldots,w_n$ that could be replaced from synsets $S_1,S_2,\ldots,S_n$ we represent state for the subliminal channel by an $n$-digit mixed-radix number with bases $\vert S_1\vert,\vert S_2\vert,\ldots,\vert S_n\vert$. We can then interpret a mixed-radix digit $a_i$ in the range $0 \leq a_i < \vert S_i\vert$ as the choice for the $a_i$-th word in synset $S_i$ with respect to alphabetic order.

Turning back to the example, we would encode one of the ten possible states by a mixed-radix number in the range from zero to nine:

\begin{displaymath}
\left[
\begin{array}{rr}
a_1, & a_0 \\
5, & 2 \\
\end{array}\right].
\end{displaymath}

If we wanted to encode a bitstring 101, we would interpret it as the numeric value 5, which clearly is in the range from zero to nine. The number 5 can be written in the above system as

\begin{displaymath}
\left[
\begin{array}{rr}
2, & 1 \\
5, & 2 \\
\end{array}\right]=2*2+1=5,
\end{displaymath}

which corresponds to the following choices of words from the synsets:

\begin{displaymath}
\w{Midshire is a}
\left\lbrace
\begin{array}{rl}
0 & \w{wond...
... \\
\mathbf{1} & \w{\textbf{town}}
\end{array}\right\rbrace
.
\end{displaymath}

Therefore, 101 would encode to the sentence

\begin{displaymath}
\w{Midshire is a fine town.}
\end{displaymath}

However, mixed-radix-conversion is computationally rather intensive, given that all we want to achieve is a uniquely decodable representation for a bitstring. A more efficient approach would be to reconstruct the secret message by concatenating codewords that are directly associated with a word-choice, instead of using mixed-radix conversion. The most simplistic approach would be to restrict synsets to cardinalities that are powers of two. For example:

\begin{displaymath}
\w{Midshire is a}
\left\lbrace
\begin{array}{rl}
\bs{00} & \...
...\bs{\textbf{1}} & \w{\textbf{town}}
\end{array}\right\rbrace
.
\end{displaymath}

The fact that nice is never chosen by the stegosystem, although it is sometimes chosen by native speakers, could be security-relevant. It would, in such circumstances be better to allow a random choice between two synonymous words that decode to the same bitstring.

\begin{displaymath}
\w{Midshire is a}
\left\lbrace
\begin{array}{rl}
\bs{00} & \...
...\bs{\textbf{1}} & \w{\textbf{town}}
\end{array}\right\rbrace
.
\end{displaymath}

In this example, when encoding the bitstring 11 using the left synset, a random number generator decides whether it should be encoded to great or nice.

The restriction to block-codes clearly reduces capacity, since fewer states can be represented by the same elements. () uses variable-length codes, the Huffman code in particular, for his mimic functions. Not only do variable-length codes lift this restriction, they also make it possible to control the probabilities with which symbols are chosen, as opposed to block-codes where all choices are equally probable:

Figure: A Huffman-tree of words in a synset.
\begin{figure}\begin{center}
\begin{bundle}{$\bigcirc$}
\chunk[0]{\w{wonderful}...
...end{bundle}}
\end{bundle}}
\end{bundle}}
\end{bundle}
\end{center}\end{figure}


\begin{displaymath}
\w{Midshire is a}
\left\lbrace
\begin{array}{lll}
\bs{0} & \...
...textbf{1}} & \w{\textbf{town}} & .5
\end{array}\right\rbrace
.
\end{displaymath}

In the above example, the variable-length codewords were chosen according to a Huffman tree (see Figure [*]). Recall that, given any set of symbols, a binary prefix-code can be constructed by arranging the symbols as leafs in a tree in which each node has two branches. Thinking of them as labeled either with 0 or 1, we can assign a binary codeword to each symbol by concatenating all the bits along the path from the root to the symbol. Prefix-codes have the property that they never assign a codeword that is the prefix of another codeword, which is important because otherwise a string resulting from concatenation of such codewords could not be uniquely deciphered. If $X$ and $Y$ are two symbols assigned to leaf-nodes $n_X$ and $n_Y$, then $X$ will of course be assigned a longer codeword than $Y$ if $n_X$ it is at a greater depth in the tree than $n_Y$. Huffman-trees in particular have the important property that they provide optimal compression in the sense that if $X$ occurs more frequently in a string of symbols that is to be compressed than $Y$, then $n_X$ will be at a smaller depth in the tree than $n_Y$. Huffman-trees can be constructed for any alphabet.

The code's impact on the distribution of words in the cover follows intuitively. In one of two cases (0, 1), a random bitstring will start with 0. Only in one out of four cases (00, 01, 10, 11), it will start with 10, etc. The probabilities will, of course, sum up to one, because some choice is made in any case, and since we are using the binary system, the probabilities are restricted to negative powers of two. Figure [*] demonstrates the use of relative entropy for quantifying the security of such a system. It can be seen that the restriction to negative powers of two is theoretically exploitable, however, () shows some approaches that make more precise mimicry possible.

Figure: An example for relative entropy.
OT1panrmn
% latex2html id marker 2718
\fbox{
\begin{minipage}{0.95\linewidth}
\par
\begin...
... \mathbf{0.0247706}
\end{eqnarray*}\end{small}\end{minipage}\par
\end{minipage}}

All approaches we have seen so far have one basic idea in common: transforming a sequence of symbols

\begin{displaymath}
s_1,s_2,s_3,\ldots,s_n
\end{displaymath}

into a sequence

\begin{displaymath}
T(s_1)~\vert~
T(s_2)~\vert~
T(s_3)~\vert~
\ldots~\vert~
T(s_n),
\end{displaymath}

which has a ``dual'' interpretation, one with regard to the cover-channel, one with regard to a secret message.

Here $T$ is some code that reinterprets single symbols, and $\vert$ is some operation that reconstructs the secret message. We have seen three examples for $T$: a binary block-code, a binary variable-length code (in particular the Huffman-code) and numbers with respect to alphabetic order. The operation most commonly used for ``$\vert$'' is string-concatenation, since it is simple and computationally efficient. However, other operations are possible, such as mixed-radix conversion.

4.2 Sentences and Syntactic Equivalence: Context-Free Mimicry

Context-Free Mimicry

() demonstrated a more sophisticated approach with his context-free mimic functions. It provides higher capacities, since it uses not only the choice for a word for steganographic purposes, but also the choice for a syntactic structure, relating the words. His approach closely mimics the real structure of natural language, since it generates context-free structures, just like natural languages do.

Recall that the Kleene-closure $X^{*}$ of a set of symbols $X$ is the set of finite sequences that can be constructed by choosing each of its elements from $X$ (including the empty sequence). Further, recall that a context-free grammar $G=(V,T,P,S)$ consists of a set of variables $V$, a set of terminals $T$, a finite set $P$ of productions and a designated start-symbol $S \in V$. Following (), if $A \rightarrow \beta$ is a production of $P$ and $\alpha$ and $\gamma$ are strings in $(V \cup T)^{*}$, we define a relation $\Rightarrow$ as $\alpha A \gamma \Rightarrow \alpha \beta \gamma$, ( $\alpha A \gamma$ directly derives $\alpha \beta \gamma$). If $\stackrel{*}{\Rightarrow}$ is the reflexive and transitive closure of $\Rightarrow$, then the language characterized by the grammar $G$ can be defined as

\begin{displaymath}
L(G) = \lbrace s \vert s \mbox{ is in } T^{*} \mbox{ and } S \stackrel{*}{\Rightarrow} s \rbrace
\end{displaymath}

Clearly, if a string $s$ is in $L(G)$, there will be a sequence of direct derivations of the form

\begin{displaymath}
S \Rightarrow \alpha_1,~~
\alpha_1 \Rightarrow \alpha_2,~~
\...
... \Rightarrow \alpha_3,~~
\ldots,~~
\alpha_{m-1} \Rightarrow s.
\end{displaymath}

If there exists only one such sequence (disregarding permutations) for each string $s \in L(G)$, we call $G$ unambiguous, and this is the only case we wish to consider furtheron.

A probabilistic context-free grammar (PCFG) associates a probability $P(p)$ with each production $p \in P$ to be chosen. If $v$ is a fixed non-terminal variable, then all productions $(v \rightarrow \alpha) \in P$ are mutually exclusive and the probabilities of these productions sum up to one.

Wendy might be a computer using such a grammar to observe a cover-channel. A datagram $e$ will be suspicious if $e \notin L(G)$ and if $D(C \vert\vert E) > \epsilon$, where $C$ is the set of all productions occuring in the cover-channel and $E$ is the set of productions the stegosystem actually uses to derive $e$ from $S$.4.1

Then, if Alice wants to send a message $m$ to Bob, she first has to randomize her message to get a bitstring $B(m)$ with bits uniformly distributed. Of course she can not transmit a 0/1-coded random bitstring over a channel, arbitrated by Wendy, unless $\lbrace 0, 1 \rbrace^{*} \in L(G)$, but she can use $G$ to generate messages that will be in $L(G)$.

To do so, she starts with $S$, and applies productions to non-terminal symbols until she is left with a grammatically correct message. Whenever she expands a non-terminal symbol $v$, and when there are several productions expanding $v$, she can use these ``degrees of freedom'' to encode some bits from the message.

In fact, the approach is very similar to that presented in the previous section. Alice constructs a sequence $d$ that serves a dual purpose. First, $d$ will be a sequence of productions deriving a message $e$ from $S$, i.e.

\begin{displaymath}
S \Rightarrow \alpha_1,~~
\alpha_1 \Rightarrow \alpha_2,~~
\...
... \Rightarrow \alpha_3,~~
\ldots,~~
\alpha_{m-1} \Rightarrow e.
\end{displaymath}

Secondly, Bob will be able to use a code $T$ and an operation ``$\vert$'' such that

\begin{displaymath}
T(S \Rightarrow \alpha_1)~\vert~
T(\alpha_1 \Rightarrow \alp...
...w \alpha_3)~\vert~
\ldots~\vert~
T(\alpha_{m-1} \Rightarrow e)
\end{displaymath}

reproduces $B(m)$. The operation ``$\vert$'' used by Wayner is string-concatenation and the code $T$ is a Huffman-code, where there is a separate Huffman-tree for each set of productions expanding a non-terminal. (The same restrictions that were presented in the previous section apply). Note that it would theoretically be possible to use block-codes or alphabets, or to reconstruct $B(m)$ by mixed-radix conversion, since we are dealing with nothing but symbolic steganography, with the restriction that symbols happen to denote productions from a CFG.

If Alice and Bob agree on some parsing-conventions, this sequence will be uniquely recoverable, given only the steganogram $e$ and the grammar $G$, and therefore the bitstring will be recoverable as well. On the other hand, Alice and Bob can be sure that $C$ is innocuous to Wendy, and therefore they have exploited a subliminal channel.

Figure: The example grammar by ().
\begin{figure}\par
\begin{small}
\begin{center}
\par
\begin{eqnarray}\cfgrule\s{...
...} ~ (.25) ~~ \bs{11}\end{eqnarray}\par\end{center}\end{small}\par
\end{figure}

Let's consider the example used by (): Figure [*] shows a PCFG. The probabilities are given in parentheses. We could have measured them by counting the occurences of the productions in a treebank (a statistically representative collection of syntax-trees). This grammar will, for example, derive a cover

\begin{displaymath}
% latex2html id marker 2218S \stackrel{\ref{ex3:1}}{\Right...
...{\Rightarrow} c_3,~~
c_3 \stackrel{\ref{ex3:4}}{\Rightarrow} C
\end{displaymath}

where the numbers are rule-numbers from Figure [*] and

\begin{eqnarray*}
c_1 &=& \s{AB} \\
c_2 &=& \gs{Good Golly } \s{B} \\
c_3 &=& ...
...& \gs{Good Golly, loving is better than pickles for lunch. } \\
\end{eqnarray*}

In our example, when reconstructing the bitstring

\begin{displaymath}
% latex2html id marker 2237T(S \stackrel{\ref{ex3:1}}{\Rig...
...row} c_3)~\vert~
T(c_3 \stackrel{\ref{ex3:4}}{\Rightarrow} C)
\end{displaymath}

we know that % latex2html id marker 2578
$T(c_2 \stackrel{\ref{ex3:A}}{\Rightarrow} c_3)$ decodes to 0. If we had % latex2html id marker 2580
$T(c_2 \stackrel{\ref{ex3:B}}{\Rightarrow} c_3)$ it would be 10 and % latex2html id marker 2582
$T(c_2 \stackrel{\ref{ex3:C}}{\Rightarrow} c_3)$ would decode to 110, etc. Figure [*] shows the bitstrings, assigned by the Huffman trees for each of the non-terminals.

We can now go through a complete example. The ``$\bullet$'' will be used to denote the current state in reading the secret bitstring and in grammatical production.

  1. Start with a bitstring \fbox{$\bullet\bs{11101011}$} and try to encode it to \fbox{$\bullet \s{S}$}. Choose between the four possible productions to expand an $S$. Since the string 11 is a prefix of the secret message and is associated with production ([*]), use it to expand $S$ to $\s{D} \s{B}$.
  2. Encode bitstring \fbox{$\bs{11}\bullet\bs{101011}$} to \fbox{$\bullet \s{D} \s{B}$}. There are three possibilities to expand $D$. Since 10 is a prefix of the message, apply production ([*]).
  3. Encode bitstring \fbox{\bs{1110}$\bullet$\bs{1011}} to \fbox{$\gs{Well, } \bullet \s{B}$}
  4. Encode bitstring \fbox{$\bs{111010} \bullet \bs{11}$} to \fbox{$\gs{Well, } \gs{a winter's night} \bullet \s{E}$}
  5. Stop at state \fbox{$\bs{11101011}\bullet$} which encodes to
    \fbox{$\gs{Well, } \gs{a winter's night }
\gs{shouldn't be } \gs{overestimated. } \bullet$}.

Note that, as opposed to replacement of lexically synonymous words, mimicry completely disregards semantic aspects of language. Data is not hidden in linguistic ambiguity, but in semantically significant parts of language. Therefore mimic-functions, as studied by Wayner, can only fool machines, not humans.

4.3 Meanings and Semantic Equivalence: The Ontological Approach

The Ontological Approach

Of the techniques considered herein, the ontological one is the most sophisticated approach with respect to modelling semantics. Instead of implicitly leaving semantics intact by replacing only synonymous words while embedding information into an innocuous text, an explicit model for ``meaning'' is used to evaluate equivalence between texts.

Consider the following examples:

Although the question whether or not one can talk about equal or distinct ``meanings'' should be left open with respect to its philosophical implications, let's pretend we could capture meaning by what () calls an Artificial Natural Language (ANL). The idea is to define the ANL in such a way, that translation from any of the above natural language sentences to ANL would yield the same representation.

For this to be successful in the above case, we would have to be able to translate between English and ANL, between German and ANL, we would have to understand sentences in active-voice and passive-voice, and we would have to make use of some knowledge about the world under discussion, such as the fact that Steve's guitar is named ``Evo''.

Figure: A simplified version of the example for systemic grammar presented by ()
\includegraphics{img/sysgra.eps}

One formalism traditionally considered a good candidate for an ANL is First-Order Predicate-Calculus. The meaning of all of the four above sentences could be represented by

\begin{eqnarray*}
\exists e,s,g:~~ && \textit{IsA}(e,\textit{Playing}) \\
&\we...
...IsA}(g,\textit{Guitar}) \wedge \textit{HasName}(g,\textit{Evo}).
\end{eqnarray*}

The predicate $IsA(x,y)$ reads ``$x$ is a kind of $y$'' or ``$x$ is an instance of $y$'' and is frequently used by computational linguists to state inclusion in the broadest sense. An important property is that it is transitive. Naturally if $IsA(x,y)$ and $IsA(y,z)$ then $IsA(x,z)$. Everything that has to do with an event is usually described in terms of its type, an actor and an experiencer. The type of an event could, for example, be $\textit{Playing}$, $\textit{Writing}$, $\textit{Building}$, indicating that, in this event, someone plays something, someone writes something, or someone builds something. Here it can be seen that an event's type is closely linked to the verb used to describe it. If $a$ is an actor taking part in event $e$, $\textit{Actor}(a,e)$ indicates $a$ as the object that causes the event to happen. It could be indicated by a sentence's subject. If $x$ is an experiencer taking part in event $e$, $\textit{Experiencer}(e,x)$ indicates $x$ as an object whose state is in any way altered in the course of the event. It is usually indicated by a sentence's object.

With respect to this model, the above expression formalizes the meaning of the sentence by the assertion ``There is an $e$, such that $e$ is the event of playing something, the actor taking part in $e$ is $s$, $s$ has the name $\textit{Steve}$, the experiencer taking part in $e$ is $g$, $g$ is a $\textit{Guitar}$, and $g$ is called $\textit{Evo}$.''

Linguists call the ``semantic'' model, just presented ``deep structure''. It is a very vague way of modeling the semantics behind a natural language sentence, and therefore it might be questioned whether one can talk about a meaning-representation in the context of deep structure. However, it demonstrates what it takes for a model to go beyond purely syntactic issues.

Natural language generation is the problem of translating such an ANL representation to a natural language. In the course of this translation many decisions must be made. Formalizing these decisions made in the course of generating a sentence is the basic idea of systemic grammar. One could say that systemic grammars are somehow anchored in the ``semantic'' view of language while context-free grammars take a ``more syntactic'' point of view. Therefore systemic grammars are often the linguistic model of choice, when it comes to automatically generating sentences. Figure [*] shows an example.

In this example, generating natural language sentences from meaning, amounts to making the following decisions:

  1. What is the grammatical mood? Indicative or Imperative? (Steve plays the guitar. or Steve, play the guitar!)
  2. What constraints do we have regarding transitivity? Is it a material process, where the verb expresses something about its noun, as in ``Steve is playing.'' or ``Steve is waiting.'', or do we have a relational process as in ``Steve plays the guitar.'' or ``Steve is waiting for the bus.'' where the verb relates a subject and an object?
  3. What mood do we want to use? Active as in ``Steve plays the guitar.'' or passive as in ``Steve is played by the guitar.''?

If we have a specific meaning ``in mind'' (or rather represented in ANL), we can make different sequences of choices, as we traverse this network, that do not affect meaning. For example, the decision voice in the grammar tells us, that we can go for active, which means that the semantic actor will syntactically be in subject position, and the semantic experiencer will syntactically end up in object position. This would yield Steve plays the guitar. The grammar also offers passive, which means that the semantic actor will end up in object position, and the semantic experiencer will end up in subject position. Then, the very same meaning will be expressed as The guitar is played by Steve. Almost all such grammatical choices are instances of ambiguity, where one meaning is expressible in many ways. Of course mood is not the only example, transitivity could also be used, by generating two sentences Steve plays. The guitar is played.

If we did not have an ANL-representation of a sentence's meaning available, we could never judge whether or not two sequences of grammatical choices are equivalent in meaning. Therefore, this basic approach was presented as the ``ontological'' one, since it relies on a formal model of the semantics behind the text that should be generated.

Note that the symbolic approach is not symbolic because it replaces symbols, but because it replaces them in such a way, that symbolic ``correctness'' is preserved, and analogously the syntactic approach is not syntactic because it replaces syntactic structures, but because it preserves syntactic correctness. This is crucial to the understanding of the semantic approach, which does not change semantics, but rather preserves a specific formal interpretation of the semantics behind a natural language text.


next up previous contents
Next: 5. Systems For Natural Up: Towards Linguistic Steganography: A Previous: 3. Lexical Language Processing   Contents
Richard Bergmair 2005-01-31