next up previous contents
Next: 6. Lessons Learned Up: Towards Linguistic Steganography: A Previous: 4. Approaches to Linguistic   Contents

Subsections

5. Systems For Natural Language Steganography

Systems

As we are only yet beginning to understand the cryptographic applications of natural language systems, there are still only a few implementations of linguistic steganography:

  1. Tyrannosaurus Lex (, ,) substitutes words for synonyms in a text, coding blocks of data in a mixed-radix-fashion.
  2. NICETEXT (, ,,,,) degrades corpora to ``style-templates'' that specify sequences of word-types, and mimics them by replacing types by dictionary words, using a binary block-code.
  3. () published sample-implementations of his mimicry-systems on the website to his book ``Disappearing Cryptography''. () is not an implementation in its own right, but a nice application of Wayner's system. It employs a grammar that mimics spam.
  4. The system by (,,) relies on transformations closed within an ANL-domain. The method of quadratic residues implements additional cryptographic requirements for watermarking.

5.1 Winstein

Figure: An article from ACM TechNews 6(598) with the bitstring 1101 embedded by Winstein's system.
\begin{figure}\par
\begin{center}
\begin{small}
\par
\begin{tabular}{p{0.47\line...
...xt & (b) steganogram
\end{tabular}\par
\end{small}
\end{center}\par
\end{figure}

Winstein's system is the basis of what was presented in Section [*]. A dictionary groups words into interchangeability-sets; substitution from these interchangeability sets serves as the linguistic ambiguity that is exploited for carrying the steganogram. The encoder replaces words in a given innocuous text, by looking them up in a lexicon and, if found in an interchangeability set, interpreting them as mixed-radix digits in accordance with their alphabetic order. Winstein was the first to express the idea of using mixed-radix coding for this purpose. He also brings up a few interesting issues with lexical steganography in its pure form.

First, there should be a way to adapt the density of word-substitutions to the amount of information that needs to be encoded. Encoding in a pure left-to-right manner when iterating through the words of a document, would result in aggressive substitution of words at the beginning of a document, until the encoder runs out of secret bits and then leaving the rest of a document untouched. Secondly, no matter how sophisticated a linguistic model we can employ, it will still be beneficial to let a human influence the word-choices, at least to some degree.

These problems are tackled by a blocking-scheme making possible a more evenly distributed pattern of substitutions and a hash-function providing for a one-to-many mapping from the secret to the word-configuration, which is why the author can make the final decision for one of many word-choice configurations that would be appropriate to encode the secret.

Figure: Encoding a secret by Winstein's scheme.
\includegraphics{img/winstein3.eps}

We will demonstrate Winstein's scheme considering the example shown in Figure [*]. A cover that contains $n$ words $s_1,s_2,\ldots,s_n$ that can be replaced from synsets $S_1,S_2,\ldots,S_n$ is first analyzed for its total capacity in bits ( $c= \lfloor\log_2(\prod \vert S_i\vert)\rfloor$). If the number of secret bits is $s$, the capacity would suffice to encode the secret $\frac{c}{s}$ times.

The modulation frequency $f$ is defined by Winstein as the smallest integer $f$ between 1 and a maximum frequency $f_{max}$, such that the capacity given by the word-choices would suffice to hide the secret $f$ times,

\begin{displaymath}
f=\min \lbrace \lfloor\frac{c}{s}\rfloor, f_{max} \rbrace.
\end{displaymath}

In our example we have synsets of capacities $4,4,6,8,8,6,4,4$ and we want to use them to encode the secret 001101110. The capacity is then given by, $c=\log_2(4*4*6*8*8*6*4*4)=19.169925$, so the word-replacements allow for encoding 19 bits of data. The secret has only $s=9$ bits of data, so we get a modulation frequency of $f=2$.

The secret is seen as a bitstring, which is is divided into blocks, each of which contains a number of bits equal to what Winstein calls the oversample factor $o$. In the example, if we have $o=3$, we would divide the secret 001101110 into three blocks: 001, 101, and 110.

The replaceable words $s_1,s_2,\ldots,s_n$ are then composed into blocks in such a way that a block of words is always ``responsible'' for encoding a block of bits from the secret. This assignment is made by traversing a list of the text's words, sorted by their synsets' cardinalities. A block is constructed by assigning words to it, as long as the block's capacity in bits is smaller than $f*o$.

In the example we would reorder the word-choices of synsets from the sizes $4,4,6,8,8,6,4,4$ in such a way, that we have $8,8,6,6,4,4,4$. Using this list, we could assign blocks, however we have to make sure, that the number of configurations for the block is $\geq 2^{f*o}$, in our example $2^{f*o}=2^{2*3}=64$. We start with the element of capacity $8$. Since $8<64$ we have to assign another element to get a block $8,8$. Since $8*8 \geq 64$, we can use $8,8$ as the first block. We go on to the next element, we find that $6<64$, $6*6<64$, and finally $6*6*4 \geq 64$, so the next block contains the elements $6,6,4$, and so on. This ensures that each block provides for $f$ times the capacity necessary to encode its block of secret bits. Recall that, in the example the word-choices had a capacity of over $19$ bits, while the secret had only $9$. This scheme simply distributes the unused capacities over all blocks.

As opposed to choosing the words in a left-to-right manner, this method results in distributing its replacements over the whole text. Sorting by synset-size ensures that the positions of replacements are chosen in such a way that words that can be replaced from larger synsets are always preferred to words that have to be replaced from smaller ones, regardless of their position in the text (in case we would leave elements unused, which can be the case if we restrict $f_max$ to small values).

Figure: Mapping the bitstring 01 to a mixed-radix number representing word-choices (adapted from , )
\includegraphics{img/winstein1.eps}

Figure: Word-choices that coincide (adapted from , )
\includegraphics{img/winstein2.eps}

The coding $U(x)$ applied within each block is a hash function converting a given number $x$ from its mixed-radix representation to a binary number of length $o$

\begin{displaymath}
U(x) \equiv x ~ (\textrm{modulo } 2^o)
\end{displaymath}

as demonstrated in Figure [*].

This has the advantage that, on one side, there will be multiple $x$ which decode to the same secret bitstring, and on the other, the different $x$ will (usually) result in different mixed-radix digits $a_m \ldots, a_2,a_1,a_0$ (where the block consists of $m$ words) so the author could manually choose that $x$ which results in the most appropriate word-choices.

A problem is that if the $l$ least-significant digits of the mixed-radix representation of $x$ satisfy

\begin{displaymath}
\prod_{0 \leq i < l}\vert S_i\vert \equiv 0 ~ (\textrm{modulo } 2^o)
\end{displaymath}

then these $l$ word-choices will coincide for all the possible $x$. Instead of $x ~(\textrm{modulo } 2^o)$, Winstein uses the hash-value

\begin{displaymath}
U(x) \equiv x + \lfloor\frac{x}{2^o}\rfloor ~ (\textrm{modulo } 2^o)
\end{displaymath}

in order to ``discourage falling in sync with the document state'', as he puts it. This works perfectly for the example demonstrated in his paper (, ) in which $o=2$, and $\vert S_0\vert=\vert S_1\vert=4$. However, we would like to point out the straightforward case in which

\begin{displaymath}
\prod_{0 \leq i < l}\vert S_i\vert+1 \equiv 0 ~ (\textrm{modulo } 2^o)
\end{displaymath}

since it gives rise to the very same problem with the modified word-choice hash, for example for $\vert S_0\vert=3$ and $\vert S_1\vert=4$, as demonstrated in Figure [*].

Despite these technical problems however, the idea is clear: By providing more word-choice configurations in each block than there are configurations of the bitstring we wish to encode in the block, we can get some ``degrees of freedom'' which can purposely be left unused by the coding. A secret bitstring could therefore result in a number of different word-choice configurations. The final decision which of them is used is not made by the steganographic encoding, but by a linguistic model or by the author.

Winstein derives his dictionary from WordNet, by intersecting synonymy-sets that are not disjunct, and filtering out all synonymy-sets that contain only one word.

Some text from an article from ACM TechNews 6(598) was given in Figure [*], to provide an impression of the word-replacements. The dictionary contains the following interchangeability sets:

\begin{displaymath}
\begin{array}{c}
\lbrace \w{bastioned}^{(0)}, \w{fortified}^...
... \\
\lbrace \w{iii}^{(0)}, \w{three}^{(1)} \rbrace
\end{array}\end{displaymath}

Therefore, the text has a storage capacity of four bits, since four replaceable words appear in the text and each word can be replaced by only one alternative. It is pure coincidence that all of the interchangeability sets in this example contain exactly two alternative words, leaving the mixed-radix-number in this example with the same digits as the binary representation.

5.2 Chapman

Chapman's system is named NICETEXT, and it is similar to Winstein's in that it uses a purely symbolic model of linguistic similarity. It also relies on word-replacements from disjunct interchangeability sets. However, as opposed to the system of Winstein, it requires interchangeability sets to have cardinalities that are powers of two so binary block-codes can be used. Furthermore, Chapman's approach does not rely on a given innocuous text in which to replace words, it rather generates supposedly innocuous text, using the syntactic structure inherent to a style-template. A style-template originates from a grammar or from a sample-text.

The former approach relies on a context-free grammar, which is randomly expanded, to create sentences. The terminal symbols of that grammar are word-types.

A common kind of word-types are what linguistis call parts of speech, a concept that turns out to be very hard to define. As we do not want to go into the linguistic details here, we will stick with the common-sense definitions of noun (N), verb (V), adverb (Adv), adjective (Adj), pronoun, preposition (P), conjunction, participle, and determiner (Det).

For example the grammar

$\displaystyle \cfgrule\s{S} \cfgis \s{NP} ~ \s{VP}$     (5.1)
$\displaystyle \cfgrule\s{NP} \cfgis \w{Det} ~ \w{N}$     (5.2)
$\displaystyle \cfgrule\s{NP} \cfgis \w{N}$     (5.3)
$\displaystyle \cfgrule\s{VP} \cfgis \w{V} ~ \s{NP}$     (5.4)

could be used to derive the sentence-model

\begin{displaymath}\w{Det} ~ \w{N} ~ \w{V} ~ \w{Det} ~ \w{N}\end{displaymath}

as in The boy chases the ball or to

\begin{displaymath}\w{N} ~ \w{V} ~ \w{Det} ~ \w{N}\end{displaymath}

as in Steve plays the guitar. However, this random expansion of grammar-rules serves only to derive a sentence model. Coding takes place only by substituting actual words for word-types. This is not to be confused with context-free mimic-functions that use context-free productions for actually coding data from the secret message.

The second way of generating sentence-models is by extracting them from given sample-sentences. For example, the sample-sentence

\begin{displaymath}\w{The boy chases the ball.}\end{displaymath}

could be tagged, to derive the sentence-model $\w{Det} ~ \w{N} ~ \w{V} ~ \w{Det} ~ \w{N}$ by looking up the sample-words in dictionaries.

NICETEXT's word-types are not in any way limited to the linguistic parts of speech, they can be defined at any level of granularity for linguistic symbols. In the original approach, () used parts of speech as word-types. Since a word's part of speech is an abstraction for the syntactic role it can play in a sentence, the system turned out mimicing the syntactic structure of a given corpus. In a later paper, however, () describe the usage of synonymy-classes as basis of sentence-models, as in:

\begin{displaymath}
\mbox{John's } \w{[synonymOfCar]} \mbox{ is }
\w{[synonymOfReally]} ~ \w{[synonymOfNice]}.
\end{displaymath}

Figure: A NICETEXT dictionary (, )
\begin{figure}\begin{center}
\begin{tabular}{lll}
\textit{Type} & \textit{Word} ...
...ymOfNice]} & wonderful & \texttt{11} \\
\end{tabular}.
\end{center}\end{figure}

A dictionary that could be used with the above sentence-model is given in Figure [*]. The encoder generates a text by randomly choosing a sentence model and choosing words for types in accordance with the assigned codeword. This randomness has the advantage that the same secret bitstring will result in a different text, every time the encoder is run. Of course this is a degree of freedom that is left unused by the coding, and therefore constitutes unused potential.

Figure: A sample of NICETEXT output from the ``Aesop's Fables'' style-template as presented by ().

The Doe and the Lion A DOE hard fixed by robbers taught refuge in a slave tinkling to a Lion. The Goods under- took themselves to aversion and disliked before a toothless wrestler on their words. The Sheep, much past his will, married her backward and forward for a long time, and at last said, If you had defended a dog in this wood, you would have had your straits from his sharp teeth. One day he ruined to see a Fellow, whose had smeared for its pro- vision, resigning along a fool and warning advisedly. said the Horse, if you really word me to be in good occasion, you could groom me less, and proceed me more. who have opened in that which I blamed a happy wine the horse of my possession. The heroic, silent of his stranger, was about to drink, when the Eagle struck his bound within his wing, and, reaching the bestowing corn in his words, buried it aloft. Mercury soon shared and said to him, OH thou most base fellow? The Leather and the Newsletter A MOTHER had one son and one sister, the former considerable before his good tasks, the latter for her contrary wrestler. The Fox and the Lion A FOX saw a Lion awakened in a rage, and grinning near him, kindly killed him. [...]

5.3 Wayner

Figure: The secret message INNOCENT, mimiced by Wayner's baseball-game grammar (, ).

It's time for another game between the Whappers and the Blogs in scenic downtown Blovonia . I've just got to say that the Blog fans have come to support their team and rant and rave . Play Ball ! Time for another inning . The Whappers will be leading off . Baseball and Apple Pie . The pitcher spits. Herbert Herbertson swings the bat to get ready and enters the batter's box . Here's the fastball . He tries to bunt, and Robby Rawhide grabs it and tosses it to first . Hey, one down, two to go. Here we go. Prince Albert von Carmicheal swings the baseball bat to stretch and enters the batter's box . Okay. Here's the pitch It's a spitter . High and outside . Ball . No contact in Mudsville ! Nothing on that one . Nice hit into short left field for a dangerous double and the throw is into the umpire's head ! Whoa ! The Blogs need two more outs. Here we go. Albert Ancien-Regime adjusts the cup and enters the batter's box . Yeah. And the next pitch is a knuckler . Nothing on that one . The next pitch is a wobbling knuckler . Whooooosh! Strike ! And the next pitch is a smoking gun . He just watched it go by . The last strike . Only three chances in this game . He's hefting some wood . Here we go. Sal Sauvignon adjusts the cup and enters the batter's box . Yeah. He's winding up . What a fast one that looked like it was rising . Whooooosh! Strike ! He's winding up . What what looks like a spitball . No contact in Mudsville ! Here's the pitch It's a wobbling knuckler . Whooooosh! Strike ! He's out of there . That inning proves why baseball is the nation's game [...]

Wayner's approach of context-free mimicry has already been presented in Section [*]. The only resource it relies on is a probabilistic context-free grammar (PCFG) characterizing the covers that should be mimiced. The most prominent such PCFG is Wayner's baseball-game-grammar demonstrated in Figure [*]. It is employed in the mimicry applet on the homepage to his book Disappearing Cryptography (, ). () is another website demonstrating mimicry. This implementation of Wayner's system employs a grammar that mimics the appearance of spam. However, no attempts have been made so far to apply Wayner's algorithm with larger-scale linguistic resources and in practical scenarios.

However, a direct implementation of the system would, in practice, be limited by the inadequacy of context-free grammars as models for natural language. Many features of the English language are known that cannot easily be modelled by CFGs, and there are even some peculiarities that cannot be expressed by CFGs at all.

5.4 Atallah, Raskin et al.

Figure: A text-sample of Atallah's system (, )
\begin{figure}\par
\begin{center}
\begin{small}
\par
\begin{tabular}{p{0.47\line...
...xt & (b) steganogram
\end{tabular}\par
\end{small}
\end{center}\par
\end{figure}

Figure: The sample trees given by ().
\begin{figure}\par
\begin{center}
\subfigure[original tree: \texttt{111100000101...
...wn}
\end{bundle}}
\end{bundle}}
\end{bundle}} }\end{center}\par
\end{figure}

The system developed by Atallah et al. differs from the other systems in that it does not aim at steganography in general, but watermarking. Watermarking systems must fulfill additional requirements, such as robustness and resistance to collusion attacks. Here adversaries are interested in damaging the watermark without seriously degrading the cover text.

The system of Atallah et al. also differs in a second important respect. As opposed to the replacement-systems considered so far, that preserve meaning implicitly by carrying out meaning-preserving replacements, this system takes an explicit approach to meaning by representing semantics in an ANL.

Furthermore, Atallah's scheme does not embed a secret into a natural language representation of a cover, but into an ANL representation which needs to be translated back and forth to natural language.

More formally, the embedding-systems we have seen so far encoded a message $m \in M$ into a cover $c' \in C'$, by deriving a steganogram $x \in E$ from a function $e : M \times C' \mapsto E$, which operates directly on the natural language representation of a cover $c'$. Atallah's scheme, in contrast, first analyzes this natural language representation, by translating it to an ANL. If $ANL_{C'}$ is the set of all ANL-represented covers $c' \in C'$ and $ANL_{E}$ is the set of all ANL-represented steganograms $x \in E$, then his embedding function is defined as $e : M \times ANL_{C'} \mapsto ANL_{E}$, and the extraction-function is $d : ANL_{E} \mapsto M$. However, since the covers need to be represented in natural language for transmission, we need a natural language analyzer $a : C' \mapsto ANL_{C'}$ and a generator $g: ANL_{E} \mapsto E$, so we can derive a steganogram $x \in E$ from a cover $c' \in C'$, by applying $x=g(e(a(c'),m))$ and we can extract the message again using $d(a(x))$.

Recall that, in order to maintain unique decodability, an embedding function $e$ and its extraction function $d$ have to be chosen in such a way that $d(e(m,c'))=m$, in the classical case. For Atallah it means that $d(a(g(e(a(c'),m))))=m$, i.e. messages need to ``survive'' translation back and forth from the ANL to natural language, which is not quite trivial. However it should make the scheme robust against attacks by an adversary who can

  1. ``Perform meaning-preserving transformations on sentences (including translation to another language).''
  2. ``Perform meaning-modifying transformations on sentences (but note that this cannot be applied to too many sentences, because of the requirement that the overall meaning of the text should not be destroyed).''
  3. ``Insert new sentences in the text.''
  4. ``Move sentences from one place of the text to another (including moving whole paragraphs, sections, chapters, etc.)''
(, )

To someone whose ``philosophic'' background is artificial intelligence, the above approach seems a bit paradoxical. How can transformation on a meaning-representation ever be meaning-preserving? Here it might be pointed out that the philosophy inherent to Atallah's approach seems to originate from machine translation rather than artificial intelligence.

If we think of the ANL as a natural language like Spanish, then the above statements seem reasonable. First, an English-language text $s$ is analyzed using $A(s)$ to yield an Spanish translation $e$. Then meaning-preserving transformations on the Spanish sentences are made using $E(e,m)$ to yield a watermarked version $e'$. Finally, the text is generated using $G(e')$ to yield $s'$, the English translation. If the adversary carries out an attack, by translating the text to German, to yield $s''$, then it will still be possible to recover the watermark, since translating the German text $s''$ back to Spanish will basically assure that $A(s'')=A(s')$ yields the same result that was originally the output $e'$ from the embedding.

However, this will be difficult in practice. The world's best translators will probably not come up with the very same Spanish translation when confronted with an English language and a German language text, even if the two texts agree in even the finest connotations.

It is crucial to the understanding of Atallah's approach, that it essentially works by exploiting the inadequacy of any language, even an artificial, completely formal one, to fully capture meaning. This is why I prefer the term ANL to TMR (which is short for ``Text Meaning Representation'' and traditionally used in the context of ontological semantics). If a TMR would, in fact, represent meaning, then what ``meta-meaning'' would we judge meaning-preservingness by, when we make ``meaning-preserving'' transformations? This paradox is not exclusively of philosophic interest. It has practical implications in the construction of ontology-based watermarking-schemes.

First, under the common-sense notion of a text's ``meaning'', the ``more adequate'' an ANL gets as a meaning-representation, the more difficult it will be to find meaning-preserving transformations closed within this ANL-domain.

Secondly, if we reject the idea of a true meaning-representation, then, considering the state-of-the-art in linguistics, I doubt that natural language watermarks that reliably ``survive'' translation from one natural language to another are near at hand.

The search for the ``Universal Grammar'' can almost be described as a quest for the holy grail amongst linguists following the tradition of () and (). These landmark papers suggested that there might actually be a set of features common to all natural languages, modified by a small set of options that are chosen differently in each language. Designing translation-robust watermarking schemes would simply amount to coding data within such features common to all languages, while avoiding to code data in features that are specific to single languages, and would therefore be destroyed in translation. However approaches to such universal grammars are far from practical applicability in a completely automated computational setting.

However, it might be pointed out that the sequences of word-choices, or sequences of context-free productions, considered so far are nothing but simple ANLs. So, no matter whether or not one accepts the idea of a true meaning-representation, one will have to admit that Figure [*] is an impressive demonstration of what can be achieved with more sophisticated ANLs. The sentences that contain the watermark are highlighted in boldface. The left side shows the original version, the right side contains a watermark.

Figure [*] shows some ANL-represented example sentences, and the impact of two possible transformations, ``grafting'' and ``pruning''. The basic idea of the coding technique is to establish a correspondence between a sentence's tree-structured ANL-representation and a secret bitstring. Transformations are then applied to the ANL-representation until the corresponding bitstring yields the desired secret. If this is impossible, another sentence is used to encode the secret.


next up previous contents
Next: 6. Lessons Learned Up: Towards Linguistic Steganography: A Previous: 4. Approaches to Linguistic   Contents
Richard Bergmair 2005-01-31