As we are only yet beginning to understand the cryptographic applications of natural language systems, there are still only a few implementations of linguistic steganography:
Winstein's system is the basis of what was presented in
Section
.
A dictionary groups words into interchangeability-sets;
substitution from these interchangeability sets serves as the linguistic
ambiguity that is exploited for carrying the steganogram.
The encoder replaces words in a given innocuous text, by looking them up
in a lexicon and, if found in an interchangeability set, interpreting them
as mixed-radix digits in accordance with their alphabetic order.
Winstein was the first to express the idea of using mixed-radix coding for
this purpose. He also brings up a few interesting issues with lexical
steganography in its pure form.
First, there should be a way to adapt the density of word-substitutions to the amount of information that needs to be encoded. Encoding in a pure left-to-right manner when iterating through the words of a document, would result in aggressive substitution of words at the beginning of a document, until the encoder runs out of secret bits and then leaving the rest of a document untouched. Secondly, no matter how sophisticated a linguistic model we can employ, it will still be beneficial to let a human influence the word-choices, at least to some degree.
These problems are tackled by a blocking-scheme making possible a more evenly distributed pattern of substitutions and a hash-function providing for a one-to-many mapping from the secret to the word-configuration, which is why the author can make the final decision for one of many word-choice configurations that would be appropriate to encode the secret.
We will demonstrate Winstein's scheme considering the example shown in Figure
. A cover that contains
words
that can be
replaced from synsets
is first analyzed for
its total capacity in bits (
).
If the number of secret bits is
, the capacity would suffice
to encode the secret
times.
The modulation frequency
is defined by Winstein as
the smallest integer
between 1 and a maximum frequency
,
such that the capacity given by the word-choices would suffice to
hide the secret
times,
In our example we have synsets of capacities
and we want
to use them to encode the secret 001101110.
The capacity is then given by,
,
so the word-replacements allow for encoding 19 bits of data. The
secret has only
bits of data, so we get a modulation
frequency of
.
The secret is seen as a bitstring, which is
is divided into blocks, each of which contains a number
of bits equal to what Winstein calls the oversample factor
.
In the example, if we have
, we would divide the secret
001101110 into three blocks: 001, 101,
and 110.
The replaceable words
are then composed into blocks
in such a way that a block of words is always ``responsible'' for encoding
a block of bits from the secret.
This assignment is made by traversing
a list of the text's words, sorted by their synsets' cardinalities.
A block is constructed by assigning words to it, as long as the block's
capacity in bits is smaller than
.
In the example we would reorder the word-choices of synsets from the
sizes
in such a way, that we have
.
Using this list, we could assign blocks, however we have to make
sure, that the number of configurations for the block is
,
in our example
.
We start with the element of capacity
. Since
we have to
assign another element to get a block
. Since
, we
can use
as the first block. We go on to the next element,
we find that
,
, and finally
, so
the next block contains the elements
, and so on.
This ensures that each block provides for
times the capacity necessary
to encode its block of secret bits. Recall that, in the example the word-choices
had a capacity of over
bits, while the secret had only
. This scheme
simply distributes the unused capacities over all blocks.
As opposed to choosing the words in a left-to-right manner, this method results
in distributing its replacements over the whole text.
Sorting by synset-size ensures that the positions of replacements are chosen
in such a way that words that can be replaced from larger
synsets are always preferred to words that have to be replaced from smaller
ones, regardless of their position in the text (in case we would leave elements
unused, which can be the case if we restrict
to small values).
|
The coding
applied within each block is a hash function converting a
given number
from its mixed-radix representation to a binary number
of length
.
This has the advantage that, on one side, there will be multiple
which
decode to the same secret bitstring, and on the other, the different
will (usually) result in different mixed-radix digits
(where the block consists of
words) so the author could manually choose
that
which results in the most appropriate word-choices.
A problem is that if the
least-significant digits of the mixed-radix
representation of
satisfy
.
Despite these technical problems however, the idea is clear: By providing more word-choice configurations in each block than there are configurations of the bitstring we wish to encode in the block, we can get some ``degrees of freedom'' which can purposely be left unused by the coding. A secret bitstring could therefore result in a number of different word-choice configurations. The final decision which of them is used is not made by the steganographic encoding, but by a linguistic model or by the author.
Winstein derives his dictionary from WordNet, by intersecting synonymy-sets that are not disjunct, and filtering out all synonymy-sets that contain only one word.
Some text from an article from ACM TechNews 6(598)
was given in Figure
, to provide an impression of the
word-replacements.
The dictionary contains the following interchangeability sets:
Chapman's system is named NICETEXT, and it is similar to Winstein's in that it uses a purely symbolic model of linguistic similarity. It also relies on word-replacements from disjunct interchangeability sets. However, as opposed to the system of Winstein, it requires interchangeability sets to have cardinalities that are powers of two so binary block-codes can be used. Furthermore, Chapman's approach does not rely on a given innocuous text in which to replace words, it rather generates supposedly innocuous text, using the syntactic structure inherent to a style-template. A style-template originates from a grammar or from a sample-text.
The former approach relies on a context-free grammar, which is randomly expanded, to create sentences. The terminal symbols of that grammar are word-types.
A common kind of word-types are what linguistis call parts of speech, a concept that turns out to be very hard to define. As we do not want to go into the linguistic details here, we will stick with the common-sense definitions of noun (N), verb (V), adverb (Adv), adjective (Adj), pronoun, preposition (P), conjunction, participle, and determiner (Det).
For example the grammar
The second way of generating sentence-models is by extracting them from
given sample-sentences. For example, the sample-sentence
NICETEXT's
word-types are not in any way limited to the linguistic parts of speech, they
can be defined at any level of granularity for linguistic symbols.
In the original approach, () used parts of speech as word-types.
Since a word's part of speech is an abstraction for the syntactic role it can
play in a sentence, the system turned out mimicing the syntactic structure of
a given corpus.
In a later paper, however,
() describe the usage of synonymy-classes as basis
of sentence-models, as in:
A dictionary that could be used with the above sentence-model
is given in Figure
. The encoder generates
a text by randomly choosing a sentence model and choosing words for
types in accordance with the assigned codeword. This randomness has
the advantage that the same secret bitstring will result in a different
text, every time the encoder is run. Of course this is a degree of freedom
that is left unused by the coding, and therefore constitutes unused
potential.
|
|
Wayner's approach of context-free mimicry has already been presented in
Section
. The only resource it relies on is a
probabilistic context-free grammar (PCFG) characterizing the
covers that should be mimiced. The most prominent such PCFG
is Wayner's baseball-game-grammar demonstrated in Figure
.
It is employed in the mimicry applet on the homepage
to his book Disappearing Cryptography (, ).
() is another website demonstrating mimicry. This
implementation of Wayner's system employs a grammar that mimics
the appearance of spam. However, no attempts have been made so far
to apply Wayner's algorithm with larger-scale linguistic resources
and in practical scenarios.
However, a direct implementation of the system would, in practice, be limited by the inadequacy of context-free grammars as models for natural language. Many features of the English language are known that cannot easily be modelled by CFGs, and there are even some peculiarities that cannot be expressed by CFGs at all.
The system developed by Atallah et al. differs from the other systems in that it does not aim at steganography in general, but watermarking. Watermarking systems must fulfill additional requirements, such as robustness and resistance to collusion attacks. Here adversaries are interested in damaging the watermark without seriously degrading the cover text.
The system of Atallah et al. also differs in a second important respect. As opposed to the replacement-systems considered so far, that preserve meaning implicitly by carrying out meaning-preserving replacements, this system takes an explicit approach to meaning by representing semantics in an ANL.
Furthermore, Atallah's scheme does not embed a secret into a natural language representation of a cover, but into an ANL representation which needs to be translated back and forth to natural language.
More formally, the embedding-systems we have seen so far encoded a message
into a cover
, by deriving a steganogram
from a function
, which operates directly on the natural language
representation of a cover
.
Atallah's scheme, in contrast, first analyzes this natural language representation,
by translating it to an ANL.
If
is the set of all ANL-represented
covers
and
is the set of all ANL-represented
steganograms
, then his embedding function is
defined as
, and the extraction-function
is
.
However, since the covers need to be represented in natural language
for transmission, we need a natural language analyzer
and a generator
, so
we can derive a steganogram
from a cover
,
by applying
and we can extract the message again
using
.
Recall that, in order to maintain unique decodability, an embedding
function
and its extraction function
have to be chosen
in such a way that
, in the classical case. For
Atallah it means that
, i.e.
messages need to ``survive'' translation back and forth from the
ANL to natural language, which is not quite trivial. However it
should make the scheme robust against attacks by an adversary who can
(, )
To someone whose ``philosophic'' background is artificial intelligence, the above approach seems a bit paradoxical. How can transformation on a meaning-representation ever be meaning-preserving? Here it might be pointed out that the philosophy inherent to Atallah's approach seems to originate from machine translation rather than artificial intelligence.
If we think of the ANL as a natural language like Spanish, then the above
statements seem reasonable. First, an English-language text
is analyzed
using
to yield an Spanish translation
.
Then meaning-preserving transformations on the Spanish sentences
are made using
to yield a watermarked version
. Finally,
the text is generated using
to yield
, the English
translation.
If the adversary carries out an attack, by translating the text
to German, to yield
, then it will still be possible to recover
the watermark, since translating the German text
back to
Spanish will basically assure that
yields the
same result that was originally the output
from the embedding.
However, this will be difficult in practice. The world's best translators will probably not come up with the very same Spanish translation when confronted with an English language and a German language text, even if the two texts agree in even the finest connotations.
It is crucial to the understanding of Atallah's approach, that it essentially works by exploiting the inadequacy of any language, even an artificial, completely formal one, to fully capture meaning. This is why I prefer the term ANL to TMR (which is short for ``Text Meaning Representation'' and traditionally used in the context of ontological semantics). If a TMR would, in fact, represent meaning, then what ``meta-meaning'' would we judge meaning-preservingness by, when we make ``meaning-preserving'' transformations? This paradox is not exclusively of philosophic interest. It has practical implications in the construction of ontology-based watermarking-schemes.
First, under the common-sense notion of a text's ``meaning'', the ``more adequate'' an ANL gets as a meaning-representation, the more difficult it will be to find meaning-preserving transformations closed within this ANL-domain.
Secondly, if we reject the idea of a true meaning-representation, then, considering the state-of-the-art in linguistics, I doubt that natural language watermarks that reliably ``survive'' translation from one natural language to another are near at hand.
The search for the ``Universal Grammar'' can almost be described as a quest for the holy grail amongst linguists following the tradition of () and (). These landmark papers suggested that there might actually be a set of features common to all natural languages, modified by a small set of options that are chosen differently in each language. Designing translation-robust watermarking schemes would simply amount to coding data within such features common to all languages, while avoiding to code data in features that are specific to single languages, and would therefore be destroyed in translation. However approaches to such universal grammars are far from practical applicability in a completely automated computational setting.
However,
it might be pointed out that the sequences of word-choices, or
sequences of context-free productions, considered so far are
nothing but simple ANLs. So, no matter whether or not one accepts
the idea of a true meaning-representation, one will have to admit
that Figure
is an impressive demonstration of what
can be achieved with more sophisticated ANLs.
The sentences that contain the watermark are
highlighted in boldface. The left side shows the original version,
the right side contains a watermark.
Figure
shows some ANL-represented example
sentences, and the impact of two possible transformations,
``grafting'' and ``pruning''. The basic idea of the coding technique
is to establish a correspondence between a sentence's tree-structured
ANL-representation and a secret bitstring. Transformations
are then applied to the ANL-representation until the corresponding
bitstring yields the desired secret. If this is impossible,
another sentence is used to encode the secret.