next up previous contents
Next: 9. Conclusions Up: Towards Linguistic Steganography: A Previous: 7. Towards Secure and   Contents

Subsections

8. Towards Coding in Lexical Ambiguity

Coding in Lexical Ambiguity

8.1 Two Instances of Ambiguity

Instances of Ambiguity

Figure: Two kinds of ambiguity involved in the replacement of move by run.
[``Forward ambiguity'']\includegraphics[scale=.9]{img/amb-cap.eps} [``Backward ambiguity'']\includegraphics[scale=.9]{img/amb-cap2.eps}

Basically, a lexical steganography system deals with two kinds of sense-ambiguity. We will refer to the sense-ambiguity an encoder is confronted with when deciding which synset the replacements of a specific word should be chosen from as forward ambiguityambiguity@forward, and the sense-ambiguity a decoder is confronted with, when deciding which synset a replacement was originally chosen from as backward ambiguityambiguity@backward.

Let $W$ be the set of words and $S$ be the set of synsets in a lexicon. We require that $W$ is enumerable and $S \subseteq 2^W$ is a set of synsets with words from $W$. In accordance with chapter [*], we use a function $\mathcal{L}: W \mapsto 2^S$ to denote the lexical evidence $\mathcal{L}(w)$ of a word $w$, which is nothing but the set of synsets that contain $w$. We use a function $\mathcal{C}: W \mapsto C$ to denote the contextual evidence $\mathcal{C}(w)$ of an occurence of $w$ in a text under investigation. We can think of $C$8.1as a set of contexts, but we will not be concerned with the actual data-structure, since it depends on the model employed by the actual disambiguation system. The disambiguation system will herein be a function $\textrm{dis}: 2^S \times C \mapsto S$, so that $s_o=\textrm{dis}(\mathcal{L}(o),\mathcal{C}(o))$ and $r \in s_o$ implies that $r$ is a correct replacement for $o$ in the specific context $\mathcal{C}(o)$.

A text in which a secret is to be embedded could, for example, contain the word $o=\textsf{move}$. When the encoder looks up the lemma move in its dictionary, it will find three synsets: $s_0 = \lbrace \textsf{move, run, go} \rbrace$, $s_1 = \lbrace \textsf{move, impress, strike} \rbrace$, and $s_2 = \lbrace \textsf{move, motion, movement} \rbrace$. These make up the lexical evidence $\mathcal{L}(o)= \lbrace s_0, s_1, s_2 \rbrace$. Since there are several alternatives from which to choose, we call $\mathcal{L}(o)$ forward-ambiguous. The disambiguator would be needed to decide upon the correct synset from which the replacements can be chosen. If it chooses $s_0$ from $\mathcal{L}(o)$, we can replace the original word $o=\textsf{move}$ by a word from $s_0$ determined for coding-purposes, for example $r=\textsf{run}$.

When decoding the secret again, the decoder would look up run in its dictionary, and will find several synsets: $s_0 = \lbrace \textsf{move, run, go} \rbrace$, $s_3 = \lbrace \textsf{run, go, work} \rbrace$, $s_4 = \lbrace \textsf{run, test} \rbrace$, which make up the lexical evidence $\mathcal{L}(r)= \lbrace s_0, s_3, s_4 \rbrace$. Since there are several alternatives, from which to choose, we call $\mathcal{L}(r)$ backward-ambiguous. At this point, the decoder would have to employ a disambiguator to decide upon the sense the given replacement was originally chosen from. If it chooses $s_0$ from $\mathcal{L}(r)$, we can interpret $r$ as a replacement from $s_0$, therefore correctly decoding the data again. However, we have to be aware of the fact that the disambiguator might just as well choose a different sense, a problem we will deal with in the next section.

8.2 Two Types of Replacements and Three Types of Words

Types of Replacements and Words

The problem of forward-ambiguity is security relevant, since an incorrect identification of the replacements will produce unnatural text. Backwards-ambiguity is relevant for the decoder, since an incorrect identification of the synset the replacements were originally chosen from will result in incorrectly decoding the data coded by the replacement.

If we employ an automated scheme, we cannot do much against the consequences of forward-ambiguity but to use a highly precise word-sense-disambiguator. However, we could get better results if we let a human judge, whether the disambiguator's decision was correct or not. As long as the human judges on the transmitting and receiving ends agree, this will not affect the performance of the code. The first drawback of this scheme is that this might not always be the case, and the second one is that it could take many such decisions, discouraging the use of practical systems implementing such a scheme, because of the user's additional effort.

In the presence of backwards-ambiguity, it is crucial not to see a word-sense-disambiguator as a black box where we put in a word, and get out an identification of the set containing the universally correct replacements. Given the context of a lexeme $\mathcal{C}(o)$ and the lexical evidence about this lexeme $\mathcal{L}(o)$, the disambiguator simply estimates which $l \in \mathcal{L}(o)$ best fits the context $\mathcal{C}(o)$. If we replace $o$ by $r$, we do not change the context, so $\mathcal{C}(o)$ will always equal $\mathcal{C}(r)$, but we have to be aware that $\mathcal{L}(o)$ is not necessarily equal to $\mathcal{L}(r)$, and we have to think of the consequences.

For example, the disambiguator employed in the encoder deciding upon the correct synset to replace move from, might choose the synset also containing the word run, because both motion and strike are very unlikely to appear in the context. The problem is that, if we substitute run for move, we change the lexical evidence. If the decoder would now blindly use a disambiguator to pick the most probable synset to replace run from, then this synset might well be the one containing test, instead of the one containing move, because the context might happen to give such evidence.

We can resolve the problem of backward-ambiguity, by letting an encoder analyze the lexicon, and decide in advance whether a word should be chosen for coding-purposes or not. It needs to decide for each possible $r$ which could be replaced for $o$, whether or not this replacement would lead to backwards-ambiguity. Looking only at the lexicon, this could potentially be the case whenever $\mathcal{L}(r) \neq \mathcal{L}(o)$, which is the reason why Winstein and Chapman required their synsets to be disjunct. However, if we bring a sense-disambiguator into the picture, a replacement of $r$ for $o$ involves problematic ambiguity only if the disambiguators resolving the forward- and backward- ambiguity disagree about the word-sense. Formally,

\begin{displaymath}
s_f = \textrm{dis}(\mathcal{L}(o),\mathcal{C}(o))
~\wedge~ ...
...trm{dis}(\mathcal{L}(r),\mathcal{C}(r))
~\wedge~ s_f \neq s_b.
\end{displaymath} (8.1)

It might be desirable to avoid using such replacements, so we can be sure the decoder will be able to pick the right synset. However, if a human would be able to pick the correct synset and only a computer would pick the wrong one, then we might even want to provoke this situation, because it is an HIP, giving an arbitrator a hard time trying to automatically analyze the text.

Let $\textrm{rep}(o)$ denote the replacements for a word $o$. It can easily be determined from the synset

\begin{displaymath}
\textrm{rep}(o) = \textrm{dis}(\mathcal{L}(o),\mathcal{C}(o)).
\end{displaymath}

We can now distinguish between

They are given by

\begin{displaymath}
r \in \textrm{rep}(o) \Rightarrow
r \in \left\{
\begin{array...
...\\
\textrm{rep}_B(o), & \mbox{otherwise}.
\end{array} \right.
\end{displaymath}

Building upon this classification of replacements we can distinguish

8.3 Variants of Replacement-Coding

We can basically think of three different scenarios for using these replacements in a coding strategy:

8.3.0.0.1 Coding for fully automated extraction

restricts us to using only type-A-words for coding purposes. Here we can be sure that it will be possible to extract all of the data again, automatically, both for the receiver and for the arbitrator. Since the type-A-replacements have the property that $\textrm{rep}(o)=\textrm{rep}(a)$ it will still be possible to automatically identify the synset that was originally used for coding the secret, after carrying out replacements.

8.3.0.0.2 Coding for extraction by humans

restricts us to using only type-B-words for coding purposes. Here we can be sure that no data can be extracted without human intervention, neither by the receiver, nor by the arbitrator. Since type-B-replacements have the property that $\textrm{rep}(o) \neq \textrm{rep}(b)$ it will not even be possible to automatically identify the words that were used for coding the secret any more. The user would have to manually disambiguate all words, so the decoder knows which words are relevant. It can do so by checking the judgement of the user, against the automatic judgement of the word-sense-disambiguator. All words, where the two disagree are then known to hold the secret.

8.3.0.0.3 Coding for maximal capacity

This is a variant of the above scheme, in which we use all words. Here the user will have to manually disambiguate all words, but an arbitrator will be able to recover parts of the secret automatically. However, since some words will resolve to the wrong synsets, and some will not, the arbitrator will not be able to distinguish between correctly and incorrectly decoded data. It will therefore be possible for the arbitrator to extract parts of the secret, but it will not easily be possible to identify it, in terms of distinguishing the correctly decoded type-A-words that hold data from the secret from the noise due to the type-B- and type-C-words that are incorrectly decoded.

8.3.0.0.4 Distinguishing two codings in one document

The above scheme has the severe disadvantage that we have no control over the parts of the secret that can be extracted automatically and those that cannot. It might therefore be desirable to distinguish between two data-blocks coded into one text-document. One could be made up of the type-A- and type-C-words to encode data we can allow to be extracted automatically (a cryptogram, for instance), and we can use type-B-words to encode data we cannot allow to be extracted automatically (for example, parts of a key necessary to decrypt the cryptogram in the other part of the document).


next up previous contents
Next: 9. Conclusions Up: Towards Linguistic Steganography: A Previous: 7. Towards Secure and   Contents
Richard Bergmair 2005-01-31