next up previous contents
Next: 4. Approaches to Linguistic Up: Towards Linguistic Steganography: A Previous: 2. Steganographic Security   Contents

Subsections

3. Lexical Language Processing

In the previous chapter we discussed what steganography is all about. Since we want to put a strong emphasis on lexical steganography, we will dedicate this chapter to lexical language processing. Especially the problem of sense-ambiguity is highly relevant, not only because it enables linguistic HIPs, which were briefly presented in the previous section. As we will see later on in this work, enabling stegosystems to mimic these peculiarities of natural language can be highly security-relevant as well.

The problem of word-sense ambiguity can be traced back to the question, ``What is the meaning of a word?''. It opens up a philosophical spectrum of thought:

Figure: Ambiguity in the matrix-representation.
\begin{figure}\begin{center}
\par
\subfigure[the lexical matrix]{
\begin{tabular...
... & 0 & 0 & 1 & 1 \\
$\ldots$\ \\
\end{tabular}}
\par\end{center}\end{figure}

Figure: Ambiguity illustrated by VENN-diagrams.
[lexical semantics]\includegraphics[scale=.9]{img/ambiguity3b.eps} [``contextual'' semantics]\includegraphics[scale=.9]{img/ambiguity3a.eps}

3.1 Ambiguity of Words

The creators of WordNet, perhaps the most prominent lexical resource in Computational Linguistics, define the notion of synonymy as follows:

``According to one definition (usually attributed to Leibniz) two expressions are synonymous if the substitution of one for the other never changes the truth value of a sentence in which the substitution is made. By that definition, true synonyms are rare, if they exist at all. A weakened version of this definition would make synonymy relative to a context: two expressions are synonymous in a linguistic context $\mathcal{C}$ if the substitution of one for the other in $\mathcal{C}$ does not alter the truth value.'' (, )
This definition clearly follows the lexical idea, and it is called a differential theory of semantics, because meaning is not represented beyond the property of different symbols to be distinguishable. For example, move, in a sense where it can be replaced by run or go, has a different meaning than move, in a sense where it can be replaced by impress or strike. If we wanted our dictionary to model semantics explicitly, we would have to formulate statements like ``use move interchangeably with run, if you want to express that something changes its position in space'' or ``use move interchangeably with impress or strike if you want to express that something has an emotional impact on you''. However, in differential approaches to semantics, we model meaning only implicitly, because we cannot formalize the ``if you want to express that...''-part of the above phrases. All we can do is to formulate statements of the form ``there exists one sense for move, in which it can be interchanged by run or go'' and ``there exists another sense for move, in which it can be interchanged by impress or strike''.

In this framework, word-meanings $s_1,s_2,\ldots$ emerge from recording words and their semantic equivalence. In a lexicon, we represent word-forms explicitly. Such explicit representations of word-forms are called lemmatalemma. For machine-readable lexica, they are most commonly ASCII-strings of a word's written form. Meanings of words are only represented implicitly, by organizing words into semantic equivalence classes, where ``semantic equivalence'' is relative to linguistic context.

() used the lexical matrix to demonstrate this relation between word-forms and their senses. Figure [*] represents this relation, considering the words from our example. If we wanted to analyze the meaning of a word, say run, we would have to look up its meaning. In this case, we would get multiple senses $s_3$,$s_4$, and $s_5$. This ambiguity is called polysemy. Inversely, if we want to express a meaning by a word, we would have to look up all the word forms that express, for example, meaning $s_2$. Here we would get multiple word-forms: move, motion and movement. This ambiguity is called synonymy.

3.2 Ambiguity of Context

We can think of context as another view of differential semantics. Let's rephrase Miller's statement, for that purpose, in order to highlight an interesting isomorphism:

According to one definition two expressions are synonymous if the substitution of one for the other never changes the truth value of the expression that is substituted. By that definition, true synonyms are rare, if they exist at all. A weakened version of this definition would make synonymy relative to a variable: two expressions are synonymous for a linguistic variable $\mathcal{L}$ if the substitution of one for the other does not alter the truth value contributed by $\mathcal{L}$.

Informally, if we have a lexicon but no text, we know everything about the words, but nothing about their usage. The ambiguity that arises about the meaning of a word needs to be resolved by knowledge inherent to linguistic context. Analogously, if we have a text but no lexicon, we know everything about how the words are used, but nothing about the words themselves. The ambiguity that arises about the meaning of a text needs to be resolved by knowledge from a linguistic variable.

We can think about a linguistic variable as a ``gap'' in a text written as ``...''. For example, if we see

\begin{displaymath}
\w{My favourite color is \ldots}
\end{displaymath}

we know that ``...'' must be one of red, green, blue, etc. If, for any reason, the interpreter of the sentence knows that the speaker does not like the color green, then the choice is even narrower.

Conversely, we can think about linguistic context as the meaning of ``...''. For example, if we see

\begin{displaymath}
\w{\ldots green \ldots}
\end{displaymath}

We know that ``...'' must be one of Grass is ..., I bought ...paint, etc.

Formally, we can think of contexts $\mathcal{C}_1,\mathcal{C}_2,\ldots,\mathcal{C}_n$, arranged in a matrix, much like the lexical matrix. Figures [*] and [*] show the idea of ``contextual semantics'' in analogy to lexical semantics.

In the lexical case, we explicitly expressed words, and senses emerged from the different configurations of these words appearing interchangeably in any context. In the contextual case, we explicitly express contexts, and senses emerge from the different configurations of them appearing with any word. The example in Figure [*] confronts us with the problem that both red and white are national colors of Austria, and we do not know anything about ``my favourite color'', except that it must be a color. These are contexts that could equally fit for red and white. If we have a third contextual clue, like blood is ..., there is only one word left to fill the gap, which is red.

3.3 A Common Approach to Disambiguation

A Common Approach

In the previous section, we examined the notion of meaning established by differential approaches to semantics, either based on words or contexts. For our purposes, it will suffice to view sense-ambiguity as the phenomenon of the lexical formalization underspecifying the meaning of a word found in a text, so that additional contextual clues are needed. For example, from a lexical point of view, we would have to expect that a lemma represents a meaning. However this is not the case with bank, since bank has a different meaning in The east river ...was flooded as in This ...has the best interest rates.

Since the notion of context turns out to be rather hard to put in formal terms, as opposed to words which can be represented by a written form, the first step in the analysis of a piece of text is to resolve a word by the lexicon. Since move is underspecified by a lexicon, sense-ambiguity arises; if we want to substitute move by a synonym, we do not know whether to replace it by movement or by impress, without changing the overall meaning. Therefore, we have to carry out a second step in the analysis, which is to disambiguate these competing word-senses. This process is what is usually abbreviated WSD (short for Word-Sense Disambiguation). Such disambiguation would have to be based on contextual evidence. The advantage of first letting ambiguity arise in the lexical analysis, and then bringing context into the picture by a selection-process has the advantage that such a heuristic selection can usually be carried out, even if we have only a ``rough idea'' of the context like a probabilistic formalization based on a few simple assumptions.

Usually the context of a word $w$ is formalized by a window of $\pm n$ words around it. For a window of $\pm 3$ words, for example, we would pick out $7$ consecutive words, as they appear in the text, and denote then as a vector that contains the $3$ words immediately to the left of the word of interest, the word itself, and the $3$ words immediately to the right (although the word itself is, of course, not significant evidence for disambiguating its word-sense).

We denote a context with:

\begin{displaymath}
\mathcal{C}(w) = \langle w_{-3}, w_{-2}, w_{-1}, w_0, w_1, w_2, w_3 \rangle,
\end{displaymath}

where $w_0=w$. Words that are insignificant for sense-disambiguation, like function-words and prepositions, are usually filtered out. For example, in the sentence

\begin{displaymath}
\w{Uncle Steve turned out to be a brilliant player of the electric guitar.}
\end{displaymath}

a window of $\pm 2$ words would formally be

\begin{displaymath}
\mathcal{C}(\textrm{brilliant}) = \langle \w{Steve}, \w{turned}, \textrm{brilliant}, \w{player}, \w{electric} \rangle.
\end{displaymath}

If $\mathcal{L}(w)$ is the set of all possible senses of a word $w$ we can derive from the lexicon, then we can consider a sense $s \in \mathcal{L}(w)$ as a correct interpretation of the word, if it maximizes the conditional probability of appearing in context $\mathcal{C}(w)$,

\begin{displaymath}
\max_{s \in \mathcal{L}(w)} P(s \vert \mathcal{C}(w)).
\end{displaymath} (3.1)

We could collect statistics for the probability $P(\mathcal{C}(x))$ by analyzing a corpus (a statistically representative collection of natural language texts). The simplest approach would be to sense-tag it by hand, i.e. to assign the correct lexical sense $s \in \mathcal{L}(w)$ to each word $w$, and count how often a particular sense appears in this context, therefore providing statistics for the probability $P(\mathcal{C}(w) \vert s)$, which we can always rewrite in the usual Bayesean manner as

\begin{displaymath}
P(s \vert \mathcal{C}(w)) = \frac{P(s) P(\mathcal{C}(w) \vert s)}{P(\mathcal{C}(w))}.
\end{displaymath}

This is why the method is called a Bayes classifier.

The first problem this approach suffers from is that corpora must be sense-tagged for the specific lexicon that is to be used, which is a tedious and costly task.

The second problem is that of sparse data. Although there are large corpora available (for example the British National Corpus, contains over 100 Million untagged words), even the largest ones would not suffice to collect significant statistics for larger windows. This is why we collect the statistics of a specific word $w$ appearing anywhere in the context of a sense $s$, written $P(w \vert s)$, from the corpus and estimate the probability of the complete window by assuming the words are independent. This leads to

\begin{displaymath}
P(\mathcal{C}(x) \vert s) = \prod_{j=-n}^{n} P( w_n \vert s ).
\end{displaymath}

Although this approach is successfully applied in part-of-speech tagging (an experimental setup that is very similar to word-sense-disambiguation, in that it assigns ambiguous semantic tokens to words) and word-sense-disambiguation, the assumption of the words in a context being independent of each other is somewhere between linguistically questionable and self-contradictory. (Wasn't the assumption of a functional dependency between subsequent words the very argument we based the idea of sense-disambiguation by context on?) This is why the method is called the naive Bayes classifier.

Using a naive Bayes classifier, we can rewrite Equation [*] as

\begin{displaymath}
\max_{s \in \mathcal{L}(w)} P(s) \prod_{j=-n}^{n} P( w_n \vert s ),
\end{displaymath} (3.2)

leaving out the division by $P(\mathcal{C}(w))$, since it is constant for all senses.

3.4 The State of the Art in Disambiguation

The State of the Art

Of course, the naive Bayes classifier is not the only way to go about WSD. There have been many approaches to formalizing context, which can be roughly divided into approaches based on co-occurrence and approaches based on collocation. The former observe which words occur together with a particular word-sense, at any position in a word's context. Decision-lists are suitable data-structures, simply enlisting, for each word-sense, the words commonly observed in a sense's surrounding. The latter concentrates on observing words at specific positions in the text surrounding a word, for example, collecting statistics about certain features of these words to point out the correct word-sense. Of course many hybrid approaches can be thought of, combining co-occurence and collocation-features. More accurate formalizations of context could result, for example, from shallow-parsing a document, so a disambiguator could concentrate on relationships like verb-object, verb-subject, head-modifier, etc.

Once a probabilistic model and its computational framework is set up, different algorithms for statistical natural language learning can be used to train the model. Generally we can distinguish

Progress in this evolving field has been measured, amongst others, in the SENSEVAL initiative, a large-scale attempt to evaluate WSD systems in a competitive way. A Gold standard corpus was compiled, by having two human annotators tag a sample of text. A basic requirement was that it should be replicable, so human annotators would have to agree at least 90% of the time. This corpus consists of a trial-, a training-, and a testing-set. In SENSEVAL-2, participating teams had 21 days to work with the training data and 7 days with the test data before submitting their systems' results to a central website for automatic scoring.

Three criteria were evaluated: Recall is the percentage of correctly tagged words in the complete test set. This measure is a good estimator for the overall system-performance since it measures how many correct answers were given overall. Precision is the percentage of correct answers in the set of instances that were answered. This measure favors systems that ``know their limits'', i.e. ones that are very accurate, even though they might be limited to solving only a small subset. Coverage is the percentage of instances that were answered. These measures were compared against the baseline of always choosing the most frequent sense appearing in the corpus.

A highly precise WSD system will enable very secure systems for lexical steganography, since it does not leave suspicious patterns in the steganograms. As far as capacity is concerned, there is a tradeoff between precision and coverage. On the one hand, systems with high coverage will identify more possibilities of word-substitutions, therefore providing more information-carrying elements, resulting in higher capacities for coding raw data. However, lower precision will result in higher probabilities of incorrectly decoding the information which has to be compensated for by error-correction. Since the redundancy which needs to be introduced by error-correction raises exponentially with the error-probability, one can say that, usually, precision is a more important criterion for lexical steganography than coverage.

Figure [*] shows the results of SENSEVAL-2, for the English lexical sample, sorted by precision. The performance of the ``BCU - ehu-dlist-best'' system (, ) was particularly impressive. It is based on a decision list that only uses features above a certainty-threshold of 85%, using 10-fold cross-validation. Unsupervised methods perform below the most-frequent-sense baseline. However, this comparison is not quite fair, since the most-frequent-sense heuristic is, of course, based on a hand-tagged corpus, whereas unsupervised WSD systems do not use any hand-tagged data.

() cites personal communication with George Miller, reporting an upper bound for human performance in sense-disambiguation of around 90% for ambiguous cases, as opposed to the level of recall for automatic systems of up to 64%, as evaluated in SENSEVAL-2. Clearly, there is room for improvement here, but research into WSD is still under way, motivated by applications in natural language understanding, machine translation, information retrieval, spell-checking, and many other fields of Natural Language Processing. The results of SENSEVAL-3 will be presented in July 2004.

Figure: Results of SENSEVAL-2: ``English Lexical Sample - Fine-grained Scoring'' (, ). Only the top five were given here.
\begin{figure}\par
\begin{center}
\par
\subfigure[unsupervised]{
\begin{tabular}...
...& Grouping Commonest \\
\end{tabular}}
\par\par
\end{center}\par
\end{figure}

3.5 Semantic Relations in the Lexicon

Semantic Relations

Figure: VENN-diagram for the levels of abstraction for guitar.
\includegraphics[scale=.9]{img/ambiguity-vert.eps}

Figure: A sample of WordNet's hyponymy-structure.
\includegraphics[scale=.7]{img/guitar-inheritance.eps}

Generally one can say $x$ is a hyponym of $y$ if a native speaker would accept sentences of the form ``$x$ is a kind of $y$''. The inverse of hyponymy is hypernymy, so if $x$ is a hyponym of $y$, then $y$ is a hypernym of $x$. Hyponymy is basically an inclusion-relation, adding a dimension of abstraction for words.

The idea of inclusion in the space of word-senses is depicted in Figure [*]. In many linguistic systems this inclusion is modelled as an inheritance system, so if $x$ is a kind of $y$, then $x$ is viewed to have all properties of $y$, and is only modified by additional ones. Lexical inheritance can be found in the glossaries of most conventional dictionaries. If we looked up the word guitar in a dictionary, it would give us a glossary like ``a stringed instrument that is small, light, made of wood, and has six strings usually plucked by hand or a pick''. Now what is a stringed instrument? If we looked up that word in the dictionary, we would get something like ``a musical instrument producing sound through vibrating strings''. What does that tell us about guitars? Obviously, that a guitar is ``a musical instrument producing sound through vibrating strings, that is small, light, made of wood, and has six strings usually plucked by hand or a pick''. Thereby we have resolved one level of lexical inheritance, and could recursively apply this, looking up instrument, and so on.

Note that hyponymy and hypernymy are semantic relations. As opposed to synonymy and polysemy, which relate words, hyponymy and hypernymy relate specific senses of words. For example, for one sense,

\begin{displaymath}
\lbrace \w{bank}, \w{banking company}, \w{financial institution} \rbrace
~\textit{IsA}~
\lbrace \w{institution} \rbrace
\end{displaymath}

but for another sense,

\begin{displaymath}
\lbrace \w{bank} \rbrace
~\textit{IsA}~
\lbrace \w{geological formation}, \w{formation} \rbrace.
\end{displaymath}

() sees synonymy and polysemy, as a horizontal kind of ambiguity and hyponymy and hypernymy as a vertical kind. This idea gets visible in Figure [*]. Analogous to synonymy, which confronts us with the problem of choosing the correct word to express something, hyponymy confronts us with the problem of choosing the correct level of abstraction, which might be viewed as another kind of interchangeability. In many sentences it would be possible to substitute guitar for electric guitar, based on the fact that an electric guitar is just a special kind of guitar. For example, instead of Yesterday I had my electric guitar repaired, one could say Yesterday I had my guitar repaired.

This idea of inheritance is crucial to how hyponymy establishes substitutability. While Yesterday I had my instrument repaired would probably still be accepted by a native-speaker, Yesterday I had my entity repaired would already sound quite peculiar. This could be viewed as a result of the fact that the speaker of Yesterday I had my guitar repaired, is using guitar, to refer to an object which has certain properties, for example that it is a physical object which can easily break, and needs repair. Since entity has not yet inherited these properties from its hypernyms in the lexicon, the word does not fit in the context.

3.6 Semantic Distance in the Lexicon

Semantic Distance

Many measures have been proposed that try to capture a degree of semantic similarity of two words in a lexicon. These measures are particularly useful in lexical steganography, since they use the knowledge from a lexicon for a model capturing the substitutability of words, which is the central issue in lexical steganography. In particular, we will introduce measures that rely on WordNet's hyponymy graph, idealized as a tree.3.1

() rely on a logarithmic measure of the length $\textrm{len}(s_1,s_2)$ of the shortest path between two word-meanings $s_1$ and $s_2$. They scale it by the depth $D$ of the whole tree.

\begin{displaymath}
\textrm{sim}_{LC}(s_1,s_2) = - \log ( \frac{\textrm{len}(s_1,s_2)}{2D} ).
\end{displaymath} (3.3)

The measure of () is based on the lowest super-ordinate $\textrm{lso}(s_1,s_2)$, also known as most specific common subsumer. It is the root of the smallest subtree containing both $s_1$ and $s_2$. () points out that, if lexica vary in the depths of the ``hyponymy-tree'' in different parts of the taxonomy, this severely limits the performance of approaches based on path length, so he uses the probability of the LSO to occur in a corpus instead, as the basis for the information-theoretic measure,

\begin{displaymath}
\textrm{sim}_R(s_1,s_2)= - \log (P(\textrm{lso}(s_1,s_2))).
\end{displaymath} (3.4)

Note that he collects the statistics in such a way that $P(\textit{super}) \geq P(\textit{sub})$, if $\textit{sub}~ \textit{IsA} ~ \textit{super}$, so the probability-spaces themselves reflect the inclusion-properties of hyponymy-relations. (see , )

() compared the most important similarity-measures based on WordNet for their overall accuracy. They examined the agreement of the degree of relatedness predicted by these measurements with data from a study by () asking human subjects to rate the degree of semantic relatedness. Furthermore they investigated the performance of these measures in a system for malapropism-detection, an experimental setup that widely parallels the application in lexical steganography. According to their observations, the most accurate similarity-measure was that of (),

\begin{displaymath}
\textrm{dist}_{JC}(s_1,s_2)=
2 \log ( P( \textrm{lso}(s_1,s_2) ) ) -
\big( \log ( P( s_1 ) ) + \log ( P( s_2 ) ) \big).
\end{displaymath} (3.5)

This measure has, from an information-theoretic point of view, an intuitive appeal, if we bear in mind the idea of lexical inheritance. $\log ( P( \textrm{lso}(s_1,s_2) ) )$ is the information both senses $s_1$ and $s_2$ share, since it contains features that are inherited down to both $s_1$ and $s_2$, which is also the idea behind the measure of (). However, since this measure is supposed to be a distance, rather than a degree of similarity, the expression has a positive sign. This amount of information is then reduced by the information that distinguishes the senses, the features that are specific to the words, as captured by $\log ( P( s_1 ) )$, respectively $\log ( P( s_2 ) )$.


next up previous contents
Next: 4. Approaches to Linguistic Up: Towards Linguistic Steganography: A Previous: 2. Steganographic Security   Contents
Richard Bergmair 2005-01-31