In the previous chapter we discussed what steganography is all about. Since we want to put a strong emphasis on lexical steganography, we will dedicate this chapter to lexical language processing. Especially the problem of sense-ambiguity is highly relevant, not only because it enables linguistic HIPs, which were briefly presented in the previous section. As we will see later on in this work, enabling stegosystems to mimic these peculiarities of natural language can be highly security-relevant as well.
The problem of word-sense ambiguity can be traced back to the question, ``What is the meaning of a word?''. It opens up a philosophical spectrum of thought:
The creators of WordNet, perhaps the most prominent lexical resource in Computational Linguistics, define the notion of synonymy as follows:
``According to one definition (usually attributed to Leibniz) two expressions are synonymous if the substitution of one for the other never changes the truth value of a sentence in which the substitution is made. By that definition, true synonyms are rare, if they exist at all. A weakened version of this definition would make synonymy relative to a context: two expressions are synonymous in a linguistic contextThis definition clearly follows the lexical idea, and it is called a differential theory of semantics, because meaning is not represented beyond the property of different symbols to be distinguishable. For example, move, in a sense where it can be replaced by run or go, has a different meaning than move, in a sense where it can be replaced by impress or strike. If we wanted our dictionary to model semantics explicitly, we would have to formulate statements like ``use move interchangeably with run, if you want to express that something changes its position in space'' or ``use move interchangeably with impress or strike if you want to express that something has an emotional impact on you''. However, in differential approaches to semantics, we model meaning only implicitly, because we cannot formalize the ``if you want to express that...''-part of the above phrases. All we can do is to formulate statements of the form ``there exists one sense for move, in which it can be interchanged by run or go'' and ``there exists another sense for move, in which it can be interchanged by impress or strike''.if the substitution of one for the other in
does not alter the truth value.'' (, )
In this framework, word-meanings
emerge from recording
words and their semantic equivalence. In a lexicon, we represent
word-forms explicitly. Such explicit representations of word-forms
are called lemmatalemma. For machine-readable lexica, they are
most commonly ASCII-strings of a word's written form. Meanings of words are
only represented implicitly, by organizing words into semantic equivalence
classes, where ``semantic equivalence'' is relative to linguistic context.
() used the lexical matrix to demonstrate
this relation between word-forms and their senses. Figure
represents this relation, considering the words from our example.
If we wanted to analyze
the meaning of a word, say run, we would have to look up its meaning.
In this case, we would get multiple senses
,
, and
.
This ambiguity is called polysemy. Inversely, if we want to express a
meaning by a word, we would have to look up all the word forms that express,
for example, meaning
. Here we would get multiple word-forms: move,
motion and movement. This ambiguity is called synonymy.
We can think of context as another view of differential semantics. Let's rephrase Miller's statement, for that purpose, in order to highlight an interesting isomorphism:
According to one definition two expressions are synonymous if the substitution of one for the other never changes the truth value of the expression that is substituted. By that definition, true synonyms are rare, if they exist at all. A weakened version of this definition would make synonymy relative to a variable: two expressions are synonymous for a linguistic variableif the substitution of one for the other does not alter the truth value contributed by
.
Informally, if we have a lexicon but no text, we know everything about the words, but nothing about their usage. The ambiguity that arises about the meaning of a word needs to be resolved by knowledge inherent to linguistic context. Analogously, if we have a text but no lexicon, we know everything about how the words are used, but nothing about the words themselves. The ambiguity that arises about the meaning of a text needs to be resolved by knowledge from a linguistic variable.
We can think about a linguistic variable as a ``gap'' in a text written
as ``...''. For example, if we see
Conversely, we can think about linguistic context as the meaning
of ``...''. For example, if we see
Formally, we can think of
contexts
,
arranged in a matrix, much like the lexical matrix.
Figures
and
show the idea
of ``contextual semantics'' in analogy to lexical semantics.
In the lexical case, we explicitly expressed words, and senses emerged from
the different configurations of these words appearing interchangeably
in any context.
In the contextual case, we explicitly express contexts, and senses
emerge from the different configurations of them appearing with any word.
The example in Figure
confronts us with the problem that
both red and white are national colors of Austria, and we
do not know anything about ``my favourite color'', except that it must be
a color. These are contexts that could equally fit for red and
white. If we have a third contextual clue, like blood is ...,
there is only one word left to fill the gap, which is red.
In the previous section, we examined the notion of meaning established by differential approaches to semantics, either based on words or contexts. For our purposes, it will suffice to view sense-ambiguity as the phenomenon of the lexical formalization underspecifying the meaning of a word found in a text, so that additional contextual clues are needed. For example, from a lexical point of view, we would have to expect that a lemma represents a meaning. However this is not the case with bank, since bank has a different meaning in The east river ...was flooded as in This ...has the best interest rates.
Since the notion of context turns out to be rather hard to put in formal terms, as opposed to words which can be represented by a written form, the first step in the analysis of a piece of text is to resolve a word by the lexicon. Since move is underspecified by a lexicon, sense-ambiguity arises; if we want to substitute move by a synonym, we do not know whether to replace it by movement or by impress, without changing the overall meaning. Therefore, we have to carry out a second step in the analysis, which is to disambiguate these competing word-senses. This process is what is usually abbreviated WSD (short for Word-Sense Disambiguation). Such disambiguation would have to be based on contextual evidence. The advantage of first letting ambiguity arise in the lexical analysis, and then bringing context into the picture by a selection-process has the advantage that such a heuristic selection can usually be carried out, even if we have only a ``rough idea'' of the context like a probabilistic formalization based on a few simple assumptions.
Usually the context of a word
is formalized by a window of
words around it. For a window of
words, for example, we
would pick out
consecutive words, as they appear in the text,
and denote then as a vector that contains the
words immediately to the left
of the word of interest, the word itself, and the
words immediately
to the right (although the word itself is, of course, not significant
evidence for disambiguating its word-sense).
We denote a context with:
If
is the set of all possible senses of a word
we can derive
from the lexicon, then we can consider a sense
as a correct interpretation of the word, if it maximizes
the conditional probability of appearing in context
,
We could collect statistics for the probability
by analyzing a corpus (a statistically representative collection
of natural language texts). The simplest approach would be to
sense-tag it by hand, i.e. to assign the correct lexical sense
to each word
, and count how often a
particular sense appears in this context, therefore providing
statistics for the probability
, which
we can always rewrite in the usual Bayesean manner as
The first problem this approach suffers from is that corpora must be sense-tagged for the specific lexicon that is to be used, which is a tedious and costly task.
The second problem is that of sparse data.
Although there are large corpora available (for example the
British National Corpus, contains over 100 Million untagged words),
even the largest ones would not suffice to collect significant statistics
for larger windows. This is why we collect the statistics of
a specific word
appearing anywhere in the context of a sense
,
written
, from the corpus and estimate the probability of the
complete window by assuming the words are independent. This leads to
Although this approach is successfully applied in part-of-speech tagging (an experimental setup that is very similar to word-sense-disambiguation, in that it assigns ambiguous semantic tokens to words) and word-sense-disambiguation, the assumption of the words in a context being independent of each other is somewhere between linguistically questionable and self-contradictory. (Wasn't the assumption of a functional dependency between subsequent words the very argument we based the idea of sense-disambiguation by context on?) This is why the method is called the naive Bayes classifier.
Using a naive Bayes classifier, we can rewrite Equation
as
Of course, the naive Bayes classifier is not the only way to go about WSD. There have been many approaches to formalizing context, which can be roughly divided into approaches based on co-occurrence and approaches based on collocation. The former observe which words occur together with a particular word-sense, at any position in a word's context. Decision-lists are suitable data-structures, simply enlisting, for each word-sense, the words commonly observed in a sense's surrounding. The latter concentrates on observing words at specific positions in the text surrounding a word, for example, collecting statistics about certain features of these words to point out the correct word-sense. Of course many hybrid approaches can be thought of, combining co-occurence and collocation-features. More accurate formalizations of context could result, for example, from shallow-parsing a document, so a disambiguator could concentrate on relationships like verb-object, verb-subject, head-modifier, etc.
Once a probabilistic model and its computational framework is set up, different algorithms for statistical natural language learning can be used to train the model. Generally we can distinguish
Progress in this evolving field has been measured, amongst others, in the SENSEVAL initiative, a large-scale attempt to evaluate WSD systems in a competitive way. A Gold standard corpus was compiled, by having two human annotators tag a sample of text. A basic requirement was that it should be replicable, so human annotators would have to agree at least 90% of the time. This corpus consists of a trial-, a training-, and a testing-set. In SENSEVAL-2, participating teams had 21 days to work with the training data and 7 days with the test data before submitting their systems' results to a central website for automatic scoring.
Three criteria were evaluated: Recall is the percentage of correctly tagged words in the complete test set. This measure is a good estimator for the overall system-performance since it measures how many correct answers were given overall. Precision is the percentage of correct answers in the set of instances that were answered. This measure favors systems that ``know their limits'', i.e. ones that are very accurate, even though they might be limited to solving only a small subset. Coverage is the percentage of instances that were answered. These measures were compared against the baseline of always choosing the most frequent sense appearing in the corpus.
A highly precise WSD system will enable very secure systems for lexical steganography, since it does not leave suspicious patterns in the steganograms. As far as capacity is concerned, there is a tradeoff between precision and coverage. On the one hand, systems with high coverage will identify more possibilities of word-substitutions, therefore providing more information-carrying elements, resulting in higher capacities for coding raw data. However, lower precision will result in higher probabilities of incorrectly decoding the information which has to be compensated for by error-correction. Since the redundancy which needs to be introduced by error-correction raises exponentially with the error-probability, one can say that, usually, precision is a more important criterion for lexical steganography than coverage.
Figure
shows the results of SENSEVAL-2,
for the English lexical sample, sorted by precision. The performance
of the ``BCU - ehu-dlist-best'' system (, ) was particularly
impressive. It is based on a decision list that only uses features
above a certainty-threshold of 85%, using 10-fold cross-validation.
Unsupervised methods perform below the most-frequent-sense baseline.
However, this comparison is not quite fair, since the most-frequent-sense
heuristic is, of course, based on a hand-tagged corpus, whereas
unsupervised WSD systems do not use any hand-tagged data.
() cites personal communication with George Miller, reporting an upper bound for human performance in sense-disambiguation of around 90% for ambiguous cases, as opposed to the level of recall for automatic systems of up to 64%, as evaluated in SENSEVAL-2. Clearly, there is room for improvement here, but research into WSD is still under way, motivated by applications in natural language understanding, machine translation, information retrieval, spell-checking, and many other fields of Natural Language Processing. The results of SENSEVAL-3 will be presented in July 2004.
![]() |
Generally one can say
is a hyponym of
if a native speaker
would accept sentences of the form ``
is a kind of
''.
The inverse of hyponymy is hypernymy, so if
is a hyponym of
,
then
is a hypernym of
. Hyponymy
is basically an inclusion-relation, adding a dimension of abstraction for
words.
The idea of inclusion in the space of word-senses is depicted in
Figure
. In many linguistic systems this inclusion is modelled
as an inheritance system, so if
is a kind of
, then
is viewed to have
all properties of
, and is only modified by additional ones.
Lexical inheritance can be found in the glossaries of most
conventional dictionaries. If we looked up the word
guitar in a dictionary, it would give us a glossary like ``a stringed instrument
that is small, light, made of wood, and has six strings usually plucked by
hand or a pick''. Now what is a stringed instrument? If we looked up that
word in the dictionary, we would get something like ``a musical instrument
producing sound through vibrating strings''. What does that tell us about
guitars? Obviously, that a guitar is ``a musical instrument producing
sound through vibrating strings, that is small, light, made of wood, and has
six strings usually plucked by hand or a pick''. Thereby we have resolved one
level of lexical inheritance, and could recursively apply this, looking up
instrument, and so on.
Note that hyponymy and hypernymy are semantic relations. As opposed to
synonymy and polysemy, which relate words, hyponymy and hypernymy relate
specific senses of words. For example, for one sense,
() sees synonymy and polysemy, as a horizontal kind of ambiguity
and hyponymy and hypernymy as a vertical
kind. This idea gets visible in Figure
.
Analogous to synonymy, which confronts us with the problem of choosing the
correct word to express something, hyponymy confronts us with the problem of
choosing the correct level of abstraction, which
might be viewed as another kind of interchangeability. In many sentences it would
be possible to substitute guitar for electric guitar, based on the
fact that an electric guitar is just a special kind of guitar. For example,
instead of Yesterday I had my electric guitar repaired, one could say
Yesterday I had my guitar repaired.
This idea of inheritance is crucial to how hyponymy establishes substitutability. While Yesterday I had my instrument repaired would probably still be accepted by a native-speaker, Yesterday I had my entity repaired would already sound quite peculiar. This could be viewed as a result of the fact that the speaker of Yesterday I had my guitar repaired, is using guitar, to refer to an object which has certain properties, for example that it is a physical object which can easily break, and needs repair. Since entity has not yet inherited these properties from its hypernyms in the lexicon, the word does not fit in the context.
Many measures have been proposed that try to capture a degree of semantic similarity of two words in a lexicon. These measures are particularly useful in lexical steganography, since they use the knowledge from a lexicon for a model capturing the substitutability of words, which is the central issue in lexical steganography. In particular, we will introduce measures that rely on WordNet's hyponymy graph, idealized as a tree.3.1
() rely on a logarithmic measure of the length
of the shortest path between two word-meanings
and
. They scale it
by the depth
of the whole tree.
| (3.3) |
The measure of () is based on the
lowest super-ordinate
, also known as
most specific common subsumer. It is the root of the smallest
subtree containing both
and
.
() points out that, if lexica vary in the depths of the
``hyponymy-tree'' in different parts of the taxonomy,
this severely limits the performance of approaches based on path length, so he
uses the probability of the LSO to occur in a corpus instead, as the basis
for the information-theoretic measure,
| (3.4) |
() compared the most important similarity-measures based on WordNet
for their overall accuracy.
They examined the agreement of the degree of relatedness predicted by these
measurements with data from a study by () asking human subjects to rate
the degree of semantic relatedness. Furthermore they investigated the performance of
these measures in a system for malapropism-detection, an experimental setup that
widely parallels the application in lexical steganography.
According to their observations, the most accurate similarity-measure was that of
(),
| (3.5) |
This measure has, from an information-theoretic point of view, an
intuitive appeal, if we bear in mind the idea of lexical inheritance.
is the information both
senses
and
share, since it contains features that
are inherited down to both
and
, which is also the idea
behind the measure of (). However, since this measure
is supposed to be a distance, rather than a degree of similarity,
the expression has a positive sign.
This amount of information is then reduced by the information
that distinguishes the senses, the features that are specific
to the words, as captured by
, respectively
.