next up previous contents
Next: 10. Evaluation & Future Up: Towards Linguistic Steganography: A Previous: 8. Towards Coding in   Contents

9. Conclusions

In this report, it was shown that natural language steganography is a very promising approach, and that, unfortunately, the topic has received only little attention in the past.

Relevant background from steganography was systematically presented, by first presenting the information theoretic characterization of steganography, relying on meaningless symbolic blackboxes exchanged by sender and receiver and then moving on to the ontologic demand for models, relating these symbols to each other, in such a way that we can interpret them and tell whether they are innocuous or suspicious, from the point of view of a model that accounts for their semantics. Usually the interpretation of covers we want to hide secrets in is ultimately carried out by intelligent humans. Unfortunately models for the essentially cognitive ability to ultimately understand the content of datagrams are difficult, if not impossible to construct. This is where we were confronted with the limits of what we can expect a computer to do, but it was shown how to use even these limits to improve steganographic security by exploiting them as human interactive proofs.

It was demonstrated that it is crucial for the success of systems based on replacement of dictionary-words to rely upon sophisticated models of lexical semantics, as investigated in computational linguistics. The ambiguity inherent to a word and to the context a word is used in was presented as the linguistic phenomenon we are seeking to exploit when we expect to encode data by substituting words. Computational models of these ambiguities have been driven by research in word-sense disambiguation, and are now a well understood topic. The state-of-the-art in this field was summarized. Moving away from purely synonymy-based ideas of substitutability, other lexical relations found in state-of-the-art computer-readable dictionaries were shown, and current measures that quantify the degree to which two words can be considered substitutable, based on lexical evidence, were described.

The ideas and approaches behind current prototypes for natural language steganography were described, systematizing them by the kind of linguistic models they employ. A distinction was made between approaches that measure the degree of distortion imposed by the embedding of a hidden message by means of symbolic, syntactic, and semantic models of language. It was shown that all of these approaches have one theme in common: manipulating a sequence of symbols in such a way that it can be reinterpreted by a function to reconstruct a secret message, leaving the usual interpretation of this sequence of symbols intact. The critical distinction of symbolic, syntactic and semantic approaches to natural language steganography is then simply the model that accounts for this ``common interpretation''. Lexical approaches were demonstrated to account for symbolic models, context-free grammars to account for syntactic models, and ontologic analysis of deep-structure to account for semantic models. These linguistic models were related to the steganographic background, by pointing out the value of all the symbols originating from either level of linguistic analysis as relevant ``clues'' to a steganalyst trying to detect hidden communication.

Moving on from the ideas and approaches behind current prototypes to their actual design and implementation, special issues that were addressed in these systems were presented in detail. Winstein's approach of the word-choice hash was described, which allows a human author to influence word-choice configurations made by a stegosystem. Chapman's approach to model natural language via style-templates was presented as well as Wayner's approach to context-free mimicry, using Huffman-trees to guide the selection of context-free productions from a grammar that characterizes innocuous covers. The use of ANLs was presented to provide for the semantic side of the ``linguistic equation'' as stated by Atallah et al.

A summary was then given about the lessons learned from theoretical and practical issues investigated so far, and objectives for the design and analysis of natural language stegosystems were proposed. Based on these objectives, the current prototypes were evaluated, and future research directions were pointed out.

Although current systems for lexical steganography allow encoding data into natural language text, none of these coding-techniques was designed with theoretically strong security and robustness in mind. It was shown that these problems are not quite trivial, for example, due to limitations in the applicability of current techniques for error-correcting coding. A blocking-scheme was shown that allows us to overcome these limitations, making the scheme robust. The use of one-way-functions in this blocking-scheme was described, to address the issue of security. Although no strong formal claims could be made, it was shown by example that the scheme does indeed provide for some degree of robustness and security.

Although current systems are already using lexical replacement for coding text, none of these replacement-strategies has been thoroughly analyzed from a linguistic point of view. The problem of word-sense ambiguity was investigated for the first time in this context. The two manifestations of this ambiguity in a coding scheme, forward- and backward-ambiguity, were identified. Based on these phenomena the use of lexical ambiguity was shown for constructing coding schemes with different interesting properties. One coding scheme outlined allows encoding data by carrying out lexical replacements and automatically decoding the data again. Due to the use of sense-disambiguators, these lexical replacements are much more adequate than any of those carried out by current systems. Another coding scheme outlined allows encoding data in such a way that no computer will be able to extract the data again, confronting large-scale detection of hidden communication with a serious practical obstacle. Some hybrid schemes, combining the two, were shown as well.

Summing it all up, one can say that, although we are nowhere near the goal of constructing provably secure and robust natural language steganography systems today, this report might have shed some light on the road that could lead us there.


next up previous contents
Next: 10. Evaluation & Future Up: Towards Linguistic Steganography: A Previous: 8. Towards Coding in   Contents
Richard Bergmair 2005-01-31