next up previous contents
Next: 2. Steganographic Security Up: Towards Linguistic Steganography: A Previous: Dear Diary,   Contents

1. Introduction

``Everyone has the right to freedom of opinion and expression; this right includes freedom to hold opinions without interference and to seek, receive and impart information and ideas through any media and regardless of frontiers.''

United Nations
Universal Declaration of Human Rights

Technologies for information and communication security have often brought forth powerful tools to make this vision come true, despite many different kinds of adverse circumstances. The most urgent threat to security that has been addressed so far is probably the exploitation of sensitive data by interceptors of messages, a situation studied in the context of cryptography. Cryptograms protect their message-content from unauthorized access, but they are vulnerable to detection. This is not a problem, as long as cryptography is perceived at a broad basis, as a legitimate way of protecting one's security, but it is, if it is seen as a tool useful primarily to a potential terrorist, volksfeind, enemy of the revolution, or whatever term the historical context seems to prefer.

Throughout history, whenever the political climate got difficult, we could often observe intentions to limit the individual's freedom of opinion and expression. What is new to the times we are living in, is that we now rely heavily upon electronic media and automated systems to distribute, and to gather information for us. The fact that these media do not, by design, rule out the possibility of central control and monitoring is dangerous in itself. However, the fact that we can now watch the necessary infrastructures being built should be highly alarming.

This is why I believe that today it is more important than ever before that we start asking ourselves about the consequences of these infrastructures being controlled by what we will often refer to as an arbitrator in this report. The connotations of this English stem already define the setup we are thinking about very well. In German we use words like willkürlich, tyrannisch, eigenmächtig, and launenhaft for arbitrary, which could roughly translate back to despotic, tyrannical, high-handed, and moody.

Clearly, it is highly desirable to protect Alice's and Bob's freedom to communicate securely in the presence of Wendy the warden, an individual who controls the used communication channels and seeks to detect and penalize unwanted communication, a well-understood setup in information-security studied in the context of steganography.

Whether we write books, articles, websites, emails, or post-it notes, whether we talk to each other over the telephone, over radio or simply over the fence that separates our next-door-neighbour's garden from our own, our communication will always adhere to one and the same protocol: natural language. So, when we talk about information and communication security, we should be well aware that we encode most of the information that makes up our society in natural language. The security of steganograms arises from the difficulty of detecting them in large amounts of data. Therefore, it seems reasonable to study natural language in the context of steganography, as a very promising haystack to hide a needle in.

Today, the best-known steganography systems use images to hide their data in. The most simplistic technique is LSB-substitution. We can think of digital images with 24 bits of color-depth as using three bytes to code the color of each pixel, one for the strength of each a red, a green, and a blue light-source producing the color under additive synthesis. If we randomly toggle the least significant bit (LSB) of each of these bytes, it will result in the respective color of the pixel deviating in $\pm \frac{1}{256}$ units of light-strength. By substituting these LSBs by bits of a secret message, instead of randomly toggling them, we can in fact encode a secret into the image, and if we do not expect humans to be able to tell the difference between the original color of a pixel and the color of the same pixel, after we have made it one of 256 degrees more, say, reddish, we have in fact hidden a secret.

From linguistics we know that natural language has similar features. For example, is there a significant difference between Yesterday I had my guitar repaired and I had my guitar repaired yesterday? Is there a significant difference between This is truly striking! and This is truly awesome!? We can think of many transformations that do not change much about the semantic content of natural language text. In this report, our attention will be devoted to using such transformations for hiding secrets.

While automatic analysis of images sent over electronic channels is already difficult, it is an undertaking that still seems feasible. Natural language text, however, is so omnipresent in today's society that arbitrators will hardly ever be able to efficiently cope with these masses of data, usually not even available in electronic form.

If we already had the kind of technology we envision, it would be possible to encode a secret PDF-file into a natural language text. It would be possible to distribute it, by having the resulting text printed, say, onto a t-shirt and showing the text around on the streets and it would be possible for legitimate receivers to enter the text into a computer and reconstruct the file again. Most importantly, it would not be possible for any arbitrator to prove that there is anything unusual about the text on that t-shirt.

Clearly this vision outlines a long way we will have to go, but we will necessarily have to build upon two disciplines:

Combining these two disciplines is not a common thing to do, so all the necessary background, as far as it is relevant to the understanding of the issues discussed in this report, will be introduced in chapters [*] and [*] for readers with traditional computer science background. As far as steganography is concerned, we will rely on information-theoretic models. As far as natural language processing is concerned, we will mainly deal with lexical models. Although other investigations of the topic, for example, based on complexity-theoretic approaches to steganography, or strictly grammatical models of natural language, like unification grammars, would surely be very interesting, we concentrated on these approaches, since they are well understood and, for a number of reasons we will discuss in chapter [*], most promising to lead to practical systems in the near future.

Unfortunately, the topic of natural language steganography has not been extensively studied in the past. One significant theoretical result has been achieved, and a small number of prototypes have been built, each following another general approach. Currently there is no formal framework for the design and analysis of such systems. No systematic literature covering relevant aspects of the field has been available, a gap we will try to fill with this report. In chapter [*], we will investigate the few systems built so far, and chapter [*] will try to systematize the ideas behind these implementations. A number of issues that are of central importance for building secure and robust steganography systems in a natural language domain have never been addressed before. Chapters [*] and [*] will identify some of these problems and will present approaches towards overcoming them.

Natural language also offers itself to analysis in the context of another topic, fairly new to computer security. Human Interactive Proofs (, ,,), or HIPs for short, deal with the distinction of computers and humans in a communication system, and the applications of such distinctions for security purposes. HIPs have been recognized as effective mechanisms to counter abuse of web-services, spam and worms, denial-of-service- and dictionary-attacks. Throughout this report, we will often find ourselves confronted with major gaps between the ability of computers and humans to understand natural language. We will analyze these with respect to their value to function as HIPs, making it difficult for arbitrators to automatically process steganograms. This has already lead to the construction of an HIP relying on natural language as a medium (, ). It provides a promising approach towards an often cited open problem.

Based on such considerations, we will discuss many properties of natural language that are highly advantageous from a steganographic point of view. For example, using natural language, it is possible to encode data in such a way that it can only be extracted by humans, but not by machines. This provides for a significant security benefit, since it is a considerable practical obstacle for large-scale attempts to detect hidden communication.

Summing it all up, we can say that steganography is a highly exciting field to be working in at the moment, investigating interesting technologies with rewarding applications already in sight, and natural language is a particularly promising medium to study in the context of steganography.


next up previous contents
Next: 2. Steganographic Security Up: Towards Linguistic Steganography: A Previous: Dear Diary,   Contents
Richard Bergmair 2005-01-31