Posts Tagged 'textual analysis'

What can be learned from text III

Another property that can be learned from text is the author’s attitude to whatever the text is about. This is called, variously, sentiment analysis or appraisal theory. For obvious reasons, it has always been interesting to advertisers and marketers.

In its simplest form, it just analyzes text for associations of adjectives with the nouns of interest, for example films or people. This could be as simple as seeing whether the adjective “good” or “bad” appears near the noun(s) in question. It is not too difficult to extend this to other sets of adjectives that can be considered positive or negative: “the movie was exciting” (good), or “the movie was boring” (bad).

However, this process is not quite as easy as it looks. First of all, it’s hard in languages like English to be sure which adjective goes with which noun — proximity in the sentence is often used, but this is not very robust: “Although parts of the movie were good, overall it was bad” is not a positive comment about the movie.

Second, authors often use devices such as irony and sarcasm which look, syntactically, as if they are giving one opinion, but are actually giving the opposite opinion. Humans figure this out using deep background knowledge about the situation and about human mental life, so it’s difficult for an algorithm to mimic this level of understanding.

Third, texts often comment about the parts of an object as well as the whole object, and it becomes difficult to decide which adjectives go with which parts.

There are three levels of algorithmic analysis used for this problem:

  1. Using simple sets of opinion adjectives (and maybe other words) and trying to associate them to the nouns of interest using proximity, perhaps with a little extra sophistication, trying to pick out dependent clauses etc.
  2. Parsing the text more deeply and using natural language analysis techniques to associate opinion words with the nouns of interest.
  3. Using systemic functional linguistics approaches, which treat language generation as a goal-driven task by an individual in a societal setting, as well as a technology.

These levels are arranged in increasing order of sophistication, and also of complexity. However, even the best algorithms perform only at the 80% or so level, and that’s only capturing relatively unsophisticated judgements.

There are obvious applications to sentiment analysis in adversarial situations: trying to decide whether a terrorist group pronouncement or a threat represents a genuine opinion by the author or some form of propaganda; and who the propaganda might be aimed at.

What can be learned from text II

Today let me talk about the internal state channel and what can be learned from it.

First, let me point out that this channel is even more driven by subconscious processes, so we have very little control over it, even if we know how it works. This makes it very revealing.

Some of the properties that can be inferred from the internal state channel are:

  • personality. This is useful in many adversarial situations because you can’t usually get an adversary to sit down and take an MMPI test. Several ways to categorize personality from word usage have been developed, although they are obviously somewhat limited.
  • status of each participant in a conversation. In general, lower status participants tend to use first-person singular pronouns at higher rates, which gives clues about how each participant regards him/herself with respect to the others involved.
  • health. The health of an individual going forward for several months can be predicted by flexibility in pronoun use. This is a bit different from the other categories because it doesn’t rely on a particular signature of word frequencies, but on the ability of each individual to vary his/her word usage widely over time. In other words, unhealthy people maintain a single perspective on the world, and so use consistent pronouns to describe themselves and those around them; healthier people have a changing perspective that is reflected in changing pronouns. (Note the connections to first position, second position etc. associated with NLP and its antecedent psychological approaches.)
  • stress/depression. This is related to the previous category, but both stress and depression show up in characteristic ways in word usage. In the case of depression, the changes continue even after the depression ends, so that the never-depressed differ from the once-depressed.
  • community involvement or embeddedness. The way in which first-person singular and plural pronouns are used gives clues about how an individual feels in relation to a community.
  • deception. I’ve written extensively about this in previous posts.

Much of these results are the work of James Pennebaker and his group at the University of Texas at Austin. His work (here) is quite accessible. The paper by Chung and Pennebaker is particularly relevant.

What can be learned from text?

I’ve blogged before about text as a way of seeing both a content channel and an internal state channel from its author’s mind; and I’ve talked about looking for deception in the internal state channel extensively in the context of politics.

What else can be learned from text?

From the content channel, these kind of things can be extracted:

  • the names of people, places, things, and sometimes other specifics such as money (these are called named entities).
  • the topic of a text, that is what it is about expressed as a single word or phrase (or perhaps a short list of such things). This property is what is used to collect together similar stories in Google News, for example.
  • events that occurred in the text. Although this seems straightforward, it is actually quite difficult and little work has been done on it.
  • a summary of the text. There is a wide range of possible summary styles: a tag cloud is one kind of summary; the snippet that comes below the url in a page of seach results is another; but there are also summaries that consist of sets of sentences drawn from the text that represent the main ideas or content.
  • a narrative from the text. Summaries try to cherry pick ideas from content, while a narrative tries to provide some sense of the flow of events.

All of these techniques make it easier to take a text, usually one of a very large number of texts, and decide whether it is worth looking at the raw text or not.

Other properties of the content channel that are often of interest are:

  • who the author is. This is an old kind of question, going back to questions like: Are Shakespeare’s plays written by Shakespeare, or the Earl of Essex, or someone else. It’s actually a range of questions: was this text written by author A or someone (anyone) else; was this text written by author A or author B; or match this large set of documents and this large set of authors.
  • properties of the author. For example, there is some work on determining author gender, author age, and author first language.

All of these problems have been studied, and many can be solved with knowledge discovery techniques whose accuracies are in the range 70-85%. This sounds pretty good, but is not terribly practical when the set of texts involved is in the tens of thousands. Most people who use Google News will have come across the occasional howler topic that is completely useless, for example.

Text in Adversarial Situations

Whenever we, as humans, write or speak we reveal something about ourselves. Part of this is what we want to reveal — the purpose of our communication. But we also reveal a great deal that we did not necessarily intend to reveal, and this is part of what makes textual analysis interesting in adversarial situations.

It’s helpful to think of what is happening when we speak or write as happening simultaneously in two channels:

  1. The content channel, which serves the purpose for which the communication is intended, and is often carried by the `big’ words: nouns and verbs; and
  2. The internal state channel, which reveals information about our mental state, intentions, and feelings, and is often carried by the `little’ words such as conjunctions and verbs.

The internal state channel is what we examined when looking for deception in earlier posts.

We tend to think, intuitively, that we control the content channel consciously, although we can’t control the internal state channel. In fact, we actually do not control either channel very well, at the level of details; although, of course, when we set out to say something we usually manage to get the content we want across.

When we think about communication emanating from bad guys, there are a number of different scenarios:

  1. They are communicating in a public way to disseminate content. They may attempt concealment, but are aware that their communication could be intercepted both accidentally by almost anyone, and by people looking for it explicitly.
  2. They are communicating in a private way to disseminate content. They will have to attempt concealment, and know that communication that is intercepted will be scrutinized carefully.
  3. They are communicating in a public way, but it is their mental state that is most interesting. For example, propaganda is a form of content-filled public communication, but the mental models and intentions behind it are probably of more interest than the content.

Each of these scenarios requires a particular kind of analysis, but the overall structure is the same:

  • Use selection to find potentially interesting communication in the mass of communication in today’s media and internet. This relies on modelling normality; looking for concealment; and looking for evasion (reaction to simple selection techniques).
  • Use analysis techniques on the set of selected communications to extract content, authorship information, metainformation (e.g. traffic analysis), intention, emotional state, deception, and attitudes.

I’ll talk in more detail about these aspects in subsequent posts.