Posts Tagged 'content'

Inspire Magazine Issue 10

The tenth issue of this al Qaeda in the Arabian Peninsula magazine is out. Continuing the textual analysis I’ve done on the earlier issues, I can conclude two things:

  1. Issue 10 wasn’t written by whoever wrote Issue 9 (nor by those who wrote the previous issues since they’re dead). In almost every respect the language resembles that of earlier issues, and is bland with respect to almost every word category. Except …
  2. The intensity of Jihadist language, which has been steadily increasing over the series, decreases sharply in Issue 10. Whoever the new editors/authors are, their hearts are not in it as much as the previous ones.

Differentiating from the other candidate

One of the puzzles of the early phases of the 2012 election campaign was how little the candidates managed to differentiate themselves from one another.

Campaigns are a situation where getting daylight between your candidate and the other guys seems like an essential (and preferably in a good way). But not only did the Republican contenders all tend to use similar words, but they all used similar words to Obama. There was some indication that each had a home ground to which they constantly returned, but it wasn’t different enough from everybody else to differentiate them, certainly not to a human audience. (I’m talking about aspects of this analysis at the Foundations of Open Source Intelligence at the end of the month in Istanbul — politicians acting as surrogates for other highly motivated, sophisticated, well-funded persuaders.)

Now that the campaign has become a two-person one, there is differentiation in the language use of the two candidates, shown here:

The blue crosses are Obama speeches and the red ones Romney speeches. There are clear differences.

So the next question is: do these differences result from differences of content or differences of style? This turns out to be hard to answer. If we pick out particular classes of words (nouns, verbs, adjectives) then there’s more of an overlap, but still a visible difference. For example, here is the equivalent plot for just the nouns, which you would imagine would primarily capture differences in content:

This rather suggests that a big part of the difference is what the candidates are talking about. But when you dig into the data, it turns out that the differentiating nouns are not big content-filled nouns, but little ordinary nouns where the differences are as much about habits and taste as they are about content.

Comparing content in the US presidential campaign 2008 vs 2012

I posted about the content in the 2012 presidential campaign speeches. It’s still relatively early in the campaign so comparisons aren’t necessarily going to reveal a lot, but I went back and looked at the speeches in 2008 by Hillary Clinton, McCain, and Obama; and compared them to the four remaining Republican contenders and President Obama so far this year.

Here’s the result of looking just at the nouns:

The key is:   Clinton — magenta circles; Obama 2008 — red circles, McCain — light blue stars;

Gingrich — green circles; Paul — yellow circles; Romney — blue circles; Santorum — black circles; Obama 2012 — red squares.

Recall that the way to interpret these plots is that points far from the origin are more interesting speeches (in the sense that they use more variable word patterns) while different directions represent different “themes” in the words used.

The most obvious difference is that the topics talked about were much more wide-ranging in 2008 than they have been this year. This may be partly because of the early stage of the campaign, the long Republican primary season keeping those candidates focused on a narrow range of topics aimed at the base, or a change in the world that has focused our collective attention on different, and fewer, topics.

This can be teased out a bit by looking at the words that are associated with each direction and distance. The next figure shows the nouns that were actually used (only those that are substantially above the median level of interestingness are labelled):

You can see that there are four “poles” or topics that differentiate the speech content. To the right are words associated with the economy, but from a consumer perspective. At the bottom are words associated with energy. To the left are actually two groups of words, although they interleave a little. At the lower end are words associated with terrorism and the associated wars and threats. At the upper end are words associated with the human side of war and patriotism.

These two figures can be lined up with each other to get a sense of which candidates are talking about which topics. The 2012 speeches and Obama’s 2008 speeches all lean heavily towards the economic words. In 2008, McCain and Clinton largely talked about the war/security issues, with a slight bias by Clinton towards the patriotism cluster.

Obama’s 2012 speeches tend towards the energy cluster but, at this point, quite weakly given the overall constellation of topics and candidates.

The other thing that is noticeable is how similar the topics for some of the Republican contenders are: their speeches cluster quite tightly.

What can be learned from text?

I’ve blogged before about text as a way of seeing both a content channel and an internal state channel from its author’s mind; and I’ve talked about looking for deception in the internal state channel extensively in the context of politics.

What else can be learned from text?

From the content channel, these kind of things can be extracted:

  • the names of people, places, things, and sometimes other specifics such as money (these are called named entities).
  • the topic of a text, that is what it is about expressed as a single word or phrase (or perhaps a short list of such things). This property is what is used to collect together similar stories in Google News, for example.
  • events that occurred in the text. Although this seems straightforward, it is actually quite difficult and little work has been done on it.
  • a summary of the text. There is a wide range of possible summary styles: a tag cloud is one kind of summary; the snippet that comes below the url in a page of seach results is another; but there are also summaries that consist of sets of sentences drawn from the text that represent the main ideas or content.
  • a narrative from the text. Summaries try to cherry pick ideas from content, while a narrative tries to provide some sense of the flow of events.

All of these techniques make it easier to take a text, usually one of a very large number of texts, and decide whether it is worth looking at the raw text or not.

Other properties of the content channel that are often of interest are:

  • who the author is. This is an old kind of question, going back to questions like: Are Shakespeare’s plays written by Shakespeare, or the Earl of Essex, or someone else. It’s actually a range of questions: was this text written by author A or someone (anyone) else; was this text written by author A or author B; or match this large set of documents and this large set of authors.
  • properties of the author. For example, there is some work on determining author gender, author age, and author first language.

All of these problems have been studied, and many can be solved with knowledge discovery techniques whose accuracies are in the range 70-85%. This sounds pretty good, but is not terribly practical when the set of texts involved is in the tens of thousands. Most people who use Google News will have come across the occasional howler topic that is completely useless, for example.