What can be learned from text?

I’ve blogged before about text as a way of seeing both a content channel and an internal state channel from its author’s mind; and I’ve talked about looking for deception in the internal state channel extensively in the context of politics.

What else can be learned from text?

From the content channel, these kind of things can be extracted:

  • the names of people, places, things, and sometimes other specifics such as money (these are called named entities).
  • the topic of a text, that is what it is about expressed as a single word or phrase (or perhaps a short list of such things). This property is what is used to collect together similar stories in Google News, for example.
  • events that occurred in the text. Although this seems straightforward, it is actually quite difficult and little work has been done on it.
  • a summary of the text. There is a wide range of possible summary styles: a tag cloud is one kind of summary; the snippet that comes below the url in a page of seach results is another; but there are also summaries that consist of sets of sentences drawn from the text that represent the main ideas or content.
  • a narrative from the text. Summaries try to cherry pick ideas from content, while a narrative tries to provide some sense of the flow of events.

All of these techniques make it easier to take a text, usually one of a very large number of texts, and decide whether it is worth looking at the raw text or not.

Other properties of the content channel that are often of interest are:

  • who the author is. This is an old kind of question, going back to questions like: Are Shakespeare’s plays written by Shakespeare, or the Earl of Essex, or someone else. It’s actually a range of questions: was this text written by author A or someone (anyone) else; was this text written by author A or author B; or match this large set of documents and this large set of authors.
  • properties of the author. For example, there is some work on determining author gender, author age, and author first language.

All of these problems have been studied, and many can be solved with knowledge discovery techniques whose accuracies are in the range 70-85%. This sounds pretty good, but is not terribly practical when the set of texts involved is in the tens of thousands. Most people who use Google News will have come across the occasional howler topic that is completely useless, for example.


0 Responses to “What can be learned from text?”

  1. Leave a Comment

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s


%d bloggers like this: