Improving the process of detecting deception

The main problem with Pennebaker’s deception model, discussed in an earlier post, is that the signal for deception for two of the word classes is a decrease in word frequencies. But a decrease with respect to what? Without some care, a text that does not use any of certain word classes can look unusually deceptive, just because it happens not to use certain kinds of words. For example, formal business writing almost never uses first-person singular pronouns (“I”, “me”, “my”) so an occasional text that used one or two might be considered undeceptive just on that basis.

It’s better to consider the deceptiveness of a group of documents or texts from a common domain, and rank them from most to least deceptive, rather than imagining that deception is a kind of absolute property. Then it’s clear that decrease means “frequencies that are lower compared to the norms of documents in this domain”.

It also turns out to be useful to consider the correlations between word usage across the documents in a domain. There may be conventions about the way in which ideas are expressed that go beyond simple word frequencies. So deception detection is improved by considering the correlation among messages, and using it to pick out documents that are more unusual for enhanced consideration. This is the approach I’ve used to look at Enron emails and politicians’ speeches, for example
here for the US presidential election in February 2008, and
for the Canadian Federal election in 2006.


%d bloggers like this: