Posts Tagged 'language'

Presidential speech word patterns

In the continuing saga of presidential campaign speech language, I’ve been analyzing parts of speech that don’t get much attention such as verbs, adverbs, and adjectives. Looking at the way in which each candidate uses such words over time turns up some interesting patterns. I don’t understand their deep significance, but there’s some work suggesting that variability in writing is a sign of health; and Ashby’s Law of requisite variety can be interpreted to mean that the actor in a system with the most available options tends to control the system.

Here are the plots of adjective use (in a common framework) for the 2008 and 2012 candidates (up to the time that Santorum dropped out of the race).

It’s striking how much the patterns over time form a kind of spiral, moving from one particular combination of adjectives to another and another and eventually back to the original pattern. The exception is Obama who displays a much more radial structure, with an adjective combination that he uses a lot, and occasional deviations to something else, but a rapid return to his “home ground”.

You can see (the extremal set of) adjectives and their relationships in this figure:

You can see that they form 3 poles: on the left, adjectives associated with energy policy; at the bottom, adjectives associated with patriotism; and on the right, adjectives associated with defence [yes, it is spelled that way]. This figure can be overlaid on those of the candidates to get a sense of which poles they are visiting. For example, Obama’s “home ground” is largely associated with the energy-related adjectives.

Comparing content in the US presidential campaign 2008 vs 2012

I posted about the content in the 2012 presidential campaign speeches. It’s still relatively early in the campaign so comparisons aren’t necessarily going to reveal a lot, but I went back and looked at the speeches in 2008 by Hillary Clinton, McCain, and Obama; and compared them to the four remaining Republican contenders and President Obama so far this year.

Here’s the result of looking just at the nouns:

The key is:   Clinton — magenta circles; Obama 2008 — red circles, McCain — light blue stars;

Gingrich — green circles; Paul — yellow circles; Romney — blue circles; Santorum — black circles; Obama 2012 — red squares.

Recall that the way to interpret these plots is that points far from the origin are more interesting speeches (in the sense that they use more variable word patterns) while different directions represent different “themes” in the words used.

The most obvious difference is that the topics talked about were much more wide-ranging in 2008 than they have been this year. This may be partly because of the early stage of the campaign, the long Republican primary season keeping those candidates focused on a narrow range of topics aimed at the base, or a change in the world that has focused our collective attention on different, and fewer, topics.

This can be teased out a bit by looking at the words that are associated with each direction and distance. The next figure shows the nouns that were actually used (only those that are substantially above the median level of interestingness are labelled):

You can see that there are four “poles” or topics that differentiate the speech content. To the right are words associated with the economy, but from a consumer perspective. At the bottom are words associated with energy. To the left are actually two groups of words, although they interleave a little. At the lower end are words associated with terrorism and the associated wars and threats. At the upper end are words associated with the human side of war and patriotism.

These two figures can be lined up with each other to get a sense of which candidates are talking about which topics. The 2012 speeches and Obama’s 2008 speeches all lean heavily towards the economic words. In 2008, McCain and Clinton largely talked about the war/security issues, with a slight bias by Clinton towards the patriotism cluster.

Obama’s 2012 speeches tend towards the energy cluster but, at this point, quite weakly given the overall constellation of topics and candidates.

The other thing that is noticeable is how similar the topics for some of the Republican contenders are: their speeches cluster quite tightly.

Negative words in the campaign

Yesterday we looked at the use of positive words in the campaign. Today, I want to present the use of negative words.

We saw the President Obama is much better at using positive words than the Republican contenders; but they are all about the same at using negative words. Note that these two flavors of words are not necessarily opposites; someone can use both positive and negative words at high rates (although that itself might be interesting).

Here are the speeches according to their patterns of negative word use:

Again, distance from the origin indicates intensity of negative word use, and direction indicates different words being used.

Romney has the strongest use of negative words (and the associated words are ones like “disappointments” and “worrying”). Ron Paul also has quite strong use of negative words. His word choices are quite different from those of the other candidates, though; they include “bankrupt”, “flawed” and “inconvenient”.

President Obama and Gingrich have moderate levels of negative word use; the most popular word for both of them is “problem”, followed by “challenge”.

Santorum has the lowest levels of negative word use of all five of them.

The differences are interesting because they shed some light on how each candidate views those aspects of the situation that are not favorable to them. Obama and Gingrich have a more proactive view: negatives to them are problems. The other candidates have a more outward focus on the source of difficulties and, at the same time, a more negative inward focus, that is they use negative words that reflect how they feel about themselves.

I also ran an experiment weighting the positive words positively and the negative words negatively, to see if there is any ranking from, as it were, most positive person to most negative person. It turns out that there isn’t such a ranking. All of them use mixtures of positive and negative words, different mixtures for each, but all of about the same ratio of positivity to negativity.

Positive words in the campaign

Yesterday I posted about the content of the speeches of the campaigners for the 2012 presidential election cycle: the Republican contenders and President Obama. Today I have similar results for the use of positive words.

Here are the speeches:

The figure should be interpreted like this:  distance from the origin indicates intensity of positive word use; direction indicates the use of a different set of positive words. So President Obama is much more positive than the Republican contenders, of which Gingrich is noticeably more positive than the rest. These are only based on the use of positive words so a placement close to the origin should be interpreted as the absence of positive words, not any kind of negativity (stay tuned). In other words, speeches near the origin are not positive (they could be either neutral or negative but this analysis can’t differentiate).

Some of the positive words associated with President Obama are: “profitable”, “creative”, “efficiency” and “outstanding”.

Some of the positive words associated with Gingrich are: “tremendous”, “optimistic”, “gains”, “happiness”, and “positive” itself.

You can see why the Republican approval numbers are dropping — people pick up on the tone of speeches, and they are attracted to positive language — which they aren’t getting. Even Gingrich’s positive words are mostly about the improvement (perceived) in his chances, not in the wider US situation.

Content in Presidential Campaign Speeches

Last week I posted details of the level of “persona deception” among the Republican presidential candidates and President Obama. Persona deception measures how much a candidate is trying to present himself as “better” in some way than he really is. This is the essence of campaigning — we don’t elect politicians based on the quality of their proposals; and we don’t fail to elect them because they tell us factual lies. Almost everything is based on our assessment of character which we get from appearance and behavior, and also from language.

Today I’ll post a description of the different content of the speeches so far in 2012. This is less informative than levels of deception, but it does give some insight into what candidates are thinking is of interest or importance to the voters they are currently targeting. Here is an overview of the topic space:

You can see that most of the Republican candidates are talking about very similar things. In fact, the speeches in the upper right-hand corner are associated strongly with words such as “greatness”, “freedom”, “opportunity”, “principles” and “prosperity” — all very abstract nouns without much content that could come back to haunt them.

Gingrich’s speeches towards the bottom of the figure are quite different, although still associated with quite abstract words: “bureaucracy”, “media”, “pipeline”, “elite”, “establishment”. These are almost all things that he is against — stay tuned for an analysis of negative word use later in the week.

Obama’s speeches, on the left-hand side, are heavily oriented to manufacturing associated with words such as: “cars”, “hi-tech”, “plant”, “oil”, “demand”, “prices”.

What a candidate chooses to talk about seems to be a mix of his personal hobbyhorses (at the time) and some judgement of what issues are of interest to the general public, or at least which can create daylight between one candidate’s position and the others. From this perspective, Gingrich separates himself from the other Republicans quite well. Somewhat surprisingly, Ron Paul’s content is not very different from that of Romney and Santorum. Probably this can be accounted for as a function of the three of them all trying to appeal to a very similar segment of the base. Whether Gingrich is consciously trying to address different issues, or whether his history or personality compel him to is not clear.

2012 US Election, Republicans plus President Obama

Yesterday I posted details about the levels of persona deception in the speeches by the Republican candidates since the beginning of 2012. In striking contrast to the 2008 cycle, the speeches fall along a single axis, indicating widespread commonalities in the way that they use words, particularly the words of the deception model.

Today I’ve included President Obama’s speeches this year in the mix. I’ve tried to select only those speeches where there was an audience. Of course, for a sitting president, the distinction between an ordinary speech and a campaign speech is difficult to draw. Almost all of these are labelled as campaign events at whitehouse.gov.

Here is the plots of the persona deception levels, with Obama’s speeches added in magenta.

Generally speaking, Obama’s levels of persona deception (see yesterday’s post to be clear on what this means) are in the low range compared to the Republican presidential candidates. This is quite different from what happened in the 2008 cycle, where his levels were almost always well above those of McCain and Clinton. It’s not altogether surprising, though. First, he can no longer be the mirror in which voters see what they want to see since he has a substantial and visible track record. Second, he doesn’t have to try as hard to project a persona (at least at this stage of the campaign) since he has no competitor. I expect that his values will climb as the campaign progresses, particularly after the Republican nominee becomes an actual person and not a potential one.

The interesting point is the outlier at the top left of the figure. This is Obama’s speech to AIPAC. Clearly this is not really a campaign speech, so the language might be expected to be different. On the other hand, if it were projected onto the single-factor line formed by the other speeches, it would be much more towards the deceptive end of that axis. Since the underlying model detects all kinds of deception, not just that associated with persona deception in campaigns, this may be revealing of the attitude of the administration to the content expressed in this speech.

Republican presidential candidates — first analysis of persona deception

Regular readers of this blog will know that I carried out extensive analysis of the speeches of the contenders in the 2008 US presidential election cycle (see earlier postings). I’m now beginning similar analysis for the 2012 cycle, concentrating on the Republican contenders for now.

You will recall that Pennebaker’s deception model enables a set of documents to be ranked in order of their deceptiveness, detected via changes in the frequency of occurrence of 86 words in four categories: first-person singular pronouns, exclusive words, negative-emotion words, and action verbs. Words in the first two categories decrease in the presence of deception, while those in the last two categories increase. The model only allows for ranking, rather than true/false determination, because “increase” and “decrease” are always relative to some norm for the set of documents being considered.

How does this apply to politics? First of all, the point isn’t to detect when a politician is lying (Cynical joke: Q: How do you tell when a politician is lying? A: His lips are moving). Politicians tell factual lies, but this seems to have no impact on how voters perceive them, perhaps because we’re come to expect it. Rather, the kind of deception that is interesting is the kind where a politician is trying to present him/herself as a much better person (smarter, wiser, more competent) than they really are. This is what politicians do all the time.

Why should we care? There are two reasons. The first is that it works — typically the politician who is able to deliver the highest level of what we call “persona deception” gets elected. Voters have to decide on the basis of something, and this kind of presentation as a great individual seems to play more of a role than, say, actual plans for action.

Second, though, watching the changes in the levels of persona deception gives us a window into how each candidate (and campaign) is perceiving themselves (and, it turns out, their rivals) from day to day. Constructing and maintaining an artificial persona is difficult and expensive. Levels of persona deception tend to drop sharply when a candidate becomes confident that they’re doing well; and when some issue surfaces about which they don’t really have a persona opinion because, apparently, it takes time to construct the new piece.

So, with that preliminary, on to some results.

The figure shows the speeches in a space where speeches with greater person deception (spin) are further to the right, and those with less persona deception are further to the left. Ron Paul shows the lowest level of persona deception which is not surprising — nobody has ever accused him of trying to be what he is not. In contrast, Romney shows the highest level of persona deception — again not surprising as he has had to try hardest to make himself appealing to voters. Note that this also predicts that he will do well. Both Gingrich and Santorum occupy the middle ground; both are running on a very overt track record and are not trying as hard to make themselves seem different from who they are. Indeed, candidates with a strong history tend to have lower levels of persona deception simply because it’s very difficult to construct a new, more attractive persona when you already have a strong one. (The two points vertically separated from the rest are the result of a sudden burst of using “I’d” in these two speeches.)

The following figures break out the temporal patterns for the four candidates:

What’s striking about Romney is how much the level of persona deception changes from speech to speech. In the last election cycle, this wasn’t associated with audience type or recent success but seemed to be much more internally driven. This zig-zag pattern is much more the norm than a constant level of persona deception — some mystery remains.

Language in Presidential Elections — 2012 Season Opener

Readers of this blog will know that we spent a lot of time analyzing the speeches of the U.S. presidential candidates in the 2008 election. Our primary interest was in the use of the deception model, a linguistic/textual model of how freeform language changes when the speaker/writer is being deceptive.

In the political arena, factual deception, saying things that just ain’t so, plays very little role, perhaps because voters have very low expectations of politicians in this area. What we call persona deception, presenting oneself as a better,wiser,  more powerful, more able, more knowledgeable person than one really is is the heart of successful campaigning. It turns out that the deception model captures deception across the whole range from factual to persona deception, so it gives us a lens to look at candidates and campaigns. What’s more, because language generation is almost entirely subconscious, this lens is hard to fool.

The most important skill candidates and their campaigns have is the ability to reach out to potential voters to convince them that they are better than the other possibilities. The language that they use is an important channel, especially in settings where everyone is conservatively dressed, and standing behind a podium that conceals most of their body language, as the Republican presidential field was in Iowa yesterday.

Strong candidates understand, at least instinctively, that they are not making arguments to convince voters, but presenting themselves as more compelling human beings. Our analysis of the speeches of candidates in the 2008 U.S. presidential election showed that candidates use three different kinds of speeches: blue skies speeches that promise generically good things and could be delivered interchangeably by any candidate – they are aimed at a wide audience; track record speeches that use past achievements to imply special qualifications for future achievements – they are aimed at swing voters; and manifesto speeches that describe a candidate’s personal qualities directly – they are aimed at a candidate’s base and reinforce common identity. But in all three cases, it’s not the content of the speech that matters, but what it implies about the speaker.

Our analysis in the last election cycle showed that Obama was by far the best as presenting himself as a wonderful person, and many voters, and certainly many in the media, projected onto the persona  positive qualities that were perhaps not there. Interestingly, yesterday was the first time I have seen open Democratic buyers remorse about electing Obama, something I predicted would happen from the analysis we did.

The Republican candidates’ debate in Ames showed what a shaky grasp many of the candidates have on how to be a convincing candidate. Of course, this venue was a difficult one. Its overt purpose was for candidates to explain themselves to the local Republican base ahead of the Ames Straw Poll,which would have required largely manifesto content; but national television coverage made it an unmissable opportunity to reach out to a wider, but much more diverse audience, suggesting track record content. Blue skies content is always dangerous in the early stages of a campaign because grand but potentially unwise statements can come back to haunt a candidate.

Manifesto content was indeed popular – for example, we learned how many children almost every candidate has – typical content aimed at the base (“I’m a parent just like you”). Several candidates also tried for track record content, but got it quite wrong. The purpose of a track record speech is not for candidates to read their resumes to the audience; it’s to make the argument “I was able to do A, so you can trust me to be able to do similar-but-larger B” and this second part was notably absent.

Voters also want candidates to be sincere — recall the famous quotation “The secret of success is sincerity. Once you can fake that you’ve got it made” (Jean Girardoux). This is not just a cute quotation; this is what good politicians are able to do. In Iowa, this was another area where almost everyone stumbled. It was clear that most of the candidates had not only prepared talking point responses to probable questions, but has also rehearsed actual answers. Delivering from a prepared and memorized script and seeming sincere is a difficult business, and actors who can do it reliably command high rewards.  Most of the candidates failed at seeming sincere. Several managed the worst of both worlds by trying to combine their prepared scripts with some ad libbing and came across as quite incoherent. One of the reasons for Gingrich’s strong showing is that he stayed away from scripts and delivered his answers as if he had just thought of them. Huntsman and Romney, in contrast, were especially wooden.

When humans listen to humans, the content matters. But when character is the issue, other aspects of language matter more. Much language generation is subconscious, and therefore beyond a candidate’s control. This is good for voters because it means we can sometimes see through to the real person no matter how sophisticated their speech writers and spin doctors.

How do I demonstrate that I am me?

The question of identity, how the question in the title gets answered, is one with an interesting history; and one that is changing again at the moment.

For much of human history, identity was almost completely determined by the fact that a person was born and grew up in a community where everyone knew them, and never moved far from this community. This is still true in many parts of the world, but was surprisingly true in the developed world until quite recently.

Things changed when migration to cities started in a big way, in Western countries perhaps around the 16th century and accelerating since then. Someone who moved to a city could become anyone they wanted as long as they kept away from people from the same general area as they were, who might know them or know of them. This was harder than it seemed, mostly because of the tendency of people with the same origin to live contiguously when they arrived in a city (so if you were from X but didn’t live in the X area, you automatically attracted attention). This ability to assume new identities was grist to the mill of detective stories up to about 100 years ago (notably Austin Freeman).

In the last 100 years, governments have become the guarantors of identity because of the requirement to collect taxes, mostly income taxes; and, for an increasing number of people, because of the need to cross borders. So governments issue identity documents that are tied to a single person via some kind of link, perhaps a biometric or even an address. And, for most people, this is where things stand now.

But there are new forms of identity beginning to be created, and new ways to blur identities as well.

I have had a web page with my photo on it, and links to my papers, and so on, since the web began. Copies of this web page have been periodically archived, at moments that I can’t control, by the Wayback Engine and probably several other places as well. If I want to prove my identity, I can now do it without any government intervention by pointing to these copies of my web page which have information that links them unqiuely to me. For many people, their Facebook or LinkedIn profile pages would do the same thing if they were publicly archivable. So identity is once again moving away from something that is government mediated to something that is more decentralized and community based.

On the other side of the coin, governments and others are actively creating artificial personas, sometimes called sock puppets. These personas are controlled by a real person, but one person can control many of them, and the postings of each persona don’t need to be the ones that the controller would naturally make. In other words if, on the internet, nobody can tell you’re a dog, it follows that nobody can tell you’re not a construct either.

In order to make these sock puppets realistic, a back story has to be created for each one; increasingly, this means that they have to have a created trail in places where this might be looked for. Once upon a time, intelligence organizations would go into official records and create entries for non-existent people; this is inherently difficult, especially in records that are owned by other governments (remember, governments validated identities); so often identities of people who had died were used as starting points. I expect we’ll see that same thing happening in the online world.

But there’s an important difference: while governments can go back and change history embodied in records, neither they nor anyone else can change the history embedded in web sites that, at random times, take a snapshot of some part of the web. So creating realistic sock puppets is actually really difficult.

There’s also the issue of language: one controller runnning multiple sock puppets cannot avoid using detectably similar language patterns for all of them; and eventually this will make it possible to detect artificial personas.

Metaphors and counterterrorism

The Intelligence Advanced Research ProjectsActivity (IARPA) has a call out for proposals to develop a system that will extract metaphors from text. The assumption is that the metaphors that are used in a document, or a community, reflect a way of viewing and organizing the world that can provide a higher-level way to understand other (sub)cultures. This seems like a very difficult challenge, which is exactly what these funding agencies derived from DARPA are supposed to do.

I remember reading a paper that Charles Williams presented to the Inklings (the Oxford society that included C.S. Lewis, Tolkien, and other high fliers) in which he talked about just how difficult it is to understand what a metaphor does (I haven’t been able to find either paper or reference). Similes are (by comparison) straightforward; when we say “A is like B” we draw attention to or highlight some aspect of B that is similar to that of A, and therefore emphasize some aspect of A, perhaps one that isn’t obvious.

A metaphor is a much more difficult object. When we say “A is B” we could take the view that this is just a more obscure kind of simile, in which the reader/hearer is invited to conceive of the possible similarity without a hint from the writer/speaker. But Williams argues, and I agree, that more is going on here. For a start, metaphors are not symmetric: if I say “A is B” it’s often nonsense to say “B is A” whereas similes usually are symmetric. Often there is no obvious and straightforward way to reduce a metaphor to a simile, that is there is no small set of properties common to A and B. And yet metaphors can be powerful.

There is a little relevant work in psychology, most of it associated with Judy DeLoache and what’s called the Dual Representation Hypothesis. Roughly speaking, the idea is that brains are well-equipped to represent symbols and the things they denote and to map computations on the symbols to computations on the denoted things in usable ways (apologies to psychologists for this mangled and computational perspective).  This goes some way to explain abstract reasoning, with some very nice experiments with young children showing when various levels of sophistication kick in; but it might also provide some explanatory power for metaphors. Unfortunately, there is some evidence that the more black-box the symbol, the more usable it is, which is evidence against this being a useful explanation for metaphors.

I won’t be applying for funding to work on this — but I’ll be watching the results with interest.

And Williams’ conclusion — that metaphors are something like a legal fiction; which I didn’t find very convincing at the time I read the article and still don’t.

Finding significance automatically

In a world where lots of data is collected and available, the critical issue for intelligence, law enforcement, fraud, and cybersecurity analysts is attention.

So the critical issue for tools to support such analysts is focus: how can the most significant and interesting pieces of data/information/knowledge be made the easiest to pay attention to?

This isn’t an easy issue to address for many reasons, some of which I talked about a few posts ago in the context of connecting the dots. But the fundamental problems are: (1) significance or interestingness are highly context dependent, so where to focus depends, in a complex way, on what the analyst already knows and understands; and (2) every new piece of information has the potential to completely alter the entire significance landscape in one hit.

Many existing tools are trying, underneath, to address the issue of focus indirectly, by providing ways for analysts to control their own focus more directly. For example, there are many analysts platforms that allow available information to be sliced and diced in many different ways. These allow two useful things to be done: (1) dross (the guaranteed insignificant stuff) can be removed (or at least hidden from sight); and the rest of the data can be rearranged in many different ways in the hope that human pattern-recognition skills can be brought to bear to find significance.

But it seems like a good idea to try and address the significance issue more directly. This has motivated a couple of the research projects I’m involved with:

  • The ATHENS system tries to find information on the web that is probably new to the user, but which s/he is well-positioned to understand; in other words, the new information is just over the horizon from the user’s current landscape. It builds this new information starting from a query that allows the user to provide the current context;
  • Finding anomalies in large graphs. Lots of data is naturally represented relationally as a graph, with nodes representing some kind of entities, and edges representing some kind of (weighted) similarity between some subset of the nodes (e.g. social networks). Graphs are difficult to work with because they don’t really have a representation that humans can work with — even drawing nice pictures of them tends to (a) occlude chunks once the graph gets big enough, and (b) hide the anomalous structure in the corners because the nice representation is derived from the big structure (e.g. the simple bits of the automorphism group). We’ve developed some tools that find and highlight anomalous regions, anomalous in the sense that, if you were to stand at their nodes and look at the landscape of the rest of the graph, it would look unusual.
  • Finding anomalies in text caused by either a desire to obfuscate the content that’s being talked about, or caused by internal mental state that’s unusual — being deceptive, or highly tense, for example.

Some other people are working in similar directions. For example, there is some work aimed at using social processes to help discover significance. In a sense, sites like Slashdot work this way — each user provides some assessment of quality/importance of some stories, and in return gets information about the quality/importance of other stories. This is also, of course, how refereed publications are supposed to work. The challenge is to contextualize this idea: what makes an object high quality/important for you may not mean anything to me. In other words, most significance lies somewhere on the spectrum from universal agreement to completely taste-based, and it’s hard to tell where, let alone compute it in a practical way.

Deception scores for the UEA emails

I’ve also calculated the deception scores for the UEA “climategate” emails, using the same methodology that I’ve written about in the context of the speeches of presidential candidates.

This doesn’t (yet) give any great results. This is partly because deception scores can only be computed for sets of similar documents. The UEA emails, however, fall into two broad classes: simple emails, and discussions and suggestions about more formal documents (papers and grant proposals). The language in these two classes is quite different, which makes them difficult to compare. For example, the base rates of first-person singular pronouns are very different.

What I have done is to see whether there are any patterns in  deception scores with time. A strong change in either class of email should be detectable as a variation of score with time, which might be visible. The result is shown below, with the deception score axis running from right (low) to left (high), and the markers getting lighter with the passage of time.

Deception scores of UEA emails

The only thing that strikes me so far is that many emails with low deception scores are older in time. This might be taken to indicate some kind of change in the language patterns of these email users.

The released emails are a small and not very random set of all of the emails sent by these individuals. So not too much should be read into this plot.

Patterns of word usage in the UEA climate emails

I’m always pleased to see examples of real emails because they can act as testbeds for various textual analysis techniques. I’ve begun to analyse the “climategate” emails from the University of East Anglia. The figure below shows a plot of the structure of the words used. (This is quite a quick and dirty analysis — I didn’t try to remove email headers or otherwise clean up the content of the files.)

There are three parts to the structure. The arm to the right is an artifact of the fact that several word files were included in the bodies of emails, rather than as attachments, so my extraction software sees them as part of the text. This can be fixed, but will take me some time.

The interesting property is the longtitudinal structure from top to bottom in the figure. The phrases at the bottom are all content, while the phrases at the top are all identifiers of people and places (admittedly hard to see). Since the analysis algorithms know nothing of the semantics of emails, and are based purely on “bag of words” style analysis, this is an interesting, and unexpected, outcome.

Spin — the technical basis

The work on spin that I’ve written about here is based primarily on the work of Pennebaker’s group at the University of Texas, Austin. The primary reference is here. The model is based on empirical studies of deception in settings where the ground truth is known, and has, by now, been validated many times. When someone is being deceptive, there will be characteristic changes in the ways they use certain words. Deception here means, of course, saying things that the speaker does not believe to be (entirely) true, not things that are factually incorrect.

Of course, this is domain dependent, so we can’t judge how deceptive politicians are in comparison to, say, used-car salesmen or nuns.

The use of the word “spin” rather than deception acknowledges the fact that there are differences between explicit intent to deceive and implicit, unconscious desire to present oneself as better (along some set of dimensions) than one actually is. This kind of self-improvement happens in job interviews, dating, and in politics.

And we don’t condemn someone for being deceptive when they make an initial offer in a negotiation, but the same kind of signals will appear in their language use.

The interesting thing is not so much that politicians try to appeal to as many people as possible, at the expense of strict accuracy, but that there are differences in how much this happens — changing over short time scales for a single person, and at longer time scales for a campaign; and that there are systematic differences between campaigns. There is a great deal of evidence that the properties mediated by changes in language patterns are not under conscious control (see e.g. Pennebaker and Chung) so they provide an insight into campaigns that is hard for the campaigns to obscure.

Can a speaker fool the deception model?

If deception can be detected in text, and the properties that signal deception are known (more or less), can a clever speaker or author use this knowledge to come across as truthful?

The short answer is No. The reason is that language production is a deeply unconscious process. Although we can decide consciously what we would like to say, we have much less control over how we say — much less even than we think we do.

A speaker, with some practice, could start to insert more first-person singular pronouns and exclusive words into their speaking, but only by concentrating. But concentration creates other problems: it tends to make the whole delivery sound more stilted, and it consumes processing resources spent thinking about the form of the speech that come at the expense of the content. In other words, to make the speech sound less deceptive, it’s almost necessary to make it more bland, and therefore less effective for whatever reason it is being made in the first place.

In written text, there is more opportunity to work on the signals in the text to make it seem less deceptive. For example, an author could use our deception detection software to measure the deceptiveness of the text, and change a few words to improve it.

There are two problems with this. First, it’s always better to think of deceptiveness in the context of a set of documents from the same domain. It may not be obvious what the norms for a domain are, and so how much the particular document needs to be adjusted. Second, explicitly manipulating word frequencies tends to create unusual documents because that’s not how documents get edited. These kinds of tinkering run the risk of creating an alternative signature which may also be detected by a different kind of analysis.

The bottom line is that language carries all sorts of information at different levels of abstraction, and it’s all consistent when the language was generated in an ordinary way. Messing around with pieces of language as if they were independent quickly breaks some or all of this global consistency.



Follow

Get every new post delivered to your Inbox.