deception | Finding Bad Guys in Data

Posts Tagged 'deception'

Results from the first Democratic debate

Published October 15, 2015 Uncategorized Leave a Comment
Tags: Bernie Sanders, Chafee, CNN debate, deception, Democratic candidate debate, election, Hillary Clinton, language, O'Malley, politics, textual analysis, US election, Webb

The debate held on Tuesday night pitted one well known figure (Hillary Clinton) against one up and coming figure (Sanders) and three others with no name recognition except among the wonkiest. The differences in exposure and preparation were obvious. I can’t see that it made any difference to anyone’s opinions.

But it remains interesting to see how well each person did at presenting a persona. Extremely well known politicians do not usually have the luxury of presenting themselves with a new, improved persona because the old one is so well known, so it’s common to find that persona deception scores are low for such candidates. For those who aren’t well-known, the strength of their persona is a blend of how well they can do it personally, and how big the gap is between their previous self-image and the persona that they are trying to project. A relatively unknown candidate with a high persona deception score, therefore, is likely to do well; one with a low score probably will not.

Here are the results from this debate:

The red and greeen points represent artificial word use corresponding to moderately high amd moderately low levels of persona deception. Clinton, as expected (and from my analysis in the 2008 cycle) has low levels of persona deception. Sanders’s levels are in the mid-range. Chafee is sincere, but this won’t help him with his current level of recognition. O’Malley has the highest level of persona deception, which is a positive indicator for him (for what it’s worth in this crowd). Webb is also in the midrange, but his language use is quite different from that of Sanders.

Results from second Republican debate

Published September 17, 2015 Uncategorized Leave a Comment
Tags: #cnndebate, Buch, Carson, Christie, CNN debate, Cruz, debate, deception, election, Fiorina, Graham, huckabee, JIndal, Kasich, language, Pataki, Paul, politics, Rubio, Santorum, textual analysis, Trump, US election, Walker

Regular readers will know that, especially in a crowded marketplace, politicians try to stand out and attract votes by presenting themselves in the best possible light that they can. This is a form of deception, and carries the word-use signals associated with deception, so it can be measured using some straightforward linguistic analysis.

Generally speaking, the candidate who achieves the highest level of this persona deception wins, so candidates try as hard as they can. There are, however, a number of countervailing forces. First, different candidates have quite different levels of ability to put on this kind of persona (Bill Clinton excelled at it). Second, it seems to be quite exhausting, so that candidates have trouble maintaining it from day to day. Third, the difficulty depends on the magnitude of the difference between the previous role and the new one that is the target of a campaign: if a vice-president runs for president, he is necessarily lumbered with the persona that’s been on view in the previous job; if not, it’s easier to present a new persona and make it seem compelling (e.g. Obama in 2008). Outsiders therefore have a greater opportunity to re-invent themselves. Fourth, it depends on the content of what is said: a speech that’s about pie in the sky can easily present a new persona, while one that talks about a candidate’s track record cannot, because it drags the previous persona into at least the candidate’s mind.

Some kinds of preparation can help to improve the persona being presented — a good actor has to be able to do this. But politicians aren’t usually actors manqué so the levels of persona deception that they achieve from day to day emerge from their subconscious and so provide fine-grained insights into how they’re perceiving themselves.

The results from the second round of debates are shown in the figure:

The red and green points represent artificial debate participants who use all of the words of the deception model at high frequency and low frequency respectively.

Most of the candidates fall into the band between these two extremes, with Rand Paul with the lowest level of persona deception (which is what you might expect). The highest levels of deception are Christie and Fiorina, who had obviously prepped extensively and were regarded as having done well; and Jindal, who is roughly at the same level, but via completely different word use.

Comparing these to the results from the first round of debates, there are two obvious changes: Trump has moved from being at the low end of the spectrum to being in the upper-middle; and Carson has moved from having very different language patterns from all of the other candidates to being quite similar to most of them. This suggests that both of them are learning to be better politicians (or being sucked into the political machine, depending on your point of view).

The candidates in the early debate have clustered together on the left hand side of the figure, showing that there was a different dynamic in the two different debates. This is an interesting datum about the strength of verbal mimicry.

Canadian election 2015: Leaders’ debate

Published August 7, 2015 Uncategorized Leave a Comment
Tags: Canada, candidates, deception, election, election campaign, Elizabeth May, influence, Justin Trudeau, language, leaders' debate, Macleans, May, Mulcair, politics, spin, Stephen Harper, textual analysis, Thomas Mulcair, Trudeau, words

Regular readers will recall that I’m interested in elections as examples of the language and strategy of influence — what we learn can be applied to understanding jihadist propaganda.

The Canadian election has begun, and last night was the first English-language debate by the four party leaders: Stephen Harper, Elizabeth May, Thomas Mulcair, and Justin Trudeau. Party leaders do not get elected directly, so all four participants had trouble wrapping their minds around whether they were speaking as party spokespeople or as “presidential” candidates.

Deception is a critical part of election campaigns, but not in the way that people tend to think. Politicians make factual misstatements all the time, but it seems that voters have already baked this in to their assessments, and so candidates pay no penalty when they are caught making such statements. This is annoying to the media outlets that use fact checking to discover and point out factual misstatements, because nobody cares, and they can’t figure out why.

Politicians also try to present themselves as smarter, wiser, and generally more qualified for the position for which they’re running, and this is a much more important kind of deception. In a fundamental sense, this is what an election campaign is — a Great White Lie. Empirically, the candidate who is best at this kind of persona deception tends to win.

Therefore, measuring levels of deception is a good predictor of the outcome of an election. Recall that deception in text is signalled by (a) reduced use of first-person singular pronouns, (b) reduced use of so-called exclusive words (“but”, “or”) that introduce extra complexity, (c) increased use of action verbs, and (d) increased use of negative-emotion words. This model can be applied by counting the number of occurrences of these words, adding them up (with appropriate signs), and computing a score for each document. But it turns out to be much more effective to add a step that weights each word by how much it varies in the set of documents being considered, and computing this weighted score.

So, I’ve taken the statements by each of the four candidates last night, and put them together into four documents. Then I’ve applied this deception model to these four documents, and ranked the candidates by levels of deceptiveness (in this socially acceptable election-campaign meaning of deceptiveness).

This figure shows, in the columns, the intensity of the 35 model words that were actually used, in decreasing frequency order. The rows are the four leaders in alphabetical order: Harper, May, Mulcair, Trudeau; and the colours are the intensity of the use of each word by each leader. The top few words are: I, but, going, go, look, take, my, me, taking, or. But remember, a large positive value means a strong contribution of this word to deception, not necessarily a high frequency — so the brown bar in column 1 of May’s row indicates a strong contribution coming from the word “I”, which actually corresponds to low rates of “I”.

This figure shows a plot of the variation among the four leaders. The line is oriented from most deceptive to least deceptive; so deception increases from the upper right to the lower left.

Individuals appear in different places because of different patterns of word use. Each leader’s point can be projected onto this line to generate a (relative) deception score.

May appears at the most deceptive end of the spectrum. Trudeau and Harper appear at almost the same level, and Mulcair appears significantly lower. The black point represents an artificial document in which each word of the model is used at one standard deviation above neutral, so it represents a document that is quite deceptive.

You might conclude from this that May managed much higher levels of persona deception than the other candidates and so is destined to win. There are two reasons why her levels are high: she said much less than the other candidates and her results are distorted by the necessary normalizations; and she used “I” many fewer times than the others. Her interactions were often short as well, reducing the opportunities for some kinds of words to be used at all, notably the exclusive words.

Mulcair’s levels are relatively low because he took a couple of opportunities to talk autobiographically. This seems intutively to be a good strategy — appeal to voters with a human face — but unfortunately it tends not to work well. To say “I will implement a wonderful plan” invites the hearer to disbelieve that the speaker actually can; saying instead “We will implement a wonderful plan” makes the hearer’s disbelief harder because they have to eliminate more possibilities’ and saying “A wonderful plan will be implemented” makes it a bit harder still.

It’s hard to draw strong conclusions in the Canadian setting because elections aren’t as much about personalities. But it looks as if this leaders’ debate might have been a wash, with perhaps a slight downward nudge for Mulcair.

Three kinds of knowledge discovery

Published June 24, 2014 Uncategorized Leave a Comment
Tags: apparent complexity, bioinformatics, data analytics, data mining, deception, knowledge discovery, language, living systems, personalised medicine, physical systems, social network analysis, trust, uncertainty

I’ve always made a distinction between “mainstream” data mining (or knowledge discovery or data analytics) and “adversarial” data mining — they require quite distinct approaches and algorithms. But my work with bioinformatic datasets has made me realise that there are more of these differences, and the differences go deeper than people generally understand. That may be part of the reason why some kinds of data mining are running into performance and applicability brick walls.

So here are 3 distinct kinds of data mining, with some thoughts about what makes them different:

1. Modelling natural/physical, that is clockwork, systems.
Such systems are characterised by apparent complexity, but underlying simplicity (the laws of physics). Such systems are entropy minimising everywhere. Even though parts of such systems can look extremely complex (think surface of a neutron star), the underlying system to be modelled must be simpler than its appearances would, at first glance, suggest.

What are the implications for modelling? Some data records will always be more interesting or significant than others — for most physical systems, records describing the status of deep space are much less interesting than those near a star or planet. So there are issues around the way data is sampled.
Some attributes will also be more interesting or significant than others — but, and here’s the crucial point, this significance is a global property. It’s possible to have irrelevant or uninteresting attributes, but these attributes are similarly uninteresting everywhere. Thus is makes sense to use attribute selection as part of the modelling process.

Because the underlying system is simpler than its appearance suggests, there is a bias towards simple models. In other words, physical systems are the domain of Occam’s Razor.

2. Living systems.
Such systems are characterised by apparent simplicity, but underlying complexity (at least relatively speaking). In other words, most living systems are really complicated underneath, but their appearances often conceal this complexity. It isn’t obvious to me why this should be so, and I haven’t come across much discussion about it — but living systems are full of what computing people call encapsulation, putting parts of systems into boxes with constrained interfaces to the outside.

One big example where this matters, and is starting to cause substantial problems for data mining, is the way diseases work. Most diseases are complex activities in the organism that has the disease, and their precise working out often depends on the genotype and phenotype of that organism as well as of the diseases themselves. In other words, a disease like influenza is a collaborative effort between the virus and the organism that has the flu — but it’s still possible to diagnose the disease because of large-scale regularities that we call symptoms.
It follows that, between the underlying complexity of disease, genotype, and phenotype, and the outward appearances of symptoms, or even RNA concentrations measured by microarrays, there must be substantial “bottlenecks” that reduce the underlying complexity. Our lack of understanding of these bottlenecks has made personalised medicine a much more elusive target than it seemed to be a decade ago. Systems involving living things are full of these bottlenecks that reduce the apparent complexity: species, psychology, language.

All of this has implications for data mining of systems involving living things, most of which have been ignored. First, the appropriate target for modelling should be these bottlenecks because this is where such systems “make the most sense”; but we don’t know where the bottlenecks are, that is which part of the system (which level of abstraction) should be modelled. In general, this means we don’t know how to guess the appropriate complexity of model to fit with the system. (And the model should usually be much more complex than we expect — in neurology, one of the difficult lessons has been that the human brain isn’t divided into nice functional building blocks; rather it is filled with “hacks”. So is a cell.)

Because systems involving living things are locally entropy reducing, different parts of the system play qualitatively different roles. Thus some data records are qualitatively of different significance to others, so the implicit sampling involved in collecting a dataset is much more difficult, but much more critical, than for clockwork systems.

Also, because different parts of the system are so different, the attributes relevant to modelling each part of the system will also tend to be different. Hence, we expect that biclustering will play an important role in modelling living systems. (Attribute selection may also still play a role, but only to remove globally uninteresting attributes; and this should probably be done with extreme caution.)

Systems of living things can also be said to have competing interests, even though these interests are not conscious. Thus such systems may involve communication and some kind of “social” interaction — which introduces a new kind of complexity: non-local entropy reduction. It’s not clear (to me at least) what this means for modelling, but it must mean that it’s easy to fall into a trap of using models that are too simple and too monolithic.

3. Human systems.
Human systems, of course, are also systems involving living things, but the big new feature is the presence of consciousness. Indeed, in settings where humans are involved but their actions and interactions are not conscious, models of the previous kind will suffice.

Systems involving conscious humans are locally and non-locally entropy reducing, but there are two extra feedback loops: (1) the loop within the mind of each actor which causes changes in behaviour because of modelling other actors and themself (the kind of thing that leads to “I know that he knows that I know that … so I’ll …); (2) the feedback loop between actors and data miners.

The first feedback loop creates two processes that must be considered in the modelling:
a. Self-consciousness, which generates, for example, purpose tremor;
b. Social consciousness, which generates, for example, strong signals from deception.

The second feedback loop creates two other processes:
a. Concealment, the intent or action of actors hiding some attributes or records from the modelling;
b. Manipulation, the deliberate attempt to change the outcomes of any analysis that might be applied.

I argue that all data mining involving humans has an adversarial component, because the interests of those being modelled never run exactly with each other, or with those doing the modelling, and so all of these processes must be considered whenever modelling of human systems is done. (You can find much more on this topic by reading back in the blog.)

But one obvious effect is that records and attributes need to have metadata associated with them that carries information about properties such as uncertainty or trustworthiness. Physical systems and living systems might mislead you, but only with your implicit connivance or misunderstanding; systems involving other humans can mislead you either with intent or as a side-effect of misleading someone else.

As I’ve written about before, systems where actors may be trying to conceal or manipulate require care in choosing modelling techniques so as not to be misled. On the other hand, when actors are self-conscious or socially conscious they often generate signals that can help the modelling. However, a complete way of accounting for issues such as trust at the datum level has still to be designed.

Inspire and Azan magazines

Published January 25, 2014 Uncategorized Leave a Comment
Tags: Azan, counterterrorism, deception, gamification, Inspire, jihadist language, language, textual analysis

I’ve been working (with Edna Reid) on understanding Inspire and Azan magazines from the perspective of their language use.

These two magazines are produced by islamists, aimed at Western audiences, and intended primarily to motivate lone-wolf attacks. Inspire comes out of AQAP, whereas Azan seems to have a Pakistan/Afghanistan base and to be targeted more at South Asians.

Both magazines have some inherent problems: it’s difficult to convince others to carry out actions that will get them killed or imprisoned using such a narrow channel and appealing only to mind and emotions. The evidence for the effectiveness of these magazines is quite weak — those (few) who have carried out lone-wolf attacks in the West have often been found to have read these magazines — but so have many others in their communities who didn’t carry out such attacks.

Regardless of effectiveness, looking at language usage gives us a way to reverse engineer what’s going on the minds of the writers and editors. For example, it’s clear that the first 8 issues of Inspire were produced by the same (two) people, but that issues 9-11 have been produced by three different people (but with some interesting underlying commonalities). It’s also clear that all of the issues of Azan so far are produced by one person (or perhaps a small group with a very similar mindset) despite the different names used as article authors.

Overall, Inspire lacks a strategic focus. Issues appear when some event in the outside world suggests a theme, and what gets covered, and how, varies quite substantially from issue to issue. Azan, on the other hand, has been tightly focused with a consistent message, and much more regular publication. Measures of infomative and imaginative language are also consistently higher for Azan than for Inspire.

The intensity of jihadist language in Inspire has been steadily increasing in recent issues. The level of deception has also been increasing, this latter surprising because previous studies have suggested that jihadi intensity tends to be correlated with low levels of deception. This may be a useful signal for intelligence organizations.

A draft of the paper about this is available on SSRN:

        http://ssrn.com/abstract=2384167

Verbal mimicry isn’t verbal (well, not lexical anyway)

Published May 9, 2013 Uncategorized Leave a Comment
Tags: deception, interrogation, language, textual analysis, words

One of my students, Carolyn Lamb, has been looking at deception in interrogation settings.

The Pennebaker model of deception, as devoted readers will know, is robust only for freeform documents. Sadly, the settings in which deception is often most interesting tend to be dialogues (law enforcement, forensic) and it’s known that the model doesn’t extend in any straightforward way to such settings.

We started out with the idea that responses would be mixtures of language elicited by the words in a question and freeform language from the respondent, and developed a clever method to separate them. Sadly, it worked, but it didn’t help. When the effect of question language was removed from answers, the differences between deceptive and truthful responses decreased.

Digging a little deeper, we were able to show that the influence of words from the question must impact response language at a higher level (i.e. earlier in the answer construction process than simply the lexical). Those who are being deceptive respond in qualitatively different ways to prompting words than those being truthful. A paper about this has been accepted for the IEEE Intelligence and Security Informatics Conference in Seattle next month.

Part of the explanation seems to be mirror neurons. There’s a considerable body of work on language acquisition, and on responses to single words, that uses mirror neurons as a big part of the explanation; I haven’t seen anything at an intermediate level where these results fit.

There are some interesting practical applications for interrogators. One strategy would be to reduce the presence of prompting words (and do so consistently across all subjects) so that responses become closer to statements, and so closer to freeform. My impression from my acquaintance is that smarter law enforcement personnel already know this and act on it.

But our results also suggest a new strategy: increase the number of prompting words because that tends to increase the separation between the deceptive and the truthful. This needs a good understanding of what kinds of response words to look for (and, for most, this has to be done offline because we as humans are terrible at estimating rates of words in real-time, especially function words). But it could be very powerful.

You heard it here first

Published November 7, 2012 Uncategorized Leave a Comment
Tags: deception, election, language, Obama, politics, prediction, presidential election, romney, spin, textual analysis, US election, words

As I predicted on August 8th, Obama has won the U.S. presidential election. The prediction was made based on his higher levels of persona deception, that is the ability to present himself as better and more wonderful than he actually is. Romney developed this a lot during the campaign and the gap was closing, but it wasn’t enough.

On a side note, it’s been interesting to notice the emphasis in the media on factual deception, and the huge amount of fact checking that they love to do. As far as I can tell, factual deception has at best a tiny effect on political success, whether because it’s completely discounted or because the effect of persona is so much stronger. On the record, it seems to me to be a tough argument that Obama has been a successful president, and indeed I saw numerous interviews with voters who said as much — but then went on to say that they would still be voting for him. So I’m inclined to the latter explanation.

Including the results of the third debate

Published October 23, 2012 Uncategorized Leave a Comment
Tags: campaign speech, debate, deception, election, language, Obama, politics, romney, spin, textual analysis, US election, words

Just a quick update from the persona deception rankings from yesterday, to include the text of the third debate (assuming that each statement is free form, which is slightly dubious).

Here’s the figure:

Persona deception scores after the third debate

You can see that they are running neck and neck when it comes to persona deception. Adding in the third debate changes the semantic space because the amount of text is so large compared to a typical campaign speech. The points corresponding to debates lie in the middle of the pack suggesting that neither is trying to hard to present themselves as better than they are — this is probably typical of a real-time adversarial setting where there aren’t enough cognitive resources to get too fancy.

Update on persona deception in the US presidential election

Published October 22, 2012 Uncategorized Leave a Comment
Tags: debate, deception, election, language, Obama, politics, romney, spin, textual analysis, US election, words

Recall that persona deception is the attempt, by a politician, to seem more wonderful than s/he actually is. It’s a form of deception, and can be detected fairly readily using the Pennebaker deception model. As I mentioned in the previous post, it relies on the ability to speak in a freeform (i.e. unprompted) way. However, both of the presidential debates so far have used the questions only as faint stimulators of pre-prepared talking points so I’m including them (but some caveats apply).

Here is the picture of the levels of persona deception over time, where I’ve changed to a more conventional red for Romney and blue for Obama.

red – Romney; blue – Obama

Apart from a few high-scoring speeches by Romney, there isn’t much separation between the two candidates. The differentiating between top-left and bottom-right is mostly driven by Obama’s heavy use of “I’m” and one of two other words that Romney doesn’t use much. The debates are labelled by the squares — it’s clear that there isn’t much difference between their debate speeches and their stump speeches, which is interesting because the former are less scripted.

However, this is a big change from my previous analysis just after the conventions. At that point Obama’s levels of persona deception were much higher than Romney’s. The change suggests that Romney has become much better at presenting the current persona (or, alternatively, that the persona he is now presenting is closer to the “real” man). Since the candidate who can best present a strong persona tends to win, this suggests that the candidates are much closer than they were.

We will see what the 3rd debate brings forth tonight…

Deception in the US Presidential Debates

Published October 3, 2012 Uncategorized Leave a Comment
Tags: debate, deception, language, Obama, politics, presidential election, romney, spin, textual analysis, words

You might be wondering if I’m going to be posting scores for the levels of persona deception in this evening’s presidential debate (and subsequent ones).

There’s a problem: the deception model relies on the rates at which certain kinds of words occur. In question-and-answer situations such as interrogations, and debates, the language of the questions drives, to some extent, the language of the answer. So we can’t get a clean read on the level of persona deception of the respondent without factoring out that part of the response that doesn’t come, so to speak, from inside the respondent’s head.

We can’t do this factoring yet, although we are making some progress. One of my students has developed a technique for “correcting” the word frequencies in an answer to allow for the prompting effects of words in the question. For example, using “you” in a question tends, not surprisingly, to alter the rates of pronouns such as “I” in the answer. The problem is complicated by the fact that the effects of the prompting don’t seem to be independent of the mental state of the respondent, something that others have noticed in forensic settings.

So the bottom line is that the deception model, thought effective in freeform situations such as speeches, remains problematic in interrogatory settings. The effect of a question seems to die away about 50 words into an answer, so there will be opportunities to look at levels of persona deception in longer responses, of which there will probably not be a shortage.

Automating microexpression detection

Published September 11, 2012 Uncategorized Leave a Comment
Tags: biometrics, deception, microexpressions

One of the weaknesses of microexpressions as a way to detect deception is that it requires substantial training to learn to recognize them, and even then might need high speed video and several playbacks (think the TV show Lie to Me). It is no surprise, therefore, that there has been a lot of work trying to recognize microexpressions automatically.

At the recent European Conference on Intelligence and Security Informatics, Senya Polikovsky, in his talk, claimed that his group had been able to do it (although this isn’t in the paper: Polikovsky, Quiros-Ramirez, Kameda, Ohta and Burgoon, Benchmark Driven Framework for Development of Emotion Sensing Support Systems which should be online quite soon). Admittedly, this is in a framework with a specific seat, specific lighting, and specific cameras and sensors, but this would still represent a significant advance. I got a brief chance to talk to him afterwards, and the approach he described seemed plausible, clever, and non-obvious.

Super Identities

Published September 6, 2012 Uncategorized Leave a Comment
Tags: behavior, behavioral modelling, biometrics, data mining, deception, identity, online, privacy, security, superidentity, userids

I heard a talk on the UK Super Identity Project last week which stimulated some musings on this important topic.

Once upon a time, almost everyone lived in villages, and identity was not an issue — everyone you knew also knew you and many of them had done so since you were born. So identity issues hardly arose, apart from an occasional baby substitution (but note Solomon in 1 Kings 3:16-28 for an early identity issue). As rich people began to travel, new forms of identity evidence such as passports and letters of introduction were developed.

About a hundred years ago and as the result of mass movement to cities, questions of identity become common. You can see from the detective stories of the time how easy it was to assume another identity, and how difficult it was to verify one, much as it is in cyberspace today. To deal with these issues, governments become involved as the primary definers of identity, getting in on the act with birth certificates (before that, e.g. baptismal records), and then providing a continuous record throughout life.

In parallel, there’s the development of biometric identifiers, mostly to deal with law enforcement, first the Bertillon system and then fingerprints (although as I’ve noted here before, one of the first of the detective stories to include fingerprints– The Red Thumb Mark — is about how easy they are to forge).

The Super Identity project is trying to fuse a set of weak identifiers into a single identity with some reliability. Identities are important for three main reasons (a) trust, for example so that I can assume that someone I’m interacting with online is the person I think it is; (b) monetizing, for example so that an advertiser can be sure that the customized ad is being sent to the right person; and (c) law enforcement and intelligence, for example, these identities are actually the same underlying person.

There are many identifying aspects, almost all of which are bound to a particular individual in a weak way. They come in four main categories:

Physical identifiers such as an address, or a place of employment.
Biometrics (really a subset of the physical) such as fingerprints, iris patterns, voice and so on. These at first glance seem to be rather strongly bound to individuals, but all is not as it appears and they can often be forged in practice, if not in theory. There is an important subset of biometrics that are often forgotten, those that arise from subconscious processes; these include language use, and certain kinds of tics and habits. They are, in many ways, more reliable than more physical biometrics because they tend to be hidden from us, and so are harder to control.
Online identifiers such as email addresses, social network presence, web pages, which are directly connected to individuals. Equally important are the indirect online identifiers that appear as an (often invisible) side-effect of online activity such as location.
Identifiers associated with accessing the online world, that is identifiers associated with bridging from the real world to the online world. These include (beloved by governments despite their weakness) IP addresses which led to a recent police raid, complete with stun grenades, on an innocent house.

The problem with trying to fuse these weak identifying aspects into a single superidentity which can be robustly associated with an individual is this: it’s relatively difficult to avoid creating these identifying aspects, but it’s relatively easy to create more identifying aspects that can be used either to actively mislead or passively confuse the creation of the superidentity.

For example, there’s been some success in matching userids from different settings (gmail, facebook, flickr) and attributing them to the same person. But surely this can only work as long as that person makes no effort to prevent it. If I want to make it hard to match up my different forms of web presence then I can choose userids that don’t associate in a natural way — but I can also create extra bogus accounts that make the matching process much harder just from a computational point of view.

So it may be possible to create a cloud of identifying aspects, but it seems much more difficult to find the real person within that cloud, especially if they’re trying to make themselves hard to find. The Super Identity project would no doubt respond that most people aren’t about making themselves harder to identify. I doubt this; I think we’re moving to a world where obfuscation is going to be the only way to gain some privacy — a world in which the only way to dissociate ourselves from something we don’t want made public is to make the connection sufficiently doubtful that it cannot reliably acted on. This might be called self-spamming.

For example, if a business decides to offer differential pricing to certain kinds of customers (which has already happened), then I want to be able to dissociate myself from the category that gets offered the higher price if I possibly can. If the business has too good a model of my identity, I may not be able to prevent them treating me the way they want to rather than the way I want them to. (This is, of course, why almost all data mining is, in the end, going to be adversarial.)

In the end, behavior is the best signal of identity because it’s hard for us to modify, partly because we don’t have conscious awareness of much of it, and partly because we don’t have conscious control even when we have awareness. No wonder behavior modelling is becoming a hot topic, particularly in the adversarial domain.

Update — persona deception from May to early August in the US presidential race

Published August 8, 2012 Uncategorized Leave a Comment
Tags: deception, election, language, Obama, romney, spin, textual analysis, US election, words

If you’re a regular reader, you’ll know that I compute “persona deception” scores for political figures. These measure all kinds of deception but, in the political arena, most of the deception is about candidates portraying themselves as better, nicer, wiser, and more competent than they really are (rather than factual misstatements).

Now that the US presidential race is down to two, I’ve done the analysis on their available speeches from the beginning of May up to the present (early August). Obama has made many more speeches (I’ve included both ‘campaign’ and ‘fundraiser’ speeches — I don’t know how he’s found the time to do anything else since there are three and four speeches apparently most days).

Here is the basic figure:

The line is the axis of scores, with high scores at the red end and low scores at the green end. The red crosses are Obama speeches, and the blue crosses Romney speeches. You can see that Obama’s scores (for example, projecting each point onto the line) are much higher. It seems to be the case that, all things being equal, the candidate with the higher persona deception scores wins an election. If this data holds up through the remaining 3 months, this can be considered a prediction. That’s certainly what happened in the 2008 cycle, which you can see by looking back in this blog.

For the technically minded, the two-factor structure here is often seen faintly because an individual’s use of markers such as first-person singular pronouns is often fairly uncorrelated with their use of exclusive words such as “but” and “or”. It’s more pronounced in this case by Romney’s high rate of use of “I” while Obama tends to prefer “I’m”. Overall, Obama’s high scores come from: high rates of “I’m”, high rates of “go” and “going”, and low rates of “but” and “or”. If you want to find out more, this analysis is based on James Pennebaker’s deception model, which we’ve extended by using a dimensionality reduction (so that scores are projections onto a set of eigenvectors rather than sums of marker frequencies).

Content in Presidential Campaign Speeches

Published March 19, 2012 Uncategorized Leave a Comment
Tags: deception, Gingrich, language, Obama, Paul, politics, romney, Santorum, spin, textual analysis, US election

Last week I posted details of the level of “persona deception” among the Republican presidential candidates and President Obama. Persona deception measures how much a candidate is trying to present himself as “better” in some way than he really is. This is the essence of campaigning — we don’t elect politicians based on the quality of their proposals; and we don’t fail to elect them because they tell us factual lies. Almost everything is based on our assessment of character which we get from appearance and behavior, and also from language.

Today I’ll post a description of the different content of the speeches so far in 2012. This is less informative than levels of deception, but it does give some insight into what candidates are thinking is of interest or importance to the voters they are currently targeting. Here is an overview of the topic space:

You can see that most of the Republican candidates are talking about very similar things. In fact, the speeches in the upper right-hand corner are associated strongly with words such as “greatness”, “freedom”, “opportunity”, “principles” and “prosperity” — all very abstract nouns without much content that could come back to haunt them.

Gingrich’s speeches towards the bottom of the figure are quite different, although still associated with quite abstract words: “bureaucracy”, “media”, “pipeline”, “elite”, “establishment”. These are almost all things that he is against — stay tuned for an analysis of negative word use later in the week.

Obama’s speeches, on the left-hand side, are heavily oriented to manufacturing associated with words such as: “cars”, “hi-tech”, “plant”, “oil”, “demand”, “prices”.

What a candidate chooses to talk about seems to be a mix of his personal hobbyhorses (at the time) and some judgement of what issues are of interest to the general public, or at least which can create daylight between one candidate’s position and the others. From this perspective, Gingrich separates himself from the other Republicans quite well. Somewhat surprisingly, Ron Paul’s content is not very different from that of Romney and Santorum. Probably this can be accounted for as a function of the three of them all trying to appeal to a very similar segment of the base. Whether Gingrich is consciously trying to address different issues, or whether his history or personality compel him to is not clear.

2012 US Election, Republicans plus President Obama

Published March 15, 2012 Uncategorized Leave a Comment
Tags: AIPAC, deception, election, Gingrich, language, Obama, Paul, politics, romney, Santorum, spin, textual analysis, US election, words

Yesterday I posted details about the levels of persona deception in the speeches by the Republican candidates since the beginning of 2012. In striking contrast to the 2008 cycle, the speeches fall along a single axis, indicating widespread commonalities in the way that they use words, particularly the words of the deception model.

Today I’ve included President Obama’s speeches this year in the mix. I’ve tried to select only those speeches where there was an audience. Of course, for a sitting president, the distinction between an ordinary speech and a campaign speech is difficult to draw. Almost all of these are labelled as campaign events at whitehouse.gov.

Here is the plots of the persona deception levels, with Obama’s speeches added in magenta.

Generally speaking, Obama’s levels of persona deception (see yesterday’s post to be clear on what this means) are in the low range compared to the Republican presidential candidates. This is quite different from what happened in the 2008 cycle, where his levels were almost always well above those of McCain and Clinton. It’s not altogether surprising, though. First, he can no longer be the mirror in which voters see what they want to see since he has a substantial and visible track record. Second, he doesn’t have to try as hard to project a persona (at least at this stage of the campaign) since he has no competitor. I expect that his values will climb as the campaign progresses, particularly after the Republican nominee becomes an actual person and not a potential one.

The interesting point is the outlier at the top left of the figure. This is Obama’s speech to AIPAC. Clearly this is not really a campaign speech, so the language might be expected to be different. On the other hand, if it were projected onto the single-factor line formed by the other speeches, it would be much more towards the deceptive end of that axis. Since the underlying model detects all kinds of deception, not just that associated with persona deception in campaigns, this may be revealing of the attitude of the administration to the content expressed in this speech.

Republican presidential candidates — first analysis of persona deception

Published March 14, 2012 Uncategorized Leave a Comment
Tags: deception, Gingrich, language, Paul, politics, romney, Santorum, spin, textual analysis, US election, words

Regular readers of this blog will know that I carried out extensive analysis of the speeches of the contenders in the 2008 US presidential election cycle (see earlier postings). I’m now beginning similar analysis for the 2012 cycle, concentrating on the Republican contenders for now.

You will recall that Pennebaker’s deception model enables a set of documents to be ranked in order of their deceptiveness, detected via changes in the frequency of occurrence of 86 words in four categories: first-person singular pronouns, exclusive words, negative-emotion words, and action verbs. Words in the first two categories decrease in the presence of deception, while those in the last two categories increase. The model only allows for ranking, rather than true/false determination, because “increase” and “decrease” are always relative to some norm for the set of documents being considered.

How does this apply to politics? First of all, the point isn’t to detect when a politician is lying (Cynical joke: Q: How do you tell when a politician is lying? A: His lips are moving). Politicians tell factual lies, but this seems to have no impact on how voters perceive them, perhaps because we’re come to expect it. Rather, the kind of deception that is interesting is the kind where a politician is trying to present him/herself as a much better person (smarter, wiser, more competent) than they really are. This is what politicians do all the time.

Why should we care? There are two reasons. The first is that it works — typically the politician who is able to deliver the highest level of what we call “persona deception” gets elected. Voters have to decide on the basis of something, and this kind of presentation as a great individual seems to play more of a role than, say, actual plans for action.

Second, though, watching the changes in the levels of persona deception gives us a window into how each candidate (and campaign) is perceiving themselves (and, it turns out, their rivals) from day to day. Constructing and maintaining an artificial persona is difficult and expensive. Levels of persona deception tend to drop sharply when a candidate becomes confident that they’re doing well; and when some issue surfaces about which they don’t really have a persona opinion because, apparently, it takes time to construct the new piece.

So, with that preliminary, on to some results.

The figure shows the speeches in a space where speeches with greater person deception (spin) are further to the right, and those with less persona deception are further to the left. Ron Paul shows the lowest level of persona deception which is not surprising — nobody has ever accused him of trying to be what he is not. In contrast, Romney shows the highest level of persona deception — again not surprising as he has had to try hardest to make himself appealing to voters. Note that this also predicts that he will do well. Both Gingrich and Santorum occupy the middle ground; both are running on a very overt track record and are not trying as hard to make themselves seem different from who they are. Indeed, candidates with a strong history tend to have lower levels of persona deception simply because it’s very difficult to construct a new, more attractive persona when you already have a strong one. (The two points vertically separated from the rest are the result of a sudden burst of using “I’d” in these two speeches.)

The following figures break out the temporal patterns for the four candidates:

What’s striking about Romney is how much the level of persona deception changes from speech to speech. In the last election cycle, this wasn’t associated with audience type or recent success but seemed to be much more internally driven. This zig-zag pattern is much more the norm than a constant level of persona deception — some mystery remains.

Language in Presidential Elections — 2012 Season Opener

Published August 12, 2011 Uncategorized Leave a Comment
Tags: debate, deception, election, language, persona, politics, Republican, spin, textual analysis, US election

Readers of this blog will know that we spent a lot of time analyzing the speeches of the U.S. presidential candidates in the 2008 election. Our primary interest was in the use of the deception model, a linguistic/textual model of how freeform language changes when the speaker/writer is being deceptive.

In the political arena, factual deception, saying things that just ain’t so, plays very little role, perhaps because voters have very low expectations of politicians in this area. What we call persona deception, presenting oneself as a better,wiser, more powerful, more able, more knowledgeable person than one really is is the heart of successful campaigning. It turns out that the deception model captures deception across the whole range from factual to persona deception, so it gives us a lens to look at candidates and campaigns. What’s more, because language generation is almost entirely subconscious, this lens is hard to fool.

The most important skill candidates and their campaigns have is the ability to reach out to potential voters to convince them that they are better than the other possibilities. The language that they use is an important channel, especially in settings where everyone is conservatively dressed, and standing behind a podium that conceals most of their body language, as the Republican presidential field was in Iowa yesterday.

Strong candidates understand, at least instinctively, that they are not making arguments to convince voters, but presenting themselves as more compelling human beings. Our analysis of the speeches of candidates in the 2008 U.S. presidential election showed that candidates use three different kinds of speeches: blue skies speeches that promise generically good things and could be delivered interchangeably by any candidate – they are aimed at a wide audience; track record speeches that use past achievements to imply special qualifications for future achievements – they are aimed at swing voters; and manifesto speeches that describe a candidate’s personal qualities directly – they are aimed at a candidate’s base and reinforce common identity. But in all three cases, it’s not the content of the speech that matters, but what it implies about the speaker.

Our analysis in the last election cycle showed that Obama was by far the best as presenting himself as a wonderful person, and many voters, and certainly many in the media, projected onto the persona positive qualities that were perhaps not there. Interestingly, yesterday was the first time I have seen open Democratic buyers remorse about electing Obama, something I predicted would happen from the analysis we did.

The Republican candidates’ debate in Ames showed what a shaky grasp many of the candidates have on how to be a convincing candidate. Of course, this venue was a difficult one. Its overt purpose was for candidates to explain themselves to the local Republican base ahead of the Ames Straw Poll,which would have required largely manifesto content; but national television coverage made it an unmissable opportunity to reach out to a wider, but much more diverse audience, suggesting track record content. Blue skies content is always dangerous in the early stages of a campaign because grand but potentially unwise statements can come back to haunt a candidate.

Manifesto content was indeed popular – for example, we learned how many children almost every candidate has – typical content aimed at the base (“I’m a parent just like you”). Several candidates also tried for track record content, but got it quite wrong. The purpose of a track record speech is not for candidates to read their resumes to the audience; it’s to make the argument “I was able to do A, so you can trust me to be able to do similar-but-larger B” and this second part was notably absent.

Voters also want candidates to be sincere — recall the famous quotation “The secret of success is sincerity. Once you can fake that you’ve got it made” (Jean Girardoux). This is not just a cute quotation; this is what good politicians are able to do. In Iowa, this was another area where almost everyone stumbled. It was clear that most of the candidates had not only prepared talking point responses to probable questions, but has also rehearsed actual answers. Delivering from a prepared and memorized script and seeming sincere is a difficult business, and actors who can do it reliably command high rewards. Most of the candidates failed at seeming sincere. Several managed the worst of both worlds by trying to combine their prepared scripts with some ad libbing and came across as quite incoherent. One of the reasons for Gingrich’s strong showing is that he stayed away from scripts and delivered his answers as if he had just thought of them. Huntsman and Romney, in contrast, were especially wooden.

When humans listen to humans, the content matters. But when character is the issue, other aspects of language matter more. Much language generation is subconscious, and therefore beyond a candidate’s control. This is good for voters because it means we can sometimes see through to the real person no matter how sophisticated their speech writers and spin doctors.

Language Patterns in the UK Election

Published April 20, 2010 Uncategorized Leave a Comment
Tags: deception, election, psychology, UK election

There’s a nice article in New Scientist about the work of Pennebaker’s group, looking at the language patterns in the leaders’ debate last week in the UK election campaign. Details here.

Deception scores for the UEA emails

Published January 26, 2010 Uncategorized Leave a Comment
Tags: climate, deception, language, politics, textual analysis, words

I’ve also calculated the deception scores for the UEA “climategate” emails, using the same methodology that I’ve written about in the context of the speeches of presidential candidates.

This doesn’t (yet) give any great results. This is partly because deception scores can only be computed for sets of similar documents. The UEA emails, however, fall into two broad classes: simple emails, and discussions and suggestions about more formal documents (papers and grant proposals). The language in these two classes is quite different, which makes them difficult to compare. For example, the base rates of first-person singular pronouns are very different.

What I have done is to see whether there are any patterns in deception scores with time. A strong change in either class of email should be detectable as a variation of score with time, which might be visible. The result is shown below, with the deception score axis running from right (low) to left (high), and the markers getting lighter with the passage of time.

Deception scores of UEA emails

The only thing that strikes me so far is that many emails with low deception scores are older in time. This might be taken to indicate some kind of change in the language patterns of these email users.

The released emails are a small and not very random set of all of the emails sent by these individuals. So not too much should be read into this plot.

Reagan vs Obama and McCain

Published November 4, 2008 Uncategorized Leave a Comment
Tags: deception, McCain, Obama, politics, reagan, spin

I thought it would be interesting to look at the level of spin in Reagan’s speeches. He shares some characteristics with Obama; not in political opinions but in his ability to motivate an audience, and to be resistant to potentially embarrassing factual issues.
Here is the plot from yesterday’s post comparing Obama and McCain since their conventions, with five campaign speeches (all I could find) by Reagan between the convention and the 1980 election.

Comparing the spin of Reagan, Obama, and McCain

The points with red stars are Reagan’s speeches. As you can see, his level of spin is much higher than either of today’s candidates. The ability to use high levels of spin without coming across as phony is, of course, what makes an actor, so this is not entirely surprising. And I’ve argued all along that high levels of spin pay off for a politician, and the ability to give high-spin speeches especially to people who do not already like you is a key asset for a politician. Reagan is a good example of this in action.

Finding Bad Guys in Data