textual analysis | Finding Bad Guys in Data

Posts Tagged 'textual analysis'

Getting election winning right

Published October 8, 2020 Uncategorized Leave a Comment
Tags: biden, election language, Gettysburg, language, politics, textual analysis, Trump, US election

In the previous post I reviewed our model for how to win a U.S presidential election:

Use high levels of positive language;
Avoid negative language completely;
Stay away from policy;
Don’t mention your opponent.

Joe Biden’s speech at Gettysburg was a textbook example of how to do this (and it’s no easy feat avoiding mentioning your opponent when it’s Trump).

He should have stopped after the first five minutes (HT Bob Newhart “On the backs of envelopes, Abe”, also Lincoln himself, 271 words).

After the first five minutes it got rambling and repetitive. The media hates speeches that fit our model, and so the only sound bites came from the second half, which was much less well-written.

How to win a US presidential election — reminder

Published August 25, 2020 Uncategorized Leave a Comment
Tags: biden, election, Hillary Clinton, language, politics, spin, textual analysis, Trump, US election

As the US presidential election ramps up, let me remind you of our conclusions about the language patterns used by winners. Since 1992, the winner is the candidate who:

uses high levels of positive language;
avoids all negative language;
stays away from policy and talks in generalities
doesn’t talk about the opposing candidate

https://www.sciencedirect.com/science/article/pii/S0261379416302062

The reason this works is that the choices made by voters are not driven by rational choice but by a more immediate appeal of the candidate as a person. The media doesn’t believe in these rules, and constantly tries to drive candidates to do the opposite. For first time candidates this pressure often works, which is partly why incumbents tend to do well in presidential elections.

But wait, you say. How did Trump win last time? The answer is that, although he doesn’t do well on 2 and 4, Hillary Clinton did very poorly on all four. So it wasn’t that Trump won, so much as that Hillary Clinton lost.

Based on this model, and its historical sucess, Biden is doing pretty much exactly what he needs to do.

Detecting intent and abuse in natural language

Published March 19, 2020 Uncategorized Leave a Comment
Tags: abuse, counterterrorism, intent, language, natural language, textual analysis

One of my students has developed a system for detecting intent and abuse in natural language. As part of the validation, he has designed a short survey to get human assessments of how the system performs.

If you’d like to participate, the url is

aquarius.cs.queensu.ca

Thanks in advance!

6.5/7 US presidential elections predicted from language use

Published November 9, 2016 Uncategorized Leave a Comment
Tags: Clinton, election, language, textual analysis, Trump, US election

I couldn’t do a formal analysis of Trump/Clinton language because Trump didn’t put his speeches online — indeed many of them weren’t scripted. But, as I posted recently, his language was clearly closer to our model of how to win elections than Clinton’s was.

So since 1992, the language model has correctly predicted the outcome, except for 2000 when the model predicted a very slight advantage for Gore over Bush (which is sort of what happened).

People judge candidates on who they seem to be as a person, a large part of which is transmitted by the language they use. Negative and demeaning statements obviously affect this, but so does positivity and optimism.

Voting is not rational choice

Published November 6, 2016 Uncategorized Leave a Comment
Tags: election, Hillary Clinton, language, politics, textual analysis, Trump, US election

Pundits and the media continue to be puzzled by the popularity of Donald Trump. They point out that much of what he says isn’t true, that his plans lack content, that his comments about various subgroups are demeaning, and so on, and so on.

Underlying these plaintive comments is a fundamental misconception about how voters choose the candidate they will vote for. This has much more to do with standard human, in the first few seconds, judgements of character and personality than it does about calm, reasoned decision making.

Our analysis of previous presidential campaigns (about which I’ve posted earlier) makes it clear that this campaign is not fundamentally different in this respect. It’s always been the case that voters decide based on the person who appeals to them most on a deeper than rational level. As we discovered, the successful formula for winning is to be positive (Trump is good at this), not to be negative (Trump is poor at this), not to talk about policy (Trump is good at this), and not to talk about the opponent (Trump is poor at this). On the other hand, Hillary Clinton is poor at all four — she really, really believes in the rational voter.

We’ll see what happens in the election this week. But apart from the unusual facts of this presidential election, it’s easy to understand why Trump isn’t doing worse and Hillary Clinton isn’t doing better from the way they approach voters.

It’s not classified emails that are the problem

Published October 31, 2016 Uncategorized Leave a Comment
Tags: cybersecurity, election, email, Hillary Clinton, Huma Abedin, language, textual analysis, US election

There’s been reporting that the email trove, belonging to Huma Abedin but found on the laptop of her ex-husband, got there as the result of automatic backups from her phone. This seems plausible; if it is true then it raises issues that go beyond whether any of the emails contain classified information or not.

First, it shows how difficult it is for ordinary people to understand, and realise, the consequences of their choices about configuring their life-containing devices. Backing up emails is good, but every user needs to understand what that means, and how potentially invasive it is.

Second, to work as a backup site, this laptop must have been Internet-facing and (apparently) unencrypted. That means that more than half a million email messages were readily accessible to any reasonably adept cybercriminal or nation-state. If there are indeed classified emails among them, then that’s a big problem.

But even if there are not, access to someone’s emails, given the existence of textual analytics tools, means that a rich picture can be built up of that individual: what they are thinking about, who they are communicating with (their ego network in the jargon), what the rhythm of their day is, where they are located physically, what their emotional state is like, and even how healthy they are.

For any of us, that kind of analysis would be quite invasive. But when the individual is a close confidante of the U.S. Secretary of State, and when many of the emails are from that same Secretary, the benefit of a picture of them at this level of detail is valuable, and could be exploited by an adversary.

Lawyers and the media gravitate to the classified information issue. This is a 20th Century view of the problems that revealing large amounts of personal text cause. The real issue is an order of magnitude more subtle, but also an order of magnitude more dangerous.

The real problem with the Clinton email server

Published May 2, 2016 Uncategorized Leave a Comment
Tags: email server, Hillary Clinton, textual analysis, US election

Every intelligence person I’ve talked to has told me that the probability that the Russians and Chinese (at least) hacked Hillary Clinton’s email server is 100%.

While the question of whether any of the emails were classified, about to be classified, or should have been classified is interesting, the real risk created by the use of this server is that it provided a real-time look at the communications of the Secretary of State (and the people she was talking to).

Even the unclassified emails provided insight into the Secretary’s state of mind, plans, location, and intentions. Some of these might have been obvious; others would follow from examining email headers; and others by carrying out textual analysis (which is getting quite good at reverse engineering mental state, as regular readers will know).

Access to your entire email stream + some analytic capacity = fairly complete understanding of your life.

(Note that Google already does this for everyone who has a gmail account, and also for anyone who sends or receives email from anyone with a gmail account.)

Added 2016/05/06: A new problem now arises: control of the presidential election is in the hands of any country that can claim to have hacked the server. While hacking by a foreign power remains a (virtually certain) hypothetical, it is clearly having no impact on the election. But if a foreign power were to leak that they had hacked the server and exploited that somehow, the impact will surely be catastrophic. And I can imagine several of America’s enemies who might prefer a President Trump to a President Clinton II.

Trump’s continuing success

Published December 30, 2015 Uncategorized Leave a Comment
Tags: Christie, election, language, politics, successful election language, textual analysis, Trump, US election, words

As I posted earlier, our study of previous successful presidential candidates shows that success is very strongly correlated with a particular language model, consisting of:

Uniformly positive language
Complete absence of negative language
Using uplifting, aspirational metaphors rather than policy proposals, and
Ignoring the competing candidates

Trump presumably polls well, to a large extent, because he uses this language model (not so much ignoring of the competing candidates recently, but maybe that’s the effect of a primary). This language pattern tends to be used by incumbent presidents running for re-election, and seems to derive from their self-perception as already-successful in the job they’re re-applying for. Trump, similarly, possesses huge self confidence that seems to have the same effect — he perceives himself as (automatically, guaranteed) successful as president.

The dynamic between the successful self-perception issue and the competence issue was hard to separate before; and we’ve used ‘statesmanlike’ to describe the model of language of electoral success. All of the presidential incumbents whom we previously studied had a self-perception of success and a demonstrated competence and we assumed that both were necessary to deploy the required language comfortably and competently. Trump, however, shows that this isn’t so — it’s possible to possess the self-perception of success without the previously demonstrated competence. In Trump’s case, presumably, it is derived from competence in a rather different job: building a financial empire.

The media is in a frenzy about the competence issue for Trump. But our language model explains how it is possible to be popular among voters without demonstrating much competence, or even planned competence, to solve the problems of the day.

Voters don’t care about objective competence in the way that the media do. They care about the underlying personal self-confidence that is revealed in each candidate’s language. The data is very clear about this.

It may even be the rational view that a voter should take. Presidents encounter, in office, many issues that they had not previously formulated a policy for, so self-confidence may be more valuable than prepackaged plans. And voters have learned that most policies do not get implemented in office anyway.

It’s silly to treat Trump as a front runner when no actual vote has yet been cast. But it wouldn’t be surprising if he continues to do well for some time. Of the other candidates, only Christie shows any sense of the use of positive language but, as a veteran politician, he cannot seem to avoid the need to present policies.

Results from the first Democratic debate

Published October 15, 2015 Uncategorized Leave a Comment
Tags: Bernie Sanders, Chafee, CNN debate, deception, Democratic candidate debate, election, Hillary Clinton, language, O'Malley, politics, textual analysis, US election, Webb

The debate held on Tuesday night pitted one well known figure (Hillary Clinton) against one up and coming figure (Sanders) and three others with no name recognition except among the wonkiest. The differences in exposure and preparation were obvious. I can’t see that it made any difference to anyone’s opinions.

But it remains interesting to see how well each person did at presenting a persona. Extremely well known politicians do not usually have the luxury of presenting themselves with a new, improved persona because the old one is so well known, so it’s common to find that persona deception scores are low for such candidates. For those who aren’t well-known, the strength of their persona is a blend of how well they can do it personally, and how big the gap is between their previous self-image and the persona that they are trying to project. A relatively unknown candidate with a high persona deception score, therefore, is likely to do well; one with a low score probably will not.

Here are the results from this debate:

The red and greeen points represent artificial word use corresponding to moderately high amd moderately low levels of persona deception. Clinton, as expected (and from my analysis in the 2008 cycle) has low levels of persona deception. Sanders’s levels are in the mid-range. Chafee is sincere, but this won’t help him with his current level of recognition. O’Malley has the highest level of persona deception, which is a positive indicator for him (for what it’s worth in this crowd). Webb is also in the midrange, but his language use is quite different from that of Sanders.

Results from second Republican debate

Published September 17, 2015 Uncategorized Leave a Comment
Tags: #cnndebate, Buch, Carson, Christie, CNN debate, Cruz, debate, deception, election, Fiorina, Graham, huckabee, JIndal, Kasich, language, Pataki, Paul, politics, Rubio, Santorum, textual analysis, Trump, US election, Walker

Regular readers will know that, especially in a crowded marketplace, politicians try to stand out and attract votes by presenting themselves in the best possible light that they can. This is a form of deception, and carries the word-use signals associated with deception, so it can be measured using some straightforward linguistic analysis.

Generally speaking, the candidate who achieves the highest level of this persona deception wins, so candidates try as hard as they can. There are, however, a number of countervailing forces. First, different candidates have quite different levels of ability to put on this kind of persona (Bill Clinton excelled at it). Second, it seems to be quite exhausting, so that candidates have trouble maintaining it from day to day. Third, the difficulty depends on the magnitude of the difference between the previous role and the new one that is the target of a campaign: if a vice-president runs for president, he is necessarily lumbered with the persona that’s been on view in the previous job; if not, it’s easier to present a new persona and make it seem compelling (e.g. Obama in 2008). Outsiders therefore have a greater opportunity to re-invent themselves. Fourth, it depends on the content of what is said: a speech that’s about pie in the sky can easily present a new persona, while one that talks about a candidate’s track record cannot, because it drags the previous persona into at least the candidate’s mind.

Some kinds of preparation can help to improve the persona being presented — a good actor has to be able to do this. But politicians aren’t usually actors manqué so the levels of persona deception that they achieve from day to day emerge from their subconscious and so provide fine-grained insights into how they’re perceiving themselves.

The results from the second round of debates are shown in the figure:

The red and green points represent artificial debate participants who use all of the words of the deception model at high frequency and low frequency respectively.

Most of the candidates fall into the band between these two extremes, with Rand Paul with the lowest level of persona deception (which is what you might expect). The highest levels of deception are Christie and Fiorina, who had obviously prepped extensively and were regarded as having done well; and Jindal, who is roughly at the same level, but via completely different word use.

Comparing these to the results from the first round of debates, there are two obvious changes: Trump has moved from being at the low end of the spectrum to being in the upper-middle; and Carson has moved from having very different language patterns from all of the other candidates to being quite similar to most of them. This suggests that both of them are learning to be better politicians (or being sucked into the political machine, depending on your point of view).

The candidates in the early debate have clustered together on the left hand side of the figure, showing that there was a different dynamic in the two different debates. This is an interesting datum about the strength of verbal mimicry.

The secret of Trump’s success

Published September 15, 2015 Uncategorized Leave a Comment
Tags: election, election winning language, George Will, Karl Rove, language, politics, textual analysis, Trump. Hillary Clinton, US election, words

Looking at US presidential elections through the lens of empirical investigation of word use shows that there’s a pattern of language that is associated with electoral success. Those who use it win, and the difference in the intensity of the pattern correlates well with the margin of victory.

The effective pattern is, in a way, intuitive: use positive language, eliminate negative language completely, talk in the abstract rather than about specific policies, and pay no attention to the other candidates.

In other words, a successful candidate should appear “statesmanlike”.

Candidates find it extremely difficult to use this approach — they feel compelled to compare themselves to the other candidates, dragging in negativity, and to explain the cleverness of their policies. Only incumbent presidents, in our investigation, were able to use this language pattern reliably.

I listened to some of Trump’s speech in Texas last night, and I’ve come to see that the media are completely and utterly wrong about why he is doing so well in the polls. It’s not that he’s tapping into a vein of disaffection with the political system; it is that he’s using this language model. In previous cycles, it’s only been incumbent presidents who’ve had the self-confidence to use it, but Trump, of course, has enough self-confidence to start a retail business selling it.

Let’s look at the components of the model:

Positive language: Trump’s positivity is orders of magnitude above that of the other candidates, and in two ways. First, he is relentlessly positive about the U.S. and about the future (catchphrase: “we can do better”). Second, he’s positive about almost everyone he mentions (catchphrase: “he’s a great guy”).

Negative language: Trump doesn’t avoid negativity altogether, but he uses it cleverly. First, his individual negative targets are not the other candidates (by and large) but pundits — Karl Rove and George Will were mentioned last night, but I doubt if more than 1% of the audience could have identified either in a line-up; so this kind of negativity acts as a lightning rod, without making Trump seem mean. And the negative references to others lack the bitterness that often bleeds through in the negative comments of more typical candidates. Second, when he mentions negative aspects of the Obama administration and its policies and actions, he does it be implication and contrast (“that’s not what I would do”, “I could do better”).

Vision not policies: the media cannot stand that Trump doesn’t come out with detailed policy plans, but it’s been clear for a while that voters don’t pay a lot of attention to policies. They’ve learned that (a) there’s a huge gap between what a president can want to do and what he can actually make happen, and (b) policies are generated with one eye on the polls and focus groups, so they often aren’t something that the candidate has much invested in doing in the first place. [It’s incredible that Secretary Clinton ran focus groups to prep her “apology”, which was actually a meta-apology for not having apologized better or earlier.]

Trump has one huge “policy” advantage — he isn’t beholden to donors, and so is freer of the behind-the-scenes pressure that most candidates face. In the present climate, this has to be a huge selling point.

Ignore the other candidates: Trump doesn’t quite do this (and it gets him into trouble), But he’s learning fast — in last night’s speech, he only mentioned a handful of his competitors and his comments about all of them were positive.

If Trump continues to give this kind of speech, then the more exposure he gets, the more voters are going to like him. I remain doubtful that he will be the Republican nominee, but I don’t see him flaming out any time soon. Even if he makes some serious gaffe, he’ll apologize in seconds and move on (in contrast to Clinton who seems determined to make acute issues into chronic ones).

Republican candidates’ debate: persona deception results

Published August 8, 2015 Uncategorized Leave a Comment
Tags: Carson, Christie, Cruz, debate, election, Fiorina, Gilmore, Graham, huckabee, Jeb Bush, JIndal, Kasich, language, Pataki, Paul, Perry, politics, Rubio, textual analysis, Trump, US election, Walker

Here are results from the first Republican debate, combining the early and prime-time material into a single corpus.

There’s more detail about the theory in the previous post, but the basic story is: an election campaign is a socially sanctioned exercise in deception; factual deception is completely discounted and so doesn’t matter, but the interesting question is the deception required of each candidate to present themselves as better than they really are; and the candidate who can implement this kind of deception best tends to be the winner. Note that, although deception often has negative connotations, there are many situations where it is considered appropriate, allowed, or condoned: negotiation, dating, selling and marketing — and campaigns are just a different kind of marketing. Sometimes this is called, in the political context, “spin” but it’s really more subtle than that.

The basic plot show the variation in level of deception, aggregated over all of the turns by each candidate during the debate. The line is the deception axis; the further towards the red end, the stronger the deception. Other variation is caused by variations in the use of different words of the model — different styles.

These results aren’t terribly surprising. Both Fiorina and Huckabee have broad media experience and so are presumably good at presenting a facade appropriate to many different occasions (and no wonder Fiorina is widely regarded as having “won” the early debate). Trump has low levels of deception — that’s partly because he doesn’t bother with a facade, and partly because the more well-known a person is, the harder it is to successfully present a different facade.

Note, again unsurprisingly, that Carson, while in the middle of the pack on the deception axis, has quite different language patterns from any of the others. That’s partly opportunity — he wasn’t asked the same kind of questions — but partly not being a professional politician.

This figure zooms in to show the structure of the pack in the centre. There isn’t a lot of difference, which reinforces the takeaway that these debates didn’t make a lot of different, positively or negatively, for most of the candidate.

The contributions of language to the ranking can be looked at by drilling down into this table:

The rows are candidates in alphabetical order (Fiorina 5, Huckabee 8, Perry 13, Trump 15), the columns are 42 of the words of the deception model that were actually used in decreasing order of overall frequency, and the blocks are darker in colour when a word used by a candidate makes a greater contribution to the model. The top words were: I, but, going, my, me, or, go, take, look, lead, run, rather, without, move, and hate. So Huckabee’s high score comes primarily from low use of first-person singular pronouns, while Fiorina’s comes from heavier use of lower-ranked words that most others didn’t use. There are qualitative similarities between Fiorina’s language and Carson’s (row 2).

In previous presidential election campaigns, the candidate who managed to present the best facade in the strongest way was the winner.

A separate question is: what kind of facade should a candidate choose? We have empirical results about that too. A winning persona is characterised by: ignoring policy issues completely, ruthlessly eliminating all negative language, using plenty of positive language, and ignoring the competing candidates. Although, at one level, this seems obvious, no candidate and no campaign can bring themselves to do it until their second presidential campaign. But not only does it predict the winner, the margin of victory is also predictable from it as well.

Canadian election 2015: Leaders’ debate

Published August 7, 2015 Uncategorized Leave a Comment
Tags: Canada, candidates, deception, election, election campaign, Elizabeth May, influence, Justin Trudeau, language, leaders' debate, Macleans, May, Mulcair, politics, spin, Stephen Harper, textual analysis, Thomas Mulcair, Trudeau, words

Regular readers will recall that I’m interested in elections as examples of the language and strategy of influence — what we learn can be applied to understanding jihadist propaganda.

The Canadian election has begun, and last night was the first English-language debate by the four party leaders: Stephen Harper, Elizabeth May, Thomas Mulcair, and Justin Trudeau. Party leaders do not get elected directly, so all four participants had trouble wrapping their minds around whether they were speaking as party spokespeople or as “presidential” candidates.

Deception is a critical part of election campaigns, but not in the way that people tend to think. Politicians make factual misstatements all the time, but it seems that voters have already baked this in to their assessments, and so candidates pay no penalty when they are caught making such statements. This is annoying to the media outlets that use fact checking to discover and point out factual misstatements, because nobody cares, and they can’t figure out why.

Politicians also try to present themselves as smarter, wiser, and generally more qualified for the position for which they’re running, and this is a much more important kind of deception. In a fundamental sense, this is what an election campaign is — a Great White Lie. Empirically, the candidate who is best at this kind of persona deception tends to win.

Therefore, measuring levels of deception is a good predictor of the outcome of an election. Recall that deception in text is signalled by (a) reduced use of first-person singular pronouns, (b) reduced use of so-called exclusive words (“but”, “or”) that introduce extra complexity, (c) increased use of action verbs, and (d) increased use of negative-emotion words. This model can be applied by counting the number of occurrences of these words, adding them up (with appropriate signs), and computing a score for each document. But it turns out to be much more effective to add a step that weights each word by how much it varies in the set of documents being considered, and computing this weighted score.

So, I’ve taken the statements by each of the four candidates last night, and put them together into four documents. Then I’ve applied this deception model to these four documents, and ranked the candidates by levels of deceptiveness (in this socially acceptable election-campaign meaning of deceptiveness).

This figure shows, in the columns, the intensity of the 35 model words that were actually used, in decreasing frequency order. The rows are the four leaders in alphabetical order: Harper, May, Mulcair, Trudeau; and the colours are the intensity of the use of each word by each leader. The top few words are: I, but, going, go, look, take, my, me, taking, or. But remember, a large positive value means a strong contribution of this word to deception, not necessarily a high frequency — so the brown bar in column 1 of May’s row indicates a strong contribution coming from the word “I”, which actually corresponds to low rates of “I”.

This figure shows a plot of the variation among the four leaders. The line is oriented from most deceptive to least deceptive; so deception increases from the upper right to the lower left.

Individuals appear in different places because of different patterns of word use. Each leader’s point can be projected onto this line to generate a (relative) deception score.

May appears at the most deceptive end of the spectrum. Trudeau and Harper appear at almost the same level, and Mulcair appears significantly lower. The black point represents an artificial document in which each word of the model is used at one standard deviation above neutral, so it represents a document that is quite deceptive.

You might conclude from this that May managed much higher levels of persona deception than the other candidates and so is destined to win. There are two reasons why her levels are high: she said much less than the other candidates and her results are distorted by the necessary normalizations; and she used “I” many fewer times than the others. Her interactions were often short as well, reducing the opportunities for some kinds of words to be used at all, notably the exclusive words.

Mulcair’s levels are relatively low because he took a couple of opportunities to talk autobiographically. This seems intutively to be a good strategy — appeal to voters with a human face — but unfortunately it tends not to work well. To say “I will implement a wonderful plan” invites the hearer to disbelieve that the speaker actually can; saying instead “We will implement a wonderful plan” makes the hearer’s disbelief harder because they have to eliminate more possibilities’ and saying “A wonderful plan will be implemented” makes it a bit harder still.

It’s hard to draw strong conclusions in the Canadian setting because elections aren’t as much about personalities. But it looks as if this leaders’ debate might have been a wash, with perhaps a slight downward nudge for Mulcair.

Empirical Assessment of Al Qaeda, Isis, and Taliban Propaganda

Published January 7, 2015 Uncategorized Leave a Comment
Tags: al Qaeda, AQAP, Azan, corpus analytics, counterterrorism, Dabiq, Daish, Inspire magazine, ISIS, language, lone wolf attacks, propaganda, Taliban, Taliban Propaganda, terrorism, textual analysis

I’ve just been working on assessing the potential impact of the three major magazines: Inspire (AQAP), Azan (Taliban), and Dabiq (ISIS), competing for the market in lone wolf jihadists in the West.

I compare these magazines using models for the intensity of informative, imaginative, deceptive, jihadist, and gamification language, and build an empirical model for propaganda which combines these into a single scale.

Unsurprisingly, Dabiq ranks highest in propaganda intensity.

The details can be found in the full draft paper, posted to SSRN:

Skillicorn, David, Empirical Assessment of Al Qaeda, Isis, and Taliban Propaganda (January 7, 2015). Available at SSRN: http://ssrn.com/abstract=2546478.

More subtle lessons from the Sony hack

Published December 19, 2014 Uncategorized Leave a Comment
Tags: aggegation, castle model, cybersecurity, data mining, deanonymization, knowledge discovery, Sony, Sony hack, textual analysis

There are some obvious lessons to learn from the Sony hack: perimeter defence isn’t much use when the perimeter has thousands of gates in it (it looks as if the starting point was a straightforward spearphishing attack); and if you don’t compartmentalise your system inside the perimeter, then anyone who gets past it has access to everything.

But the less obvious lesson has to do with the difference between our human perception of the difficulties of de-anonymization and aggregation, and the actual power of analytics to handle both. For example, presumably Sony kept data on their employees health in properly-protected HIPAA-compliant storage — but there were occasional emails that mentioned individuals and their health status. The people sending these emails presumably didn’t feel as if any particular one was a breach of privacy — the private content in each one was small. But they failed to realise that all of these emails get aggregated, at least in backups. So now all of those little bits of information are in one place, and the risks of building significant models from them has increased substantially.

Anyone with analytic experience and access to a large number of emails can find structures that are decidedly non-obvious; but this is far from intuitive to the public at large, and hence to Sony executives.

We need to learn to value data better, and to understand in a deep way that the value of data increases superlinearly with the amount that is collected into a single coherent unit.

Inspire and Azan paper is out

Published September 24, 2014 Uncategorized Leave a Comment
Tags: Azan, counterterrorism, gamification, Inspire magazine, intelligence analysis, ISIS, jihadist language, language, terrorism, textual analysis

The paper Edna Reid and I wrote about the language patterns in Inspire and Azan magazines has now appeared (at least online) in Springer’s Security Informatics journal. Here’s the citation:

“Language Use in the Jihadist Magazines Inspire and Azan”
David B Skillicorn and Edna F Reid
Springer Security Informatics.2014, 3:9
Security Informatics

The paper examines the intensity of various kinds of language in these jihadist magazines. The main conclusions are:

These magazines use language as academic models of propaganda would predict, something that has not been empirically verified at this scale AFAIK.
The intellectual level of these magazines is comparable to other mass market magazines — they aren’t particularly simplistic, and they assume a reasonably well-educated readership.
The change in editorship/authorship after the deaths of Al-Awlaki and Samir Khan are clearly visible in Inspire. The new authors have changed for each issue, but there is an overarching similarity. Azan has articles claiming many different authors, but the writing style is similar across all articles and issues; so it’s either written by a single person or by a tightly knit group.
Jihadist language intensity has been steadily increasing over the past few issues of Inspire, after being much more stable during the Al-Awlaki years (this is worrying).
Inspire is experimenting with using gamification strategies to increase motivation for lone-wolf attacks and/or to decrease the reality of causing deaths and casualties. It’s hard to judge whether this is being done deliberately, or by osmosis — the levels of gamification language waver from issue to issue.

ISIS is putting out its own magazine. Its name, “Islamic State News”, and the fact that it is entirely pictorial (comic or graphic novel depending on your point of view) says something about their view of the target audience.

Update on Inspire and Azan magazines

Published April 14, 2014 Uncategorized Leave a Comment
Tags: Azan, counterterrorism, Inspire, jihadist language, language, Samir Khan, terrorism, textual analysis

Issue 12 of Inspire and Issue 5 of Azan are now out, so I’m updating the analysis of the language patterns in these two sequences of magazines.

To recap, both of these magazines are glossy and picture-heavy and intended primarily to encourage lone-wolf attacks by diaspora jihadists. It’s unclear how much impact they have actually had — several attackers have had copies, but so have many other non-attackers in the same environments. We have written a full analysis that can be downloaded from SSRN (here).

Here is the variation among issues for Inspire, based on the 1000 most-frequent words:

inspirefreqdocstime

You can see that the first 8 issues, edited by Samir Khan, are quite similar to one another, except for Issues 3 and 7, which are different in tone (and quite similar to one another, although that isn’t obvious in this figure). The new issues, by unknown editors don’t resemble one another very much, but they do have an underlying consistency (they form almost a straight line) which argues for some underlying organization.

The other interesting figures are based on a model of the intensity of jihadi language. The figure shows the variation among issues of both magazines, with jihadi intensity increasing from right to left:

Overall, the jihadist intensity of Azan is lower than that of Inspire; but the most recent four issues of Inspire represent a departure: their levels are much, much greater than previous issues of Inspire and all of the issues of Azan. This is a worrying trend.

Inspire and Azan magazines

Published January 25, 2014 Uncategorized Leave a Comment
Tags: Azan, counterterrorism, deception, gamification, Inspire, jihadist language, language, textual analysis

I’ve been working (with Edna Reid) on understanding Inspire and Azan magazines from the perspective of their language use.

These two magazines are produced by islamists, aimed at Western audiences, and intended primarily to motivate lone-wolf attacks. Inspire comes out of AQAP, whereas Azan seems to have a Pakistan/Afghanistan base and to be targeted more at South Asians.

Both magazines have some inherent problems: it’s difficult to convince others to carry out actions that will get them killed or imprisoned using such a narrow channel and appealing only to mind and emotions. The evidence for the effectiveness of these magazines is quite weak — those (few) who have carried out lone-wolf attacks in the West have often been found to have read these magazines — but so have many others in their communities who didn’t carry out such attacks.

Regardless of effectiveness, looking at language usage gives us a way to reverse engineer what’s going on the minds of the writers and editors. For example, it’s clear that the first 8 issues of Inspire were produced by the same (two) people, but that issues 9-11 have been produced by three different people (but with some interesting underlying commonalities). It’s also clear that all of the issues of Azan so far are produced by one person (or perhaps a small group with a very similar mindset) despite the different names used as article authors.

Overall, Inspire lacks a strategic focus. Issues appear when some event in the outside world suggests a theme, and what gets covered, and how, varies quite substantially from issue to issue. Azan, on the other hand, has been tightly focused with a consistent message, and much more regular publication. Measures of infomative and imaginative language are also consistently higher for Azan than for Inspire.

The intensity of jihadist language in Inspire has been steadily increasing in recent issues. The level of deception has also been increasing, this latter surprising because previous studies have suggested that jihadi intensity tends to be correlated with low levels of deception. This may be a useful signal for intelligence organizations.

A draft of the paper about this is available on SSRN:

        http://ssrn.com/abstract=2384167

Businesses processing emails

Published September 6, 2013 Uncategorized Leave a Comment
Tags: data mining, Facebook, Google, knowledge discovery, private emails, textual analysis, Twitter

The Daily Mail reports an experiment by the High-Tech Bridge company in which they sent private emails or uploaded documents containing unique urls to 50 different platforms, and then waited to see if and who visited these urls.

Sure enough, several of them were visited by the businesses that had handled the matching document, including Facebook, Twitter, and Google. This won’t come as a surprise to readers of this blog, but once again points out the extent to which businesses like these are processing any documents they see to extract models of the sender/receiver.

There has been some confusion in the media about how this process might work. Evidently it’s not obvious to many that such a process is automated — there isn’t anyone ‘reading’ these documents, but they’re being processed by software which is capable of ingesting pages pointed to, and processing the contents of those pages as well. It would help if we agreed to verbs that distinguished ‘read by a human’ from ‘processed by software’ that were simple enough for the wider public to understand the difference.

Finding Bad Guys in Data