Update on Inspire and Azan magazines

Issue 12 of Inspire and Issue 5 of Azan are now out, so I’m updating the analysis of the language patterns in these two sequences of magazines.

To recap, both of these magazines are glossy and picture-heavy and intended primarily to encourage lone-wolf attacks by diaspora jihadists. It’s unclear how much impact they have actually had — several attackers have had copies, but so have many other non-attackers in the same environments. We have written a full analysis that can be downloaded from SSRN (here).

Here is the variation among issues for Inspire, based on the 1000 most-frequent words:

inspirefreqdocstime

You can see that the first 8 issues, edited by Samir Khan, are quite similar to one another, except for Issues 3 and 7, which are different in tone (and quite similar to one another, although that isn’t obvious in this figure). The new issues, by unknown editors don’t resemble one another very much, but they do have an underlying consistency (they form almost a straight line) which argues for some underlying organization.

The other interesting figures are based on a model of the intensity of jihadi language. The figure shows the variation among issues of both magazines, with jihadi intensity increasing from right to left:

jihaddocstime

Overall, the jihadist intensity of Azan is lower than that of Inspire; but the most recent four issues of Inspire represent a departure: their levels are much, much greater than previous issues of Inspire and all of the issues of Azan. This is a worrying trend.

Inspire and Azan magazines

I’ve been working (with Edna Reid) on understanding Inspire and Azan magazines from the perspective of their language use.

These two magazines are produced by islamists, aimed at Western audiences, and intended primarily to motivate lone-wolf attacks. Inspire comes out of AQAP, whereas Azan seems to have a Pakistan/Afghanistan base and to be targeted more at South Asians.

Both magazines have some inherent problems: it’s difficult to convince others to carry out actions that will get them killed or imprisoned using such a narrow channel and appealing only to mind and emotions. The evidence for the effectiveness of these magazines is quite weak — those (few) who have carried out lone-wolf attacks in the West have often been found to have read these magazines — but so have many others in their communities who didn’t carry out such attacks.

Regardless of effectiveness, looking at language usage gives us a way to reverse engineer what’s going on the minds of the writers and editors. For example, it’s clear that the first 8 issues of Inspire were produced by the same (two) people, but that issues 9-11 have been produced by three different people (but with some interesting underlying commonalities). It’s also clear that all of the issues of Azan so far are produced by one person (or perhaps a small group with a very similar mindset) despite the different names used as article authors.

Overall, Inspire lacks a strategic focus. Issues appear when some event in the outside world suggests a theme, and what gets covered, and how, varies quite substantially from issue to issue. Azan, on the other hand, has been tightly focused with a consistent message, and much more regular publication. Measures of infomative and imaginative language are also consistently higher for Azan than for Inspire.

The intensity of jihadist language in Inspire has been steadily increasing in recent issues. The level of deception has also been increasing, this latter surprising because previous studies have suggested that jihadi intensity tends to be correlated with low levels of deception. This may be a useful signal for intelligence organizations.

A draft of the paper about this is available on SSRN:

        http://ssrn.com/abstract=2384167

Protecting data

In the world of things, we often value objects more than the rest of the world would value them — the inherited china or silver, the souvenir bought on a meaningful trip, and so on.

In the world of data, this seems to be exactly the other way around: once we’ve captured some data (and perhaps used it to model our customers) then most of its value to us has been exhausted. So we fail to see that it has much greater value to others than it has to us — and fail to protect it well. For example, Adobe used and uses its data about customers for its own internal purposes. But it clearly failed to realize that this data was of huge potential value to criminals who can use it for identity theft.

The bottom line is: data should be protected by its real-world open-market value, not by its current value to the business. Until this sinks in, we are going to continue to see data breaches in businesses and governments.

Understanding “anomaly” in large dynamic datasets

A pervasive mental model of what it means to be an “anomaly” is that this concept is derived from difference or dissimilarity; anomalous objects or records are those that are far from the majority, common, ordinary, or safe records. This intuition is embedded in the language used — for example, words like “outlier”.

May I suggest that a much more helpful, and even more practical, intuition of what “anomaly” means comes from the consideration of boundaries rather than dissimilarity. Consider the following drastically simplifed rendering of a clustering:

anomalies

There are 3 obvious clusters and a selection of individual points. How are we to understand these points?

The point A, which would conventionally by considered the most obvious outlier, is probably actually the least interesting. Points like this are almost always the result of some technical problem on the path between data collection and modelling. You wouldn’t think this would happen with automated systems, but it’s actually surprisingly common for data not to fit properly into a database schema or for data to be shifted over one column in a spreadsheet, and that’s exactly the kind of thing that leads to points like A. An inordinate amount of analyst attention can be focused on such points because they look so interesting, but they’re hardly ever of practical importance.

Points B and C create problems for many outlier/anomaly detection algorithms because they aren’t particularly far from the centre of gravity of the entire dataset. Sometimes points like these are called local outliers or inliers and their significance is judged by how far they are (how dissimilar) from their nearest cluster.

Such accounts are inadequate because they are too local. A much better way to judge B and C is to consider the boundaries between each cluster and the aggregate rest of the clusters; and then to consider how close such points lie to these boundaries. For example, B lies close to the boundary between the lower left cluster and the rest and is therefore an interesting anomalous point. If it were slightly further down in the clustering it would be less anomalous because it would be closer to the lower left cluster and further from this boundary. Point C is more anomalous than B because it lies close to three boundaries: those between the lower left cluster and the rest, between the upper left cluster and the rest, and the rightmost cluster and the rest. (Note that a local outlier approach might not think C is anomalous because it’s close to all three clusters.)

The point D is less anomalous  than B and C, but is also close to a boundary, the boundary the wraps the rightmost cluster. So this idea can be extended to many different settings. For example, wrapping a cluster more or less tightly changes the set of points that are “outside” the wrapping and so gives an ensemble score for how unusual the points on the fringe of a cluster might be. This is especially important in adversarial settings, because these fringes are often where those with bad intent lurk.

The heart of this approach is that anomaly must be a global property derived from all of the data, not just a local property derived from the neighbourhood of the point in question. Boundaries encode non-local properties in a way that similarity (especially similarity in a geometry, which is usually how clusterings are encoded) does not.

The other attractive feature of this approach is that it actually defines regions of the space based on the structure of the “normal” clusters. These regions can be precomputed and then, when new points arrive, it’s fast to decide how to understand them. In other words, the boundaries become ridge lines of high abnormality in the space and it’s easy to see and understand the height of any other point in the space. Thus the model works extremely effectively for dynamic data as long as there’s an initial set of normal data to prime the system. (New points can also be exploited as feedback to the system so that, if a sequence of points arrive in a region, the first few will appear as strong anomalies, but their presence creates a new cluster, and hence a new set of boundaries that mean that newer points in the same region no longer appear anomalous).

Businesses processing emails

The Daily Mail reports an experiment by the High-Tech Bridge company in which they sent private emails or uploaded documents containing unique urls to 50 different platforms, and then waited to see if and who visited these urls.

Sure enough, several of them were visited by the businesses that had handled the matching document, including Facebook, Twitter, and Google. This won’t come as a surprise to readers of this blog, but once again points out the extent to which businesses like these are processing any documents they see to extract models of the sender/receiver.

There has been some confusion in the media about how this process might work. Evidently it’s not obvious to many that such a process is automated — there isn’t anyone ‘reading’ these documents, but they’re being processed by software which is capable of ingesting pages pointed to, and processing the contents of those pages as well. It would help if we agreed to verbs that distinguished ‘read by a human’ from ‘processed by software’ that were simple enough for the wider public to understand the difference.

Benford’s Law in action

Benford’s Law is about the distribution of initial digits in numbers from the real world. It plays a role in detecting, for example, financial fraud in tax returns because made-up numbers are quite easily distinguishable from actual ones.

There have been several attempts to explain Benford’s Law based on the processes that give rise to actual numbers. I’ve been analysing U.S. State of the Union speeches over the past 200+ years. The patterns agree with what Benford’s Law would predict, but it’s much less clear that the putative explanations make sense, given the time scale and varying authorship of these documents.

Here’s the list of the top 100 in decreasing order of frequency of occurrence, with increasing indents by magnitude:

     
          000
one
two
1
2
3
three
     30
          million
5
4
          billion
     10
6
four
     20
          500
     12
five
7
     15
          hundred
9
     50
     25
          100
8
     11
     14
ten
six
     40
     13
     16
     18
     00
     22
          thousand
seven
     17
               1947
          300
          200
     24
eight
     60
               1890
     27
     21
     35
     28
     26
     23
     19
     90
           400
     80
                1893
     70
                1946
                1945
     75
                1899
          600
     31
     twenty
     33
          700
     45
                1860
                1891
                1898
     65
          250
          150
                1892
     fifty
                1900
     twelve
     41
     37
                1894
nine
                1897
                1846
                1861
     36
                1878
     55
                1911
                1889
     thirty
                1909
     29
     32
                1885
                1858
     54
     34
     63

Within each range, the larger the number the lower its frequency. But there are some interesting exceptions: numbers that are multiples of 5 or 10 tend to appear ‘earlier’ than they should. The references to years almost all cluster around the end of the 19th Century and the beginning of the 20th.

Simmelian backbones in social networks

There’s growing understanding that many, perhaps most, social networks have a structure with two main parts: a central, richly connected part sometimes called the hairball, and a set of more isolated and much smaller parts sometimes called whiskers. A growth process that reproduces this well is the forest fire model which, roughly speaking, models a new person joining one of the whiskers, and then slowly building connections via friends of friends that connect them into the hairball, and so eventually causethe whole whisker to be absorbed into the hairball. The hairball is therefore the result of the accretion of a large number of whiskers that overlap increasingly richly.

The problem, from a social network analysis point of view, is that the hairball, as its name suggests, is a big mess of overlapping communities. For reasons I’ve written about before, applying clustering or cut techniques to the hairball doesn’t do very well because there are many “long” edges between communities (since the whole graph is typically small world).

One of the talks at ASONAM 2013 provides some hope of being able to look at the structure of the hairball, using the insight that triangles are more useful than edges. The likelihood of triangles with two “long” edges is low, so this seems to be a good way of distinguishing which edges lie within communities; and, of course, there have been social theories positing triangles as an important pattern of social interaction for a century, most recently in the book Tribal Leadership, which points out that the advantage of a triangle relationship is that, when two people fall out, the third person can mediate.

Roughly speaking, the trick is to consider how many triangles each pair of nodes share and leave connections between them only when this number is large enough. The paper shows some nice examples of how this works on real-world datasets, with some convincing evidence that it does the right thing.

This approach is in contrast to previous work that has tried to peel the hairball by removing the whiskers and their edges into the hairball, and then removing the new whiskers that this reveals, and so on. While this peeling approach seems sensible and tries, in a way, to undo the hypothetical formation process, it is much less clear that it can get the order right. A “clustering” technique that is order-independent seems inherently more attractive.

The reference is: Bobo Nick, Conrad Lee, Pádraig Cunningham and Ulrik Brandes. Simmelian Backbones, ASONAM 2013; and there’s a version here: http://www.inf.uni-konstanz.de/algo/publications/nlcb-sb-13.pdf



Follow

Get every new post delivered to your Inbox.