Posts Tagged 'clustering'

Annular similarity

When similarity is used for clustering, then obviously the most similar objects need to be placed in the same cluster.

But when similarity is being used for human consumption, a different dynamic is in play — humans usually already know what the most similar objects are, and are interested in those that are (just) beyond those.

This can be seen most clearly in recommender systems. Purchase an item or watch a Netlflix show, and your recommendation list will fill up with new objects that are very similar to the thing you just bought/watched.

From a strictly algorithm point of view, this is a success — the algorithm found objects similar to the starting object. But from a human point of view this is a total fail because it’s very likely that you, the human, already know about all of these recommended objects. If you bought something, you probably compared the thing you bought with many or all of the objects that are now being recommended to you. If you watched something, the recommendations are still likely to be things you already knew about.

The misconception about what similarity needs to mean to be useful to humans is at the heart of the failure of recommender systems, and even the ad serving systems that many of the online businesses make their money from. Everyone has had the experience of buying something, only to have their ad feed (should they still see it) fill up with ads for similar products (“I see you just bought a new car — here are some other new cars you might like”).

What’s needed is annular similarity — a region that is centred at the initial object, but excludes new objects that are too similar, and focuses instead on objects that are a bit similar.

Amazon tries to do this via “People who bought this also bought” which can show useful add-on products. (They also use “People who viewed this also viewed” but this is much less effective because motivations are so variable.) But this mechanism also fails because buying things together doesn’t necessarily mean that they belong together — it’s common to see recommendations based on the fact that two objects were on special on the same day, and so more likely to be bought together because of the opportunity, rather than any commonality.

Annular similarity is also important in applications that help humans to learn new things: web search, online courses, intelligence analysis. That’s why we built the ATHENS divergent web search engine (refs below) — give it some search terms and it returns (clusters of) web pages that contain information that is just over the horizon from the search terms. We found that this required two annuli — we first constructed the information implicit in the search terms, then an annulus around that of information that we assumed would be known to someone who knew the core derived from the search terms, and only then did we generate another annulus which contains the results returned.

We don’t know many algorithmic ways to find annular similarity. In any distance-based clustering it’s possible, of course, to define an annulus around any point. But it’s tricky to decide on what the inner and outer radii should be, the calculations have to happen in high-dimensional space where the points are very sparse, and it’s not usually clear whether the space is isotropic.

Annular similarity doesn’t work (at least straightforwardly) in density-based (e.g. DBScan) or distribution-based clustering (e.g. EM) because the semantics of ‘cluster’ doesn’t allow for an annulus.

One way that does work (and was used extensively in the ATHENS system) is based on singular vallue decomposition (SVD). An SVD projects a high-dimensional space into a low-dimensional one in such a way as to preserve as much of the variation as possible. One of its useful side-effects is that a point that is similar to many other points tends to be projected close to the origin; and a point that is dissimilar to most other points also tends to be projected close to the origin because the dimension(s) it inhabits have little variation and tend to be projected away. In the resulting low-dimensional projection, points far from the origin tend to be interestingly dissimilar to those at the centre of the structure — and so an annulus imposed on the embedding tends to find an interesting set of objects.

Unfortunately this doesn’t solve the recommender system problem because recommenders need to find similar points that have more non-zeroes than the initial target point — and the projection doesn’t preserve this ordering well. That means that the entire region around the target point has to be searched, which becomes expensive.

There’s an opportunity here to come up with better algorithms to find annular structures. Success would lead to advances in several diverse areas.

(A related problem is the Sound of Music problem, the tendency for a common/popular object to muddle the similarity structure of all of the other objects because of its weak similarity to all of them. The Sound of Music plays this role in movie recommendation systems, but think of wrapping paper as a similar object in the context of Amazon. I’ve written about this in a previous post.)


Tracy A. Jenkin, Yolande E. Chan, David B. Skillicorn, Keith W. Rogers:
Individual Exploration, Sensemaking, and Innovation: A Design for the Discovery of Novel Information. Decision Sciences 44(6): 1021-1057 (2013)
Tracy A. Jenkin, David B. Skillicorn, Yolande E. Chan:
Novel Idea Generation, Collaborative Filtering, and Group Innovation Processes. ICIS 2011
David B. Skillicorn, Nikhil Vats:
Novel information discovery for intelligence and counterterrorism. Decision Support Systems 43(4): 1375-1382 (2007)
Nikhil Vats, David B. Skillicorn:
Information discovery within organizations using the Athens system. CASCON 2004: 282-292


The right level of abstraction = the right level to model

I think the take away from my last post is that models of systems should aim to model them at the right level of abstraction, where that right level corresponds to the places where there are bottlenecks. These bottlenecks are places where, as we zoom out in terms of abstraction, the system suddenly seems simpler. The underlying differences don’t actually make a difference; they are just variation.

The difficulty is that it’s really, really hard to see or decide where these bottlenecks are. We rightly laud Newton for seeing that a wide range of different systems could all be described by a single equation; but it’s also true that Einstein showed that this apparent simplicity was actually an approximation for a certain (large!) subclass of systems, and so the sweet spot of system modelling isn’t quite where Newton thought it was.

For living systems, it’s even harder to see where the right level of abstraction lies. Linnaeus (apparently the most-cited human) certainly created a model that was tremendously useful, working at the level of the species. This model has frayed a bit with the advent of DNA technology, since the clusters from observations don’t quite match the clusters from DNA, but it was still a huge contribution. But it’s turning out to be very hard to figure out the right level of abstractions to capture ideas like “particular disease” “particular cancer” even though we can diagnose them quite well. The variations in what’s happening in cells are extremely difficult to map to what seems to be happening in the disease.

For human systems, the level of abstraction is even harder to get right. In some settings, humans are surprisingly sheep-like and broad-brush abstractions are easy to find. But dig a little, and it all falls apart into “each person behaves as they like”. So predicting the number of “friends” a person will have on a social media site is easy (it will be distributed around Dunbar’s number), but predicting whether or not they will connect with a particular person is much, much harder. Does advertising work? Yes, about half of it (as Ogilvy famously said). But will this ad influence this person? No idea. Will knowing the genre of this book or film improve the success rate of recommendations? Yes. Will it help with this book and this person? Not so much.

Note the connection between levels of abstraction and clustering. In principle, if you can cluster (or, better, bicluster) data about your system and get (a) strong clusters, and (b) not too many of them, then you have some grounds for saying that you’re modelling at the right level. But this approach founders on the details: which attributes to include, which algorithm to use, which similarity measure, which parameters, and so on and on.

Crime in Chicago

The Chicago Police Department makes details of all of its incidents available. For each one, there’s a record describing what kind of incident and crime it was, where it took place (thinly anonymized), and when it happened.

This data is available for more than a decade’s worth of crimes (a big file!) but I’ve used one subset of just over a month as a working dataset. What sorts of things can be learned from such data? One research project looked at seasonal patterns, and discovered that there are strong and consistent patterns over time.

I’m more interested in this dataset as a publicly available example of the kind of information that might be collected about terrorist incidents, IED explosions, and the like. Such data is of mixed type: some fields are numeric, others (times and dates) are cyclic, and still others are textual. So it’s not straightforward to apply knowledge discovery algorithms to them.

In what follows, I’m using a hashing technique to deal with the non-numeric fields, z-scoring to treat variation in each attribute as equally significant, and singular value decomposition to project the data into lower dimensions and to visualize it. To help understand which attributes are making an interesting difference, the resulting plot can be overlaid so that the colour of each point corresponds to the value of that particular attribute for the record corresponding to that point. (In an earlier post, I did a similar analysis for the complete START set of terrorist attacks.)

Incidents are labelled with the physical coordinates of where they took place, so one way to visualize the data is to plot each of the attributes against position. Here is a figure showing the distribution on incidents in space, labelled their FBI crime descriptor:

There are patterns to be seen, but they are weak and hard to pick out.

The advantage of clustering using singular value decomposition (SVD) is that we can see the effects of all of the attributes on which incidents resemble others, without having to know anything in advance about the signficance of any one of them.

Here’s the clustering derived from SVD:

Clustering of all incidents derived from SVD

Clustering of all incidents derived from SVD

It’s clear from this figure that there are clusters, i.e. not all crimes are one-offs, nor are all crimes essentially the same. But what properties of these incidents accounts for the 7 or so strong clusters? That’s where being able to overlay single attributes on the clusters can help.

For example, if we overlay time of day on the clustering like this:

Clusters overlaid with time of day

Clusters overlaid with time of day

we see that time of day is somehow orthogonal to the clustering — it has some relevance, but each of the clusters has the same internal structure with respect to time. So this doesn’t help to explain the clustering but it does suggest that there is deeper structure that is connected to time.

On the other hand, if we overlay the clustering with whether or not it led to an arrest:

Clustering overlaid with arrests

Clustering overlaid with arrests

we see that arrested or not plays a major differentiating role in the clustering. Similarly, if we overlay whether or not the incident was domestic:

Clustering overlaid with domestic or not

Clustering overlaid with domestic or not

we see that this also makes a big difference in the clustering.

If we overlay with the primary description of the incident:

Clustering overlaid with the primary classification of the incident.

Clustering overlaid with the primary classification of the incident.

then we see that another part of the clustering is explained. Notice that, while arrested or not varies from top to bottom, incident classification varies from left to right. In other words, macroscopically at least, there is no correlation between type of crime and arrest rate — they vary in an uncorrelated way — which is a good thing.

There’s a similar structure in the secondary crime description attribute:

Clustering overlaid with the secondary crime classification

Clustering overlaid with the secondary crime classification

We can get another sense of which attributes are driving the clustering by plotting the attributes, produced from the same SVD. In these plots, points far from the centre are more significant than those close to the centre, and those in the same direction from the centre are correlated. Furthermore, points representing incidents are “pulled” towards attributes for which they have large values. So this plot provides a sense of which attributes play the largest role in differentiating incidents, and how they fit together.

Variation among attributes aligned with variation among incidents

Variation among attributes aligned with variation among incidents

In fact, we can plot incidents and attributes in the same plot to make the connections obvious:

The relationship between attributes and incidents

The relationship between attributes and incidents

This dataset also shows how difficult some of the problems of anomaly detection are. Suppose we wanted to answer the question: which incident was the most unusual in this dataset? The SVD provides a theoretically well-motivated answer: the one whose representative point is farthest from the origin. However, looking at the clustering, this theoretical answer seems rather weak.

Understanding “anomaly” in large dynamic datasets

A pervasive mental model of what it means to be an “anomaly” is that this concept is derived from difference or dissimilarity; anomalous objects or records are those that are far from the majority, common, ordinary, or safe records. This intuition is embedded in the language used — for example, words like “outlier”.

May I suggest that a much more helpful, and even more practical, intuition of what “anomaly” means comes from the consideration of boundaries rather than dissimilarity. Consider the following drastically simplifed rendering of a clustering:


There are 3 obvious clusters and a selection of individual points. How are we to understand these points?

The point A, which would conventionally by considered the most obvious outlier, is probably actually the least interesting. Points like this are almost always the result of some technical problem on the path between data collection and modelling. You wouldn’t think this would happen with automated systems, but it’s actually surprisingly common for data not to fit properly into a database schema or for data to be shifted over one column in a spreadsheet, and that’s exactly the kind of thing that leads to points like A. An inordinate amount of analyst attention can be focused on such points because they look so interesting, but they’re hardly ever of practical importance.

Points B and C create problems for many outlier/anomaly detection algorithms because they aren’t particularly far from the centre of gravity of the entire dataset. Sometimes points like these are called local outliers or inliers and their significance is judged by how far they are (how dissimilar) from their nearest cluster.

Such accounts are inadequate because they are too local. A much better way to judge B and C is to consider the boundaries between each cluster and the aggregate rest of the clusters; and then to consider how close such points lie to these boundaries. For example, B lies close to the boundary between the lower left cluster and the rest and is therefore an interesting anomalous point. If it were slightly further down in the clustering it would be less anomalous because it would be closer to the lower left cluster and further from this boundary. Point C is more anomalous than B because it lies close to three boundaries: those between the lower left cluster and the rest, between the upper left cluster and the rest, and the rightmost cluster and the rest. (Note that a local outlier approach might not think C is anomalous because it’s close to all three clusters.)

The point D is less anomalous  than B and C, but is also close to a boundary, the boundary the wraps the rightmost cluster. So this idea can be extended to many different settings. For example, wrapping a cluster more or less tightly changes the set of points that are “outside” the wrapping and so gives an ensemble score for how unusual the points on the fringe of a cluster might be. This is especially important in adversarial settings, because these fringes are often where those with bad intent lurk.

The heart of this approach is that anomaly must be a global property derived from all of the data, not just a local property derived from the neighbourhood of the point in question. Boundaries encode non-local properties in a way that similarity (especially similarity in a geometry, which is usually how clusterings are encoded) does not.

The other attractive feature of this approach is that it actually defines regions of the space based on the structure of the “normal” clusters. These regions can be precomputed and then, when new points arrive, it’s fast to decide how to understand them. In other words, the boundaries become ridge lines of high abnormality in the space and it’s easy to see and understand the height of any other point in the space. Thus the model works extremely effectively for dynamic data as long as there’s an initial set of normal data to prime the system. (New points can also be exploited as feedback to the system so that, if a sequence of points arrive in a region, the first few will appear as strong anomalies, but their presence creates a new cluster, and hence a new set of boundaries that mean that newer points in the same region no longer appear anomalous).

Terrorist incidents come in only a few flavors

Terrorist attacks are different in many ways: they take place in different countries, with different motivations behind them, using different mechanisms, and with varying degrees of success. But are there any commonalities that could be used, for example, to categorize them and so to defend against them in more focused ways? The answer is yes, there are large-scale similarities.

To do this analysis, I started from the Global Terrorism Database developed by START, the National Consortium for the Study of Terrorism and Responses to Terrorism. The database contains details of all incidents that meet their coding standards since the beginning of 1970, and I used the version released at the end of 2012. There was one major discontinuity where new fields were added but overall the coding has been consistent over the entire 40+ year period.

The image below shows the clustering of all attacks over that time period:

attackslabelledThe large structure looks like a hinge with clusters A and B at the top, clusters C and D forming the hinge itself, and clusters E, F, G, and H at the bottom. There’s also a distinction between the clusters at the front (B, D, F, and H) and those at the back (A,C,E, and G). (You’ll have to expand the figure to see the labels clearly.)

The first thing to notice is that there are only 8 clusters and, with the exception of H which is quite diffuse, they clusters are fairly well defined. In other words, there are 8 distinctive kinds of terrorist attack (and only 8, over a very long time period).

Let’s dig into these clusters and see what they represent. The distinction between the front and the back is almost entirely related to issues of attribution: whether the attack was claimed, how clear that claim is (for example, are there multiple claim of responsibility for the same incident), and whether the incident should be properly claimed as terrorism or something else (quasi-military, for example).

The structure of the hinge differentiates between incidents involving capturing people (hijackings or kidnappings in A and B) and incidents that are better characterized as attacks (C, D, E, F, G, H).  The extremal ends of A and B (to the right) are incidents that lasted longer and/or the ransom was larger.

The differences between C/D, E/F, and G/H arise from the number of targets (which seems to be highly correlated with the number of different nationalities involved). So C and D are attacks on a single target, E and F are attacks on two targets, and G and H are attacks on three targets. Part of the diffuse structure of H happens because claims are always murkier for more complex attacks and part because there is a small group of incidents involving 4 targets that appears, as you’d expect, even further down and to the right.

Here are some interesting figures which overlay the intensity of a property on the clustering, so that you can see how it’s associated with the clusters:


This figure shows whether the incident was claimed or not. The color coding runs from dark red to bright yellow; I’m not specifying the direction, because it’s complicated, but the contrast shows differences. In each case, the available color spectrum is mapped to the range of values.


This figure shows the differences between incidents where there were some hostages or kidnapped and those where there weren’t.

overlaycountryThis figure shows that the country in which the incident took place is mostly unrelated to other properties of the incident; in other words, attacks are similar no matter where they take place.

This analysis shows that, despite human variability, those designing terrorist incidents choose from a fairly small repertoire of possibilities. That’s not to say that there couldn’t be attacks in which some people are also taken hostage; rather that those doing the planning don’t seem to conceptualize incidents that way, so when it happens it’s  more or less by accident. Perhaps some kind of Occam’s razor plays a role: planning an incident is already difficult so there isn’t a lot of brainpower to try for extra cleverness, and there’s probably also a perception that complexity increases risk.

Understanding High-Dimensional Spaces

My new book with the title above has been published by Springer, just in time for Christmas gift giving for the data miner on your list.

The book explores how to represent high-dimensional data (which almost all data is), and how to understand the models, particularly for problems where the goal is to find the most interesting subset of the records. “Interesting”, of course, means different things in different settings; a big part of the focus is on finding outliers and anomalies.

Partly the book is a reaction to the often unwitting assumption that clouds of data can be understood as if they had a single centre — for example, much of the work on social networks.

The most important technical ideas are (a) that clusters themselves need to be understood as having a structure which provides each one with a higher-level context that is usually important to make sense of them, and (b) that the empty space between clusters also provides information that can help to understand the non-empty-space.

You can buy the book here.

Finding things you don’t already know

One of the big problems with the tools available on the web is that they are all convergent — in other words, they help you find out more about things you already know something about. Search engines are the most obvious example. Almost nobody looks beyond the first one or two results returned, so there’s almost no opportunity to be surprised by something that was returned in response to a query. Blogs and tags look, at first glance, as if they’re a little better; but after a while you come to realize that an awful lot of content is recycled among different blogs, and genuinely coming across something new and unexpected is a rarity.

We started working, some years ago, on a divergent information system called Athens. It’s designed to show you things that you don’t already know about, but which you are well-positioned to understand. So you get to find out new things, but not just random new things — things that should make some sense to you.

In a nutshell, the way it works is as follows:

  • You provide some “search terms” that indicate an area that you already know about.
  • To reduce the dependence of the system on your particular choices of initial search terms, Athens uses them as search terms, fetches a large set of pages, extracts their content, and treats this as a definition of the “area you already know about”.
  • Athens then searches out one level from this initial area using a combination of your original search terms, and new terms discovered from understanding the initial area. A large set of pages are retrieved and clustered. The content of these new pages is assumed to be things you already know about, because they are so closely related to the initial subject area.
  • Athens then repeats the process using combinations of search terms from the new clusters, and from the topics of the initial area. Again, a large set of pages are retrieved and clustered.
  • These clusters are presented as the results of the search. They are two levels away from the initial search, and so are likely to be both novel and related to the initial area of knowledge.

This works well, but there are some intricate problems along the way.

The first is that people actually have trouble getting their minds around the idea of novel knowledge. They keep being sucked into expectations derived from search engines such as Google and Yahoo — they expect, unconsciously, to see things that look familiar. And when they don’t, they want to see and be reassured about the connections between the new ideas and the old.

The second problem is that it’s hard to present the content of a cluster of novel information in a way that makes sense, given that the user doesn’t know yet what it is about. This goes to show how much we fill in the results of ordinary search with background and contextual information without realizing that we are doing it.

The other thing that surprises people is how long the process takes. A search engine query takes less than a second. An Athens query can take 8-10 hours!

Athens is useful for novel knowledge discovery in a number of different ways, which I’ll talk about in the next post.