Understanding “anomaly” in large dynamic datasets

A pervasive mental model of what it means to be an “anomaly” is that this concept is derived from difference or dissimilarity; anomalous objects or records are those that are far from the majority, common, ordinary, or safe records. This intuition is embedded in the language used — for example, words like “outlier”.

May I suggest that a much more helpful, and even more practical, intuition of what “anomaly” means comes from the consideration of boundaries rather than dissimilarity. Consider the following drastically simplifed rendering of a clustering:


There are 3 obvious clusters and a selection of individual points. How are we to understand these points?

The point A, which would conventionally by considered the most obvious outlier, is probably actually the least interesting. Points like this are almost always the result of some technical problem on the path between data collection and modelling. You wouldn’t think this would happen with automated systems, but it’s actually surprisingly common for data not to fit properly into a database schema or for data to be shifted over one column in a spreadsheet, and that’s exactly the kind of thing that leads to points like A. An inordinate amount of analyst attention can be focused on such points because they look so interesting, but they’re hardly ever of practical importance.

Points B and C create problems for many outlier/anomaly detection algorithms because they aren’t particularly far from the centre of gravity of the entire dataset. Sometimes points like these are called local outliers or inliers and their significance is judged by how far they are (how dissimilar) from their nearest cluster.

Such accounts are inadequate because they are too local. A much better way to judge B and C is to consider the boundaries between each cluster and the aggregate rest of the clusters; and then to consider how close such points lie to these boundaries. For example, B lies close to the boundary between the lower left cluster and the rest and is therefore an interesting anomalous point. If it were slightly further down in the clustering it would be less anomalous because it would be closer to the lower left cluster and further from this boundary. Point C is more anomalous than B because it lies close to three boundaries: those between the lower left cluster and the rest, between the upper left cluster and the rest, and the rightmost cluster and the rest. (Note that a local outlier approach might not think C is anomalous because it’s close to all three clusters.)

The point D is less anomalous  than B and C, but is also close to a boundary, the boundary the wraps the rightmost cluster. So this idea can be extended to many different settings. For example, wrapping a cluster more or less tightly changes the set of points that are “outside” the wrapping and so gives an ensemble score for how unusual the points on the fringe of a cluster might be. This is especially important in adversarial settings, because these fringes are often where those with bad intent lurk.

The heart of this approach is that anomaly must be a global property derived from all of the data, not just a local property derived from the neighbourhood of the point in question. Boundaries encode non-local properties in a way that similarity (especially similarity in a geometry, which is usually how clusterings are encoded) does not.

The other attractive feature of this approach is that it actually defines regions of the space based on the structure of the “normal” clusters. These regions can be precomputed and then, when new points arrive, it’s fast to decide how to understand them. In other words, the boundaries become ridge lines of high abnormality in the space and it’s easy to see and understand the height of any other point in the space. Thus the model works extremely effectively for dynamic data as long as there’s an initial set of normal data to prime the system. (New points can also be exploited as feedback to the system so that, if a sequence of points arrive in a region, the first few will appear as strong anomalies, but their presence creates a new cluster, and hence a new set of boundaries that mean that newer points in the same region no longer appear anomalous).


0 Responses to “Understanding “anomaly” in large dynamic datasets”

  1. Leave a Comment

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: