Posts Tagged 'anomaly detection'

Provenance — a neglected aspect of data analytics

Provenance is defined by Merriam-Webster as “the history of ownership of a valued object or work of art or literature“, but the idea has much wider applicability.

There are three kinds of provenance:

  1. Where did an object come from. This kind of provenance is often associated with food and drink: country of origin for fresh produce, brand for other kinds of food, appellation d’origine contrôlée for French wines, and many other examples. This kind of provenance is usually signalled by something that is attached to the object.
  2. Where did an object go on its way from source to destination. This is actually the most common form of provenance historically — the way that you know that a chair really is a Chippendale is to be able to trace its ownership all the way back to the maker. A chair without provenance is probably much less valuable, even though it may look like a Chippendale, and the wood seems the right age. This kind of provenance is beginning to be associated with food. For example, some shipments now have temperature sensors attached to them that record the maximum temperature they ever encountered between source and destination. Many kinds of shipments have had details about their pathway and progress available to shippers, but this is now being exposed to customers as well. So if you buy something from Amazon you can follow its progress (roughly) from warehouse to you.
  3. The third kind of provenance is still in its infancy — what else did the object encounter on it way from source to destination. This comes in two forms. First, what  other objects was it close to? This is the essence of Covid19 contact tracing apps, but it applies to any situation where closeness could be associated with poor outcomes. Second, where the objects that it was close to ones that were expected or made sense?

The first and second forms of provenace don’t lead to interesting data-analytics problems. They can be solved by recording technologies with, of course, issues of reliability, unforgeability, and non-repudiation.

But the third case raises many interesting problems. Public health models of the spread of infection usually assume some kind of random particle model of how people interact (with various refinements such as compartments). These models would be much more accurate if they could be based on actual physical encounter networks — but privacy quickly becomes an issue. Nevertheless, there are situations where encounter networks are already collected for other reasons: bus and train driver handovers, shift changes of other kinds, police-present incidents; and such data provides natural encounter networks. [One reason why Covid19 contact tracing apps work so poorly is that Bluetooth proximity is a poor surrogate for potentially infectious physical encounter.]

Customs also has a natural interest in provenance: when someone or something presents at the border, the reason they’re allowed to pass or not is all about provenance: hard coded in a passport, pre-approved by the issue of a visa, or with real-time information derived from, say, a vehicle licence plate.

Some of clearly suspicious, but hard to detect, situations arise from mismatched provenance. For example, if a couple arrive on the same flight, then they will usually have been seated together; if two people booked their tickets or got visas using the same travel agency at the same time then they will either arrive on different flights (they don’t know each other), or they will arrive on the same flight and sit together (they do know each other). In other words, the similarity of provenance chains should match the similarity of relationships, and mismatches between the two signal suspicious behaviour. Customs data analytics is just beginning to explore leveraging this kind of data.

Predicting the Future

Arguably the greatest contribution of computing to the total of human knowledge is that relatively simple results from theoretical models of computation show that the future is an inherently unknowable place — not just in principle, but for fundamental reasons.
A Turing machine is a simple model of computation with two parts: an infinite row of memory elements, each able to contain a single character; and a state machine, a simple device that is positioned at one of the memory elements, and makes moves by inspecting the single character in the current element and using a simple transition table to decide what to do next. Possible next moves include changing the current character to another one (from a finite alphabet) or moving one element to the right or to the left, or stopping and turning off.
A Turing machine is a simple device; all of its parts are straightforward; and many real-world simulators have been built. But it is hypothesised that this simple device can compute any function that can be computed by any other computational device, and so it contains within its simple structure everything that can be found in the most complex supercomputer, except speed.
It has been suggested that the universe is a giant Turing machine, and everything we know so far about physics continues to work from this perspective, with the single exception that it requires that time is quantized rather than continuous — the universe ticks rather than runs.
But here’s the contribution of computation to epistemology: Almost nothing interesting about the future behaviour of a Turing machine is knowable in any shortcut way, that is in any way that is quicker than just letting the Turing machine run and seeing what happens. This includes questions like: will the Turing machine ever finish its computation? Will it revisit this particular memory element? Will this symbol ever again contain the same symbol that it does now? and many others. (These questions may be answerable in particular cases, but they can’t be answered in general — that is you can’t inspect the transitions and the storage and draw conclusions in a general way.)
If most of the future behaviour of such a simple device is not accessible to “outside” analysis, then almost every property of more complex systems must be equally inaccessible.
Note that this is not an argument built from limitations that might be thought of as “practical”. Predicting what will happen tomorrow is not, in the end, impossible because we can’t gather enough data about today, or because we don’t have the processing power to actually build the predictions — it’s a more fundamental limitation in the nature of what it means to predict the future. This limitation is akin (in fact, quite closely related) to the fact that, within any formal system, there are some theorems that we know to be true but cannot prove.

There are other arguments that also speak to the problem of predicting the future. These aren’t actually needed, given the argument above, but they are often advanced, and speak more to the practical difficulties.

The first is that non-linear systems are not easy to model, and often have unsuspected actions that are not easy to infer even when we have a detailed understanding of them. Famously, bamboo canes can suddenly appear from the ground and then grow more than 2 feet in a day.

The second is that many real-world systems are chaotic, that is infinitesimal differences in their conditions at one moment in time can turn into enormous differences at a future time. This is why forecasting the weather is difficult: a small error in measurement at one  weather station today (caused, perhaps by a butterfly flapping its wings) can completely change tomorrow’s weather a thousand miles away. The problem with predicting the future here is that the current state cannot be measured to sufficient accuracy.

So if the future is inherently, fundamentally impossible to predict, what do we mean when we talk about prediction in the context of knowledge discovery? The answer is that predictive models are not predicting a previously unknown future, but are predicting the recurrence of patterns that have existed in the past. It’s desperately important to keep this in mind.

Thus when a mortgage prediction system (should this new applicant be given a mortgage?) is built, it’s built from historical data: which of a pool of earlier applicants for mortagages did, and did not, repay those loans. The prediction for a new mortgage applicant is, roughly speaking, based on matching the new applicant to the pool of previous applicants and making a determination from what the outcomes were for those. In other words, the prediction assumes an approximate rerun of what happened before — “now” is essentially the same situation as “then”. It’s not really a prediction of the future; it’s a prediction of a rerun of the past.

All predictive models have (and must have) this historical replay character. Trouble starts when this gets forgotten, and models are used to predict scenarios that are genuinely in the future.  For example, in mortgage prediction, a sudden change in the wider economy may be significant enough that the history that is wired into the predictor no longer makes sense. Using the predictor to make new lending decisions becomes foolhardy.

Other situations have similar pitfalls, but they are a bit better hidden. For example, the dream of personalised medicine is to be able to predict the outcome for a patient who has been diagnosed with a particular disease and is being given a particular treatment. This might work, but it assumes that every new patient is close enough to some of the previous patients that there’s some hope of making a plausible prediction. At present, this is foundering on the uniqueness of each patient, especially as the available pool of existing patients for building the predictor is often quite limited. Without litigating the main issue, models that attempt to predict future global temperatures are vulnerable to the same pitfall: previous dependencies of temperatures on temperatures at earlier times do not provide a solid epistemological basis for predicting future temperatures based on temperatures now (and with the triple whammy of fundamental unpredictability, chaos, and non-linear systems).

All predictors should be built so that predictions all pass through a preliminary step that compares them to the totality of the data used to build the predictor. New records that do not resemble records used for training cannot legitimately be passed to the predictor, since the result has a strong probability of being fictional. In other words, the fact that a predictor was build from a particular set of training data must be preserved in the predictor’s use. Of course, there’s an issue of how similar a new record must be to the training records to be plausibly predicted. But at least this question should be asked.

So can we predict the future? No, we can only repredict the past.

Understanding “anomaly” in large dynamic datasets

A pervasive mental model of what it means to be an “anomaly” is that this concept is derived from difference or dissimilarity; anomalous objects or records are those that are far from the majority, common, ordinary, or safe records. This intuition is embedded in the language used — for example, words like “outlier”.

May I suggest that a much more helpful, and even more practical, intuition of what “anomaly” means comes from the consideration of boundaries rather than dissimilarity. Consider the following drastically simplifed rendering of a clustering:

anomalies

There are 3 obvious clusters and a selection of individual points. How are we to understand these points?

The point A, which would conventionally by considered the most obvious outlier, is probably actually the least interesting. Points like this are almost always the result of some technical problem on the path between data collection and modelling. You wouldn’t think this would happen with automated systems, but it’s actually surprisingly common for data not to fit properly into a database schema or for data to be shifted over one column in a spreadsheet, and that’s exactly the kind of thing that leads to points like A. An inordinate amount of analyst attention can be focused on such points because they look so interesting, but they’re hardly ever of practical importance.

Points B and C create problems for many outlier/anomaly detection algorithms because they aren’t particularly far from the centre of gravity of the entire dataset. Sometimes points like these are called local outliers or inliers and their significance is judged by how far they are (how dissimilar) from their nearest cluster.

Such accounts are inadequate because they are too local. A much better way to judge B and C is to consider the boundaries between each cluster and the aggregate rest of the clusters; and then to consider how close such points lie to these boundaries. For example, B lies close to the boundary between the lower left cluster and the rest and is therefore an interesting anomalous point. If it were slightly further down in the clustering it would be less anomalous because it would be closer to the lower left cluster and further from this boundary. Point C is more anomalous than B because it lies close to three boundaries: those between the lower left cluster and the rest, between the upper left cluster and the rest, and the rightmost cluster and the rest. (Note that a local outlier approach might not think C is anomalous because it’s close to all three clusters.)

The point D is less anomalous  than B and C, but is also close to a boundary, the boundary the wraps the rightmost cluster. So this idea can be extended to many different settings. For example, wrapping a cluster more or less tightly changes the set of points that are “outside” the wrapping and so gives an ensemble score for how unusual the points on the fringe of a cluster might be. This is especially important in adversarial settings, because these fringes are often where those with bad intent lurk.

The heart of this approach is that anomaly must be a global property derived from all of the data, not just a local property derived from the neighbourhood of the point in question. Boundaries encode non-local properties in a way that similarity (especially similarity in a geometry, which is usually how clusterings are encoded) does not.

The other attractive feature of this approach is that it actually defines regions of the space based on the structure of the “normal” clusters. These regions can be precomputed and then, when new points arrive, it’s fast to decide how to understand them. In other words, the boundaries become ridge lines of high abnormality in the space and it’s easy to see and understand the height of any other point in the space. Thus the model works extremely effectively for dynamic data as long as there’s an initial set of normal data to prime the system. (New points can also be exploited as feedback to the system so that, if a sequence of points arrive in a region, the first few will appear as strong anomalies, but their presence creates a new cluster, and hence a new set of boundaries that mean that newer points in the same region no longer appear anomalous).

Understanding High-Dimensional Spaces

My new book with the title above has been published by Springer, just in time for Christmas gift giving for the data miner on your list.

The book explores how to represent high-dimensional data (which almost all data is), and how to understand the models, particularly for problems where the goal is to find the most interesting subset of the records. “Interesting”, of course, means different things in different settings; a big part of the focus is on finding outliers and anomalies.

Partly the book is a reaction to the often unwitting assumption that clouds of data can be understood as if they had a single centre — for example, much of the work on social networks.

The most important technical ideas are (a) that clusters themselves need to be understood as having a structure which provides each one with a higher-level context that is usually important to make sense of them, and (b) that the empty space between clusters also provides information that can help to understand the non-empty-space.

You can buy the book here.

Anomalies in record-based data

Many organisations have large datasets whose entities are records, perhaps records of transactions. In some settings, such as detecting credit-card fraud, sophisticated sets of rules have been developed to decide which records deserve further attention as potentially fraudulent. What does an organisation do, however, when it has a large dataset like this, hasn’t developed a model of what “interesting” records look like, but would still like to focus attention on “interesting” records — usually because there aren’t enough resources even to look at all of the records individually.

One way to decide which records are interesting, is to label records as uninteresting if there are lot of other records like them. I have developed ways to rank records by interestingness using this idea.

So when the Sydney Morning Herald published a dataset of Australian defence contracts (700,000 of them) I thought I would try my approach. The results are interesting. Here are the most unusual records from this ranking (the columns are contract number, description, contracting agency, start date, end data, amount, and supplier):

1.   1217666,REPAIR PARTS,Department of Defence,16-October-2002,,5872.52,L
This one comes at the top of the list because the supplier name is unusual, only a single letter.

2.  1120859,Supply of,Department of Defence,15-May-2002,,0,C & L AEROSPACE

This one has a very short description and an amount of $0.
3.  854967,EARTH MOVING EQUIPMENT PARTS FOR REPAIR,Department of Defence,21-May-2002,,2134.05,439
Unusual because the supplier name is a number

4.  956798,PRESSURE GAUGE (WRITE BACK  SEE ROSS DAVEY),Department of Defence,11-September-2002,,1,WORMALD FIRE & SAFETY
Unusual because of the extra detail in the description and the cost of $1

5.  1053172,5310/66/105/3959.PURCHASE OF WASHER  FLAT.*CANCELLED* 29/04/03,Department of Defence,12-February-2003,,0,ID INTERNATIONAL
Unusual because of the dollar value, and the unusual description because of the cancellation

6.  868380,cancelled,Department of Defence,14-June-2002,,0,REDLINE
Unusual again because of the description and dollar value

7.  1043448,tetanus immunoglobulin-human,Department of Defence,10-January-2003,,1,AUSTRALIAN RED CROSS
Unusual because of the low dollar value

8  1014322,NATIONAL VISA PURCHASING,Department of Defence,18-October-2002,,26933.99,NAB 4715 2799 0000 0942
Unusual because the supplier is a bank account number (and so numeric); also a largish dollar value

9.  1023922,NATIONAL VISA PURCHASING,Department of Defence,18-September-2002,,25586.63,NAB 4715 2799 0000 0942
Same sort of pattern as (8) — globally unusual but similar to (8), note the common date

10.  968986,COIL  RADIO FREQUENCY,Department of Defence,27-September-2002,,2305.6,BAE
Unusual because of the short supplier name and large dollar value

11.  887357,SWIMMING POOL COVER.,Department of Defence,07-May-2002,,7524,H & A TEC
Unusal supplier name and large (!!) dollar value — hope it’s a big pool

12.  1010554,NAB VISA CARD,Department of Defence,02-August-2002,,16223.19,NAB 4715 2799 0000 0942
Another numeric bank account number as supplier and large dollar amount

13.  1005569,Interest,Department of Defence,12-August-2002,,2222.99,NAB 4715 2799 0000 1494
And again

14.  925011,FLIR RECORDER REPPRODUCER SET REPAIR KIOWA,Department of Defence,16-August-2002,,1100,BAE
Shart supplier name, long description with unusual words

15.  1012869,NAB VISA STATEMENT,Department of Defence,22-August-2002,,12934.87,NAB 4715 2799 0000 0942
Another financial transaction

16.  1073019,NATIONAL VISA,Department of Defence,03-February-2003,,10060.16,NAB 4715 2799 0000 0942
And again

17.  969039,SUSPENDERS  WHITE,Department of Defence,30-September-2002,,41800,ADA
Short supplier name and very large dollar amount (hopefully not just one suspender)

18.  1097060,Purchase of Coveralls  Flyers  Lightweight  Sage Green.,Department of Defence,11-February-2003,,18585.6,ADA
Again short supplier name and large dollar amount

959232,SUPPLY OF COATS AND TROUSERS DPDU,Department of Defence,23-September-2002,,1032350,ADA

Again short supplier name and very (!!) large dollar amount

Clearly the process is turning up example records that seem to be quite unusual within this large set, and might sometimes be worth further investigation.

This technique can be applied to any record-based data. As well as providing a version of the data ranked by interestingness, it also provides a graphical view of the data, and some indication of what the density of unusual records is compared to ordinary records. As the example shows, what it also often turns up are technical problems with the way that the data was collected, since mistakes in fields are records with the wrong fields, or with fields in the wrong place will usually turn up as anomalous.Some of the top records are there not because they are really unusual (probably) but because something went wrong with the capture of the supplier names. So it can be used for quality control as well.

ans =

1    23

ans =

1     6

ans =

1    23

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1     5

ans =

1     6

ans =

1    61

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1     8

ans =

1     6

ans =

1    11

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     6

ans =

1    20

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    16

ans =

1     6

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     6

ans =

1    20

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     4

ans =

1    24

ans =

1     6

ans =

1    25

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     6

ans =

1    17

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    25

ans =

1     6

ans =

1    26

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    18

ans =

1     6

ans =

1    26

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    32

ans =

1     6

ans =

1    25

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    18

ans =

1     6

ans =

1    82

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     9

ans =

1    18

ans =

1     6

ans =

1    32

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    25

ans =

1     6

ans =

1    43

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    12

ans =

1     6

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    37

ans =

1     6

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    15

ans =

1     6

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    15

ans =

1     6

ans =

1    38

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    44

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    25

ans =

1     7

ans =

1    18

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     7

ans =

1    37

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    15

ans =

1     7

ans =

1    23

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    31

ans =

1     7

ans =

1    33

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     4

ans =

1    32

ans =

1     7

ans =

1    65

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    29

ans =

1     7

ans =

1    79

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    34

ans =

1     7

ans =

1    27

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    21

ans =

1     7

ans =

1    26

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     7

ans =

1    38

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    17

ans =

1     7

ans =

1    27

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    21

ans =

1     7

ans =

1    44

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    25

ans =

1     7

ans =

1    22

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    99

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    29

ans =

1     7

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     4

ans =

1    25

ans =

1     7

ans =

1     5

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    29

ans =

1     7

ans =

1    22

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    18

ans =

1     7

ans =

1    77

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    19

ans =

1     7

ans =

1    30

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    20

ans =

1     7

ans =

1    31

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    30

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    24

ans =

1     7

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     9

ans =

1    11

ans =

1     7

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    11

ans =

1     7

ans =

1    14

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    79

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    34

ans =

1     7

ans =

1     9

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     1

ans =

1    15

ans =

1     7

ans =

1    29

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    23

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    20

ans =

1     7

ans =

1    22

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    77

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    19

ans =

1     7

ans =

1    35

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    31

ans =

1     7

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    29

ans =

1     7

ans =

1    15

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     9

ans =

1    20

ans =

1     7

ans =

1    44

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    25

ans =

1     7

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     9

ans =

1    11

ans =

1     7

ans =

1    99

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    29

ans =

1     7

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1    11

ans =

1     9

ans =

1    11

A Gentle Introduction to Finding the Most Important Node in Graph-Connected Data

I’ve posted frequently about the problem that analysts face when they are dealing with a set of objects, and connections between pairs of objects. For example, the nodes could be people, and the edges some measure of how much each pair of people communicate; the nodes could be reports and the edges how much each pair of reports mention the same entities; the nodes could be computers and the edges how well connected they are to each other. In other words, the natural data structure to represent the set of knowledge is a graph.

Very often such graphs are large, with thousands of nodes, and thousands of edges. The problem the analyst has is to work out which nodes of the graph deserve the most attention; often most of the graph is boring, or at least it is already well understood.

In what follows, I’ll assume that nodes by themselves have no associated score, but there is some natural way to associate a score with the edges that connect them. Let’s see how such a graph could be built, and so think about how we would keep track of which bits of it might be most important.

If the graph consists of only a single node, there’s not much to say. So consider what happens when a second node is added, and let’s assume that the second node is connected to the first (by a weighted edge). If they aren’t connected, then we might treat it as two separate graphs and two separate analysis problems. The only way to represent the graph with two nodes is with the single edge between them, but it might be plausible to make the length of the edge inversely proportional to the weight of the edge — if the weight is large, there’s a strong connection between them and so the nodes might plausibly be close together. We start to insert some geometry into the graph relationship because geometry is reasonably intuitive to us humans.

Now if we add a third node, there are two possibilities, if we want to continue to make edge length inversely proportional to edge weights (strong = close). If the third node is connected to both of the current two nodes, then the resulting structure is a triangle (whose exact shape depends on the relative size of the edge weights). If the third node is connected to only one of the current nodes, then the shape is a path.

The worst case is that adding each new node requires us to expand into another dimension: 3 nodes needs 2 dimensions, 4 nodes needs 3 dimensions, and so on, if we want to preserve the relationship between edge length and edge weight.

In this representation, nodes that are important because they are strongly connected to other important nodes (note the circularity) tend to be pulled towards the “middle” of the graph structure. So if we want to find the most important nodes, we should look near the middle.

But we don’t have that much geometry so, although the idea of the middle is intuitive, finding the middle is rather difficult. One approach is widely used in Social Network Analysis — the middle can (sort of) be found by computing how far it is from each node to the outer surface of the graph, and then looking at those nodes that are farthest inside. This works, after a fashion, but the algorithms can be complex and they tend not to tell you much else about the graph structure.

What about trying to use geometry directly and embedding the graph into an (n-1)-dimensional space? If this can be made to work then “close” actually will mean close, say in Euclidean distance.

The first possible way to do this is to build the adjacency matrix of the graph. For a graph with n nodes, this just means forming an n by n matrix with one row and column corresponding to each node. The matrix entries are zero, except that when node i and node j are connected, then entries ij and ji are set to the edge weight of the connection. We can actually dispense with the last column, since its contents can be inferred from the other columns.

Now the rows of any matrix can be interpreted geometrically by treating the entries as coordinates in a space spanned by the columns. So we can take our n by n-1 adjacency matrix and create an (n-1)-dimensional space with a point corresponding to each row of the matrix, and so to each node of the graph.

This geometry for the graph is still a bit awkward, because an (n-1)-dimensional space is still an ugly place to be (our intuitions don’t work very well in high-dimensional spaces). But if all you want to do is find the most important few nodes, then this is now easy.

Just for clarity, imagine that each entry of each row of the matrix has been divided by the sum of that row, so the entries are all less than 1. Now the embedded representation of the graph lives inside the (n-1)-dimensional (hyper)cube with one corner at the origin and the other corner (1,1,1,…..,1). The nodes with lots of high-weight edges to lots of other nodes are represented by points that are quite far from the origin, because the entries in their rows are comparatively large, and there are comparatively many non-zero entries. The nodes that are poorly connected to the rest of the graph are represented by points quite close to the origin because their rows have values that are mostly zeroes, and otherwise small. The most important nodes, then, will be on the “outer” edge of the geometric representation.

Now if we can find the direction out from the origin in which the “cloud” representing the points of the graph sticks out the most, then the extremal points in this direction will correspond to the most important nodes. Note that this direction isn’t necessarily towards the most extremal point (although we would usually expect it to be generally in that direction).

The direction in which the cloud bulges out the most from the origin is given by the principal eigenvector of the matrix representing the graph. Although this sounds complicated, it’s actually straightforward to compute. What’s more, the position of every node can be projected onto this vector and interpreted as a relative measure of the importance of that node. This is the basis for Google’s PageRank algorithm which treats the web as a giant graph, and does (more or less) this calculation to give every page an importance score. Note that this solves the problem of having to work in n-1 dimensions by projecting the entire graph structure onto a 1-dimensional space.

But there’s something badly wrong with this geometry. Suppose node i is connected to node j with weight 3, so in the geometry the point that corresponds to it is 5 units from the origin in dimension j. Now suppose that the weight on this edge is increased to 10. The point corresponding to node i suddenly moves further from the origin in dimension 3, and so further from all of the other points it is connected to, including point j. But an increase in weight is supposed to mean that i and j are closer! The problem is that, in this simple geometric embedding, the highly connected nodes are all far from the origin, on the outside of the graph structure, which tends to force them to be geometrically far from one another rather than close to one another.

The calculation of a single global importance works because large edge weights create large distances in the geometry and so tend to push nodes with edges like that far out. But in doing so messes up any other geometry among the nodes.

The solution is to use a different geometric embedding that turns the graph inside out so the the points that represent well-connected nodes are positioned close to the origin, and the sparsely connected nodes are positioned on the outer edges. This is done by converting the adjacency matrix, with its rows normalized by dividing by the rows sums, into a walk Laplacian matrix. This is done by negating all of the entries of the normalized adjacency matrix and adding 1’s down the diagonal.

The global variation among the nodes of the graph is now discovered by computing an eigendecomposition of this matrix (again straightforward) and looking at the eigenvector corresponding to the smallest non-zero eigenvalue. The intuition in this embedding is that this eigenvector points along the direction of minimal vibration in the embedded graph structure. Think of a cigar-shaped graph where the nodes are joined by metal rods. Striking it parallel to its long axis will produce much smaller vibrations than striking it parallel to any of the other axes. (In fact, the values of the eigenvector can be interpreted as the amplitude of the associated vibration of each node and the eigenvalue as the frequency of the vibration).

This approach will produce the same answer for the global importance ranking of the nodes, and so will find the same most important nodes. But the geometry is much better behaved. The distances between the embedded points behave as we would want them to (short edges = heavy weights) so clustering algorithms can be applied directly in the embedded space. The other eigenvectors also properly capture the substructure of the graph — for example, the eigenvector corresponding to the second-smallest non-zero eigenvalue describes the next most significant uncorrelated variation in the graph (in fact, the vibrational mode if the graph is struck parallel to the second axis of variation).

A geometric embedding is much more convenient if the graph is being updated. One problem with computing properties of a graph directly is that adding a single node can completely change all of the paths and all of the distances — so a social network measure has to be recomputed from scratch. In a geometric space, adding a new point doesn’t change the existing distances at all. An approximation to the position of the new node can be computed by calculating an adjacency matrix row for it, converting to a walk Laplacian row, and then using the existing eigendecomposition to embed it. If the new point really does change the structure of the graph, then this approximation is not useful; but most of the time it probably does not.

In fact, this embedding has a few other features that are not so obvious. The geometric embedding takes into account not just the (weighted) distance between nodes but also the possible presence of multiple paths between nodes. In other words, two nodes connected by lightweight paths, but many of them, will appear more similar than if only the weights had been taken into account. This is usually the right way to think about global relationships between unconnected nodes. In other words, this geometric embedding takes into account not only distance, but also density and curvature in the original graph.

This approach to finding importance is called spectral graph theory, and there is a large literature on it. But it is easy to lose sight of the wood for the trees. This short blog is intended to provide a more intuitive overview of the main point of using spectral graph techniques. In particular, if you already use PCA, it’s often possible to compute correlations of records to create a graph structure, and then analyse the structure using the approach described here. Almost always, clearer structures can be seen.

Low Hanging Fruit in Cybersecurity III

Any attempt to decide whether a particular action is “bad” or “good” requires some model of what “good” actually means. The only basis for intelligent action in almost any setting is to be able to have a plan for the expected, but also a mechanism for noticing the unexpected — to which some kind of meta-planning can be attached. This is, of course, a crucial part of how we function as humans; we don’t hang as software often does, because if we encounter the unexpected, we do something about it. (Indeed, an argument along this line has been used by J.R. Lucas to argue that the human mind is not a Turing machine.)

But most cybersecurity applications do not try (much) to build a model of what “good” or “expected” or “normal” should be like. Granted, this can be difficult; but I can’t help but think that often it’s not as difficult as it looks at first. Partly this is because of the statistical distribution that I discussed in my last post — although, on the internet, lots of things could happen, most of them are extremely unlikely. It may be too draconian to disallow them, but it seems right to be suspicious of them.

Actually, three different kinds of models of what should happen are needed. These are:

  1. A model of what “normal” input should look like. For example, for an intrusion detection system, this might be IP addresses and port numbers; for a user-behavioral system, this might be executables and times of day.
  2. A  model of what “normal” transformations look like. Inputs arriving in the system lead to consequent actions. There should be a model of how these downstream actions depend on the system inputs.
  3. A model of what “normal” rates of change look like. For example, I may go to a web site in a domain I’ve never visited before; but over the course of different time periods (minutes, hours, days) the rate at which I encounter brand new web sites exhibits characteristic patterns.

An exception to the first model shows that something new is happening in the “outside” world — it’s a signal of novelty. An exception to the second model shows that the system’s model of activity is not rich enough — it’s a signal of interestingness. An exception to the third model shows that the environment is changing.

Activity that does not fit with any one of these models should not necessarily cause the actions to be refused or to sound alarms — but it does provide a hook to which a meta-level of analysis can be attached, using more sophisticated models with new possibilities that are practical only because they don’t get invoked very often.

Again think of the human analogy. We spent a great deal of our time running on autopilot/habit. This saves us cognitive effort for things that don’t need much. But, when anything unusual happens, we can quickly snap into a new mode where we can make different kinds of decisions as needed. This isn’t a single two-level hierarchy — in driving, for example, we typically have quite a sophisticated set of layers of attention, and move quickly to more attentive states as conditions require.

Cybersecurity systems would, it seems to me, work much more effectively if they used the combination of models of expected/normal behavior, organized in hierarchies, as their building blocks.

Low Hanging Fruit in Cybersecurity II

If cybersecurity exists to stop bad things happening in computing systems, then it seems to me that there are several implicit assumptions that underlie many approaches and techniques that might not be completely helpful. These are:

  • The distinction between “good” (or “allowable”) and “bad” is a binary distinction;
  • The decision about this distinction has to be made monolithically in a single step;
  • The distribution of likely things that could happen is uniform (flat).

Even to write them explicitly shows that they can’t quite be right, but nevertheless I suspect they exist, unexamined, in the design of many security systems.

What happens if we remove these assumptions?

If the distinction between “good” and “bad” is not discrete, then our systems instead allocate some kind of continuous risk or suspicion to actions. This creates an interesting new possibility — the decision about what to do about an action can now be decoupled from how the action is categorized. This is not even a possibility if the only distinction we recognize is binary.

From a purely technical point of view, this means that many different kinds of risk measuring algorithms can be developed and used orthogonally to decisions about what the outputs of these algorithms means. Critical boundaries can be determined after the set of risks has been calculated, and may even be derived from the distribution of such risks. For example, bad things are (almost always) rare, so a list of actions ordered by risk will normally have a bulge of “normal” actions and then a small number of anomalous actions. The boundary could be placed at the edge of the bulge.

Second, what if the decision about whether to allow an action doesn’t have to be made all at once. Then systems can have defence in depth. The first, outer, layer can decide on the risk of a new action and decide whether or not to allow it. But it can be forgiving of potential risky actions if there are further layers of categorization and defence to follow. What it can do is to disallow the clearly and definitively bad things, reducing the number of potentially bad things that have to be considered at later stages.

From a technical point of view, this means that weaker but cheaper algorithms can be used on the front lines of defence, with more effective but more expensive algorithms available for later stages (where they work with less data, and so do not cost as much overall, despite being more expensive per instance).

Third, what if our defence took into account that the landscape of expected actions is not uniform, so that low probability events should automatically be treated as more suspicious. For example, spam filtering does lots of clever things, but it doesn’t build a model of the sources of my email, and flag emails from countries that I’ve never, ever received email from as inherently more likely to be spam. (Yes, I know that sender addresses can be spoofed.)

This idea has been used in behavioral profiling of computer activity, and it sort of works. But it needs to be combined with the ideas above, so that actions can be rated along a continuum from: routine (allow), to unusual but still not that unusual (allow, but maybe with a user question or at least logged for occasional inspection), to very unusual (user better explicitly allow), to bizarre (disallow). Windows has a weak version of this, which hasn’t been accepted well by users, but it flags only one thing (program start) and it doesn’t build a model of typical behavior by each user.

For example, the set of IP addresses with which my computer interacts is quite large, and hard to represent by some kind of convex structure, so intrusion detection doesn’t work very well if it depends on wrapping/categorising those IP addresses that are OK, and blocking traffic from those that are not. And usually the set of OK IP addresses is not derived from those I interact with, but encoded in some set of rules that apply to many computers. But if instead I built a model of the IP addresses I interact with, allowing older ones to get stale and disappear, and then looked at new IP addresses and allowed them if they resembled (tricky) those I already interact with, and asked me about the others, then this might work better than current approaches. An IP address is a hierarchical structure, with a possible country followed by the top octet, and so on, so I can discriminate quite finely about what it might mean. Even a web server that is theoretically visible to every other IP address could still benefit from handling unlikely source IP addresses differently.

OK, maybe this isn’t exactly low hanging fruit, but the ideas are straightforward and (IMHO) should be built into the design of more robust systems.

Finding significance automatically

In a world where lots of data is collected and available, the critical issue for intelligence, law enforcement, fraud, and cybersecurity analysts is attention.

So the critical issue for tools to support such analysts is focus: how can the most significant and interesting pieces of data/information/knowledge be made the easiest to pay attention to?

This isn’t an easy issue to address for many reasons, some of which I talked about a few posts ago in the context of connecting the dots. But the fundamental problems are: (1) significance or interestingness are highly context dependent, so where to focus depends, in a complex way, on what the analyst already knows and understands; and (2) every new piece of information has the potential to completely alter the entire significance landscape in one hit.

Many existing tools are trying, underneath, to address the issue of focus indirectly, by providing ways for analysts to control their own focus more directly. For example, there are many analysts platforms that allow available information to be sliced and diced in many different ways. These allow two useful things to be done: (1) dross (the guaranteed insignificant stuff) can be removed (or at least hidden from sight); and the rest of the data can be rearranged in many different ways in the hope that human pattern-recognition skills can be brought to bear to find significance.

But it seems like a good idea to try and address the significance issue more directly. This has motivated a couple of the research projects I’m involved with:

  • The ATHENS system tries to find information on the web that is probably new to the user, but which s/he is well-positioned to understand; in other words, the new information is just over the horizon from the user’s current landscape. It builds this new information starting from a query that allows the user to provide the current context;
  • Finding anomalies in large graphs. Lots of data is naturally represented relationally as a graph, with nodes representing some kind of entities, and edges representing some kind of (weighted) similarity between some subset of the nodes (e.g. social networks). Graphs are difficult to work with because they don’t really have a representation that humans can work with — even drawing nice pictures of them tends to (a) occlude chunks once the graph gets big enough, and (b) hide the anomalous structure in the corners because the nice representation is derived from the big structure (e.g. the simple bits of the automorphism group). We’ve developed some tools that find and highlight anomalous regions, anomalous in the sense that, if you were to stand at their nodes and look at the landscape of the rest of the graph, it would look unusual.
  • Finding anomalies in text caused by either a desire to obfuscate the content that’s being talked about, or caused by internal mental state that’s unusual — being deceptive, or highly tense, for example.

Some other people are working in similar directions. For example, there is some work aimed at using social processes to help discover significance. In a sense, sites like Slashdot work this way — each user provides some assessment of quality/importance of some stories, and in return gets information about the quality/importance of other stories. This is also, of course, how refereed publications are supposed to work. The challenge is to contextualize this idea: what makes an object high quality/important for you may not mean anything to me. In other words, most significance lies somewhere on the spectrum from universal agreement to completely taste-based, and it’s hard to tell where, let alone compute it in a practical way.