Posts Tagged 'link analysis'

Small world is incompatible with communities

Most human-generated graphs, especially social networks, have the small world property: their diameter is much smaller than for a random graph with the same number of nodes or, more intuitively, there are short paths linking every pair of nodes. For example, in the real world one plausible path that could be used to transfer something between A and B would be:

A — local political representative — president of A’s country — president of B’s country — local political representative — B

taking 5 steps. The famous “6 degrees of separation” from Milgram’s experiment reflects the need for willingness of the people along the path, and also shows that there must be many redundant paths connecting A and B.

The example shows that the small world property sort of implies that there’s a tree contained within the topology of the network, so that one way to find short paths is to move up the tree and down again. Of course, real networks will tend to be much richer. But a binary tree is the canonical example of a small world graph.

Now what about communities within such networks. It’s a little difficult to come up with a reasonable definition for a community, although it’s kind of intuitive. A community is a set of nodes that’s more richly connected to one another than they are to the rest of the graph.

The trouble is that the property that this definition is trying to capture is actually the opposite of what needs to happen in a small world graph. Suppose that I’m a node in a community and I want to reach some distant node via a short path. Once I get to the long-distance connection structure everything is easy because it gets me right across the network in a few steps. The problem is getting to this structure in the first place. If I have to follow a number of steps within my community (and have to do the same within the community at the other end) then the small world property breaks.

The only way to preserve the small world property and also have communities of non-trivial size is for the graph to be a tree with cliques at each leaf — then every node has immediate access to the node that is connected to theĀ  long-distances edges of the tree. But this is a very artificial graph. In real networks, the nodes are not specialized for long-distance connection; they are simply ordinary nodes (and so could be inside communities) that happen to have an incident edge that’s connected “far away”.

In other words, a small world graph can’t really have subgraphs that are connected mostly internally with only an occasional edge going to other subgraphs. Yet it’s surprising how often the example figures in research papers in this area show such a graph. Leskovec, Lang et al. did a series of experiments looking at the community structure discovered by a large variety of algorithms. They showed what you’d expect — if you look for communities of size k, you’ll find them because you’reĀ carving a structure that has communities all the way down into chunks of size k. The really interesting part of their results is that there is an inhomogeneity: there are better (more easily selected) communities of size around 125-150, right at the Dunbar size that seems to be hardwired into humans. And this seems to be true for several different kinds of networks, suggesting that it this size is cognitively mediated rather than directly socially mediated.

J. Leskovec, K. Lang, A. Dasgupta, and M. Mahoney. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Mathematics 6(1), 2009.

The bottom line: there are communities in a network exactly to the extent that it fails to be small world (except for the trivial tree-connected cliques case). We think that there are communities in the networks to which we belong but it may (must) be, to some extent, an illusion that we impose by using other, non-graph criteria for who’s in and who’s out. Useful community detection probably can’t be done on the basis of graph structure alone.

Questions are data too

In the followup investigation of the Boston Marathon bombings, we see again the problem that data analytics has with questions.

Databases are built to store data. But, as Jeff Jones has most vocally pointed out, simply keeping the data is not enough in adversarial settings. You also need to keep the questions, and treat them as part of the ongoing data. The reason is obvious once you think about it — intelligence analysts need not only to know the known facts; they also need to know that someone else has asked the same question they just asked. Questions are part of the mental model of analysts, part of their situational awareness, but current systems don’t capture this part and preserve it so that others can build on it. In other words, we don’t just need to connect the dots; we need to connect the edges!

Another part of this is that, once questions are kept, they can be re-asked automatically. This is immensely powerful. At present, an analyst can pose a question (“has X ever communicated with Y?”), get a negative answer, only for information about such a communication to arrive a microsecond later and not be noticed. In fast changing environments, this can happen frequently, but it’s implausible to expect analysts to remember and re-pose their questions at intervals, just in case.

We still have some way to go with the tools and techniques available for intelligence analysis.

Understanding High-Dimensional Spaces

My new book with the title above has been published by Springer, just in time for Christmas gift giving for the data miner on your list.

The book explores how to represent high-dimensional data (which almost all data is), and how to understand the models, particularly for problems where the goal is to find the most interesting subset of the records. “Interesting”, of course, means different things in different settings; a big part of the focus is on finding outliers and anomalies.

Partly the book is a reaction to the often unwitting assumption that clouds of data can be understood as if they had a single centre — for example, much of the work on social networks.

The most important technical ideas are (a) that clusters themselves need to be understood as having a structure which provides each one with a higher-level context that is usually important to make sense of them, and (b) that the empty space between clusters also provides information that can help to understand the non-empty-space.

You can buy the book here.

The power is in the edges

I’ve argued that it isn’t social media unless there are relational edges between individuals or individual objects. These edges are the drivers of power because the graph structure that emerges from them reveals a lot more than the individual nodes and edges do.

The number of LinkedIn contacts I have is now large enough that I can tell this story. I know someone from one of the more secretive US government organizations. His (or it might be her) public web presence, of course, has nothing at all to do with his day job, and we’ve never exchanged emails using his public email. Yet LinkedIn suggests him as someone I might possibly know.

The reason must be that we have enough mutual connections that the software LinkedIn uses sees that there “should” be some connection between us — it is doing edge prediction. This is exactly the kind of analysis that good intelligence tools can do on relational/graph data. The knowledge is in the links, collectively; in other words, noticing the potential link between us requires knowing both the presence of some links and the absence of others (because the system doesn’t recommend other people whose web presence is as dissimilar from mine as his is).

So, well done LinkedIn, but a cautionary tale for security folks generally, and especially those who believe in anonymization — it can’t be done!

Finding the most important node in a graph — the plot thickens

I posted earlier about finding the most important node in a graph. Many kinds of data have a natural (althought not necessarily obvious) representation as a graph, where the nodes represent pieces of data, and the edges represent the strength of the association or affinity between them. Finding the most important/interesting node is an effective way to direct human (analyst) attention to the good stuff.

The standard way to do this, then, is to build an adjacency matrix from the edge affinities, treating them as edge weights (a large weight indicates a strong affinity), transform the adjacency matrix into a walk laplacian matrix (which twists it inside out so that nodes with many/heavy edges are on the inside rather than the outside), and then use an eigendecomposition technique to project the resulting embedded graph into fewer dimensions. The projection step reveals the global structure in the graph by discounting weak variation and emphasising strong variation. The embedded graph can then be analyzed further by clustering or edge prediction or local neighbor search or one of a number of other possibilities.

There are a number of difficulties with this basically straightforward process — there are actually an alarmingly large number of choices to be made, starting with the way edge weights are constructed.

One of the more subtle issues has to do with whether or not the data used to build the graph should properly be thought of as a sample or as the entire graph. If it’s the entire graph, then it is what it is, and can be processed as described above. But if it’s a sample, then deeper thought is needed.

If the sample is truly i.i.d. then there’s not an issue — but real graphs are never like this. The sampling process always introduces artifacts and they tend to make the heavily weighted, richly connected nodes and edges overrepresented. For example, consider a collaborative filtering system that asks people to rate films they have seen. Each person who fills in some ratings is far more likely to recall films that are popular (and well advertised) than other films, even though they might not have liked the popular film all that much. If the graph edges are built from intercepted data, then edges that are most active tend to have higher probabilities of being intercepted and so appear more often in the collected data than their true frequency. Similarly, nodes with lots of edges are more likely to have been captured in the graph in the first place.

Thus, in sampled graphs, there’s a systematic bias towards heavily weighted edges, and derivatively towards nodes with many or heavily weighted edges.

When such a sampled graph is embedded, these overrepresented nodes gravitative towards the centre of the representation, and the edges that connect them therefore tend to be short. Thus two edges that have the same weight in the initial graph can end up being of quite different lengths in the embedding just because one connects two ‘popular’ nodes and the other does not. For example, suppose everyone in a film rating dataset rated a blockbuster film 5/10. The resulting graph would place the node for that film at the origin and the graph would be almost a star — the structure depending on the ratings of other films would generate some other edges and there would be connections between the other nodes, but this peripheral structure would tend to be overwhelmed by the central node to which every other node was connected. In particular, trying to make recommendations based on neighborhoods would not work at all because everyone is only two steps away from everyone else.

The solution to this sampling problem is to project the nodes onto a hypersphere centred at the origin. This forces all of the nodes to be the same distance from the origin and so cancels out (most of) the distorting effect caused by the sampling. In the film example, the node corresponding to the blockbuster gets pushed out in some direction (that depends in a complicated way on the structure of the whole graph) and so is only close to a few of the nodes corresponding to other films. (This is a bit like what happens in latent semantic indexing where a query is treated like a very short document, but has to be compared to very long documents.)

The resulting graph has still integrated the local affinity structure into a global affinity structure, where the length of edges reflects the (inverse) association between the nodes at their ends — but in a way that more accurately reflects the fact that the graph being analyzed is a biased sample of the ‘real’ graph.

Call for Papers: Link Analysis, Counterterrorism and Security

The Call for the LACTS 2009 workshop is now available here.

The workshop takes place at the SIAM Data Mining Conference and brings together academics, practitioners, law enforcement, and intelligence people to talk about leading-edge work in the area of adversarial data analysis.

The workshop is intended primarily for early-stage work. The proceedings are published electronically, but authors may retain copyright.

The deadline for submissions is probably late December, but perhaps a little later (still being decided).

Using private documents to improve search in public documents

I’m back from the SIAM International Conference on Data Mining, and the 5th Workshop on Link Analysis, Counterterrorism, and Security, which I helped to organize. The workshop papers are now online, along with some open problems that were discussed at the end of the workshop.

I’ll post about some ideas that were tossed around at the workshop and conference in the next few days.

Let me start by talking about the work of Roger Bradford. Information retrieval starts from a document-term matrix, which is typically extremely large and sparse, and then reduces the dimensionality by using an SVD, a process sometimes called latent semantic indexing. This creates a representation space for both documents and terms. A query is treated as if it were a kind of short document and mapped into this representation space. Its near neighbours are then the documents retrieved in response to the query; and they can be sorted in decreasing distance from the query point as well.

Bradford showed that the original space can be built using a set of private documents and a set of public documents, and that the resulting representation space allows better retrieval performance than the space derived from the public documents, without allowing the properties of the private documents to be inferred.

In fact, the set of private documents can be diluted by mixing them with other documents before the process starts, making it even more difficult to work backwards to the private documents.

This process has a number of applications that he talks about in the paper. One of the most interesting is that it allows different organizations, for example allies, to share sensitive information without compromising it to each other — and still get the benefits of the relationships in the full set of documents.