Posts Tagged 'link analysis'

The power is in the edges

I’ve argued that it isn’t social media unless there are relational edges between individuals or individual objects. These edges are the drivers of power because the graph structure that emerges from them reveals a lot more than the individual nodes and edges do.

The number of LinkedIn contacts I have is now large enough that I can tell this story. I know someone from one of the more secretive US government organizations. His (or it might be her) public web presence, of course, has nothing at all to do with his day job, and we’ve never exchanged emails using his public email. Yet LinkedIn suggests him as someone I might possibly know.

The reason must be that we have enough mutual connections that the software LinkedIn uses sees that there “should” be some connection between us — it is doing edge prediction. This is exactly the kind of analysis that good intelligence tools can do on relational/graph data. The knowledge is in the links, collectively; in other words, noticing the potential link between us requires knowing both the presence of some links and the absence of others (because the system doesn’t recommend other people whose web presence is as dissimilar from mine as his is).

So, well done LinkedIn, but a cautionary tale for security folks generally, and especially those who believe in anonymization — it can’t be done!

Finding the most important node in a graph — the plot thickens

I posted earlier about finding the most important node in a graph. Many kinds of data have a natural (althought not necessarily obvious) representation as a graph, where the nodes represent pieces of data, and the edges represent the strength of the association or affinity between them. Finding the most important/interesting node is an effective way to direct human (analyst) attention to the good stuff.

The standard way to do this, then, is to build an adjacency matrix from the edge affinities, treating them as edge weights (a large weight indicates a strong affinity), transform the adjacency matrix into a walk laplacian matrix (which twists it inside out so that nodes with many/heavy edges are on the inside rather than the outside), and then use an eigendecomposition technique to project the resulting embedded graph into fewer dimensions. The projection step reveals the global structure in the graph by discounting weak variation and emphasising strong variation. The embedded graph can then be analyzed further by clustering or edge prediction or local neighbor search or one of a number of other possibilities.

There are a number of difficulties with this basically straightforward process — there are actually an alarmingly large number of choices to be made, starting with the way edge weights are constructed.

One of the more subtle issues has to do with whether or not the data used to build the graph should properly be thought of as a sample or as the entire graph. If it’s the entire graph, then it is what it is, and can be processed as described above. But if it’s a sample, then deeper thought is needed.

If the sample is truly i.i.d. then there’s not an issue — but real graphs are never like this. The sampling process always introduces artifacts and they tend to make the heavily weighted, richly connected nodes and edges overrepresented. For example, consider a collaborative filtering system that asks people to rate films they have seen. Each person who fills in some ratings is far more likely to recall films that are popular (and well advertised) than other films, even though they might not have liked the popular film all that much. If the graph edges are built from intercepted data, then edges that are most active tend to have higher probabilities of being intercepted and so appear more often in the collected data than their true frequency. Similarly, nodes with lots of edges are more likely to have been captured in the graph in the first place.

Thus, in sampled graphs, there’s a systematic bias towards heavily weighted edges, and derivatively towards nodes with many or heavily weighted edges.

When such a sampled graph is embedded, these overrepresented nodes gravitative towards the centre of the representation, and the edges that connect them therefore tend to be short. Thus two edges that have the same weight in the initial graph can end up being of quite different lengths in the embedding just because one connects two ‘popular’ nodes and the other does not. For example, suppose everyone in a film rating dataset rated a blockbuster film 5/10. The resulting graph would place the node for that film at the origin and the graph would be almost a star — the structure depending on the ratings of other films would generate some other edges and there would be connections between the other nodes, but this peripheral structure would tend to be overwhelmed by the central node to which every other node was connected. In particular, trying to make recommendations based on neighborhoods would not work at all because everyone is only two steps away from everyone else.

The solution to this sampling problem is to project the nodes onto a hypersphere centred at the origin. This forces all of the nodes to be the same distance from the origin and so cancels out (most of) the distorting effect caused by the sampling. In the film example, the node corresponding to the blockbuster gets pushed out in some direction (that depends in a complicated way on the structure of the whole graph) and so is only close to a few of the nodes corresponding to other films. (This is a bit like what happens in latent semantic indexing where a query is treated like a very short document, but has to be compared to very long documents.)

The resulting graph has still integrated the local affinity structure into a global affinity structure, where the length of edges reflects the (inverse) association between the nodes at their ends — but in a way that more accurately reflects the fact that the graph being analyzed is a biased sample of the ‘real’ graph.

Call for Papers: Link Analysis, Counterterrorism and Security

The Call for the LACTS 2009 workshop is now available here.

The workshop takes place at the SIAM Data Mining Conference and brings together academics, practitioners, law enforcement, and intelligence people to talk about leading-edge work in the area of adversarial data analysis.

The workshop is intended primarily for early-stage work. The proceedings are published electronically, but authors may retain copyright.

The deadline for submissions is probably late December, but perhaps a little later (still being decided).

Using private documents to improve search in public documents

I’m back from the SIAM International Conference on Data Mining, and the 5th Workshop on Link Analysis, Counterterrorism, and Security, which I helped to organize. The workshop papers are now online, along with some open problems that were discussed at the end of the workshop.

I’ll post about some ideas that were tossed around at the workshop and conference in the next few days.

Let me start by talking about the work of Roger Bradford. Information retrieval starts from a document-term matrix, which is typically extremely large and sparse, and then reduces the dimensionality by using an SVD, a process sometimes called latent semantic indexing. This creates a representation space for both documents and terms. A query is treated as if it were a kind of short document and mapped into this representation space. Its near neighbours are then the documents retrieved in response to the query; and they can be sorted in decreasing distance from the query point as well.

Bradford showed that the original space can be built using a set of private documents and a set of public documents, and that the resulting representation space allows better retrieval performance than the space derived from the public documents, without allowing the properties of the private documents to be inferred.

In fact, the set of private documents can be diluted by mixing them with other documents before the process starts, making it even more difficult to work backwards to the private documents.

This process has a number of applications that he talks about in the paper. One of the most interesting is that it allows different organizations, for example allies, to share sensitive information without compromising it to each other — and still get the benefits of the relationships in the full set of documents.

Workshop and Link Analysis, Counterterrorism, and Security

If you’re interested in the content of this blog, and you live in the Atlanta area, you might be interested in coming to LACTS, the Workshop on Link Analysis, Counterterrorism, and Security. It’s being held on April 26th (Saturday) as part of the SIAM International Data Mining Conference. A one-day registration deal is available.

The proceedings will also be available online, both via my website and from SIAM after the workshop.

Here is the schedule:

0825-0830: Introduction
Antonio Badia and David Skillicorn

0830-0900: Detecting Hidden Passages in Documents
Saket S.R. Mengle and Nazli Goharian

0900-0930: Exploiting Sensitive Information in Background Mode using Latent Semantic Indexing
R. B. Bradford

0930-1000: Topic Detection Using Independent Component Analysis
Scott Grant, David Skillicorn, and James R. Cordy

1000-1030: Coffee Break

1030-1100: Using AI for Sensemaking in Investigative Analysis
Summer Adams, Ashok K. Goel, and Neha Sugandh

1100-1130: Vulnerability Assessment on Adversarial Organization: Unifying Command and Control Structure Analysis and Social Network Analysis
Il-Chul Moon, Kathleen M. Carley, and Alexander H. Levis

1130-1200: Torus Graph Inference for Detection of Localized Activity
Elizabeth A. Beer, Carey E. Priebe, and Edward R. Scheinerman

1200-1330: Lunch (on your own)

1330-1430: Workshop Keynote: “The Road to Link Intelligence”
Sherry Marcus, 21st Century Technologies.

1430-1500: Enhancing the Automated Analysis of Criminal Careers
Tim K. Cocx, Walter A. Kosters, and Jeroen F.J. Laros

1500-1530: Summarization and Information Loss in Network Analysis
Jamie F. Olson and Kathleen M. Carley

1530-1545: Summing Up
Antonio Badia and David Skillicorn



Follow

Get every new post delivered to your Inbox.