Finding the most important node in a graph — the plot thickens

I posted earlier about finding the most important node in a graph. Many kinds of data have a natural (althought not necessarily obvious) representation as a graph, where the nodes represent pieces of data, and the edges represent the strength of the association or affinity between them. Finding the most important/interesting node is an effective way to direct human (analyst) attention to the good stuff.

The standard way to do this, then, is to build an adjacency matrix from the edge affinities, treating them as edge weights (a large weight indicates a strong affinity), transform the adjacency matrix into a walk laplacian matrix (which twists it inside out so that nodes with many/heavy edges are on the inside rather than the outside), and then use an eigendecomposition technique to project the resulting embedded graph into fewer dimensions. The projection step reveals the global structure in the graph by discounting weak variation and emphasising strong variation. The embedded graph can then be analyzed further by clustering or edge prediction or local neighbor search or one of a number of other possibilities.

There are a number of difficulties with this basically straightforward process — there are actually an alarmingly large number of choices to be made, starting with the way edge weights are constructed.

One of the more subtle issues has to do with whether or not the data used to build the graph should properly be thought of as a sample or as the entire graph. If it’s the entire graph, then it is what it is, and can be processed as described above. But if it’s a sample, then deeper thought is needed.

If the sample is truly i.i.d. then there’s not an issue — but real graphs are never like this. The sampling process always introduces artifacts and they tend to make the heavily weighted, richly connected nodes and edges overrepresented. For example, consider a collaborative filtering system that asks people to rate films they have seen. Each person who fills in some ratings is far more likely to recall films that are popular (and well advertised) than other films, even though they might not have liked the popular film all that much. If the graph edges are built from intercepted data, then edges that are most active tend to have higher probabilities of being intercepted and so appear more often in the collected data than their true frequency. Similarly, nodes with lots of edges are more likely to have been captured in the graph in the first place.

Thus, in sampled graphs, there’s a systematic bias towards heavily weighted edges, and derivatively towards nodes with many or heavily weighted edges.

When such a sampled graph is embedded, these overrepresented nodes gravitative towards the centre of the representation, and the edges that connect them therefore tend to be short. Thus two edges that have the same weight in the initial graph can end up being of quite different lengths in the embedding just because one connects two ‘popular’ nodes and the other does not. For example, suppose everyone in a film rating dataset rated a blockbuster film 5/10. The resulting graph would place the node for that film at the origin and the graph would be almost a star — the structure depending on the ratings of other films would generate some other edges and there would be connections between the other nodes, but this peripheral structure would tend to be overwhelmed by the central node to which every other node was connected. In particular, trying to make recommendations based on neighborhoods would not work at all because everyone is only two steps away from everyone else.

The solution to this sampling problem is to project the nodes onto a hypersphere centred at the origin. This forces all of the nodes to be the same distance from the origin and so cancels out (most of) the distorting effect caused by the sampling. In the film example, the node corresponding to the blockbuster gets pushed out in some direction (that depends in a complicated way on the structure of the whole graph) and so is only close to a few of the nodes corresponding to other films. (This is a bit like what happens in latent semantic indexing where a query is treated like a very short document, but has to be compared to very long documents.)

The resulting graph has still integrated the local affinity structure into a global affinity structure, where the length of edges reflects the (inverse) association between the nodes at their ends — but in a way that more accurately reflects the fact that the graph being analyzed is a biased sample of the ‘real’ graph.