I’ve posted frequently about the problem that analysts face when they are dealing with a set of objects, and connections between pairs of objects. For example, the nodes could be people, and the edges some measure of how much each pair of people communicate; the nodes could be reports and the edges how much each pair of reports mention the same entities; the nodes could be computers and the edges how well connected they are to each other. In other words, the natural data structure to represent the set of knowledge is a graph.
Very often such graphs are large, with thousands of nodes, and thousands of edges. The problem the analyst has is to work out which nodes of the graph deserve the most attention; often most of the graph is boring, or at least it is already well understood.
In what follows, I’ll assume that nodes by themselves have no associated score, but there is some natural way to associate a score with the edges that connect them. Let’s see how such a graph could be built, and so think about how we would keep track of which bits of it might be most important.
If the graph consists of only a single node, there’s not much to say. So consider what happens when a second node is added, and let’s assume that the second node is connected to the first (by a weighted edge). If they aren’t connected, then we might treat it as two separate graphs and two separate analysis problems. The only way to represent the graph with two nodes is with the single edge between them, but it might be plausible to make the length of the edge inversely proportional to the weight of the edge — if the weight is large, there’s a strong connection between them and so the nodes might plausibly be close together. We start to insert some geometry into the graph relationship because geometry is reasonably intuitive to us humans.
Now if we add a third node, there are two possibilities, if we want to continue to make edge length inversely proportional to edge weights (strong = close). If the third node is connected to both of the current two nodes, then the resulting structure is a triangle (whose exact shape depends on the relative size of the edge weights). If the third node is connected to only one of the current nodes, then the shape is a path.
The worst case is that adding each new node requires us to expand into another dimension: 3 nodes needs 2 dimensions, 4 nodes needs 3 dimensions, and so on, if we want to preserve the relationship between edge length and edge weight.
In this representation, nodes that are important because they are strongly connected to other important nodes (note the circularity) tend to be pulled towards the “middle” of the graph structure. So if we want to find the most important nodes, we should look near the middle.
But we don’t have that much geometry so, although the idea of the middle is intuitive, finding the middle is rather difficult. One approach is widely used in Social Network Analysis — the middle can (sort of) be found by computing how far it is from each node to the outer surface of the graph, and then looking at those nodes that are farthest inside. This works, after a fashion, but the algorithms can be complex and they tend not to tell you much else about the graph structure.
What about trying to use geometry directly and embedding the graph into an (n-1)-dimensional space? If this can be made to work then “close” actually will mean close, say in Euclidean distance.
The first possible way to do this is to build the adjacency matrix of the graph. For a graph with n nodes, this just means forming an n by n matrix with one row and column corresponding to each node. The matrix entries are zero, except that when node i and node j are connected, then entries ij and ji are set to the edge weight of the connection. We can actually dispense with the last column, since its contents can be inferred from the other columns.
Now the rows of any matrix can be interpreted geometrically by treating the entries as coordinates in a space spanned by the columns. So we can take our n by n-1 adjacency matrix and create an (n-1)-dimensional space with a point corresponding to each row of the matrix, and so to each node of the graph.
This geometry for the graph is still a bit awkward, because an (n-1)-dimensional space is still an ugly place to be (our intuitions don’t work very well in high-dimensional spaces). But if all you want to do is find the most important few nodes, then this is now easy.
Just for clarity, imagine that each entry of each row of the matrix has been divided by the sum of that row, so the entries are all less than 1. Now the embedded representation of the graph lives inside the (n-1)-dimensional (hyper)cube with one corner at the origin and the other corner (1,1,1,…..,1). The nodes with lots of high-weight edges to lots of other nodes are represented by points that are quite far from the origin, because the entries in their rows are comparatively large, and there are comparatively many non-zero entries. The nodes that are poorly connected to the rest of the graph are represented by points quite close to the origin because their rows have values that are mostly zeroes, and otherwise small. The most important nodes, then, will be on the “outer” edge of the geometric representation.
Now if we can find the direction out from the origin in which the “cloud” representing the points of the graph sticks out the most, then the extremal points in this direction will correspond to the most important nodes. Note that this direction isn’t necessarily towards the most extremal point (although we would usually expect it to be generally in that direction).
The direction in which the cloud bulges out the most from the origin is given by the principal eigenvector of the matrix representing the graph. Although this sounds complicated, it’s actually straightforward to compute. What’s more, the position of every node can be projected onto this vector and interpreted as a relative measure of the importance of that node. This is the basis for Google’s PageRank algorithm which treats the web as a giant graph, and does (more or less) this calculation to give every page an importance score. Note that this solves the problem of having to work in n-1 dimensions by projecting the entire graph structure onto a 1-dimensional space.
But there’s something badly wrong with this geometry. Suppose node i is connected to node j with weight 3, so in the geometry the point that corresponds to it is 5 units from the origin in dimension j. Now suppose that the weight on this edge is increased to 10. The point corresponding to node i suddenly moves further from the origin in dimension 3, and so further from all of the other points it is connected to, including point j. But an increase in weight is supposed to mean that i and j are closer! The problem is that, in this simple geometric embedding, the highly connected nodes are all far from the origin, on the outside of the graph structure, which tends to force them to be geometrically far from one another rather than close to one another.
The calculation of a single global importance works because large edge weights create large distances in the geometry and so tend to push nodes with edges like that far out. But in doing so messes up any other geometry among the nodes.
The solution is to use a different geometric embedding that turns the graph inside out so the the points that represent well-connected nodes are positioned close to the origin, and the sparsely connected nodes are positioned on the outer edges. This is done by converting the adjacency matrix, with its rows normalized by dividing by the rows sums, into a walk Laplacian matrix. This is done by negating all of the entries of the normalized adjacency matrix and adding 1’s down the diagonal.
The global variation among the nodes of the graph is now discovered by computing an eigendecomposition of this matrix (again straightforward) and looking at the eigenvector corresponding to the smallest non-zero eigenvalue. The intuition in this embedding is that this eigenvector points along the direction of minimal vibration in the embedded graph structure. Think of a cigar-shaped graph where the nodes are joined by metal rods. Striking it parallel to its long axis will produce much smaller vibrations than striking it parallel to any of the other axes. (In fact, the values of the eigenvector can be interpreted as the amplitude of the associated vibration of each node and the eigenvalue as the frequency of the vibration).
This approach will produce the same answer for the global importance ranking of the nodes, and so will find the same most important nodes. But the geometry is much better behaved. The distances between the embedded points behave as we would want them to (short edges = heavy weights) so clustering algorithms can be applied directly in the embedded space. The other eigenvectors also properly capture the substructure of the graph — for example, the eigenvector corresponding to the second-smallest non-zero eigenvalue describes the next most significant uncorrelated variation in the graph (in fact, the vibrational mode if the graph is struck parallel to the second axis of variation).
A geometric embedding is much more convenient if the graph is being updated. One problem with computing properties of a graph directly is that adding a single node can completely change all of the paths and all of the distances — so a social network measure has to be recomputed from scratch. In a geometric space, adding a new point doesn’t change the existing distances at all. An approximation to the position of the new node can be computed by calculating an adjacency matrix row for it, converting to a walk Laplacian row, and then using the existing eigendecomposition to embed it. If the new point really does change the structure of the graph, then this approximation is not useful; but most of the time it probably does not.
In fact, this embedding has a few other features that are not so obvious. The geometric embedding takes into account not just the (weighted) distance between nodes but also the possible presence of multiple paths between nodes. In other words, two nodes connected by lightweight paths, but many of them, will appear more similar than if only the weights had been taken into account. This is usually the right way to think about global relationships between unconnected nodes. In other words, this geometric embedding takes into account not only distance, but also density and curvature in the original graph.
This approach to finding importance is called spectral graph theory, and there is a large literature on it. But it is easy to lose sight of the wood for the trees. This short blog is intended to provide a more intuitive overview of the main point of using spectral graph techniques. In particular, if you already use PCA, it’s often possible to compute correlations of records to create a graph structure, and then analyse the structure using the approach described here. Almost always, clearer structures can be seen.