Posts Tagged 'social network analysis'

Advances in Social Network Analysis and Mining Conference — Sydney

This conference will be in Sydney in 2017, from 31st July to 3rd August.

As well as the main conference, there is also a workshop, FOSINT: Foundations of Open Source Intelligence, which may be of even more direct interest for readers of this blog.

Also I will be giving a tutorial on Adversarial Analytics as part of the conference.

The right level of abstraction = the right level to model

I think the take away from my last post is that models of systems should aim to model them at the right level of abstraction, where that right level corresponds to the places where there are bottlenecks. These bottlenecks are places where, as we zoom out in terms of abstraction, the system suddenly seems simpler. The underlying differences don’t actually make a difference; they are just variation.

The difficulty is that it’s really, really hard to see or decide where these bottlenecks are. We rightly laud Newton for seeing that a wide range of different systems could all be described by a single equation; but it’s also true that Einstein showed that this apparent simplicity was actually an approximation for a certain (large!) subclass of systems, and so the sweet spot of system modelling isn’t quite where Newton thought it was.

For living systems, it’s even harder to see where the right level of abstraction lies. Linnaeus (apparently the most-cited human) certainly created a model that was tremendously useful, working at the level of the species. This model has frayed a bit with the advent of DNA technology, since the clusters from observations don’t quite match the clusters from DNA, but it was still a huge contribution. But it’s turning out to be very hard to figure out the right level of abstractions to capture ideas like “particular disease” “particular cancer” even though we can diagnose them quite well. The variations in what’s happening in cells are extremely difficult to map to what seems to be happening in the disease.

For human systems, the level of abstraction is even harder to get right. In some settings, humans are surprisingly sheep-like and broad-brush abstractions are easy to find. But dig a little, and it all falls apart into “each person behaves as they like”. So predicting the number of “friends” a person will have on a social media site is easy (it will be distributed around Dunbar’s number), but predicting whether or not they will connect with a particular person is much, much harder. Does advertising work? Yes, about half of it (as Ogilvy famously said). But will this ad influence this person? No idea. Will knowing the genre of this book or film improve the success rate of recommendations? Yes. Will it help with this book and this person? Not so much.

Note the connection between levels of abstraction and clustering. In principle, if you can cluster (or, better, bicluster) data about your system and get (a) strong clusters, and (b) not too many of them, then you have some grounds for saying that you’re modelling at the right level. But this approach founders on the details: which attributes to include, which algorithm to use, which similarity measure, which parameters, and so on and on.

Three kinds of knowledge discovery

I’ve always made a distinction between “mainstream” data mining (or knowledge discovery or data analytics) and “adversarial” data mining — they require quite distinct approaches and algorithms. But my work with bioinformatic datasets has made me realise that there are more of these differences, and the differences go deeper than people generally understand. That may be part of the reason why some kinds of data mining are running into performance and applicability brick walls.

So here are 3 distinct kinds of data mining, with some thoughts about what makes them different:

1. Modelling natural/physical, that is clockwork, systems.
Such systems are characterised by apparent complexity, but underlying simplicity (the laws of physics). Such systems are entropy minimising everywhere. Even though parts of such systems can look extremely complex (think surface of a neutron star), the underlying system to be modelled must be simpler than its appearances would, at first glance, suggest.

What are the implications for modelling? Some data records will always be more interesting or significant than others — for most physical systems, records describing the status of deep space are much less interesting than those near a star or planet. So there are issues around the way data is sampled.
Some attributes will also be more interesting or significant than others — but, and here’s the crucial point, this significance is a global property. It’s possible to have irrelevant or uninteresting attributes, but these attributes are similarly uninteresting everywhere. Thus is makes sense to use attribute selection as part of the modelling process.

Because the underlying system is simpler than its appearance suggests, there is a bias towards simple models. In other words, physical systems are the domain of Occam’s Razor.

2. Living systems.
Such systems are characterised by apparent simplicity, but underlying complexity (at least relatively speaking). In other words, most living systems are really complicated underneath, but their appearances often conceal this complexity. It isn’t obvious to me why this should be so, and I haven’t come across much discussion about it — but living systems are full of what computing people call encapsulation, putting parts of systems into boxes with constrained interfaces to the outside.

One big example where this matters, and is starting to cause substantial problems for data mining, is the way diseases work. Most diseases are complex activities in the organism that has the disease, and their precise working out often depends on the genotype and phenotype of that organism as well as of the diseases themselves. In other words, a disease like influenza is a collaborative effort between the virus and the organism that has the flu — but it’s still possible to diagnose the disease because of large-scale regularities that we call symptoms.
It follows that, between the underlying complexity of disease, genotype, and phenotype, and the outward appearances of symptoms, or even RNA concentrations measured by microarrays, there must be substantial “bottlenecks” that reduce the underlying complexity. Our lack of understanding of these bottlenecks has made personalised medicine a much more elusive target than it seemed to be a decade ago. Systems involving living things are full of these bottlenecks that reduce the apparent complexity: species, psychology, language.

All of this has implications for data mining of systems involving living things, most of which have been ignored. First, the appropriate target for modelling should be these bottlenecks because this is where such systems “make the most sense”; but we don’t know where the bottlenecks are, that is which part of the system (which level of abstraction) should be modelled. In general, this means we don’t know how to guess the appropriate complexity of model to fit with the system. (And the model should usually be much more complex than we expect — in neurology, one of the difficult lessons has been that the human brain isn’t divided into nice functional building blocks; rather it is filled with “hacks”. So is a cell.)

Because systems involving living things are locally entropy reducing, different parts of the system play qualitatively different roles. Thus some data records are qualitatively of different significance to others, so the implicit sampling involved in collecting a dataset is much more difficult, but much more critical, than for clockwork systems.

Also, because different parts of the system are so different, the attributes relevant to modelling each part of the system will also tend to be different. Hence, we expect that biclustering will play an important role in modelling living systems. (Attribute selection may also still play a role, but only to remove globally uninteresting attributes; and this should probably be done with extreme caution.)

Systems of living things can also be said to have competing interests, even though these interests are not conscious. Thus such systems may involve communication and some kind of “social” interaction — which introduces a new kind of complexity: non-local entropy reduction. It’s not clear (to me at least) what this means for modelling, but it must mean that it’s easy to fall into a trap of using models that are too simple and too monolithic.

3. Human systems.
Human systems, of course, are also systems involving living things, but the big new feature is the presence of consciousness. Indeed, in settings where humans are involved but their actions and interactions are not conscious, models of the previous kind will suffice.

Systems involving conscious humans are locally and non-locally entropy reducing, but there are two extra feedback loops: (1) the loop within the mind of each actor which causes changes in behaviour because of modelling other actors and themself (the kind of thing that leads to “I know that he knows that I know that … so I’ll …); (2) the feedback loop between actors and data miners.

The first feedback loop creates two processes that must be considered in the modelling:
a. Self-consciousness, which generates, for example, purpose tremor;
b. Social consciousness, which generates, for example, strong signals from deception.

The second feedback loop creates two other processes:
a. Concealment, the intent or action of actors hiding some attributes or records from the modelling;
b. Manipulation, the deliberate attempt to change the outcomes of any analysis that might be applied.

I argue that all data mining involving humans has an adversarial component, because the interests of those being modelled never run exactly with each other, or with those doing the modelling, and so all of these processes must be considered whenever modelling of human systems is done. (You can find much more on this topic by reading back in the blog.)

But one obvious effect is that records and attributes need to have metadata associated with them that carries information about properties such as uncertainty or trustworthiness. Physical systems and living systems might mislead you, but only with your implicit connivance or misunderstanding; systems involving other humans can mislead you either with intent or as a side-effect of misleading someone else.

As I’ve written about before, systems where actors may be trying to conceal or manipulate require care in choosing modelling techniques so as not to be misled. On the other hand, when actors are self-conscious or socially conscious they often generate signals that can help the modelling. However, a complete way of accounting for issues such as trust at the datum level has still to be designed.

Simmelian backbones in social networks

There’s growing understanding that many, perhaps most, social networks have a structure with two main parts: a central, richly connected part sometimes called the hairball, and a set of more isolated and much smaller parts sometimes called whiskers. A growth process that reproduces this well is the forest fire model which, roughly speaking, models a new person joining one of the whiskers, and then slowly building connections via friends of friends that connect them into the hairball, and so eventually causethe whole whisker to be absorbed into the hairball. The hairball is therefore the result of the accretion of a large number of whiskers that overlap increasingly richly.

The problem, from a social network analysis point of view, is that the hairball, as its name suggests, is a big mess of overlapping communities. For reasons I’ve written about before, applying clustering or cut techniques to the hairball doesn’t do very well because there are many “long” edges between communities (since the whole graph is typically small world).

One of the talks at ASONAM 2013 provides some hope of being able to look at the structure of the hairball, using the insight that triangles are more useful than edges. The likelihood of triangles with two “long” edges is low, so this seems to be a good way of distinguishing which edges lie within communities; and, of course, there have been social theories positing triangles as an important pattern of social interaction for a century, most recently in the book Tribal Leadership, which points out that the advantage of a triangle relationship is that, when two people fall out, the third person can mediate.

Roughly speaking, the trick is to consider how many triangles each pair of nodes share and leave connections between them only when this number is large enough. The paper shows some nice examples of how this works on real-world datasets, with some convincing evidence that it does the right thing.

This approach is in contrast to previous work that has tried to peel the hairball by removing the whiskers and their edges into the hairball, and then removing the new whiskers that this reveals, and so on. While this peeling approach seems sensible and tries, in a way, to undo the hypothetical formation process, it is much less clear that it can get the order right. A “clustering” technique that is order-independent seems inherently more attractive.

The reference is: Bobo Nick, Conrad Lee, Pádraig Cunningham and Ulrik Brandes. Simmelian Backbones, ASONAM 2013; and there’s a version here:

Compelling evidence on Benghazi timeline

Kathleen Carley presented work on the social media data flow before, during, and after the Benghazi embassy attack in September 2012. She happened to be teaching a course on analysis of social media (tweets and mainstream media) over the early part of September and was able to quickly repurpose it.

Her results show that, in Libya, there was no social media discussion of the embassy attacks until several hours after they happened. Discussion of the infamous movie also only begins well after the attacks and then only as a result of speculation about whether it played any role.

In contrast, Egyptian social media feeds were abuzz with demonstration rhetoric well before the activity in Cairo.

This seems to provide a compelling argument against any “spontaneous demonstration” scenario to explain what happened in Benghazi (if anyone still thinks that). It’s also a nice demonstration of the coming of age of real-time social media analysis, although it also shows that getting real-time analysis requires having a team in place before hand.

The reference is: Near Real Time Assessment of Social Media Using Geo-Temporal Network Analytics, Kathleen M. Carley,  Juergen Pfeffera, Huan Liu, Fred Morstatter, Rebecca Goolsby, Proceedings of Advances in Social Network Analysis and Modelling (ASONAM) 2013, ACM & IEEE, 517-524.

Notes from ASONAM 2013

I’m at Asonam in Niagara Falls. I have noticed a few macro changes from the same conference last year in Istanbul:

  1. There is almost no interest in any form of clustering or community detection. I think that this is the result not of solving the problem but realising that it isn’t a well-formed problem for social networks (regular readers will be aware of my thoughts about this);
  2. There is growing awareness that preferential attachment does not generate networks that look very realistic, except when you look at them from a long way off with your eyes half closed (and some hope that models like forest fire might be close to usable);
  3. There has been a significant amount of progress in understanding the language use of tweets despite the obvious issues of dialect/patois, short length, mistyping and, often, lack of mental engagement when writing. I thought there was very little hope for this, so I’m delighted to be proved wrong. There are starting to be useful results learned from tweet corpora.

The mixture of attendees is less diverse than last year in Istanbul, not just geographically but by “home discipline”, which is  a pity.


Small world is incompatible with communities

Most human-generated graphs, especially social networks, have the small world property: their diameter is much smaller than for a random graph with the same number of nodes or, more intuitively, there are short paths linking every pair of nodes. For example, in the real world one plausible path that could be used to transfer something between A and B would be:

A — local political representative — president of A’s country — president of B’s country — local political representative — B

taking 5 steps. The famous “6 degrees of separation” from Milgram’s experiment reflects the need for willingness of the people along the path, and also shows that there must be many redundant paths connecting A and B.

The example shows that the small world property sort of implies that there’s a tree contained within the topology of the network, so that one way to find short paths is to move up the tree and down again. Of course, real networks will tend to be much richer. But a binary tree is the canonical example of a small world graph.

Now what about communities within such networks. It’s a little difficult to come up with a reasonable definition for a community, although it’s kind of intuitive. A community is a set of nodes that’s more richly connected to one another than they are to the rest of the graph.

The trouble is that the property that this definition is trying to capture is actually the opposite of what needs to happen in a small world graph. Suppose that I’m a node in a community and I want to reach some distant node via a short path. Once I get to the long-distance connection structure everything is easy because it gets me right across the network in a few steps. The problem is getting to this structure in the first place. If I have to follow a number of steps within my community (and have to do the same within the community at the other end) then the small world property breaks.

The only way to preserve the small world property and also have communities of non-trivial size is for the graph to be a tree with cliques at each leaf — then every node has immediate access to the node that is connected to the  long-distances edges of the tree. But this is a very artificial graph. In real networks, the nodes are not specialized for long-distance connection; they are simply ordinary nodes (and so could be inside communities) that happen to have an incident edge that’s connected “far away”.

In other words, a small world graph can’t really have subgraphs that are connected mostly internally with only an occasional edge going to other subgraphs. Yet it’s surprising how often the example figures in research papers in this area show such a graph. Leskovec, Lang et al. did a series of experiments looking at the community structure discovered by a large variety of algorithms. They showed what you’d expect — if you look for communities of size k, you’ll find them because you’re carving a structure that has communities all the way down into chunks of size k. The really interesting part of their results is that there is an inhomogeneity: there are better (more easily selected) communities of size around 125-150, right at the Dunbar size that seems to be hardwired into humans. And this seems to be true for several different kinds of networks, suggesting that it this size is cognitively mediated rather than directly socially mediated.

J. Leskovec, K. Lang, A. Dasgupta, and M. Mahoney. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Mathematics 6(1), 2009.

The bottom line: there are communities in a network exactly to the extent that it fails to be small world (except for the trivial tree-connected cliques case). We think that there are communities in the networks to which we belong but it may (must) be, to some extent, an illusion that we impose by using other, non-graph criteria for who’s in and who’s out. Useful community detection probably can’t be done on the basis of graph structure alone.

Understanding High-Dimensional Spaces

My new book with the title above has been published by Springer, just in time for Christmas gift giving for the data miner on your list.

The book explores how to represent high-dimensional data (which almost all data is), and how to understand the models, particularly for problems where the goal is to find the most interesting subset of the records. “Interesting”, of course, means different things in different settings; a big part of the focus is on finding outliers and anomalies.

Partly the book is a reaction to the often unwitting assumption that clouds of data can be understood as if they had a single centre — for example, much of the work on social networks.

The most important technical ideas are (a) that clusters themselves need to be understood as having a structure which provides each one with a higher-level context that is usually important to make sense of them, and (b) that the empty space between clusters also provides information that can help to understand the non-empty-space.

You can buy the book here.

Problems at the heart of social network analysis

About a month ago, I was at the conference on Advances in Social Network Analysis and Modelling (ASONAM), a first for me. It’s a wide ranging conference with, for example, both sociologists and computer scientists presenting. Of course, “social networks” in this context means so-called online social networks.

I was surprised by the kind of presentations, and I came away thinking that there are two big problems at the heart of social network analysis that are unsolved and that are hardly being investigated, or even thought about:

  1. We can’t generate social networks artificially that look much like the real thing. Clearly, there’s been some progress here: preferential attachment gets much closer to the real thing than random graph models did; assortativity was another big step closer to reality; but it seems as if there must be at least one more big subtle process that we don’t understand about how real-world social networks get built. In other words, when humans form pairwise connections, there’s some aspect of that that we still don’t understand.
  2. There seems to be a universal assumption that an individual reveals, in his/her online social activity, a kind of homunculus of their total social activity — in other words, online behavior is a subset of real-world behavior. I only have to say this explicitly to show how fragile (maybe even foolhardy) such an assumption is. What people post on Facebook or tweet is a side-channel of their full behavioral spectrum — and an extremely odd side-channel as well. I haven’t seen any work in behavioral modelling that tries to understand how such a side-channel works and what it captures. But building models that make the subset assumption seems to me to be building castles on clouds.

There were some other, smaller assumptions that seemed to me to be accepted with too little thought as well. First, I don’t think that centrality measures have much to tell us in all but the smallest networks, because centrality implicitly assumes that a network has a single centre — but most networks have multiple centres whether or not they contain multiple communities (clusters); and anyway they mostly do.

Second, I think that the creation of a link is qualitatively a different kind of action to any subsequent use of that link — and so it is important to model them differently. This is quite tricky to do, it seems to me, but might repay the effort. Sometimes, the creation is a weak signal and its only the use of the link that makes it a noteworthy connection; friending someone is pointless unless there’s some actual interaction. On the other hand, marrying someone is a strong status-changing signal regardless of whether anything happens subsequently (for example, if nothing happens subsequently, immigration enforcement authorities may take an interest).

The power is in the edges

I’ve argued that it isn’t social media unless there are relational edges between individuals or individual objects. These edges are the drivers of power because the graph structure that emerges from them reveals a lot more than the individual nodes and edges do.

The number of LinkedIn contacts I have is now large enough that I can tell this story. I know someone from one of the more secretive US government organizations. His (or it might be her) public web presence, of course, has nothing at all to do with his day job, and we’ve never exchanged emails using his public email. Yet LinkedIn suggests him as someone I might possibly know.

The reason must be that we have enough mutual connections that the software LinkedIn uses sees that there “should” be some connection between us — it is doing edge prediction. This is exactly the kind of analysis that good intelligence tools can do on relational/graph data. The knowledge is in the links, collectively; in other words, noticing the potential link between us requires knowing both the presence of some links and the absence of others (because the system doesn’t recommend other people whose web presence is as dissimilar from mine as his is).

So, well done LinkedIn, but a cautionary tale for security folks generally, and especially those who believe in anonymization — it can’t be done!

What is social media?

I was at a meeting last week whose focus was on social media. It quickly became clear that there were two kinds of interests. One group wanted to build high-level systems that would revolutionize business and government (somehow) leveraging social media; another group were building or wanted to build tools that would provide some kind of meta-view of social media content and activity.

The topic that was missing from all of the discussion was what social media was, and why it is the way it is; and so I came away feeling like the entire discussion, and quite a lot of work, was dancing on clouds. There seem to be a number of things that “everybody knows” about social media, but for which there seems to be little or no evidence. The Arab Spring was driven by social media! Well, maybe, but (a) was it and how much, (b) which parts were important and which were irrelevant?

It seems helpful to divide social media into three categories:

1.  Media that is essentially public access publishing or public access (micro)blogging. Although sites that provide this kind of functionality are often considered “social” there is almost nothing social about them — yes, the audience for posts can be restricted to a particular group, but that’s always been true of any publication. There is an interesting question lurking here though: what are the reasons why individuals read such posts? What kind of bond does it imply between the reader and the author? (Cynically, why would I care what even my closest friend had for breakfast?)

2. Media that start as public access publishing, but where the conversation built on an initial post is more important or interesting than the initial post itself — in other words, there’s something emergent in the conversation that transcends what any of the participants would have said ab initio. This is a kind of social knowledge or opinion construction, and there are lots of interesting questions about who participates, what their roles are, and how the content and tone are affected by the interactions. This is, of course, not a new phenomenon but what’s new is the scope and the detail of what’s recorded, allowing answers to be worked out in ways that were impractical or too expensive before.

3. Media in which explicit relational links are created between one person and another. This is the real heart of social media. Relational links between a pair of people have, of course, always existed, but they could only be constructed in a small number of ways and were (almost always) limited by geography.

The emergent structure of these links is a really interesting artifact that deserves study and from which we will probably learn a lot about what it means to be human in a global society. What does it mean when one person “friends” another? This is one question for which simple answers tend to be assumed, but even a brief consideration of A’s Facebook friends and the rest of A’s relationships in the real world quickly shows that there’s a complex connection between the two sets (and it depends heavily on characteristics of A).

One thing that quickly becomes clear when these questions are addressed computationally is that we aren’t going to get far until relationship links are typed. It’s fairly easy to look at each relationship and give it a numerical weight that reflects (say) closeness — but it’s still true that different kinds of relationships behave differently, and need to be modelled differently to understand them. (Social media sites should also implement this typing — not every piece of data should flow down every link of A’s social network.)

The fundamental question in a world where one person can create a visible relationship, is what does this mean — for the person creating it, for the person at the other end of the relationship, and for the emergent graph structure that a collection of these individual relationships creates. Good, solid answers to this question would be a foundation on which much more useful applications could be built.

Structure of social network graphs

Many researchers study social network graphs to try and understand how we as humans interact, especially in information systems and online. However, it has always been difficult to validate results because privacy concerns usually limit access to real datasets. Many results have been validated using artificial graphs, generated in a way that mimics the large-scale properties of real graphs. For example, the artificial graphs look like real graphs in the sense that they obey power laws, have the right kind of degree sequences and so on.

Often, preferential attachment is used as the construction technique. In this algorithm, an edge is attached to a vertex with a probability in proportion to the number of edges already attached to it.  This seems intuitively plausible in many human settings: a person with many friends tends to meet more people and so has a greater chance of making more friends.

There have been hints for a while that these artificial graphs were not quite like real graphs, even though they match according to many large-scale measures. For example, Newman showed that, in human graphs, high-degree nodes tended to be connected to high-degree nodes, while in technical networks this was not the case — even though both looked the same from a power-law perspective.

But now Faloutsos’s group at CMU have shown convincingly that there are substantial differences between artificial and real graphs. They looked at what happens to the diameter of a graph as edges are uniformly randomly deleted. As edges are deleted, the diameter grows slowly but, at some point, there’s a sharp increase. They call this the shatter point.

The important thing is that the shatter point of artificial graphs is substantially higher (i.e. the fraction of edges remaining when this happens) than for real graphs. In other words, graphs generated by humans rather than simply by preferential attachment are somehow tougher. Although humans must be choosing edges to connect based on local criteria, they must somehow do this in a way that makes the global srtucture of the graph more robust. It’s not at all clear (to me at least) how this happens, but it seems plausible.

One of the implications of this difference is that it calls into question much of the conventional wisdom about social networks, whenever this has been derived from, or validated by, artificially generated datasets. Which is quite a lot of the time.

Call for Papers: Link Analysis, Counterterrorism and Security

The Call for the LACTS 2009 workshop is now available here.

The workshop takes place at the SIAM Data Mining Conference and brings together academics, practitioners, law enforcement, and intelligence people to talk about leading-edge work in the area of adversarial data analysis.

The workshop is intended primarily for early-stage work. The proceedings are published electronically, but authors may retain copyright.

The deadline for submissions is probably late December, but perhaps a little later (still being decided).