Posts Tagged 'social network analysis'

Advances in Social Network Analysis and Mining Conference — Sydney

This conference will be in Sydney in 2017, from 31st July to 3rd August.

As well as the main conference, there is also a workshop, FOSINT: Foundations of Open Source Intelligence, which may be of even more direct interest for readers of this blog.

Also I will be giving a tutorial on Adversarial Analytics as part of the conference.

The right level of abstraction = the right level to model

I think the take away from my last post is that models of systems should aim to model them at the right level of abstraction, where that right level corresponds to the places where there are bottlenecks. These bottlenecks are places where, as we zoom out in terms of abstraction, the system suddenly seems simpler. The underlying differences don’t actually make a difference; they are just variation.

The difficulty is that it’s really, really hard to see or decide where these bottlenecks are. We rightly laud Newton for seeing that a wide range of different systems could all be described by a single equation; but it’s also true that Einstein showed that this apparent simplicity was actually an approximation for a certain (large!) subclass of systems, and so the sweet spot of system modelling isn’t quite where Newton thought it was.

For living systems, it’s even harder to see where the right level of abstraction lies. Linnaeus (apparently the most-cited human) certainly created a model that was tremendously useful, working at the level of the species. This model has frayed a bit with the advent of DNA technology, since the clusters from observations don’t quite match the clusters from DNA, but it was still a huge contribution. But it’s turning out to be very hard to figure out the right level of abstractions to capture ideas like “particular disease” “particular cancer” even though we can diagnose them quite well. The variations in what’s happening in cells are extremely difficult to map to what seems to be happening in the disease.

For human systems, the level of abstraction is even harder to get right. In some settings, humans are surprisingly sheep-like and broad-brush abstractions are easy to find. But dig a little, and it all falls apart into “each person behaves as they like”. So predicting the number of “friends” a person will have on a social media site is easy (it will be distributed around Dunbar’s number), but predicting whether or not they will connect with a particular person is much, much harder. Does advertising work? Yes, about half of it (as Ogilvy famously said). But will this ad influence this person? No idea. Will knowing the genre of this book or film improve the success rate of recommendations? Yes. Will it help with this book and this person? Not so much.

Note the connection between levels of abstraction and clustering. In principle, if you can cluster (or, better, bicluster) data about your system and get (a) strong clusters, and (b) not too many of them, then you have some grounds for saying that you’re modelling at the right level. But this approach founders on the details: which attributes to include, which algorithm to use, which similarity measure, which parameters, and so on and on.

Three kinds of knowledge discovery

I’ve always made a distinction between “mainstream” data mining (or knowledge discovery or data analytics) and “adversarial” data mining — they require quite distinct approaches and algorithms. But my work with bioinformatic datasets has made me realise that there are more of these differences, and the differences go deeper than people generally understand. That may be part of the reason why some kinds of data mining are running into performance and applicability brick walls.

So here are 3 distinct kinds of data mining, with some thoughts about what makes them different:

1. Modelling natural/physical, that is clockwork, systems.
Such systems are characterised by apparent complexity, but underlying simplicity (the laws of physics). Such systems are entropy minimising everywhere. Even though parts of such systems can look extremely complex (think surface of a neutron star), the underlying system to be modelled must be simpler than its appearances would, at first glance, suggest.

What are the implications for modelling? Some data records will always be more interesting or significant than others — for most physical systems, records describing the status of deep space are much less interesting than those near a star or planet. So there are issues around the way data is sampled.
Some attributes will also be more interesting or significant than others — but, and here’s the crucial point, this significance is a global property. It’s possible to have irrelevant or uninteresting attributes, but these attributes are similarly uninteresting everywhere. Thus is makes sense to use attribute selection as part of the modelling process.

Because the underlying system is simpler than its appearance suggests, there is a bias towards simple models. In other words, physical systems are the domain of Occam’s Razor.

2. Living systems.
Such systems are characterised by apparent simplicity, but underlying complexity (at least relatively speaking). In other words, most living systems are really complicated underneath, but their appearances often conceal this complexity. It isn’t obvious to me why this should be so, and I haven’t come across much discussion about it — but living systems are full of what computing people call encapsulation, putting parts of systems into boxes with constrained interfaces to the outside.

One big example where this matters, and is starting to cause substantial problems for data mining, is the way diseases work. Most diseases are complex activities in the organism that has the disease, and their precise working out often depends on the genotype and phenotype of that organism as well as of the diseases themselves. In other words, a disease like influenza is a collaborative effort between the virus and the organism that has the flu — but it’s still possible to diagnose the disease because of large-scale regularities that we call symptoms.
It follows that, between the underlying complexity of disease, genotype, and phenotype, and the outward appearances of symptoms, or even RNA concentrations measured by microarrays, there must be substantial “bottlenecks” that reduce the underlying complexity. Our lack of understanding of these bottlenecks has made personalised medicine a much more elusive target than it seemed to be a decade ago. Systems involving living things are full of these bottlenecks that reduce the apparent complexity: species, psychology, language.

All of this has implications for data mining of systems involving living things, most of which have been ignored. First, the appropriate target for modelling should be these bottlenecks because this is where such systems “make the most sense”; but we don’t know where the bottlenecks are, that is which part of the system (which level of abstraction) should be modelled. In general, this means we don’t know how to guess the appropriate complexity of model to fit with the system. (And the model should usually be much more complex than we expect — in neurology, one of the difficult lessons has been that the human brain isn’t divided into nice functional building blocks; rather it is filled with “hacks”. So is a cell.)

Because systems involving living things are locally entropy reducing, different parts of the system play qualitatively different roles. Thus some data records are qualitatively of different significance to others, so the implicit sampling involved in collecting a dataset is much more difficult, but much more critical, than for clockwork systems.

Also, because different parts of the system are so different, the attributes relevant to modelling each part of the system will also tend to be different. Hence, we expect that biclustering will play an important role in modelling living systems. (Attribute selection may also still play a role, but only to remove globally uninteresting attributes; and this should probably be done with extreme caution.)

Systems of living things can also be said to have competing interests, even though these interests are not conscious. Thus such systems may involve communication and some kind of “social” interaction — which introduces a new kind of complexity: non-local entropy reduction. It’s not clear (to me at least) what this means for modelling, but it must mean that it’s easy to fall into a trap of using models that are too simple and too monolithic.

3. Human systems.
Human systems, of course, are also systems involving living things, but the big new feature is the presence of consciousness. Indeed, in settings where humans are involved but their actions and interactions are not conscious, models of the previous kind will suffice.

Systems involving conscious humans are locally and non-locally entropy reducing, but there are two extra feedback loops: (1) the loop within the mind of each actor which causes changes in behaviour because of modelling other actors and themself (the kind of thing that leads to “I know that he knows that I know that … so I’ll …); (2) the feedback loop between actors and data miners.

The first feedback loop creates two processes that must be considered in the modelling:
a. Self-consciousness, which generates, for example, purpose tremor;
b. Social consciousness, which generates, for example, strong signals from deception.

The second feedback loop creates two other processes:
a. Concealment, the intent or action of actors hiding some attributes or records from the modelling;
b. Manipulation, the deliberate attempt to change the outcomes of any analysis that might be applied.

I argue that all data mining involving humans has an adversarial component, because the interests of those being modelled never run exactly with each other, or with those doing the modelling, and so all of these processes must be considered whenever modelling of human systems is done. (You can find much more on this topic by reading back in the blog.)

But one obvious effect is that records and attributes need to have metadata associated with them that carries information about properties such as uncertainty or trustworthiness. Physical systems and living systems might mislead you, but only with your implicit connivance or misunderstanding; systems involving other humans can mislead you either with intent or as a side-effect of misleading someone else.

As I’ve written about before, systems where actors may be trying to conceal or manipulate require care in choosing modelling techniques so as not to be misled. On the other hand, when actors are self-conscious or socially conscious they often generate signals that can help the modelling. However, a complete way of accounting for issues such as trust at the datum level has still to be designed.

Simmelian backbones in social networks

There’s growing understanding that many, perhaps most, social networks have a structure with two main parts: a central, richly connected part sometimes called the hairball, and a set of more isolated and much smaller parts sometimes called whiskers. A growth process that reproduces this well is the forest fire model which, roughly speaking, models a new person joining one of the whiskers, and then slowly building connections via friends of friends that connect them into the hairball, and so eventually causethe whole whisker to be absorbed into the hairball. The hairball is therefore the result of the accretion of a large number of whiskers that overlap increasingly richly.

The problem, from a social network analysis point of view, is that the hairball, as its name suggests, is a big mess of overlapping communities. For reasons I’ve written about before, applying clustering or cut techniques to the hairball doesn’t do very well because there are many “long” edges between communities (since the whole graph is typically small world).

One of the talks at ASONAM 2013 provides some hope of being able to look at the structure of the hairball, using the insight that triangles are more useful than edges. The likelihood of triangles with two “long” edges is low, so this seems to be a good way of distinguishing which edges lie within communities; and, of course, there have been social theories positing triangles as an important pattern of social interaction for a century, most recently in the book Tribal Leadership, which points out that the advantage of a triangle relationship is that, when two people fall out, the third person can mediate.

Roughly speaking, the trick is to consider how many triangles each pair of nodes share and leave connections between them only when this number is large enough. The paper shows some nice examples of how this works on real-world datasets, with some convincing evidence that it does the right thing.

This approach is in contrast to previous work that has tried to peel the hairball by removing the whiskers and their edges into the hairball, and then removing the new whiskers that this reveals, and so on. While this peeling approach seems sensible and tries, in a way, to undo the hypothetical formation process, it is much less clear that it can get the order right. A “clustering” technique that is order-independent seems inherently more attractive.

The reference is: Bobo Nick, Conrad Lee, Pádraig Cunningham and Ulrik Brandes. Simmelian Backbones, ASONAM 2013; and there’s a version here:

Compelling evidence on Benghazi timeline

Kathleen Carley presented work on the social media data flow before, during, and after the Benghazi embassy attack in September 2012. She happened to be teaching a course on analysis of social media (tweets and mainstream media) over the early part of September and was able to quickly repurpose it.

Her results show that, in Libya, there was no social media discussion of the embassy attacks until several hours after they happened. Discussion of the infamous movie also only begins well after the attacks and then only as a result of speculation about whether it played any role.

In contrast, Egyptian social media feeds were abuzz with demonstration rhetoric well before the activity in Cairo.

This seems to provide a compelling argument against any “spontaneous demonstration” scenario to explain what happened in Benghazi (if anyone still thinks that). It’s also a nice demonstration of the coming of age of real-time social media analysis, although it also shows that getting real-time analysis requires having a team in place before hand.

The reference is: Near Real Time Assessment of Social Media Using Geo-Temporal Network Analytics, Kathleen M. Carley,  Juergen Pfeffera, Huan Liu, Fred Morstatter, Rebecca Goolsby, Proceedings of Advances in Social Network Analysis and Modelling (ASONAM) 2013, ACM & IEEE, 517-524.

Notes from ASONAM 2013

I’m at Asonam in Niagara Falls. I have noticed a few macro changes from the same conference last year in Istanbul:

  1. There is almost no interest in any form of clustering or community detection. I think that this is the result not of solving the problem but realising that it isn’t a well-formed problem for social networks (regular readers will be aware of my thoughts about this);
  2. There is growing awareness that preferential attachment does not generate networks that look very realistic, except when you look at them from a long way off with your eyes half closed (and some hope that models like forest fire might be close to usable);
  3. There has been a significant amount of progress in understanding the language use of tweets despite the obvious issues of dialect/patois, short length, mistyping and, often, lack of mental engagement when writing. I thought there was very little hope for this, so I’m delighted to be proved wrong. There are starting to be useful results learned from tweet corpora.

The mixture of attendees is less diverse than last year in Istanbul, not just geographically but by “home discipline”, which is  a pity.


Small world is incompatible with communities

Most human-generated graphs, especially social networks, have the small world property: their diameter is much smaller than for a random graph with the same number of nodes or, more intuitively, there are short paths linking every pair of nodes. For example, in the real world one plausible path that could be used to transfer something between A and B would be:

A — local political representative — president of A’s country — president of B’s country — local political representative — B

taking 5 steps. The famous “6 degrees of separation” from Milgram’s experiment reflects the need for willingness of the people along the path, and also shows that there must be many redundant paths connecting A and B.

The example shows that the small world property sort of implies that there’s a tree contained within the topology of the network, so that one way to find short paths is to move up the tree and down again. Of course, real networks will tend to be much richer. But a binary tree is the canonical example of a small world graph.

Now what about communities within such networks. It’s a little difficult to come up with a reasonable definition for a community, although it’s kind of intuitive. A community is a set of nodes that’s more richly connected to one another than they are to the rest of the graph.

The trouble is that the property that this definition is trying to capture is actually the opposite of what needs to happen in a small world graph. Suppose that I’m a node in a community and I want to reach some distant node via a short path. Once I get to the long-distance connection structure everything is easy because it gets me right across the network in a few steps. The problem is getting to this structure in the first place. If I have to follow a number of steps within my community (and have to do the same within the community at the other end) then the small world property breaks.

The only way to preserve the small world property and also have communities of non-trivial size is for the graph to be a tree with cliques at each leaf — then every node has immediate access to the node that is connected to the  long-distances edges of the tree. But this is a very artificial graph. In real networks, the nodes are not specialized for long-distance connection; they are simply ordinary nodes (and so could be inside communities) that happen to have an incident edge that’s connected “far away”.

In other words, a small world graph can’t really have subgraphs that are connected mostly internally with only an occasional edge going to other subgraphs. Yet it’s surprising how often the example figures in research papers in this area show such a graph. Leskovec, Lang et al. did a series of experiments looking at the community structure discovered by a large variety of algorithms. They showed what you’d expect — if you look for communities of size k, you’ll find them because you’re carving a structure that has communities all the way down into chunks of size k. The really interesting part of their results is that there is an inhomogeneity: there are better (more easily selected) communities of size around 125-150, right at the Dunbar size that seems to be hardwired into humans. And this seems to be true for several different kinds of networks, suggesting that it this size is cognitively mediated rather than directly socially mediated.

J. Leskovec, K. Lang, A. Dasgupta, and M. Mahoney. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Mathematics 6(1), 2009.

The bottom line: there are communities in a network exactly to the extent that it fails to be small world (except for the trivial tree-connected cliques case). We think that there are communities in the networks to which we belong but it may (must) be, to some extent, an illusion that we impose by using other, non-graph criteria for who’s in and who’s out. Useful community detection probably can’t be done on the basis of graph structure alone.