Refining “Data Science”

Regular readers will know that I have been thinking about the constellation of ideas that are getting a lot of play in universities and the research community around words like “data science”, and ‘big data”,¬† and especially the intersection of these ideas with the other constellation of “data mining”, “knowledge discovery” and “machine learning”.

I’ve argued that inductive model discovery (which I think is the core of all of these ideas) is a new way of doing science that is rapidly replacing the conventional Enlightenment or Popperian view of science. This is happening especially quickly in fields that struggled to apply the conventional scientific method, especially in medicine, the social “sciences”, business schools, and in the humanities.

Attending the International Conference on Computational Science meeting made me realise, however, that computational science is a part of this story as well.

Here’s how I see the connections between these three epistemologies:

  1. Conventional science. Understand systems via controlled experiments: setting up configurations that differ in only a few managed ways and seeing whether those differences correspond to different system behaviours. If they do, construct an “explanation”; if they don’t, it’s back to the drawing board.
  2. Computational science. Understand systems by building simulations of them and tweaking the simulations to see if the differences are those that are expected from the tweaks. (Simulations increase the range of systems that can be investigated when either the tweaks can’t be done on the real system, or when the system is hypothesised to be emergent from some simpler pieces.)
  3. Data science. Understand systems by looking at the different configurations that naturally occur and seeing how these correspond to different system behaviors. When they do, construct an “explanation”.

In other words, conventional science pokes the system being investigated in careful ways and sees how it reacts; computational science creates a replica of the system and pokes that; and data science looks at the system being poked and tries to match the reactions to the poking.

Underlying these differences in approach is also, of course, differences in validation: how one tells if an explanation is sufficient. The first two both start from a hypothesis and use statistical machinery to decide whether the hypothesis is supported sufficiently strongly. The difference is that the computational science has more flexibility to set up controlled experiments and so, all things considered, can get stronger evidence. (But there is always the larger question of whether the simulation actually reproduces the system of interest — critical, but often ignored, and with huge risks of “unknown unknowns”.) Data science, in contrast, validates its models of the system being studied by approaches such as the use of a test set, a component of the system that was not used to build the model, but which should behave as the original system did. It is also buttressed by the ability to generate multiple models and so compare among them.

Data science is advancing on two fronts: first, the flexibility it provides to conventional science not to have to construct carefully balanced controlled experiments; second, and much more significantly, the opportunity it creates for making scientific progress in the social sciences and humanities, replacing “qualitative” by “quantitative” in unprecedented ways.

Spectral graph embedding doesn’t work on an adjacency matrix

I’ve heard several talks at conferences in the past few weeks where someone has run an eigendecomposition or SVD on an adjacency matrix and assumed that the embedding they end up with is meaningful. Some of them noticed that this embedding didn’t represent their graph very well. There’s a simple explanation for that — it’s wrong. In this post I’ll try and explain why.

For a graph with n nodes, an adjacency matrix is an nxn matrix whose ijth entry represents the weight of the edge connecting node i and node j. The entries of this matrix are all non-negative and the matrix must be symmetric. (There are ways to handle non-symmetric matrices, that is, directed graphs, but they require significantly more care to embed appropriately.)

Now remember, eigendecompositions or SVDs are numeric algorithms that don’t know that the content of this matrix represents a graph. They regard the rows of the adjacency matrix as vectors in an n-dimensional vector space — and this view does not fit very well with the graph that this matrix is representing. For example, a well-connected node in the graph has a corresponding row with many non-zero entries; as a vector, then, it is quite long and so the point corresponding to its end is far from the origin. A poorly connected node, on the other hand, has mostly zero entries in its row, so it corresponds to a short vector. All of the entries of the adjacency matrix are non-negative, so all of these vectors are in the positive hyperquadrant.

The cloud of points corresponding to the graph therefore looks like this figure:


where the red area represents the well-connected nodes of the graph. The eigendecomposition/SVD of this cloud corresponds to a rotation to new axes and (usually) a projection to a lower-dimensional space. There are several problems with this.

First, the well-connected nodes are on the outside of the cloud, but they should be in the middle — they are important and so should be central. Second, the well-connected nodes should be close to one another in general but they are spread along the outer shell of the cloud. In other words, the cloud derived from the adjacency matrix is inside-out with respect to the natural and expected structure of the graph. Any embedding derived from this cloud is going to inherit its inside-out structure and so will be close to useless.

There is also an equally serious issue: the direction of the first eigenvector of such a cloud will be the vector from the origin to ‘the center of the cloud’ because the numerically greatest variation is between the origin and this center. This vector is shown as black in the figure. So far so good: projection onto this vector does indeed provide an importance ranking for the graph nodes, with the most important projected onto the end away from the origin.

However, the second and subsequent axes are necessarily orthogonal to this first axis — but directions orthogonal to it do not tell us anything about the variation within the cloud. If we took exactly the same shaped cloud and moved it a little in the positive hyperquadrant, the first axis would change, forcing changes in all of the other axes, but the shape of the cloud has not changed! In other words, all of the axes after the first are meaningless as measures of variation in the graph.

The right way to embed a graph is to convert the adjacency matrix to one of several Laplacian matrices. This conversion has the effect of centering the cloud around the origin so that the eigendecomposition/SVD now finds the axes in which the cloud varies, and so gives you the embedding you want.

If you see something, say something — and we’ll ignore it

I arrived on a late evening flight at a Canadian airport that will remain nameless, and I was the second person into an otherwise deserted Customs Hall. On a chair was a cloth shoulder bag and a 10″ by 10″ by 4″ opaque plastic container. Being a good citizen, I went over to the distant Customs officers on duty and told them about it. They did absolutely nothing.

There are lessons here about predictive modelling in adversarial settings. The Customs officers were using, in their minds, a Bayesion predictor, which is the way that we, as humans, make many of our predictions. In this Bayesian predictor, the prior that the ownerless items contained explosives was very small, so the overall probability that they should act was also very small — and so they didn’t act.

Compare this to the predictive model used by firefighters. When a fire alarm goes off, they don’t consider a prior at all. That is, they don’t consider factors such as: a lot of new students just arrived in town, we just answered a hoax call to this location an hour ago, or anything else of the same kind. They respond regardless of whether they consider it a ‘real’ fire or not.

The challenge is how to train front-line defenders against acts of terror to use the firefighter predictive model rather than the Bayesian one. Clearly, there’s still some distance to go.

Bridging airgaps for amateurs

I’ve pointed out before that air gapping (for example, keeping military networks physically separated from the internet) is a very weak mechanism in a world where most devices have microphones and speakers. Devices can communicate using audio, at frequencies humans in the room can’t hear; so that real air gapping requires keeping the two networks separated by distances or soundproofing good enough to prevent this kind of covert channel. The significance of this channel is underappreciated — it’s common even in secure environments to find internet-connected devices in the same room as secure devices.

The ante has been upped a bit by Google’s introduction of Tone, a Chrome add-on that communicates via the audio channel to allow sharing of URLs, in sort of the same way that Palm Pilots used to communicate using infrared. Adapting this app to communicate even more content is surely straightforward, so even amateurs will be able to use the audio channel. Quite apart from the threat to military and intelligence systems, there are many other nasty possibilities, including exfiltrating documents and infecting with malware that can exploit this new channel. And it doesn’t help that its use is invisible (inaudible).

The introduction of LiFi, which will bring many benefits, also introduces a similar side channel when most devices have a camera and a screen.

A world in which cybersecurity is conceived of as a mechanism of walls and gates is looking increasingly obsolete when the network is everywhere, and every gate has holes in it.

Why Data Science?

Data Science has become a hot topic lately. As usual, there’s not a lot of agreement about what data science actually is. I was on a panel last week, and someone asked afterwards what the difference was between data mining, which we’ve been doing for 15 years, and data science.

It’s a good question. Data science is a new way of framing the scientific enterprise in which a priori hypothesis creation is replaced by inductive modelling; and this is exactly what data mining/knowledge discovery is about (as I’ve been telling my students for a decade).

What’s changed, perhaps, is that scientists in many different areas have realised the existence and potential of this approach, and are commandeering it for their own.

I’ve included the slides from a recent talk I gave on this subject (at the University of Technology Sydney).

And once again let me emphasise that the social sciences and humanities did not really have access to the Enlightenment model of doing science (because they couldn’t do controlled experiments), but they certainly do to the new model. So expect a huge development in data social science and data humanities as soon as research students with the required computational skills move into academia in quantity.

Why data science (ppt slides)

Radicalization — it’s just a phase he’s going through

All of the discussion of radicalization in the past few weeks seems to assume that it’s a one-way process.

But if it’s a process with a large personality component (and evidence suggests it is); and if it’s a phenomenon associated with adolescence and young adulthood (which are times of attitudinal change anyway); and if the data fits models of infection by disease (and they do), then it seems plausible that, for many people, radicalization is a phase they go through. Such people will not be obtrusive because they never act on their (temporary) beliefs, and eventually cease to hold them. If radicalization can be a temporary phenomenon, then there’s de-radicalization, but there’s also post-radicalization; the first extrinsic, but the second intrinsic.

What’s the practical relevance? If some people “get over” their radicalization, then it argues for more gentle responses during the infected period. Actions such as interviews by security services with radicalized individuals and their relatives (a practice of MI5, and soon to be possible in Canada via bill C-51), and pulling passports may indeed have negative consequences if they make infected individuals become more entrenched (and less likely to become cured).

Of course, there are risks to a more gentle intervention strategy (and government departments are allergic to risks). But, for countries with exit controls, perhaps it’s better to rely on these than to act more explicitly; and at least the discussions about strategy should keep the possibility of cure in mind.


Get every new post delivered to your Inbox.

Join 34 other followers