Posts Tagged 'inductive modelling'

Why Data Science?

Data Science has become a hot topic lately. As usual, there’s not a lot of agreement about what data science actually is. I was on a panel last week, and someone asked afterwards what the difference was between data mining, which we’ve been doing for 15 years, and data science.

It’s a good question. Data science is a new way of framing the scientific enterprise in which a priori hypothesis creation is replaced by inductive modelling; and this is exactly what data mining/knowledge discovery is about (as I’ve been telling my students for a decade).

What’s changed, perhaps, is that scientists in many different areas have realised the existence and potential of this approach, and are commandeering it for their own.

I’ve included the slides from a recent talk I gave on this subject (at the University of Technology Sydney).

And once again let me emphasise that the social sciences and humanities did not really have access to the Enlightenment model of doing science (because they couldn’t do controlled experiments), but they certainly do to the new model. So expect a huge development in data social science and data humanities as soon as research students with the required computational skills move into academia in quantity.

Why data science (ppt slides)

Pull from data versus push to analyst

One of the most striking things about the discussion of the NSA data collection that Snowden has made more widely known is the extent to which the paradigm for its use is database oriented. Both the media and, more surprisingly, the senior administrators talk only about using the data as a repository: “if we find a cell phone in Afghanistan we can look to see which numbers in the US it has been calling and who those numbers in turn call” has been the canonical justification. In other words, the model is: collect the data and then have analysts query it as needed.

The essence of data mining/knowledge discovery is exactly the opposite: allow the data to actively and inductively generate models with an associated quality score, and use analysts to determine which of these models is truly plausible and then useful. In other words, rather than having analysts create models in their heads and then use queries to see if they are plausible (a “pull” model), algorithmics generates models inductively and presents them to analysts (a “push” model). Since getting analysts to creatively think of reasonable models is difficult (and suffers from the “failure of imagination” problem, the inductive approach is both cheaper and more effective.

For example, given the collection of metadata about which phone numbers call which others, it’s possible to build systems that produce results of the form: here’s a set of phone numbers whose calling patterns are unlike any others (in the whole 500 million node graph of phones). Such a calling pattern might not represent something bad, but it’s usually worth a look. The phone companies themselves do some of this kind of analysis, for example to detect phones that are really business lines but are claiming to be residential and, in the days when long distance was expensive, to detect the same scammers moving across different phone numbers.

I would hope that inductive model building is being used on collected data, and the higher-ups in the NSA either don’t really understand or are being cagey. But I’ve talked to a lot of people in government who collect large data but are completely stuck in the database model, and have no inkling of inductive modelling.

The Analysis Chasm

I’ve recently heard a couple of government people (in different countries) complain about the way in which intelligence analysis is conceptualized, and so how intelligence organizations are constructed. There are two big problems:

1.  “Intelligence analysts” don’t usually interact with datasets directly, but rather via “data analysts”, who aren’t considered “real” analysts. I’m told that, at least in Canada, you have to have a social science degree to be an intelligence analyst. Unsurprisingly (at least for now) people with this background don’t have much feel for big data and for what can be learned from it. Intelligence analysts tend to treat the aggregate of the datasets and the data analysts as a large black box, and use it as a form of Go Fish. In other words, intelligence analysts ask data analysts “Have we seen one of these?”; the data analysts search the datasets and the models built from them, and writes a report giving the answer. The data analyst doesn’t know why the question was asked and so cannot write a more helpful report that would be possible given some knowledge of the context. Neither side is getting as much benefit from the data as they could, and it’s mostly because of a separation of roles that developed historically, but makes little sense.

2. Intelligence analysts, and many data analysts, don’t understand inductive modelling from data. It’s not that they don’t have the technical knowledge (although they usually don’t) but they don’t have the conceptual mindset to understand that data can push models to analysts: “Here’s something that’s anomalous and may be important”; “Here’s something that only occurs a few times in a dataset where all behavior should be typical and so highly repetitive”; “Here’s something that has changed since yesterday in a way that nothing else has”. Data systems that do inductive modelling don’t have to wait for an analyst to think “Maybe this is happening”. The role of an analyst changes from being the person who has to think up hypotheses, to the person who has to judge hypotheses for plausibility. The first task is something humans aren’t especially good at, and it’s something that requires imagination, which tends to disappear in a crisis or under pressure. The second task is easier, although not something we’re necessarily perfect at.

There simply is no path for inductive models from data to get to intelligence analysts in most organizations today. It’s difficult enough to get data analysts to appreciate the possibilities; getting models across the chasm, unsolicited, to intelligence analysts is (to coin a phrase) a bridge too far.

Addressing both of these problems requires a fairly revolutionary redesign of the way intelligence analysis is done, and an equally large change in the kind of education that analysts receive. And it really is a different kind of education, not just a kind of training, because inductive modelling from data seems to require a mindset change, not the supply of some missing mental information. Until such changes are made, most intelligence organizations are fighting with one and a half arms tied behind their collective backs.

Adversarial knowledge discovery is not just knowledge discovery with classified data

Someone made a comment to me this week implying that data mining/knowledge discovery in classified settings was a straightforward problem because the algorithms didn’t have to be classified, just the datasets.

This view isn’t accurate, for the following reason: mainstream/ordinary knowledge discovery builds models inductively from data by maximizing the fit between the model and the available data. In an adversarial setting, this creates problems: adversaries can make a good guess about what your model will look like, and so can do better at hiding their records (making them less obvious or detectable); and they can also tell exactly what kinds of data they should try and force you to use if/when you retrain your model.

A simple example. The two leading-edge predictors today are support vector machines and random forests and, in mainstream settings, there’s often little to choose between them in prediction performance. However, in an adversarial setting, the difference is huge: support vector machines are relatively easy to game, because even one record that’s in the wrong place or mislabelled can change the entire orientation of the decision surface. To make it even worse, the way in which the boundary changes is roughly predictable from the kind of manipulation made. (You can see how this works in: J. G. Dutrisac, David B. Skillicorn: Subverting prediction in adversarial settings. ISI 2008: 19-24.). Random forests, on the other hand, are much more robust predictors, partly because their ensemble characteristics makes it hard to force particular behaviour because of the inherent randomness ‘inside’ the algorithm.

The same thing happens with clustering algorithms. Algorithms such as Expectation-Maximization are easily misled by a few carefully chosen records; and no mainstream clustering technique does a good job of finding the kinds of ‘fringe’ clusters that occur when something unusual is present in the data, but adversaries have tried hard to make it as usual as possible.

In fact, adversarial knowledge discovery requires building the entire framework of knowledge discovery over again, taking into account the adversarial nature of the problem from the very beginning. Some parts of mainstream knowledge discovery can be used directly; others with some adaptation; and others can’t safely be used. There are also some areas where we don’t know very much about how to solve problems that aren’t interesting in the mainstream, but are critical in the adversarial domain.