Posts Tagged 'knowledge discovery'

Advances in Social Network Analysis and Mining Conference — Sydney

This conference will be in Sydney in 2017, from 31st July to 3rd August.

http://asonam.cpsc.ucalgary.ca/2017/

As well as the main conference, there is also a workshop, FOSINT: Foundations of Open Source Intelligence, which may be of even more direct interest for readers of this blog.

Also I will be giving a tutorial on Adversarial Analytics as part of the conference.

“But I don’t have anything to hide” Part II

In an earlier post, I pointed out that differential pricing — the ability of businesses to charge different people different prices — is one of the killer apps of data analytics. Since the goal of businesses will be to charge everyone the most they’re willing to pay, there are strong reasons why we might not want to be modelled this way, even if “we have nothing to hide”

There’s a second area in which models of everyone might feel like a bad thing — healthcare. As the costs of healthcare rise, there will be strong arguments that individuals need to be participants in maintaining their health. This sounds good — but when big data collection means that a health case provider might charge more to someone who has been smoking (plausible case), overeating (hmm), not exercising enough (hmmm), or any of a number of other lifestyle choices, it begins to be uncomfortable, even if “we have nothing to hide”.

The point of much data analytics by organisations and states is to assess the cost/benefit ratio of interacting with you. Of course, humans (and their organisations) have always made such assessments. What is new is the ability to do so in a fine-grained, pervasive, and never-removable way. What will be lost is the possibility of a fresh start, of redemption, or even of convenient amnesia about aspects of the past.

“But I don’t have anything to hide”

This is the common response of many ordinary people when the discussion of (especially) government surveillance programs comes up. And they’re right, up to a point. In a perfect world, innocent people have nothing to fear from government.

The bigger problem, in fact, comes from the data collected and the models built by multinational businesses. Everyone has something to hide from them: the bottom line prices we are willing to pay.

We have not yet quite reached the world of differential pricing. We’ve become accustomed to the idea that the person sitting next to us on a plane may have paid (much) less for the identical travel experience, but we haven’t quite become reconciled to the idea that an online retailer might be charging us more for the same product than they charge other people, let alone that the chocolate bar at the corner store might be more expensive for us. If anything, we’re inclined to think that an organisation that has lots of data about us and has built a detailed model of us might give us a better price.

But it doesn’t require too much prescience to see that this isn’t always going to be the case. The seller’s slogan has always been “all the market can bear”.

Any commercial organization, under the name of customer relationship management, is building a model of your predicted net future value. Their actions towards you are driven by how large this is. Any benefits and discounts you get now are based on the expectation that, over the long haul, they will reap the converse benefits and more. It’s inherently an adversarial relationship.

Now think about the impact of data collection and modelling, especially with the realization that everything collected is there for ever. There’s no possibility of an economic fresh start, no bankruptcy of models that will wipe the slate clean and let you start again.

Negotiation relies on the property that each party holds back their actual bottom line. In a world where your bottom line is probably better known to the entity you’re negotiating with than it is to you, can you ever win? Or even win-win? Now tell me that you have nothing to hide.

[And, in the ongoing discussion of post-Snowden government surveillance, there’s still this enormous blind spot about the fact that multinational businesses collect electronic communication, content and metadata; location; every action on portable devices and some laptops; complete browsing and search histories; and audio around any of these devices. And they’re processing it all extremely hard.]

Time to build an artificial intelligence (well, try anyway)

NASA is a divided organisation, with one part running uncrewed missions, and the other part running crewed missions (and, apparently, internal competition between these parts is fierce). There’s no question that the uncrewed-mission part is advancing both science and solar system exploration, and so there are strong and direct arguments for continuing in this direction. The crewed-mission part is harder to make a case for; the ISS spends most of its time just staying up there, and has been reduced to running high-school science experiments. (Nevertheless, there is a strong visceral feeling that humans ought to continue to go into space, which is driving the various proposals for Mars.)

But there’s one almost invisible but extremely important payoff from crewed missions (no, it’s not Tang!): reliability. You only have to consider the differences in the level of response between the failed liftoff last week of a supply rocket to the ISS, and the accidents involving human deaths. When an uncrewed rocket fails, there’s an engineering investigation leading to a signficant improvement in the vehicle or processes. When there’s a crewed rocket failure, it triggers a phase shift in the whole of NASA, something that’s far more searching. The Grissom/White/Chaffee accident triggered a complete redesign of the way life support systems worked; the Challenger disaster caused the entire launch approval process to be rethought; and the Columbia disaster led to a new regime of inspections after reaching orbit. Crewed missions cause a totally different level of attention to safety which in turn leads to a new level of attention to reliability.

Some changes only get made when there’s a really big, compelling motivation.

Those of us who have been flying across oceans for a while cannot help but be aware that 4-engined aircraft have almost entirely been replaced by 2-engined aircraft; and that spare engines strapped underneath the wing for delivery to a remote location to replace an engine that failed are never seen either. This is just one area where increased reliability has paid off. Cars routinely last for hundreds of thousands of kilometres, which only as few high-end brands used to do forty years ago. There are, of course, many explanations, but the crewed space program is part of the story.

What does this have to do with analytics? I think the time is right to try to build an artificial intelligence. The field known as “artificial intelligence” or machine learning or data mining or knowledge discovery is notorious for overpromising and underdelivering since at least the 1950s so I need to be clear: I don’t think we actually know enough to build an artificial intelligence — but I think it has potential as an organising principle for a large amount of research that is otherwise unfocused.

One reason why I think this might be the right time is that deep learning has shown how to solve (at least potentially) some of the roadblock problems (although I don’t know that the neural network bias in much deep learning work is necessary). Consciousness, which was considered an important part of intelligence, has turned out to be a bit of a red herring. Of course, we can’t tell what the next hard part of the problem will turn out to be.

The starting problem is this: how can an algorithm identify when it is “stuck” and “go meta”? (Yes, I know that computability informs us that there isn’t a general solution but we, as humans, do this all the time — so how can we encode this mechanism algorithmically, at least as an engineering strategy?) Any level of success here would payoff immediately in analytics platforms that are really good at building models, and even assessing them, but hopeless at knowing when to revise them.

Refining “Data Science”

Regular readers will know that I have been thinking about the constellation of ideas that are getting a lot of play in universities and the research community around words like “data science”, and ‘big data”,  and especially the intersection of these ideas with the other constellation of “data mining”, “knowledge discovery” and “machine learning”.

I’ve argued that inductive model discovery (which I think is the core of all of these ideas) is a new way of doing science that is rapidly replacing the conventional Enlightenment or Popperian view of science. This is happening especially quickly in fields that struggled to apply the conventional scientific method, especially in medicine, the social “sciences”, business schools, and in the humanities.

Attending the International Conference on Computational Science meeting made me realise, however, that computational science is a part of this story as well.

Here’s how I see the connections between these three epistemologies:

  1. Conventional science. Understand systems via controlled experiments: setting up configurations that differ in only a few managed ways and seeing whether those differences correspond to different system behaviours. If they do, construct an “explanation”; if they don’t, it’s back to the drawing board.
  2. Computational science. Understand systems by building simulations of them and tweaking the simulations to see if the differences are those that are expected from the tweaks. (Simulations increase the range of systems that can be investigated when either the tweaks can’t be done on the real system, or when the system is hypothesised to be emergent from some simpler pieces.)
  3. Data science. Understand systems by looking at the different configurations that naturally occur and seeing how these correspond to different system behaviors. When they do, construct an “explanation”.

In other words, conventional science pokes the system being investigated in careful ways and sees how it reacts; computational science creates a replica of the system and pokes that; and data science looks at the system being poked and tries to match the reactions to the poking.

Underlying these differences in approach is also, of course, differences in validation: how one tells if an explanation is sufficient. The first two both start from a hypothesis and use statistical machinery to decide whether the hypothesis is supported sufficiently strongly. The difference is that the computational science has more flexibility to set up controlled experiments and so, all things considered, can get stronger evidence. (But there is always the larger question of whether the simulation actually reproduces the system of interest — critical, but often ignored, and with huge risks of “unknown unknowns”.) Data science, in contrast, validates its models of the system being studied by approaches such as the use of a test set, a component of the system that was not used to build the model, but which should behave as the original system did. It is also buttressed by the ability to generate multiple models and so compare among them.

Data science is advancing on two fronts: first, the flexibility it provides to conventional science not to have to construct carefully balanced controlled experiments; second, and much more significantly, the opportunity it creates for making scientific progress in the social sciences and humanities, replacing “qualitative” by “quantitative” in unprecedented ways.

Why Data Science?

Data Science has become a hot topic lately. As usual, there’s not a lot of agreement about what data science actually is. I was on a panel last week, and someone asked afterwards what the difference was between data mining, which we’ve been doing for 15 years, and data science.

It’s a good question. Data science is a new way of framing the scientific enterprise in which a priori hypothesis creation is replaced by inductive modelling; and this is exactly what data mining/knowledge discovery is about (as I’ve been telling my students for a decade).

What’s changed, perhaps, is that scientists in many different areas have realised the existence and potential of this approach, and are commandeering it for their own.

I’ve included the slides from a recent talk I gave on this subject (at the University of Technology Sydney).

And once again let me emphasise that the social sciences and humanities did not really have access to the Enlightenment model of doing science (because they couldn’t do controlled experiments), but they certainly do to the new model. So expect a huge development in data social science and data humanities as soon as research students with the required computational skills move into academia in quantity.

Why data science (ppt slides)

More subtle lessons from the Sony hack

There are some obvious lessons to learn from the Sony hack: perimeter defence isn’t much use when the perimeter has thousands of gates in it (it looks as if the starting point was a straightforward spearphishing attack); and if you don’t compartmentalise your system inside the perimeter, then anyone who gets past it has access to everything.

But the less obvious lesson has to do with the difference between our human perception of the difficulties of de-anonymization and aggregation, and the actual power of analytics to handle both. For example, presumably Sony kept data on their employees health in properly-protected HIPAA-compliant storage — but there were occasional emails that mentioned individuals and their health status. The people sending these emails presumably didn’t feel as if any particular one was a breach of privacy — the private content in each one was small. But they failed to realise that all of these emails get aggregated, at least in backups. So now all of those little bits of information are in one place, and the risks of building significant models from them has increased substantially.

Anyone with analytic experience and access to a large number of emails can find structures that are decidedly non-obvious; but this is far from intuitive to the public at large, and hence to Sony executives.

We need to learn to value data better, and to understand in a deep way that the value of data increases superlinearly with the amount that is collected into a single coherent unit.