Posts Tagged 'data analytics'

The growing role of data curation

My view of Data Science, or Big Data if you prefer, is that it divides naturally into three different subfields:

  1. Data curation, which involves focusing on the issues of managing large amounts of heterogeneous data, but is primarily concerned about provenance, that is tracking the metadata about the data.
  2. Computational science, which builds models of the real-world inside computer systems to study their properties.
  3. Analytics, which infers the properties of systems based on data about them.

I’ve posted about these ideas previously (https://skillicorn.wordpress.com/2015/05/09/why-data-science/),

Data curation might have seemed like the poor cousin among these three, and certainly gets the least funding and attention.

But issues of provenance have suddenly become mainstream as everyone on the web struggles to figure out what to do about fake news stories. So far, the Internet has not really addressed the issues of metadata. Most of the big content providers know who generated the content that they create and distribute, but they don’t necessarily make this information known or available for those who read the content to leverage. It’s time for the data curation experts, who tend to come from information systems and library science, to step up.

Data curation is also about to become the front line in cyberattack. As I’ve suggested (Skillicorn, DB, Leuprecht, C, and Tait, V. 2016. Beyond the Castle Model of Cybersecurity.  Government Information Quarterly.), a natural cyberdefence strategy is replication. Data exfiltration is made much more difficult if there many, superficially similar, versions of any document or data that might be a target. However, progress in assigning provenance becomes the cyberattack that matches this cyber defence.

So here’s the research question for data curation: how can I tell, from the internal evidence, and partial external evidence, whether this particular document is legitimate (or is the legitimate version of a set of almost-replicates)?

Advertisements

“But I don’t have anything to hide” Part II

In an earlier post, I pointed out that differential pricing — the ability of businesses to charge different people different prices — is one of the killer apps of data analytics. Since the goal of businesses will be to charge everyone the most they’re willing to pay, there are strong reasons why we might not want to be modelled this way, even if “we have nothing to hide”

There’s a second area in which models of everyone might feel like a bad thing — healthcare. As the costs of healthcare rise, there will be strong arguments that individuals need to be participants in maintaining their health. This sounds good — but when big data collection means that a health case provider might charge more to someone who has been smoking (plausible case), overeating (hmm), not exercising enough (hmmm), or any of a number of other lifestyle choices, it begins to be uncomfortable, even if “we have nothing to hide”.

The point of much data analytics by organisations and states is to assess the cost/benefit ratio of interacting with you. Of course, humans (and their organisations) have always made such assessments. What is new is the ability to do so in a fine-grained, pervasive, and never-removable way. What will be lost is the possibility of a fresh start, of redemption, or even of convenient amnesia about aspects of the past.

Three kinds of knowledge discovery

I’ve always made a distinction between “mainstream” data mining (or knowledge discovery or data analytics) and “adversarial” data mining — they require quite distinct approaches and algorithms. But my work with bioinformatic datasets has made me realise that there are more of these differences, and the differences go deeper than people generally understand. That may be part of the reason why some kinds of data mining are running into performance and applicability brick walls.

So here are 3 distinct kinds of data mining, with some thoughts about what makes them different:

1. Modelling natural/physical, that is clockwork, systems.
Such systems are characterised by apparent complexity, but underlying simplicity (the laws of physics). Such systems are entropy minimising everywhere. Even though parts of such systems can look extremely complex (think surface of a neutron star), the underlying system to be modelled must be simpler than its appearances would, at first glance, suggest.

What are the implications for modelling? Some data records will always be more interesting or significant than others — for most physical systems, records describing the status of deep space are much less interesting than those near a star or planet. So there are issues around the way data is sampled.
Some attributes will also be more interesting or significant than others — but, and here’s the crucial point, this significance is a global property. It’s possible to have irrelevant or uninteresting attributes, but these attributes are similarly uninteresting everywhere. Thus is makes sense to use attribute selection as part of the modelling process.

Because the underlying system is simpler than its appearance suggests, there is a bias towards simple models. In other words, physical systems are the domain of Occam’s Razor.

2. Living systems.
Such systems are characterised by apparent simplicity, but underlying complexity (at least relatively speaking). In other words, most living systems are really complicated underneath, but their appearances often conceal this complexity. It isn’t obvious to me why this should be so, and I haven’t come across much discussion about it — but living systems are full of what computing people call encapsulation, putting parts of systems into boxes with constrained interfaces to the outside.

One big example where this matters, and is starting to cause substantial problems for data mining, is the way diseases work. Most diseases are complex activities in the organism that has the disease, and their precise working out often depends on the genotype and phenotype of that organism as well as of the diseases themselves. In other words, a disease like influenza is a collaborative effort between the virus and the organism that has the flu — but it’s still possible to diagnose the disease because of large-scale regularities that we call symptoms.
It follows that, between the underlying complexity of disease, genotype, and phenotype, and the outward appearances of symptoms, or even RNA concentrations measured by microarrays, there must be substantial “bottlenecks” that reduce the underlying complexity. Our lack of understanding of these bottlenecks has made personalised medicine a much more elusive target than it seemed to be a decade ago. Systems involving living things are full of these bottlenecks that reduce the apparent complexity: species, psychology, language.

All of this has implications for data mining of systems involving living things, most of which have been ignored. First, the appropriate target for modelling should be these bottlenecks because this is where such systems “make the most sense”; but we don’t know where the bottlenecks are, that is which part of the system (which level of abstraction) should be modelled. In general, this means we don’t know how to guess the appropriate complexity of model to fit with the system. (And the model should usually be much more complex than we expect — in neurology, one of the difficult lessons has been that the human brain isn’t divided into nice functional building blocks; rather it is filled with “hacks”. So is a cell.)

Because systems involving living things are locally entropy reducing, different parts of the system play qualitatively different roles. Thus some data records are qualitatively of different significance to others, so the implicit sampling involved in collecting a dataset is much more difficult, but much more critical, than for clockwork systems.

Also, because different parts of the system are so different, the attributes relevant to modelling each part of the system will also tend to be different. Hence, we expect that biclustering will play an important role in modelling living systems. (Attribute selection may also still play a role, but only to remove globally uninteresting attributes; and this should probably be done with extreme caution.)

Systems of living things can also be said to have competing interests, even though these interests are not conscious. Thus such systems may involve communication and some kind of “social” interaction — which introduces a new kind of complexity: non-local entropy reduction. It’s not clear (to me at least) what this means for modelling, but it must mean that it’s easy to fall into a trap of using models that are too simple and too monolithic.

3. Human systems.
Human systems, of course, are also systems involving living things, but the big new feature is the presence of consciousness. Indeed, in settings where humans are involved but their actions and interactions are not conscious, models of the previous kind will suffice.

Systems involving conscious humans are locally and non-locally entropy reducing, but there are two extra feedback loops: (1) the loop within the mind of each actor which causes changes in behaviour because of modelling other actors and themself (the kind of thing that leads to “I know that he knows that I know that … so I’ll …); (2) the feedback loop between actors and data miners.

The first feedback loop creates two processes that must be considered in the modelling:
a. Self-consciousness, which generates, for example, purpose tremor;
b. Social consciousness, which generates, for example, strong signals from deception.

The second feedback loop creates two other processes:
a. Concealment, the intent or action of actors hiding some attributes or records from the modelling;
b. Manipulation, the deliberate attempt to change the outcomes of any analysis that might be applied.

I argue that all data mining involving humans has an adversarial component, because the interests of those being modelled never run exactly with each other, or with those doing the modelling, and so all of these processes must be considered whenever modelling of human systems is done. (You can find much more on this topic by reading back in the blog.)

But one obvious effect is that records and attributes need to have metadata associated with them that carries information about properties such as uncertainty or trustworthiness. Physical systems and living systems might mislead you, but only with your implicit connivance or misunderstanding; systems involving other humans can mislead you either with intent or as a side-effect of misleading someone else.

As I’ve written about before, systems where actors may be trying to conceal or manipulate require care in choosing modelling techniques so as not to be misled. On the other hand, when actors are self-conscious or socially conscious they often generate signals that can help the modelling. However, a complete way of accounting for issues such as trust at the datum level has still to be designed.