Posts Tagged 'computational science'

The growing role of data curation

My view of Data Science, or Big Data if you prefer, is that it divides naturally into three different subfields:

  1. Data curation, which involves focusing on the issues of managing large amounts of heterogeneous data, but is primarily concerned about provenance, that is tracking the metadata about the data.
  2. Computational science, which builds models of the real-world inside computer systems to study their properties.
  3. Analytics, which infers the properties of systems based on data about them.

I’ve posted about these ideas previously (https://skillicorn.wordpress.com/2015/05/09/why-data-science/),

Data curation might have seemed like the poor cousin among these three, and certainly gets the least funding and attention.

But issues of provenance have suddenly become mainstream as everyone on the web struggles to figure out what to do about fake news stories. So far, the Internet has not really addressed the issues of metadata. Most of the big content providers know who generated the content that they create and distribute, but they don’t necessarily make this information known or available for those who read the content to leverage. It’s time for the data curation experts, who tend to come from information systems and library science, to step up.

Data curation is also about to become the front line in cyberattack. As I’ve suggested (Skillicorn, DB, Leuprecht, C, and Tait, V. 2016. Beyond the Castle Model of Cybersecurity.  Government Information Quarterly.), a natural cyberdefence strategy is replication. Data exfiltration is made much more difficult if there many, superficially similar, versions of any document or data that might be a target. However, progress in assigning provenance becomes the cyberattack that matches this cyber defence.

So here’s the research question for data curation: how can I tell, from the internal evidence, and partial external evidence, whether this particular document is legitimate (or is the legitimate version of a set of almost-replicates)?

Advertisements

Refining “Data Science”

Regular readers will know that I have been thinking about the constellation of ideas that are getting a lot of play in universities and the research community around words like “data science”, and ‘big data”,  and especially the intersection of these ideas with the other constellation of “data mining”, “knowledge discovery” and “machine learning”.

I’ve argued that inductive model discovery (which I think is the core of all of these ideas) is a new way of doing science that is rapidly replacing the conventional Enlightenment or Popperian view of science. This is happening especially quickly in fields that struggled to apply the conventional scientific method, especially in medicine, the social “sciences”, business schools, and in the humanities.

Attending the International Conference on Computational Science meeting made me realise, however, that computational science is a part of this story as well.

Here’s how I see the connections between these three epistemologies:

  1. Conventional science. Understand systems via controlled experiments: setting up configurations that differ in only a few managed ways and seeing whether those differences correspond to different system behaviours. If they do, construct an “explanation”; if they don’t, it’s back to the drawing board.
  2. Computational science. Understand systems by building simulations of them and tweaking the simulations to see if the differences are those that are expected from the tweaks. (Simulations increase the range of systems that can be investigated when either the tweaks can’t be done on the real system, or when the system is hypothesised to be emergent from some simpler pieces.)
  3. Data science. Understand systems by looking at the different configurations that naturally occur and seeing how these correspond to different system behaviors. When they do, construct an “explanation”.

In other words, conventional science pokes the system being investigated in careful ways and sees how it reacts; computational science creates a replica of the system and pokes that; and data science looks at the system being poked and tries to match the reactions to the poking.

Underlying these differences in approach is also, of course, differences in validation: how one tells if an explanation is sufficient. The first two both start from a hypothesis and use statistical machinery to decide whether the hypothesis is supported sufficiently strongly. The difference is that the computational science has more flexibility to set up controlled experiments and so, all things considered, can get stronger evidence. (But there is always the larger question of whether the simulation actually reproduces the system of interest — critical, but often ignored, and with huge risks of “unknown unknowns”.) Data science, in contrast, validates its models of the system being studied by approaches such as the use of a test set, a component of the system that was not used to build the model, but which should behave as the original system did. It is also buttressed by the ability to generate multiple models and so compare among them.

Data science is advancing on two fronts: first, the flexibility it provides to conventional science not to have to construct carefully balanced controlled experiments; second, and much more significantly, the opportunity it creates for making scientific progress in the social sciences and humanities, replacing “qualitative” by “quantitative” in unprecedented ways.