The growing role of data curation

My view of Data Science, or Big Data if you prefer, is that it divides naturally into three different subfields:

  1. Data curation, which involves focusing on the issues of managing large amounts of heterogeneous data, but is primarily concerned about provenance, that is tracking the metadata about the data.
  2. Computational science, which builds models of the real-world inside computer systems to study their properties.
  3. Analytics, which infers the properties of systems based on data about them.

Data curation might have seemed like the poor cousin among these three, and certainly gets the least funding and attention.

But issues of provenance have suddenly become mainstream as everyone on the web struggles to figure out what to do about fake news stories. So far, the Internet has not really addressed the issues of metadata. Most of the big content providers know who generated the content that they create and distribute, but they don’t necessarily make this information known or available for those who read the content to leverage. It’s time for the data curation experts, who tend to come from information systems and library science, to step up.

Data curation is also about to become the front line in cyberattack. As I’ve suggested (Skillicorn, DB, Leuprecht, C, and Tait, V. 2016. Beyond the Castle Model of Cybersecurity.  Government Information Quarterly.), a natural cyberdefence strategy is replication. Data exfiltration is made much more difficult if there many, superficially similar, versions of any document or data that might be a target. However, progress in assigning provenance becomes the cyberattack that matches this cyber defence.

So here’s the research question for data curation: how can I tell, from the internal evidence, and partial external evidence, whether this particular document is legitimate (or is the legitimate version of a set of almost-replicates)?