Provenance — a neglected aspect of data analytics

Provenance is defined by Merriam-Webster as “the history of ownership of a valued object or work of art or literature“, but the idea has much wider applicability.

There are three kinds of provenance:

  1. Where did an object come from. This kind of provenance is often associated with food and drink: country of origin for fresh produce, brand for other kinds of food, appellation d’origine contrôlée for French wines, and many other examples. This kind of provenance is usually signalled by something that is attached to the object.
  2. Where did an object go on its way from source to destination. This is actually the most common form of provenance historically — the way that you know that a chair really is a Chippendale is to be able to trace its ownership all the way back to the maker. A chair without provenance is probably much less valuable, even though it may look like a Chippendale, and the wood seems the right age. This kind of provenance is beginning to be associated with food. For example, some shipments now have temperature sensors attached to them that record the maximum temperature they ever encountered between source and destination. Many kinds of shipments have had details about their pathway and progress available to shippers, but this is now being exposed to customers as well. So if you buy something from Amazon you can follow its progress (roughly) from warehouse to you.
  3. The third kind of provenance is still in its infancy — what else did the object encounter on it way from source to destination. This comes in two forms. First, what  other objects was it close to? This is the essence of Covid19 contact tracing apps, but it applies to any situation where closeness could be associated with poor outcomes. Second, where the objects that it was close to ones that were expected or made sense?

The first and second forms of provenace don’t lead to interesting data-analytics problems. They can be solved by recording technologies with, of course, issues of reliability, unforgeability, and non-repudiation.

But the third case raises many interesting problems. Public health models of the spread of infection usually assume some kind of random particle model of how people interact (with various refinements such as compartments). These models would be much more accurate if they could be based on actual physical encounter networks — but privacy quickly becomes an issue. Nevertheless, there are situations where encounter networks are already collected for other reasons: bus and train driver handovers, shift changes of other kinds, police-present incidents; and such data provides natural encounter networks. [One reason why Covid19 contact tracing apps work so poorly is that Bluetooth proximity is a poor surrogate for potentially infectious physical encounter.]

Customs also has a natural interest in provenance: when someone or something presents at the border, the reason they’re allowed to pass or not is all about provenance: hard coded in a passport, pre-approved by the issue of a visa, or with real-time information derived from, say, a vehicle licence plate.

Some of clearly suspicious, but hard to detect, situations arise from mismatched provenance. For example, if a couple arrive on the same flight, then they will usually have been seated together; if two people booked their tickets or got visas using the same travel agency at the same time then they will either arrive on different flights (they don’t know each other), or they will arrive on the same flight and sit together (they do know each other). In other words, the similarity of provenance chains should match the similarity of relationships, and mismatches between the two signal suspicious behaviour. Customs data analytics is just beginning to explore leveraging this kind of data.

0 Responses to “Provenance — a neglected aspect of data analytics”



  1. Leave a Comment

Leave a comment