Regular readers will know that I have been thinking about the constellation of ideas that are getting a lot of play in universities and the research community around words like “data science”, and ‘big data”, and especially the intersection of these ideas with the other constellation of “data mining”, “knowledge discovery” and “machine learning”.
I’ve argued that inductive model discovery (which I think is the core of all of these ideas) is a new way of doing science that is rapidly replacing the conventional Enlightenment or Popperian view of science. This is happening especially quickly in fields that struggled to apply the conventional scientific method, especially in medicine, the social “sciences”, business schools, and in the humanities.
Attending the International Conference on Computational Science meeting made me realise, however, that computational science is a part of this story as well.
Here’s how I see the connections between these three epistemologies:
- Conventional science. Understand systems via controlled experiments: setting up configurations that differ in only a few managed ways and seeing whether those differences correspond to different system behaviours. If they do, construct an “explanation”; if they don’t, it’s back to the drawing board.
- Computational science. Understand systems by building simulations of them and tweaking the simulations to see if the differences are those that are expected from the tweaks. (Simulations increase the range of systems that can be investigated when either the tweaks can’t be done on the real system, or when the system is hypothesised to be emergent from some simpler pieces.)
- Data science. Understand systems by looking at the different configurations that naturally occur and seeing how these correspond to different system behaviors. When they do, construct an “explanation”.
In other words, conventional science pokes the system being investigated in careful ways and sees how it reacts; computational science creates a replica of the system and pokes that; and data science looks at the system being poked and tries to match the reactions to the poking.
Underlying these differences in approach is also, of course, differences in validation: how one tells if an explanation is sufficient. The first two both start from a hypothesis and use statistical machinery to decide whether the hypothesis is supported sufficiently strongly. The difference is that the computational science has more flexibility to set up controlled experiments and so, all things considered, can get stronger evidence. (But there is always the larger question of whether the simulation actually reproduces the system of interest — critical, but often ignored, and with huge risks of “unknown unknowns”.) Data science, in contrast, validates its models of the system being studied by approaches such as the use of a test set, a component of the system that was not used to build the model, but which should behave as the original system did. It is also buttressed by the ability to generate multiple models and so compare among them.
Data science is advancing on two fronts: first, the flexibility it provides to conventional science not to have to construct carefully balanced controlled experiments; second, and much more significantly, the opportunity it creates for making scientific progress in the social sciences and humanities, replacing “qualitative” by “quantitative” in unprecedented ways.