This morning Jonah Bossewitch pointed me to an article over at Wired, authored by Chris Anderson which announces “The End of Theory”. The article’s main argument in itself is not very interesting for anybody with a knack for epistemology – Anderson has apparently never heard of the induction / deduction discussion and a limited idea about what statistics does – but there is a very interesting question lurking somewhere behind all the Californian Ideology and the following citation points right to it:
We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
One could point to the fact that the natural sciences had their experimental side for quite a while (Roger Bacon advocated his scientia experimentalis in the 13th century) and that a laboratory is in a sense a pattern-finding machine where induction continuously plays an important role. What interests me more though is Anderson’s insinuation that statistical algorithms are not models. Let’s just look at one of the examples he uses:
Google’s founding philosophy is that we don’t know why this page is better than that one: If the statistics of incoming links say it is, that’s good enough. No semantic or causal analysis is required.
This is a very limited understanding of what constitutes a model. I would argue that PageRank does in fact rely very explicitly on a model which combines several layers of justification. In their seminal paper on Google, Brin and Page write the following:
PageRank can be thought of as a model of user behavior. We assume there is a “random surfer” who is given a web page at random and keeps clicking on links, never hitting “back” but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank.
The assumption behind this graph oriented justification is that people do not randomly place links but they do so with purpose. Linking implies attribution of importance: we don’t link to documents that we’re indifferent about. The statistical exploration of the huge graph that is the Web is indeed oriented by this basic assumption and adds the quite contestable ruling according to which shall be most visible what is thought important by the greatest number of linkers. I would, then, argue that there is no experimental method that is purely inductive, not even neural networks. Sure, on the mathematical side we can explore data without limitations concerning their dimensionality, i.e. the number of characteristics that can be taken into account; the method of gathering data is however always a process of selection that is influenced by some idea or intuition that at least implicitly has the characteristic of a model. There is a deductive side to even the most inductive approach. Data is made not given and every projection of that data is oriented. To quote Fernando Pereira:
[W]ithout well-chosen constraints — from scientific theories — all that number crunching will just memorize the experimental data.
As Jonah points out, Anderson’s article is probably a straw man argument whose sole purpose is to attract attention but it points to something that is really important: too many people think that mathematical methods for knowledge discovery (datamining that is) are neutral and objective tools that will find what’s really there and show the world as it is without the stain of human intentionality; these algorithms are therefore not seen as objects of political inquiry. In this view statistics is all about counting facts and only higher layers of abstraction (models, theories,…) can have a political dimension. But it matters what we count and how we count.
In the end, Anderson’s piece is little more than the habitual prostration before the altar of emergence and self-organization. Just exchange the invisible hand for the invisible brain and you’ll get pop epistemology for hive minds…
Leave a Reply
Tech support questions will not be answered. Please refer to the FAQ of the tool.
Pingback: Abstraction » statistics vs. science (and why this is rather political)
July 2, 2008 at 10:19 pm /
Thanks for continuing this thread – you really captured my intent better than I originally expressed 😉
I have to say its been a bit disconcerting to learn this week that quite a few CS folks I have talked with accept Anderson’s formulation, to some degree. What’s scary here is that this is precisely how we relinquish control to the machines – if we give up our agency, I suppose we’ll get what we deserve.
While I’ve had some success swaying perspectives in conversations (some of this philosophy stuff is actually quite practical), the tougher argument to develop is that there “is no experimental method that is purely inductive, not even neural networks.” These implicit forms of knowledge representation are so counter-intuitive, that we don’t really know how to think about them yet.
I think its possible (and important) to tease out the specific intentional choices that go into deciding how and what to count, as you have done with the PageRank algorithm here.
This was another example I came across a few months ago that struck me as problematic in similar ways:
We had better pre-empt this quickly, before “they” decide to feed their behavioural auto-classification systems the schemes in the DSM …