While there are probably a lot of people that have stumbled over the Google Ngram Viewer, it is safe to assume that fewer have read the paper (Science, January 2011) by Michel et al. that documents the project and gives a good idea of the kind of “big iron” science we can expect to capture quite a lot of attention over the next couple of years. According to the (14, one being “The Google Books Team”, another Steven Pinker) authors, the projet – fittingly termed culturomics – is based on a sample of 5,195,769 books, which apparently represents roughly 4% of all the books ever published. They easiest way to show the scope of what the researchers aim to do is quoting the abstract in full:

We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of ‘culturomics,’ focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.

Next to the sheer size of the corpus, there are several things that are quite remarkable with this project:

1) While the paper is full of graphs, it is immensely interesting that many of the measurements taken can be “reenacted” with the Ngram Viewer. In a passage that diagnoses “a greater focus on the present” in more recent publications, the authors show that the half-life (i.e. the number of years it takes for a date to get to half the frequency value of an initial peak) of dates gets much shorter over time. We can easily graph the result ourselves:This possibility to query the data ourselves (as well as the comprehensive data sharing) represents quite a change in how we can relate to the results as scholars and while only the most well-funded projects will be able to provide a “companion” data-tool, there is a real epistemological shift underway. From a teaching perspective, the hands-on approach may actually be even more valuable.

2) We increasingly have very comprehensive available data sets that can be used as concept markers in very different contexts. In this case, the authors used 740.000 names of persons from Wikipedia to study different aspects of fame. But one could easily imagine using GeoNames to perform a similar survey of the ebb and fall of geographic prominence. I am quite sure that linguists will soon bring together the Ngram data with WordNet to study concept evolution and other things.

3) While the examples developed in the article are fascinating – and there will certainly be many more – the epistemological horizon is quite vague for the moment. There is no question that historical linguistics will have a field day plunging into the data, but the intellectual rationale behind the project of culturomics is a bit thin for the moment:

Culturomics is the application of high-throughput data collection and analysis to the study of human culture. Books are a beginning, but we must also incorporate newspapers, manuscripts, maps, artwork, and a myriad of other human creations. Of course, many voices—already lost to time— lie forever beyond our reach.

Culturomic results are a new type of evidence in the humanities. As with fossils of ancient creatures, the challenge of culturomics lies in the interpretation of this evidence.

I would argue that it is not so much the interpretation of evidence that represents a challenge but the integration of these new computer-based approaches into meaningful research agendas that ask non-trivial questions. While it may be interesting to be able to attach a number to the competence of Nazi censorship efforts, this competence  was never very much in doubt and while numbers and graphs may confer an aura of scientific respectability, the findings will most probably not add anything to our understanding of national socialism.

While it is increasingly unpopular to cite Snow’s Two Cultures, this early proposal for a quantitative approach to culture (in its historic dimension) will give rise to all kinds of polemics, misunderstandings, and demarkation efforts. The public availability of a query tool is, however, a real reason for hope: humanities scholars will be able to try it out for themselves and with a bit of luck, we will have a broader view on its usefulness for cultural analysis in a couple of month.

Post filed under database, epistemolgy, method, statistics.

Leave a Reply

Your email address will not be published. Required fields are marked *

Tech support questions will not be answered. Please refer to the FAQ of the tool.