The New York Times is not only a very good newspaper, it is also a really, really interesting archive that provides search access to all articles since 1851 via a pretty nice API. I’ve been meaning to play with it for some time, but things were extremely busy this year. But yesterday, I had some time in the evening and looked into the system a little bit and wrote a couple of scripts to try out some quick ideas.
While the API has all kinds of interesting things – in particular access to the Times’ controlled vocabulary – I am most interested in the article archive and the different possibilities to explore it. Understandably, the API does not provide the full text of articles; but it does search in the full text and for every found article it delivers quite a number of interesting things. Here is an example of what the returned data for a query (“guantanamo bay”) looks like:
While there are many things to go with, I found the manually attributed (and controlled) keywords to be particularly interesting. So I decided to explore and visualize how a particular subject evolves over time inside of this classificatory structure. Because the request rate for the search API is quite generous (10/s, 10K/day) I wrote a short PHP script (grab.php) that grabs this metadata for every article corresponding to a given search query. It simply downloads the data and stores it in a bunch of JSON files. A second script (analyze.php) then parses these files and creates a simple CSV file that can then be visualized with something like R (which I started working with some weeks ago, much easier than I thought, lots of fun).
With the help of the amazing ggplot2 library in R, using “guantanamo bay” as query, I quickly got a first result (click for larger image):
One can quite easily see that Guantanamo Bay was discussed in the 1990s in terms of immigration, asylum, and similar terms, while the current frame (terrorism, etc.) appears just after 9/11. While this script (bubbles.R) provides overview, a second one (bubbles_numbers.R) provides a combination of bubbles and numbers (click for larger image):
There is certainly much more interesting stuff to do with this data (e.g. different types of normalization, taking into account word count and page number, etc.) and I’ll hopefully come back to this in more detail in the future. In the meantime, all scripts can be found here.
Update June 2, 2013:
I’ve added a network export feature to the scripts on github. Generated network files are not limited to subject tags, but include people, organizations, locations, and creative works (e.g. books or movies). If two tags appear on the same article, a link is created and the more often they appear together, the stronger the connection. Here’s a quick visualization, made with gephi, of the most common people (red), organizations (green), and locations (blue) for the query “climate change” (click for larger image):
Pages are part of Facebook’s project to suck up the Web. They are also full of data. In the next version of netvizz I will add a feature that allows to dig into that data a little bit. Here is a preview:
This network (visualized with gephi) shows interactions on the Facebook page of the Guardian. I extracted all the likes and comments for the last 80 posts. On the whole, there are 9.500 users liking and commenting away. Each dark and labeled node is a post while all the others are users. A heat scale (blue => yellow => red) shows how often a user interacts with the page; size shows how often a node was liked or commented on (for pages) or liked and commented (for users).
One can see a a core of regulars in the middle of the graph, but the main engagement comes from a large majority of users that have only interacted with a single posts. These users drag the big subjects out to the margins in this specific spatialization. Engagement, here, comes from a fleeting audience rather than a more stable group or community.
There is still some testing to do, but I hope to get this feature ready soon for general use.
The second issue of computational culture is out and I am really looking forward to plunging into it once my teaching schedule leaves me a little bit more time. I am very happy that my paper made it in. As I am currently preparing a lecture on visual analysis for a class, I’ve been using the text for a bit of fun.
James A. Danowski‘s co-word tool wordij is unfortunately no longer online (why?), but it’s a really interesting and powerful piece of software and I used it here to create an alternative view of the paper (click for bigger image) with the help of gephi:
Many Eyes is still has a few tricks up its sleeve and this word tree visualization is really quite a strong tool for exploring the use and context of select words in a text:
These really work quite well on this particular paper, but I hope to spend more time with text analysis over the next months – working on historical papers from computer and information science – to see how well these and other tools hold up in a truly exploratory context.
Before moving back into media studies, I taught programming at various levels for quite a number of years. One of the things that always struck me, in particular at the beginner’s level, is how little my attempts to find different metaphors, explanations, ways of approaching the subject and practical exercises changed about a basic observation that would repeat itself over the years: some people get it, some don’t. Some students seem like they are born to program, while others cannot – even after long hours of training and no lack of trying – write even the simplest piece of functional and useful code.
Arstechnica has posted a really interesting piece relaying a Q&A on Stack Exchange on this exact topic: “can everyone be a programmer”? Amongst the different resources cited is a link to a paper from 2006 by Saeed Dehnadi and Richard Bornat from the School of Computing at Middlesex University that basically confirms the “double bump” so many programming teachers observe year after year (a lively discussion of the paper is here). What really struck me about the paper was not so much the empirical outcome though, neither the fact that a relatively simple test seems to yield robust results in predicting programming aptitude, but rather a passage that speculates on the reasons for the findings:
It has taken us some time to dare to believe in our own results. It now seems to us, although we are aware that at this point we do not have sufficient data, and so it must remain a speculation, that what distinguishes the three groups in the first test is their different attitudes to meaninglessness.
Formal logical proofs, and therefore programs – formal logical proofs that particular computations are possible, expressed in a formal system called a programming language – are utterly meaningless. To write a computer program you have to come to terms with this, to accept that whatever you might want the program to mean, the machine will blindly follow its meaningless rules and come to some meaningless conclusion. In the test the consistent group showed a pre-acceptance of this fact: they are capable of seeing mathematical calculation problems in terms of rules, and can follow those rules wheresoever they may lead. The inconsistent group, on the other hand, looks for meaning where it is not. The blank group knows that it is looking at meaninglessness, and refuses to deal with it.
While a single explanatory strategy like this is certainly not enough, I must admit that I have rarely read something about teaching programming that comes as close to my own experience and intuition. It was always my impression – and I have discussed this in great length with some of my peers – that programming requires a leap of faith, an acceptance of something that contradicts many of the things we have learned from our everyday experience. When programming, we are using words to build a machine and because the “meaning” of software is expressed through the medium of functionality, the words we write are simply not expressive in the same way as human language.
Over these last years, there has been a debate in software studies and related fields about the linguistic vs. functional dimension of “code”, and I have always found this debate to be strangely removed from the practice of programming itself (which may actually be a goal rather than an omission) – perhaps simply because I have always seen the an acceptance of the meaninglessness – in terms of semantics – of programming languages as a requirement to be a programmer. This does not mean that code does not have meaning as text but, to put it bluntly, if one cannot suspend that belief and look through the code directly into the functionality it produces, one cannot be a programmer. This is also the reason why I am so skeptical about the focus on code and the idea that scholars interested in software have to “learn how to write code”. Certainly, programming languages, development environments, etc. structure the practice – they open certain doors and close others, they orient and facilitate – but in the end, from the perspective of the programmer, the developer who actually builds a program the true expressiveness of software is behind the code; and when developers discuss the different advantages and disadvantages of various programming languages, they are measuring them in terms of how to get to that behind, how they allow us to pierce through their formalism, their meaninglessness to get to the level of expression we are actually speaking on: functionality.
This is the leap of faith that I have seen so many students not be able to commit to (for whatever reason, perhaps not the least my incapacity to present it in a way they can relate to): to accept the pure abstract performative formality of the words you are writing and see the machine behind the text. Because for those who program – and they are certainly not the only ones who have the right to speak about these matters – the challenge is to go through one kind of language (code) to get to another kind of language (function).
The example discussed in the paper – and used as a test – is telling:
int a = 10;
int b = 20;
a = b;
I do not know how many times I have tried to explain assignment. One of the tricks I came up with was the “arrow to the left” idea: a = 10 does not mean that a is equivalent to 10 but that the “=” should be a “<=” that puts “what is on the right” into “what is on the left”. In my experience – and the paper fully confirms that – not every student can get to a functional understanding of why “a” contains the value “20″ at the end of this script. Actually, the “why” is completely secondary. “a = b” means that you are putting the value for b into a. You’re doing it. Really. Really? I don’t know. But if you want to write a program you need to accept that this is what’s happening here. You don’t have to believe in it; formalism is about the commitment to a method, not an ontology. Let go of the idea that words mean something; here, they do something.
Software is not “linguistic” because source code has meaning – it may very well, but to write a program you have to forget about that – but because functionality is itself a means of expression. Despite the many footnotes I should add here, it is the capacity to be able to express the meaningful through the meaningless that makes a programmer. We have to let go a little of the world as we know it, in order to find it again, but in a different way.
This implies a very specific epistemological and even psychological stance, a way of unveiling the world in a certain manner. The question why this manner is apparently not everybody’s cup of tea may be a fruitful way to better understand what it actually consists of.
So, I share my name with quite a bunch of other people. In a search engine culture, where names are queries, this does lead to confusion, for I am but one of many. But no longer, Google has just shown me how I can rise above the other BRs:
On Thursday, I will be giving a talk at the “The Lived Logics of Database Machinery” workshop, organized by computational culture, which will take place at the Wellcome Collection Conference Centre in London, from 10h to 17h30. I am very much looking forward to this, although I’ll be missing a couple of days from the currently ongoing DMI summer school. This is what I will be talking about:
ORDER BY column_name. The Relational Database as Pervasive Cultural Form
This contribution starts from the observation that, in a way similar to the computational equivalence of programming languages, the major types of database models (network, relational, object-oriented, etc.) and implementations are all able to store and manage a very large variety of data structures. This means that most data structures could be modeled, in one way or another, in almost any existing database system. So why have there been so many intense debates about how to conceive and build database systems? Just like with programming languages, the specific way a database system embeds an abstract concept in a set of concrete methods and mechanisms for specifying, accessing, and manipulating datasets is significant. Different database models and implementations imply different ways of “thinking” data organization, they vary in performance, robustness, and “logistics” (one of the reasons why Oracle’s product succeeded well in the enterprise sector in the 1980s, despite its lack of certain features, was the ability to make backups of a running database), and they provide different modes of interaction with both the data and the system.
The central vector of differentiation, however, is the question how users “see” the data: during the “database debates” of the 1970s and 1980s the idea of the database as a set of tables (relational model) was put in opposition to the vision of the database as a network of records (network model). The difference between the two concerned not only performance, flexibility, and complexity, but also the crucial question who the users of these systems would be in the first place. The supporters of the network model clearly saw the programmer as the target audience for database systems but the promoters of the much simpler relational model and its variants imagined “accountants, engineers, architects, and urban planners” (Chamberlin and Boyce 1974) to directly interact with data by means of a simple query language. While this vision has not played out, according to Michael Stonebreaker’s famous observation, SQL (the most popular, albeit impure implementation of Codd’s relational ideas) has indeed become “intergalactic data-speak” (most packages on the market provide SQL interfaces) and this standardization has strongly facilitated the penetration of database systems into all corners of society and contributed to a widespread “relational view” of data organization and manipulation, even if data modeling is still mostly in expert hands.
The goal of this contribution is to examine this “relational view” in terms of what Jack Goody called the “modes of thought” associated with writing, and in particular with the list form, which “encourages the ordering of the items, by number, by initial sound, by category, etc.” (Goody 1977). As with most modern technologies, the relational model implies a complex set of constraining and enabling elements. The basic structural unit, the “relation” (what most people would simply call a table) disciplines data modeling practices into logical consistency (tables only accept tuples/rows with the same attributes) while remaining “semantically impoverished” (Stonebreaker 1993). Heterogeneity is purged from the relational model on the level of modeling, especially if compared to navigational approaches (e.g. XPath or DOM), but the “set-at-a-time” retrieval concept, combined with a declarative query language, affords remarkable flexibility and expressiveness on the level of data selection. The relational view thus implies an “ontology” consisting of regular, uniform, and only loosely connected objects that can be ordered in a potentially unlimited number of ways at the time of retrieval (by means of the query language, i.e. without having to program explicit retrieval routines). In this sense, the relational model perfectly fits the qualities that Callon and Muniesa (2005) attribute to “powerful” calculative agency: handle a long list of diverse entities, keep the space of possible classifications and reclassifications largely open, multiply possible hierarchies and classifications. What database systems then do, is bridging the gap between these calculative capacities and other forms of agency by relating them to different forms of performativity (e.g., in SQL speak, to SELECT, TRIGGER, and VIEW).
While the relational model’s simplicity has led to many efforts to extend or replace it in certain application areas, its near universal uptake in business and government means that the logistics of knowledge and ordering implied by the relational ontology resonate through the technological layers and database schemas into the domains of management, governance, and everyday practices.
I will argue that the vision of the “programmer as navigator” trough a database (Bachman 1973) has, in fact, given way to a setting where database consultants, analysts, and modelers sit between software engineering on the one side and management on the other, (re)defining procedures and practices in terms of the relational model. Especially in business and government sectors, central forms of management and evaluation (reporting, different forms of data analysis, but also reasoning in terms of key performance indicators and, more generally, “evidence based” management) are directly related to the technological and cognitive standardization effects derived from the pervasiveness of relational databases. At the risk of overstretching my argument, I would like to propose that Thrift’s (2005) “knowing capitalism” indeed knows (largely) in terms of the relational model.