Category Archives: epistemolgy

While there are probably a lot of people that have stumbled over the Google Ngram Viewer, it is safe to assume that fewer have read the paper (Science, January 2011) by Michel et al. that documents the project and gives a good idea of the kind of “big iron” science we can expect to capture quite a lot of attention over the next couple of years. According to the (14, one being “The Google Books Team”, another Steven Pinker) authors, the projet – fittingly termed culturomics – is based on a sample of 5,195,769 books, which apparently represents roughly 4% of all the books ever published. They easiest way to show the scope of what the researchers aim to do is quoting the abstract in full:

We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of ‘culturomics,’ focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.

Next to the sheer size of the corpus, there are several things that are quite remarkable with this project:

1) While the paper is full of graphs, it is immensely interesting that many of the measurements taken can be “reenacted” with the Ngram Viewer. In a passage that diagnoses “a greater focus on the present” in more recent publications, the authors show that the half-life (i.e. the number of years it takes for a date to get to half the frequency value of an initial peak) of dates gets much shorter over time. We can easily graph the result ourselves:This possibility to query the data ourselves (as well as the comprehensive data sharing) represents quite a change in how we can relate to the results as scholars and while only the most well-funded projects will be able to provide a “companion” data-tool, there is a real epistemological shift underway. From a teaching perspective, the hands-on approach may actually be even more valuable.

2) We increasingly have very comprehensive available data sets that can be used as concept markers in very different contexts. In this case, the authors used 740.000 names of persons from Wikipedia to study different aspects of fame. But one could easily imagine using GeoNames to perform a similar survey of the ebb and fall of geographic prominence. I am quite sure that linguists will soon bring together the Ngram data with WordNet to study concept evolution and other things.

3) While the examples developed in the article are fascinating – and there will certainly be many more – the epistemological horizon is quite vague for the moment. There is no question that historical linguistics will have a field day plunging into the data, but the intellectual rationale behind the project of culturomics is a bit thin for the moment:

Culturomics is the application of high-throughput data collection and analysis to the study of human culture. Books are a beginning, but we must also incorporate newspapers, manuscripts, maps, artwork, and a myriad of other human creations. Of course, many voices—already lost to time— lie forever beyond our reach.

Culturomic results are a new type of evidence in the humanities. As with fossils of ancient creatures, the challenge of culturomics lies in the interpretation of this evidence.

I would argue that it is not so much the interpretation of evidence that represents a challenge but the integration of these new computer-based approaches into meaningful research agendas that ask non-trivial questions. While it may be interesting to be able to attach a number to the competence of Nazi censorship efforts, this competence  was never very much in doubt and while numbers and graphs may confer an aura of scientific respectability, the findings will most probably not add anything to our understanding of national socialism.

While it is increasingly unpopular to cite Snow’s Two Cultures, this early proposal for a quantitative approach to culture (in its historic dimension) will give rise to all kinds of polemics, misunderstandings, and demarkation efforts. The public availability of a query tool is, however, a real reason for hope: humanities scholars will be able to try it out for themselves and with a bit of luck, we will have a broader view on its usefulness for cultural analysis in a couple of month.

The use of computers in the humanities has a long and fine history. What is striking though is how lucid scholars reflected on their tools even in the earliest days. Here’s a beautiful citation by Irwin C. Lieb from a text published in the the inaugural issue of Computers in the Humanities, a journal started in 1966.

The great advances which have so far been made with computers have been in those fields where we find countable items or have ready substitutes for them. The real or seeming extraneousness of computer studies for the humanities is owed to the fact that, in the humanities, what are most important are, if items at all, items that we can’t count, or can count only most artificially. We know, for example, how little definite we mean in saying that we have two or three ideas, that there are four themes in a play, or that there were this or that number of historical events. Our “counting” is not the counting of items that were somehow there separate, waiting to be pointed out; it is a “counting” in which judgments themselves mark out what come to be the items that we count. Apart from the judgments, there are no separate items. Therefore, no technique of counting such items so as to yield, for the first time, a judgment or a summary is possible at all. But, granting that this sort of limitation is inescapable, computers could, it seems, still come to have a more vital use in the humanities than we have seen so far.

[…]

The suggestion, then, is that some of the simplest but most important work to be done in deepening the usefulness of computers for the humanities will be in imagining those schemas by which we will model what we know cannot be modeled undistortedly: — ideas, themes, events and even more importantly, insights, appraisals, and appreciations. There are, there must be, revealing models for all of these. And as we think of them, and then use them in the humanities, the achievement for us will come as we feel out just what the distortions are, as we make the right mistakes. For as we see them as mistakes, we will penetrate further and still more appreciate what we are most concerned to understand. With the possibilities for computer studies of depth and importance in the humanities seeming still so genuine, it would be a mistake, I think, to curtail our exploration of them soon.

Some debates are just so much older than our short forgetful minds allow us to recognize. In 1965 Jacques Barzun (still alive today at a biblical 102!) made the following statement:

What have the humanities been doing for thirty-five years except to do exactly what a computer would do, only with their own unaided card indexes and fountain pens? They have taken apart poetry, they have taken apart novels, they have counted images, they have followed symbols that are sometimes non-existent, they have destroyed their own subject matter by a pseudo-computer-like approach, and now they have only themselves to blame if they have to learn the tricks and the jargon of computerizing. (Jacques Barzun at a conference at Yale University, cited in. Taviss (ed.), The Computer Impact, 1970, p.199)

While I have not found the original document of Barzun’s talk, Bowler (ed.), Computers in Humanistic Research, 1967, p.232 has a summary of his three main points of critique:

First is the assumption of a false relation between the units defined and written and the reality they are supposed to represent. For example, 20 years ago, someone attempted to study genius by selecting names from Who’s Who in America, as being indicative of the quality of genius. Second is the fallacy of assessing importance by weight or numbers. The speaker mentioned a published census, again some 20 years ago, which indicated that the number of brownstone or frame houses in New York was much larger than the number of skyscrapers, giving the erroneous impression that the former represented the city’s characteristic architectural form. The third error is the attribution of meaning based upon only a partial study of the object in question. Two conspicuous examples of the faulty attribution of meaning to partial signs are the cases of machine translation and the objective tests given to school children and the people in business.

Would it be very hard to find contemporary examples that fit these three points?

What is a link? From a methodology standpoint, there is no answer to that question but only the recognition that when using graph theory and associated software tools, we project certain aspects of a dataset as nodes and others as links. In my last post, I “projected” authors from the air-l list as nodes and mail-reply relationships as links. In the example below, I still use authors as nodes but links are derived from a similarity measure of a statistical analysis of each poster’s mails. Here are two gephi graphs:

If you are interested in the technique, it’s a simple similarity measure based on the vector-space model and my amateur computer scientist’s PHP implementation can be found here. The fact that the two posters who changed their “from:” text have both of their accounts close together (can you find them?) is a good indication that the algorithm is not completely botched. The words floating on the links on the right graph are the words that confer the highest value to the similarity calculation, which means that it is a word that is relatively often used by both of the linked authors while being generally rare in the whole corpus. Elis Godard and Dana Boyd for example have both written on air-l about Ron Vietti, a pastor who (rightfully?) thinks the Internet is the devil and because very few other people mentioned the holy warrior, the word “vietti” is the highest value “binder” between the two.

What is important in networks that are the result of heavily iterative processing is that the algorithms used to create them are full of parameters and changing one of these parameters just little bit may (!) have larger repercussions. In the example above I actually calculate a similarity measure between each two nodes (60^2 / 2 results) but in order to make the graph somewhat readable I inserted a threshold that boils it down to 637 links. The missing measures are not taken into account in the physics simulation that produces the layout – although they may (!) be significant. I changed the parameter a couple of times to get the graph “right”, i.e. to find a good compromise between link density for simulation and readability. But look at what happens when I grow the threshold so than only the 100 strongest similarity measures survive:

First, a couple of nodes disconnect, two binary stars form around the “from:” changers and the large component becomes a lot looser. Second, Jeremy Hunsinger looses the highest PageRank to Chris Heidelberg. Hunsinger had more links when lower similarity scores were taken into account, but when things get rough in the network world, bonding is better than bridging. What is result and what is artifact?

Most advanced algorithmic techniques are riddled with such parameters and getting a “good” result not only implies fiddling around a lot (how do I clean the text corpus, what algorithms to look for what kind of structures or dynamics, what parameters, what type of representation, here again, what parameters, and so on…) but also having implicit ideas about what kind of result would be “plausible”. The back and forth with the “algorithmic microscope” is always floating against a backdrop of “domain knowledge” and this is one of the reasons why the idea of a science based purely on data analysis is positively absurd. I believe that the key challenge is to stay clear of methodological monoculture and to articulate different approaches together whenever possible.

The Association of Internet Researchers (AOIR) is an important venue if you’re interested in, like the name indicates, Internet research. But it is also a good primary source if one wants to inquire into how and why people study the Internet, which aspects of it, etc. Conveniently for the lazy empirical researcher that I am, the AOIR has an archive of its mailing-list, which has about 22K mails posted by 3K addresses, enough for a little playing around with the impatient person’s tool, the algorithm. I have downloaded the data and I hope I can motivate some of my students to build something interesting with it, but I just had to put it into gephi right away. Some of the tools we’ll hopefully build will concentrate more on text mining but using an address as a node and a mail-reply relationship as a link, one can easily build a social graph.

I would like to take this example as an occasion to show how different algorithms can produce quite different views on the same data:

So, these are the air-l posters with more than 60 messages posted since 2001. Node size indicates the number of posts, a node’s color (from blue to red) shows its connectivity in the graph (click on the image to see a much larger version). Link strength, i.e. number of replies between two people, is taken into account. You can download the full .gdf here. The only difference between the four graphs is the layout algorithm used (Force Atlas, Force Atlas with attraction distribution, Yifan Hu, and Fruchterman Reingold). You can instantly notice that Yifan Hu pushes nodes with low link count much more strongly to the periphery than the others, while Fruchterman Reingold as always keeps its symmetrical sphere shape, suggesting a more harmonious picture than the rest. Force Atlas’ attraction distribution feature will try to differentiate between hubs and authorities, pushing the former to the periphery while keeping the latter in the center; just compare Barry Wellman’s position over the different graphs.

I’ll probably repeat this experiment with a more segmented graph, but I think this already shows that layout algorithms are not just innocently rendering a graph readable. Every method puts some features of the graph to the forefront and the capacity for critical reading is as important as the willingness for “critical use” that does not gloss over the differences in tools used.

This blogpost is somewhat of an experiment that I hope will turn into a series. I have started to work seriously on a book that will suggest a somewhat different take on understanding computing and particularly contemporary software deployed on the Internet. A large part of that work consists of historical analysis and in this context I am (re)reading many of the seminal papers of the information and computer sciences. What is striking about these texts is not only their content but their far-reaching influence on the landscape of technological concepts and, often enough, on the actual technological developments that followed. Writing software today is in most cases an articulation that takes place in an extremely dense space of established languages, APIs, frameworks, and libraries but also of concepts, methodologies, best practices, tacit assumptions, strategies, and community rules. There is so much “old” in every “new” but many concepts have become so pervasive, so dominant that we no longer see them as the particularities they in fact are. Being canonical, they become second nature. But many of these path-defining moments can be retraced and given the pervasiveness of computers today, an archeology of computing is, in a way, an archeology of our culture.

One of the ways to do such an archeology may simply consist in trying to read seminal computer and information science papers sideways, not (only) as technological proposals, but as political and cultural projects that combine a (most often critical) analysis of a status quo with a prescriptive take on how a more ideal setting could/should look like. Technology is, in that sense, a way of relating to society, a means of contributing that is political in a very different way than the traditional arenas of governance and debate. What I would like to suggest is that this aspect of technological writing (science papers but also reports, RFCs, norms, proposals, documentation, etc.) is by far not examined enough, particularly when it comes to techniques that are related to software. Our view of technology is still very much shaped by the physical machine – the box, the screen, the keyboard – perhaps also because these physical parts are closer to our bodies, more visible and easier to integrate into the cognitive practices of a culture that, paradoxically, is able to produce extremely sophisticated mechanisms while being quite inept when it comes to understanding the role technical objects play in constituting its very fabric.

In my view, the central mistake is to assimilate technology to techné and be done with it. Perhaps I am wrong, but I cannot shake the feeling that very few scholars in the humanities and social sciences are prepared to accord to technological creation the same depth, complexity, variety, the same imbrication in society, the same amount of “humanity” than literature or artistic creation in general. This unwillingness to really engage technology beyond the surface leads to the familiar reflex-like reactions, both positive and negative, that seem to dominate public debates on “hot” topics like social networking, privacy on the Internet, or computer games.

So what I am looking for is a different way of understanding technology that subscribes neither to an engineering perspective concerned with function nor to a purely “culturalist” analysis that sees only imaginaries, symbols, and metaphors, thereby risking to loose the machine in the machine. So, today, first try and why not start with a big one.

In 1970, Edgar F. Codd, a British computer scientist who moved to the US in the 1940s, published one of the most influential papers in the history of computer science, A Relational Model of Data for Large Shared Data Banks (available here, doi:10.1145/362384.362685), in which he proposed a concept for the construction of database systems built around the central idea of separating the logical organization of information from the way it is stored on a physical storage medium. While the usefulness of such a separation may seem very obvious from today’s viewpoint, Codd’s paper stirred a virulent debate and his employer, IBM, was quite reluctant when it came to turning the proposal into a product (it took eight years for the first relational database system to make it to the market). When discussing Codd’s work, we should be very suspicious of the popular narratives of technological development as a series of inventions, or worse, ideas. To separate logical organization from physical storage had been a common practice in libraries for a long time: the library catalogue, in combination with some basic shelf logistics, allows for very different ways of recording books – alphabetically, by subject, and so on. But technologies are not simply ideas; Gene Roddenberry did not invent beaming. As science and technology studies have shown many times, a successful scientific “discovery” or a technological “invention” is somewhat of a “perfect storm”: many pieces have to fall into place, many different actors have to be mobilized, and most often there is talking, writing, demonstrating, debating, and a whole lot of fuzz. As computer history shows, having an idea (Babbage) or even building a functioning machine (Zuse) may simply not be enough to establish a technology. Since the industrial revolution, technologies are increasingly often systems that require logistics, markets, organizational reform, or an installed user base. In our case, the really interesting thing is not necessarily the abstract idea for what has become today’s omnipresent relational database, but the way Codd builds an idea into a technological concept, as an argument as well as a potential system. To start, let’s quote the abstract in full:

Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation). A prompting service which supplies such information is not a satisfactory solution. Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed. Changes in data representation will often be needed as a result of changes in query, update, and report traffic and natural growth in the types of stored information.
Existing noninferential, formatted data systems provide users with tree-structured files or slightly more general network models of the data. In Section 1, inadequacies of these models are discussed. A model based on n-ary relations, a normal form for data base relations, and the concept of a universal data sublanguage are introduced. In Section 2, certain operations on relations (other than logical inference) are discussed and applied to the problems of redundancy and consistency in the user’s model. (p. 377)

First of all, who are these users that have to be “protected”? In 1970, this is obviously not (yet) the manager sitting in front of a screen and keyboard but rather the application programmer that will implement the “query, update, and report” functions every larger organizations rely on for management. These users/programmers had been forced to make changes in storage structures whenever requirements changed in a significant way. This was not just an onerous task but also a source of potentially crippling problems as every adaptation risked breaking existing applications. Without explicit reference, Codd’s work is directly related to what has become to be known as the “software crisis” that lead to the emergence of software engineering. The separation of systems into black-boxed modules that communicated via well-specified interfaces was one of the solutions put forward to counter the explosion of complexity that followed the introduction of computers into large-scale, real-world (business) organizations. Seen in this light, the relational model and the concept of “data independence” (p. 377) is an extremely powerful agent for the division of labor that cleanly separates the engineering of a database system from the specification of data structures, adding to the ground work for the concept of end-user software that we know today.

So what is Codd’s proposal? For a reader trained in the humanities trying to read a paper like the one in question (even the first half, which does not use any formal notation), adaptions to the habitual reading style have to be made to get something useful out of it. Much like mathematics, computer science deploys language quite differently than the humanities (except for analytical philosophy): language, here, is not (only) narrative and argumentative, it aims a building a demonstration, which is most certainly a rhetorical form, but a very formal one that follows a convention consisting of laying out a space of thinking through a series of very precise definitions, which often attribute quite specific significations to words taken from everyday language. Miss one of these definitions and the whole pyramid crumbles. In Codd’s case, the basic building block is the concept of relation (taken from mathemataical set theory, like most reasoning about databases), which designates a basic form for structuring data where every abstract entity is composed of a series of attributes. This data structure can be “filled” with entries (rows). If you’re familiar with SQL (today’s standard query language, derivative of Codd’s work), relation (or rather relationship, the unordered version of relation in Codd’s paper; nowadays, relation is used for Codd’s relationship and I’ll follow that convention) is simply the structure of a table. In practice, Codd suggest to build databases that represent all data in a from that looks like this:

students:  name  email           major
           Jack  jack@email.com  history
           Mary  mary@email.com  science

Here, students is a relation composed of three attributes (name, email, number). Jack is a row (entry), Mary is another one. What was new in this definition is obviously not the notion of the table, but rather the idea to define a relation as a purely abstract and unordered structure, a logical construct that did not specify in any way how it was to be stored on a physical medium. An important indicator for this decoupling is Codd’s comment that “the ordering of rows is immaterial” (p. 379). Without stating it explicitly, Codd shifts the construction of order from the storage to the query. More on this later.

The second key concept is the notion of primary key and its corollary, the foreign key. Let’s add a primary key to our table:

students:  key  name  email           major
           1    Jack  jack@email.com  history
           2    Mary  mary@email.com  science

The primary key is a way of addressing a row of data unambiguously (student #1 is Jack and no other student, keys have to be unique). The idea of a foreign key means to simply use a primary key in another table. Instead of doubling information (which may lead to all kinds of update problems as well as storage overhead), we’re simply “pointing” from one table to another. Take the relation (table) “grades”:

grades:   student.id  english  history  geography
          1           C        C        C
          2           B        B        B

In this case, students.id (relation.attribute is the notation we still use today) is the foreign key linking to the primary key of our “students” relation. In practice this means that Jack had all Cs and Mary all Bs in the three classes they took. Codd shows that using this concept of primary/foreign key, very complex organizations of data can be produced while keeping the basic principles very simple. While both of the dominant models of the time, the tree and network models, were based on data hierarchies (that had to be rebuilt if informational practices changed), the relational model is much more flexible.

To put things into perspective: most of the world’s structured data is currently organized according to this basic form. I would guess that despite the current NoSQL hype (companies like Google and Facebook use even simpler and highly customized data structures for ultra-high speed access) more than 90% of all Web applications have a database backend based on one of the many implementations of the relational model, e.g. Oracle, MS SQL Server, MySQL, PostgreSQL, to name just a few. But data organization is only the first half of the proposal.
The next step in Codd’s paper is to reflect on a language that would allow for data retrieval and manipulation by addressing the logical organization of the data rather than its physical storage. Rather than specifying the physical location of the data, saying “I want the entries from address 0x00000 to address 0xfffff” (and we would have to know these addresses beforehand!), we could simply ask for all the entries in the table students. Remember that above, I indicated that Codd declared entry order as “immaterial”? This is because the ordering of data is no longer (merely) a property of the archive. Ordering is done in the language we use to get the data: “I want all the students, sorted alphabetically by name” (SQL: SELECT * FROM students ORDER BY name). The data structure has of course be prepared for the kind of queries we will want to make, but in our example, I could group my list by major, sort it by email, or, by “joining” our two tables, order by grade average. More elaborate queries would allow me to select the 25% percent students with the best grade average or to plot the grade evolution over the years if I have that data.
A data retrieval and manipulation language would have to do more than just query and this quote summarizes the requirements:

A set so specified may be fetched for query purposes only, or it may be held for possible changes. Insertions take the form of adding new elements to declared relations without regard to any ordering that may be present in their machine representation. Deletions which are effective for the community (as opposed to the individual user or sub- communities) take the form of removing elements from declared relations. (p. 382)

These are the four building blocks of every database system I have worked with (again using SQL): SELECT (query a database using different parameters for searching and ordering, e.g. get all students with a certain grade average), INSERT (insert new data into a table, e.g. add a new student into students), UPDATE (change data, e.g. change a student’s grade after accepting a bribe), DELETE (erase date, e.g. expel a student for offering you a bribe). Such a language – Codd will propose the Alpha language in the 1970s but IBMs SQL (structured query language; Larry Ellison of Oracle actually was the first to bring a SQL based product to the market and consequently became one of the richest people on the planet) largely won out – would again “protect” the user from having to interact with anything but the data organization specified in the terms of the relational model.
In the rest of the paper, Codd tackles a series of problems that could arise in the implementation of actual systems (and what we would call a “storage engine” today) based on the relational model, but this part is less interesting for my purposes.

I would like, however, to propose a couple of comments that may help putting things into a larger perspective:

1) The central critique of Codd’s proposals came from programmers and engineers that abhorred the loss of control (an potentially performance) over the actual organization of data storage on the physical medium and the dangers such a black-boxing may pose to data integrity in the case of dysfunction or accident. But in the 1980s the demands for more flexibility and cost control won the day, driven by lower hardware costs and better techniques for securing data. This evolution towards layering, modularity, and a general “abstraction” from the hardware has happened in all fields of computing and, indeed, the loss of control and visibility is most often the prime concern. In a sense, software has followed a similar trajectory as social organization, from community to society (and back, whenever there is a new frontier to homestead), that is from small-scale teams and organizations to the large-scale efforts of companies like Microsoft or Oracle. Abstraction techniques like Codd’s played a central role here as enablers of division of labor. It also permitted – and this is crucial – a much tighter integration between management processes and information technology. The moment information structures are “liberated” from questions of physical storage, they can be implemented in flexible, end-user friendly software packages, which makes it possible for management to interact much more directly with data. The rise of Business Intelligence and Decision Support Systems would have been much less spectacular without the relational model turning “information” into the malleable material it has become.

2) While I am of course tempted to write something like “The decoupling of the logical structure of data from physical storage and the immense power and flexibility afforded by query languages have led to the emergence of late-modern network economies.”, this would be too quick and easy. The relational database, the powerful query languages, and the business control and intelligence functions they enable are certainly a central part of the informational infrastructure that supports contemporary economic organization. Data, once collected, can be interrogated from every possible angle and automatic reporting (which is no more than a series of very elaborate SQL queries over a large number of tables) has introduced incredible speed into business processes, while keeping up an illusion of control. Illusion, because just like any formal model of reality, data and query models are necessarily reductionist. At the same time, databases are themselves part of a much longer trend in management that started with systems management in the late 19th century. We’re snowballing from one information age to the next and technologies like the relational model are as much enablers as results, causes and effects.

3) The relational database is part of a much larger transformation in how documents, information, and knowledge are handled. From the library catalog to documentation centers and further on to data banks, information retrieval, and data mining, we see a steady growth in the attention being payed to the logistics, organization, and “exploitation” of an always faster growing mountain of texts, images, sounds, and so forth. The relational model not only helps with classic tasks such as storage and retrieval, it shares in the birth of the what could be called the “automated production of knowledge”, i.e. the creation of new information from cross-referencing, comparing, statistically examining, synthesizing, and representing large quantities of information. Whether these automated processes (think reporting, data mining, etc.) produce “real” knowledge is a rather stale question; it is much more important to emphasize how businesses and other organizations have come to depend on these tools for everyday management and decision-making. Query languages built on Codd’s proposal constitute the foundation for these developments.

There would be much more to say about Codd’s work and the relational database but I want to close by going back to the initial question about reading computer science from a humanities perspective. A classic analysis of language and use of metaphors would probably have proceeded quite differently and would have homed in on things like the “protection” of users or citations such as this footnote:

Naturally, as with any data put into and retrieved from a computer system, the user will normally make far more effective use of the data if he is aware of its meaning. (p. 380)

Imaginaries are indeed important aspects of an archeology of computing but even in written form, computer science is, in a way, always looking elsewhere, beyond the text, and Codd points to this “elsewhere” in his last paragraph:

Nevertheless, the material presented should be adequate for experienced systems programmers to visualize several approaches. (p. 387)

What Codd asks the reader to visualize is the laboratory of computer science, the site where things come together, the working system. While the discursive aspects are certainly important, I feel that function is central to the poetics of the technical sciences and if we want to understand their cultural significance we have to read them both as texts and as functional blueprints.

If we want to understand the plethora of very specific roles computers play in today’s world, the question “What is software?” is inevitable. Many different answers have been articulated from different viewpoints and different positions – creator, user, enterprise, etc. – in the networks of practices that surround digital objects. From a scholarly perspective, the question is often tied to another one, “Where does software come from?”, and is connected to a history of mathematical thought and the will/pressure/need to mechanize calculation. There we learn for example that the term “algorithm” is derived from the name of the Persian mathematician al-Khwārizmī and that in mathematical textbooks from the middle ages, the term algorism is used to denote the basic arithmetic techniques – that we now learn in grammar school – which break down e.g. the calculation of a multiplication with large numbers into a series of smaller operations. We learn first about Pascal, Babbage, and Lady Lovelace and then about Hilbert, Gödel, and Turing, about the calculation of projectile trajectories, about cryptography, the halt-problem, and the lambda calculus. The heroic history of bold pioneers driven by an uncompromising vision continues into the PC (Engelbart, Kay, the Steves, etc.) and Network (Engelbart again, Cerf, Berners-Lee, etc.) eras. These trajectories of successive invention (mixed with a sometimes exaggerated emphasis on elements from the arsenal of “identity politics”, counter-culture, hacker ethos, etc.) are an integral part for answering our twin question, but they are not enough.

A second strand of inquiry has developed in the slipstream of the monumental work by economic historian Alfred Chandler Jr. (The Visible Hand) who placed the birth of computers and software in the flux of larger developments like industrialization (and particularly the emergence of the large scale enterprise in the late 19th century), bureaucratization, (systems) management, and the general history of modern capitalism. The books by James Beniger (The Control Revolution), JoAnne Yates (Control through Communication and more recently Structuring the Information Age), James W. Cortada (most notably The Digital Hand in three Volumes), and others deepened the economic perspective while Paul N. Edwards’ Closed World or Jon Agar’s The Government Machine look more closely at the entanglements between computers and government (bureaucracy). While these works supply a much needed corrective to the heroic accounts mentioned above, they rarely go beyond the 1960s and do not aim at understanding the specifics of computer technology and software beyond their capacity to increase efficiency and control in information-rich settings (I have not yet read Martin Campell-Kelly’s From Airline Reservations to Sonic the Hedgehog, the title is a downer but I’m really curious about the book).

Lev Manovich’s Language of New Media is perhaps the most visible work of a third “school”, where computers (equipped with GUIs) are seen as media born from cinema and other analogue technologies of representation (remember Computers as Theatre?). Clustering around an illustrious theoretical neighborhood populated by McLuhan, Metz, Barthes, and many others, these works used to dominate the “XY studies” landscape of the 90s and early 00s before all the excitement went to Web 2.0, participation, amateur culture, and so on. This last group could be seen as a fourth strand but people like Clay Shirky and Yochai Benkler focus so strongly on discontinuity that the question of historical filiation is simply not relevant to their intellectual project. History is there to be baffled by both present and future.

This list could go on, but I do not want to simply inventory work on computers and software but to make the following point: there is a pronounced difference between the questions “What is software?” and “What is today’s software?”. While the first one is relevant to computational theory, software engineering, analytical philosophy, and (curiously) cognitive science, there is no direct line from universal Turing machines to our particular landscape with the millions of specific programs written every year. Digital technology is so ubiquitous that the history of computing is caught up with nearly every aspect of the development of western societies over the last 150 years. Bureaucratization, mass-communication, globalization, artistic avant-garde movements, transformations in the organization of labor, expert movements in public administrations, big science, library classifications, the emergence of statistics, minority struggles, two world wars and too many smaller conflicts to count, accounting procedures, stock markets and the financial crisis, politics from fascism to participatory democracy,… – all of these elements can be examined in connection with computing, shaping the tools and being shaped by them in return. I am starting to believe that for the humanities scholar or the social scientist the question “What is software?” is only slightly less daunting than “What is culture?” or “What is society?”. One thing seems sure: we can no longer pretend to answer the latter two questions without bumping into the first one. The problem for the author, then, becomes to choose the relevant strands, to untangle the mess.

In my view, there is a case to be made for a closer look at the role the library and information sciences played in the development of contemporary software techniques, most obviously on the Internet, by not exclusively. While Bush’s Memex has perhaps been commented on somewhat beyond its actual relevance, the work done by people such as Eugene Garfield (citation analysis), Calvin M. Mooers (information retrieval), Hans-Peter Luhn (KWIC), Edgar Codd (relational database) or Gerard Salton (the vector space model) from the 1950s on has not been worked on much outside of specialist circles – despite the fact that our current ways of working with information (yes, this includes your Facebook profile, everything Google is doing, cloud computing, mobile applications and all the other cool stuff Wired writes about) have left behind the logic of the library catalog quite some time ago. This is also where today’s software comes from.

Gabriel Tarde is a springwell of interesting – and sometimes positively weird – ideas. In his 1899 article L’opinion et la conversation (reprinted in his 1901 book L’opinion et la foule), the French judge/sociologist makes the following comment:

Il n’y [dans un Etat féodal, BR] avait pas “l’opinion”, mais des milliers d’opinions séparées, sans nul lien continuel entre elles. Ce lien, le livre d’abord, le journal ensuite et avec bien plus d’efficacité, l’ont seuls fourni. La presse périodique a permis de former un agrégat secondaire et très supérieur dont les unités s’associent étroitement sans s’être jamais vues ni connues. De là, des différences importantes, et, entre autre, celles-ci : dans les groupes primaires [des groupes locales basés sur la conversation, BR], les voix ponderantur plutôt que numerantur, tandis que, dans le groupe secondaire et beaucoup plus vaste, où l’on se tient sans se voir, à l’aveugle, les voix ne peuvent être que comptées et non pesées. La presse, à son insu, a donc travaillé à créer la puissance du nombre et à amoindrir celle du caractère, sinon de l’intelligence.

After a quick survey, I haven’t found an English translation anywhere – there might be one in here – so here’s my own (taking some liberties to make it easier to read):

[In a feudal state, BR] there was no “opinion” but thousands of separate opinions, without any steady connection between them. This connection was only delivered by first the book, then, and with greater efficiency, the newspaper. The periodical press allowed for the formation of a secondary and higher-order aggregate whose units associate closely without ever having seen or known each other. Several important differences follow from this, amongst others, this one: in primary  groups [local groups based on conversation, BR], voices ponderantur rather than numerantur, while in the secondary and much larger group, where people connect without seeing each other – blind – voices can only be counted and cannot be weighed. The press has thus unknowingly labored towards giving rise to the power of the number and reducing the power of character, if not of intelligence.

Two things are interesting here: first, Lazarsfeld, Berelson, and Gaudet’s classic study from 1945, The People’s Choice, and even more so Lazarsfeld’s canonical Personal Influence (with Elihu Katz, 1955) are seen as a rehabilitation of the significance (for the formation of opinion) of interpersonal communication at a time when media were considered all-powerful brainwashing machines by theorists such as Adorno and Horkheimer (Adorno actually worked with/for Lazarsfeld in the 30ies, where Lazarsfeld tried to force poor Adorno into “measuring culture”, which may have soured the latter to any empirical inquiry, but that’s a story for another time). Tarde’s work on conversation (the first order medium) is theoretically quite sophisticated – floating against the backdrop of Tarde’s theory of imitation as basic mechanism of cultural production – and actually succeeds in thinking together everyday conversation and mass-media without creating any kind of onerous dichotomy. L’opinion et la conversation would merit an inclusion into any history of communication science and it should come as no surprise that Elihu Katz actually published a paper on Tarde in 1999.

Second, the difference between ponderantur (weighing) and numerantur (counting) is at the same time rather self-evident – an object’s weight and it’s number are logically quite different things – and somewhat puzzling: it reminds us that while measurement does indeed create a universe of number where every variable can be compared to any other, the aspects of reality we choose to measure remain connected to a conceptual backdrop that is by itself neither numerical nor mathematical. What Tarde calls “character” is a person’s capacity to influence, to entice imitation, not the size of her social network.

I’m currently working on a software tool that helps studying Twitter and while sifting through the literature I came across this citation from a 2010 paper by Cha et al.:

We describe how we collected the Twitter data and present the characteristics of the top users based on three influence measures: indegree, retweets, and mentions.

Besides the immense problem of defining influence in non trivial terms, I wonder whether many of the studies on (social) networks that pop up all over the place are hoping to weigh but end up counting again. What would it mean, then, to weigh a person’s influence? What kind of concepts would we have to develop and what could be indicators? In our project we use the bit.ly API to look at clickstream referers – if several people post the same link, who succeeds in getting the most people to click it – but this may be yet another count that says little or nothing about how a link will be uses/read/received by a person. But perhaps this is as far as the “hard” data can take us. But is that really a problem? The one thing I love about Tarde is how he can jump from a quantitative worldview to beautiful theoretical speculation and back with a smile on his face…

Over the last year, I have been reading loads of books in and on Information Science, paying special attention to key texts in the (pre)history of the discipline. Fritz Machlup and Una Mansfield’s monumental anthology The Study of Information (Wiley & Sons, 1983) has been a pleasure to read and there are several passages in the foreword that merit a little commentary. I have always wondered why Shannon’s Mathematical Theory of Communication from 1948 has been such a reference point in the discipline I started out in, communication science. Talking about purely technological problems and pumped with formulas than very, very few social science scholars could make sense of, the whole things seems like a misunderstanding. The simplicity and clearness of the schema on page two – which has been built into the canonical sender-receiver model – cannot be the only reason for the exceptional (mostly second or third hand) reception the text has enjoyed. In Machlup & Mansfield’s foreword one can find some strong words on the question of why a work on engineering problems that excludes even the slightest reference to matters of human understanding came to be cited in probably every single introduction to communication science:

“When scholars were chiefly interested in cognitive information, why did they accept a supposedly scientific definition of ‘information apart from meaning’? One possible explanation is the fact that they were impressed by a definition that provided for measurement. To be sure, measurement was needed for the engineering purposes at hand; but how could anybody believe that Shannon’s formula would also measure information in the sense of what one person tells another by word of mouth, in writing, or in print?
We suspect that the failure to find, and perhaps impossibility of finding, any ways of measuring information in this ordinary sense has induced many to accept measurable signal transmission, channel capacity, or selection rate, misnamed amount of information, as a substitute or proxy for information. The impressive slogan, coined by Lord Kelvin, that ‘science is measurement’ has persuaded many researchers who were anxious to qualify as scientists to start measuring things that cannot be measured. As if under a compulsion, they looked for an operational definition of some aspect of communication or information that stipulated quantifiable operations. Shannon’s formula did exactly that; here was something related to information that was objectively measurable. Many users of the definition were smart enough to realize that the proposed measure – perfectly suited for electrical engineering and telecommunication – did not really fit their purposes; but the compulsion to measure was stronger than their courage to admit that they were not operating sensibly.” (p. 52)

For Machlup & Manfield – who, as trained (neoclassical) economists, should not be deemed closet postmodernists – this compulsion to measure is connected to implicit hierarchies in academia where mathematical rationality reigns supreme.  A couple of pages further, the authors’ judgment becomes particularly harsh:

“This extension of information theory, as developed for communication engineering, to other quite different fields has been a methodological disaster – though the overenthusiastic extenders did not see it, and some of them, who now know that it was an aberration, still believe that they have learned a great deal from it. In actual fact, the theory of signal transmission or activating impulses has little or nothing to teach that could be extended of applied to human communication, social behavior, or psychology, theoretical or experimental.” (p. 56)

Shannon himself avoided the term “information theory” and his conception of communication obviously had nothing to do with what the term has come to mean in the social sciences and general discourse. But the need to show that the social sciences could be “real” sciences in search of laws formulated in mathematical terms proved stronger than the somewhat obvious epistemological mismatch.

Like many classic texts, Machlup & Manfield’s work offers a critique that is not based on dismissal or handbag relativism but on deep engagement with the complexities of the subject matter and long experience  with interdisciplinary work, which, necessarily, makes one bump into unfamiliar concepts, methods, ontological preconceptions, modes of reasoning, vectors of explanation and epistemological urges (what is your knowledge itch? how do you want to scratch it?). The Study of Information is a pleasure to read because it brings together very different fields without proposing some kind of unifying meta-concept or imperialist definition of what science – the quest for knowledge – should look like.

My colleague Theo Röhle and  I went to the Computational Turn conference this week. While I would have preferred to hear a bit more on truly digital research methodology (in the fully scientific sense of the word “method”), the day was really quite interesting and the weather unexpectedly gorgeous. Most of the papers are available on the conference site, make sure to have a look. The text I wrote with Theo tried to structure some of the epistemological challenges and problems to take into account when working with digital methods. Here’s a tidbit:

…digital technology is set to change the way scholars work with their material, how they “see” it and interact with it. The question is, now, how well the humanities are prepared for these transformations. If there truly is a paradigm shift on the horizon, we will have to dig deeper into the methodological assumptions that are folded into the new tools. We will need to uncover the concepts and models that have carried over from different disciplines into the programs we employ today…