Reading Technology 1: Edgar F. Codd, A Relational Model of Data for Large Shared Data Banks, 1970

Posted by Bernhard on September 30th 2010 @ 11:46 pm

This blogpost is somewhat of an experiment that I hope will turn into a series. I have started to work seriously on a book that will suggest a somewhat different take on understanding computing and particularly contemporary software deployed on the Internet. A large part of that work consists of historical analysis and in this context I am (re)reading many of the seminal papers of the information and computer sciences. What is striking about these texts is not only their content but their far-reaching influence on the landscape of technological concepts and, often enough, on the actual technological developments that followed. Writing software today is in most cases an articulation that takes place in an extremely dense space of established languages, APIs, frameworks, and libraries but also of concepts, methodologies, best practices, tacit assumptions, strategies, and community rules. There is so much “old” in every “new” but many concepts have become so pervasive, so dominant that we no longer see them as the particularities they in fact are. Being canonical, they become second nature. But many of these path-defining moments can be retraced and given the pervasiveness of computers today, an archeology of computing is, in a way, an archeology of our culture.

One of the ways to do such an archeology may simply consist in trying to read seminal computer and information science papers sideways, not (only) as technological proposals, but as political and cultural projects that combine a (most often critical) analysis of a status quo with a prescriptive take on how a more ideal setting could/should look like. Technology is, in that sense, a way of relating to society, a means of contributing that is political in a very different way than the traditional arenas of governance and debate. What I would like to suggest is that this aspect of technological writing (science papers but also reports, RFCs, norms, proposals, documentation, etc.) is by far not examined enough, particularly when it comes to techniques that are related to software. Our view of technology is still very much shaped by the physical machine – the box, the screen, the keyboard – perhaps also because these physical parts are closer to our bodies, more visible and easier to integrate into the cognitive practices of a culture that, paradoxically, is able to produce extremely sophisticated mechanisms while being quite inept when it comes to understanding the role technical objects play in constituting its very fabric.

In my view, the central mistake is to assimilate technology to techné and be done with it. Perhaps I am wrong, but I cannot shake the feeling that very few scholars in the humanities and social sciences are prepared to accord to technological creation the same depth, complexity, variety, the same imbrication in society, the same amount of “humanity” than literature or artistic creation in general. This unwillingness to really engage technology beyond the surface leads to the familiar reflex-like reactions, both positive and negative, that seem to dominate public debates on “hot” topics like social networking, privacy on the Internet, or computer games.

So what I am looking for is a different way of understanding technology that subscribes neither to an engineering perspective concerned with function nor to a purely “culturalist” analysis that sees only imaginaries, symbols, and metaphors, thereby risking to loose the machine in the machine. So, today, first try and why not start with a big one.

In 1970, Edgar F. Codd, a British computer scientist who moved to the US in the 1940s, published one of the most influential papers in the history of computer science, A Relational Model of Data for Large Shared Data Banks (available here, doi:10.1145/362384.362685), in which he proposed a concept for the construction of database systems built around the central idea of separating the logical organization of information from the way it is stored on a physical storage medium. While the usefulness of such a separation may seem very obvious from today’s viewpoint, Codd’s paper stirred a virulent debate and his employer, IBM, was quite reluctant when it came to turning the proposal into a product (it took eight years for the first relational database system to make it to the market). When discussing Codd’s work, we should be very suspicious of the popular narratives of technological development as a series of inventions, or worse, ideas. To separate logical organization from physical storage had been a common practice in libraries for a long time: the library catalogue, in combination with some basic shelf logistics, allows for very different ways of recording books – alphabetically, by subject, and so on. But technologies are not simply ideas; Gene Roddenberry did not invent beaming. As science and technology studies have shown many times, a successful scientific “discovery” or a technological “invention” is somewhat of a “perfect storm”: many pieces have to fall into place, many different actors have to be mobilized, and most often there is talking, writing, demonstrating, debating, and a whole lot of fuzz. As computer history shows, having an idea (Babbage) or even building a functioning machine (Zuse) may simply not be enough to establish a technology. Since the industrial revolution, technologies are increasingly often systems that require logistics, markets, organizational reform, or an installed user base. In our case, the really interesting thing is not necessarily the abstract idea for what has become today’s omnipresent relational database, but the way Codd builds an idea into a technological concept, as an argument as well as a potential system. To start, let’s quote the abstract in full:

Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation). A prompting service which supplies such information is not a satisfactory solution. Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed. Changes in data representation will often be needed as a result of changes in query, update, and report traffic and natural growth in the types of stored information.
Existing noninferential, formatted data systems provide users with tree-structured files or slightly more general network models of the data. In Section 1, inadequacies of these models are discussed. A model based on n-ary relations, a normal form for data base relations, and the concept of a universal data sublanguage are introduced. In Section 2, certain operations on relations (other than logical inference) are discussed and applied to the problems of redundancy and consistency in the user’s model. (p. 377)

First of all, who are these users that have to be “protected”? In 1970, this is obviously not (yet) the manager sitting in front of a screen and keyboard but rather the application programmer that will implement the “query, update, and report” functions every larger organizations rely on for management. These users/programmers had been forced to make changes in storage structures whenever requirements changed in a significant way. This was not just an onerous task but also a source of potentially crippling problems as every adaptation risked breaking existing applications. Without explicit reference, Codd’s work is directly related to what has become to be known as the “software crisis” that lead to the emergence of software engineering. The separation of systems into black-boxed modules that communicated via well-specified interfaces was one of the solutions put forward to counter the explosion of complexity that followed the introduction of computers into large-scale, real-world (business) organizations. Seen in this light, the relational model and the concept of “data independence” (p. 377) is an extremely powerful agent for the division of labor that cleanly separates the engineering of a database system from the specification of data structures, adding to the ground work for the concept of end-user software that we know today.

So what is Codd’s proposal? For a reader trained in the humanities trying to read a paper like the one in question (even the first half, which does not use any formal notation), adaptions to the habitual reading style have to be made to get something useful out of it. Much like mathematics, computer science deploys language quite differently than the humanities (except for analytical philosophy): language, here, is not (only) narrative and argumentative, it aims a building a demonstration, which is most certainly a rhetorical form, but a very formal one that follows a convention consisting of laying out a space of thinking through a series of very precise definitions, which often attribute quite specific significations to words taken from everyday language. Miss one of these definitions and the whole pyramid crumbles. In Codd’s case, the basic building block is the concept of relation (taken from mathemataical set theory, like most reasoning about databases), which designates a basic form for structuring data where every abstract entity is composed of a series of attributes. This data structure can be “filled” with entries (rows). If you’re familiar with SQL (today’s standard query language, derivative of Codd’s work), relation (or rather relationship, the unordered version of relation in Codd’s paper; nowadays, relation is used for Codd’s relationship and I’ll follow that convention) is simply the structure of a table. In practice, Codd suggest to build databases that represent all data in a from that looks like this:

students:  name  email           major
           Jack  jack@email.com  history
           Mary  mary@email.com  science

Here, students is a relation composed of three attributes (name, email, number). Jack is a row (entry), Mary is another one. What was new in this definition is obviously not the notion of the table, but rather the idea to define a relation as a purely abstract and unordered structure, a logical construct that did not specify in any way how it was to be stored on a physical medium. An important indicator for this decoupling is Codd’s comment that “the ordering of rows is immaterial” (p. 379). Without stating it explicitly, Codd shifts the construction of order from the storage to the query. More on this later.

The second key concept is the notion of primary key and its corollary, the foreign key. Let’s add a primary key to our table:

students:  key  name  email           major
           1    Jack  jack@email.com  history
           2    Mary  mary@email.com  science

The primary key is a way of addressing a row of data unambiguously (student #1 is Jack and no other student, keys have to be unique). The idea of a foreign key means to simply use a primary key in another table. Instead of doubling information (which may lead to all kinds of update problems as well as storage overhead), we’re simply “pointing” from one table to another. Take the relation (table) “grades”:

grades:   student.id  english  history  geography
          1           C        C        C
          2           B        B        B

In this case, students.id (relation.attribute is the notation we still use today) is the foreign key linking to the primary key of our “students” relation. In practice this means that Jack had all Cs and Mary all Bs in the three classes they took. Codd shows that using this concept of primary/foreign key, very complex organizations of data can be produced while keeping the basic principles very simple. While both of the dominant models of the time, the tree and network models, were based on data hierarchies (that had to be rebuilt if informational practices changed), the relational model is much more flexible.

To put things into perspective: most of the world’s structured data is currently organized according to this basic form. I would guess that despite the current NoSQL hype (companies like Google and Facebook use even simpler and highly customized data structures for ultra-high speed access) more than 90% of all Web applications have a database backend based on one of the many implementations of the relational model, e.g. Oracle, MS SQL Server, MySQL, PostgreSQL, to name just a few. But data organization is only the first half of the proposal.
The next step in Codd’s paper is to reflect on a language that would allow for data retrieval and manipulation by addressing the logical organization of the data rather than its physical storage. Rather than specifying the physical location of the data, saying “I want the entries from address 0x00000 to address 0xfffff” (and we would have to know these addresses beforehand!), we could simply ask for all the entries in the table students. Remember that above, I indicated that Codd declared entry order as “immaterial”? This is because the ordering of data is no longer (merely) a property of the archive. Ordering is done in the language we use to get the data: “I want all the students, sorted alphabetically by name” (SQL: SELECT * FROM students ORDER BY name). The data structure has of course be prepared for the kind of queries we will want to make, but in our example, I could group my list by major, sort it by email, or, by “joining” our two tables, order by grade average. More elaborate queries would allow me to select the 25% percent students with the best grade average or to plot the grade evolution over the years if I have that data.
A data retrieval and manipulation language would have to do more than just query and this quote summarizes the requirements:

A set so specified may be fetched for query purposes only, or it may be held for possible changes. Insertions take the form of adding new elements to declared relations without regard to any ordering that may be present in their machine representation. Deletions which are effective for the community (as opposed to the individual user or sub- communities) take the form of removing elements from declared relations. (p. 382)

These are the four building blocks of every database system I have worked with (again using SQL): SELECT (query a database using different parameters for searching and ordering, e.g. get all students with a certain grade average), INSERT (insert new data into a table, e.g. add a new student into students), UPDATE (change data, e.g. change a student’s grade after accepting a bribe), DELETE (erase date, e.g. expel a student for offering you a bribe). Such a language – Codd will propose the Alpha language in the 1970s but IBMs SQL (structured query language; Larry Ellison of Oracle actually was the first to bring a SQL based product to the market and consequently became one of the richest people on the planet) largely won out – would again “protect” the user from having to interact with anything but the data organization specified in the terms of the relational model.
In the rest of the paper, Codd tackles a series of problems that could arise in the implementation of actual systems (and what we would call a “storage engine” today) based on the relational model, but this part is less interesting for my purposes.

I would like, however, to propose a couple of comments that may help putting things into a larger perspective:

1) The central critique of Codd’s proposals came from programmers and engineers that abhorred the loss of control (an potentially performance) over the actual organization of data storage on the physical medium and the dangers such a black-boxing may pose to data integrity in the case of dysfunction or accident. But in the 1980s the demands for more flexibility and cost control won the day, driven by lower hardware costs and better techniques for securing data. This evolution towards layering, modularity, and a general “abstraction” from the hardware has happened in all fields of computing and, indeed, the loss of control and visibility is most often the prime concern. In a sense, software has followed a similar trajectory as social organization, from community to society (and back, whenever there is a new frontier to homestead), that is from small-scale teams and organizations to the large-scale efforts of companies like Microsoft or Oracle. Abstraction techniques like Codd’s played a central role here as enablers of division of labor. It also permitted – and this is crucial – a much tighter integration between management processes and information technology. The moment information structures are “liberated” from questions of physical storage, they can be implemented in flexible, end-user friendly software packages, which makes it possible for management to interact much more directly with data. The rise of Business Intelligence and Decision Support Systems would have been much less spectacular without the relational model turning “information” into the malleable material it has become.

2) While I am of course tempted to write something like “The decoupling of the logical structure of data from physical storage and the immense power and flexibility afforded by query languages have led to the emergence of late-modern network economies.”, this would be too quick and easy. The relational database, the powerful query languages, and the business control and intelligence functions they enable are certainly a central part of the informational infrastructure that supports contemporary economic organization. Data, once collected, can be interrogated from every possible angle and automatic reporting (which is no more than a series of very elaborate SQL queries over a large number of tables) has introduced incredible speed into business processes, while keeping up an illusion of control. Illusion, because just like any formal model of reality, data and query models are necessarily reductionist. At the same time, databases are themselves part of a much longer trend in management that started with systems management in the late 19th century. We’re snowballing from one information age to the next and technologies like the relational model are as much enablers as results, causes and effects.

3) The relational database is part of a much larger transformation in how documents, information, and knowledge are handled. From the library catalog to documentation centers and further on to data banks, information retrieval, and data mining, we see a steady growth in the attention being payed to the logistics, organization, and “exploitation” of an always faster growing mountain of texts, images, sounds, and so forth. The relational model not only helps with classic tasks such as storage and retrieval, it shares in the birth of the what could be called the “automated production of knowledge”, i.e. the creation of new information from cross-referencing, comparing, statistically examining, synthesizing, and representing large quantities of information. Whether these automated processes (think reporting, data mining, etc.) produce “real” knowledge is a rather stale question; it is much more important to emphasize how businesses and other organizations have come to depend on these tools for everyday management and decision-making. Query languages built on Codd’s proposal constitute the foundation for these developments.

There would be much more to say about Codd’s work and the relational database but I want to close by going back to the initial question about reading computer science from a humanities perspective. A classic analysis of language and use of metaphors would probably have proceeded quite differently and would have homed in on things like the “protection” of users or citations such as this footnote:

Naturally, as with any data put into and retrieved from a computer system, the user will normally make far more effective use of the data if he is aware of its meaning. (p. 380)

Imaginaries are indeed important aspects of an archeology of computing but even in written form, computer science is, in a way, always looking elsewhere, beyond the text, and Codd points to this “elsewhere” in his last paragraph:

Nevertheless, the material presented should be adequate for experienced systems programmers to visualize several approaches. (p. 387)

What Codd asks the reader to visualize is the laboratory of computer science, the site where things come together, the working system. While the discursive aspects are certainly important, I feel that function is central to the poetics of the technical sciences and if we want to understand their cultural significance we have to read them both as texts and as functional blueprints.

Post filed under computing, database, epistemolgy, reading technology, technological determinism.

The Politics of Systems

Reading Technology 1: Edgar F. Codd, A Relational Model of Data for Large Shared Data Banks, 1970

One Comment

Leave a Reply to Hugo Tavares Cancel reply

Categories

Archives

Blogroll

Meta