Category Archives: network theory
Over the last couple of years, the social sciences have been increasingly interested in using computer-based tools to analyze the complexity of the social ant farm that is the Web. Issuecrawler was one of the first of such tools and today researchers are indeed using very sophisticated pieces of software to “see” the Web. Sciences-Po, one of these rather strange french institutions that were founded to educate the elite but which now have to increasingly justify their existence by producing research, has recently hired Bruno Latour to head their new médialab, which will most probably head into that very direction. Given Latour’s background (and the fact that Paul Girard, a very competent former colleague at my lab, heads the R&D departement), this should be really very interesting. I do hope that there will be occasion to tackle the most compelling methodological question when in comes to the application of computers (or mathematics in general) to analyzing human life, which is beautifully framed in a rather reluctant statement from 1889 by Karl Pearson, a major figure in the history of statistics:
“Personally I ought to say that there is, in my own opinion, considerable danger in applying the methods of exact science to problems in descriptive science, whether they be problems of heredity or of political economy; the grace and logical accuracy of the mathematical processes are apt to so fascinate the descriptive scientist that he seeks for sociological hypotheses which fit his mathematical reasoning and this without first ascertaining whether the basis of his hypotheses is as broad as that human life to which the theory is to be applied.” cit. in. Stigler, Stephen M.: The History of Statistics. Harvard University Press, 1990 p. 304
This morning Jonah Bossewitch pointed me to an article over at Wired, authored by Chris Anderson which announces “The End of Theory”. The article’s main argument in itself is not very interesting for anybody with a knack for epistemology – Anderson has apparently never heard of the induction / deduction discussion and a limited idea about what statistics does – but there is a very interesting question lurking somewhere behind all the Californian Ideology and the following citation points right to it:
We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
One could point to the fact that the natural sciences had their experimental side for quite a while (Roger Bacon advocated his scientia experimentalis in the 13th century) and that a laboratory is in a sense a pattern-finding machine where induction continuously plays an important role. What interests me more though is Anderson’s insinuation that statistical algorithms are not models. Let’s just look at one of the examples he uses:
Google’s founding philosophy is that we don’t know why this page is better than that one: If the statistics of incoming links say it is, that’s good enough. No semantic or causal analysis is required.
This is a very limited understanding of what constitutes a model. I would argue that PageRank does in fact rely very explicitly on a model which combines several layers of justification. In their seminal paper on Google, Brin and Page write the following:
PageRank can be thought of as a model of user behavior. We assume there is a “random surfer” who is given a web page at random and keeps clicking on links, never hitting “back” but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank.
The assumption behind this graph oriented justification is that people do not randomly place links but they do so with purpose. Linking implies attribution of importance: we don’t link to documents that we’re indifferent about. The statistical exploration of the huge graph that is the Web is indeed oriented by this basic assumption and adds the quite contestable ruling according to which shall be most visible what is thought important by the greatest number of linkers. I would, then, argue that there is no experimental method that is purely inductive, not even neural networks. Sure, on the mathematical side we can explore data without limitations concerning their dimensionality, i.e. the number of characteristics that can be taken into account; the method of gathering data is however always a process of selection that is influenced by some idea or intuition that at least implicitly has the characteristic of a model. There is a deductive side to even the most inductive approach. Data is made not given and every projection of that data is oriented. To quote Fernando Pereira:
[W]ithout well-chosen constraints — from scientific theories — all that number crunching will just memorize the experimental data.
As Jonah points out, Anderson’s article is probably a straw man argument whose sole purpose is to attract attention but it points to something that is really important: too many people think that mathematical methods for knowledge discovery (datamining that is) are neutral and objective tools that will find what’s really there and show the world as it is without the stain of human intentionality; these algorithms are therefore not seen as objects of political inquiry. In this view statistics is all about counting facts and only higher layers of abstraction (models, theories,…) can have a political dimension. But it matters what we count and how we count.
In the end, Anderson’s piece is little more than the habitual prostration before the altar of emergence and self-organization. Just exchange the invisible hand for the invisible brain and you’ll get pop epistemology for hive minds…
Two things currently stand out in my life: a) I’m working on an article on the relationship between mathematical network analysis and the humanities, and b) continental
Part of the research that I’m looking into is what has been called “The New Science of Networks” (NSN), a field founded mostly by physicists and mathematicians that started to quantitatively analyze very big networks belonging to very different domains (networks of acquaintance, the Internet, food networks, brain connectivity, movie actor networks, disease spread, etc.). Sociologists have worked with mathematical analysis and network concepts from at least the 1930ies but because of the limits of available data, the networks studied rarely went beyond hundreds of nodes. NSN however studies networks with millions of nodes and tries to come up with representations of structure, dynamics and growth that are not just used to make sense of empirical data but also to build simulations and come up with models that are independent of specific domains of application.
Very large data sets have only become available in recent history: social network data used to be based on either observation or surveys and thus inherently limited. Since the arrival of digital networking, a lot more data has been produced because many forms of communication or interaction leave analyzable traces. From newsgroups to trackback networks on blogs, very simple crawler programs suffice to produce matrices that include millions of nodes and can be played around with indefinitely, from all kinds of angles. Social network sites like Facebook or MySpace are probably the best example for data pools just waiting to be analyzed by network scientists (and marketers, but that’s a different story). This brings me to a naive question: what is a social network?
The problem of creating data sets for quantitative analysis in the social sciences is always twofold: a) what do I formalize, i.e. what are the variables I want to measure? b) how do I produce my data? The question is that of building a representation. Do my categories represent the defining traits of the system I wish to study? Do my measuring instruments truly capture the categories I decided on? In short: what to measure and how to measure it, categories and machinery. The results of mathematical analysis (which is not necessarily statistical in nature) will only begin to make sense if formalization and data collection were done with sufficient care. So, again, what is a social network?
Facebook (pars pro toto for the whole category qua currently most annoying of the bunch) allows me to add “friends” to my “network”. By doing so, I am “digitally mapping out the relationships I already have”, as Mark Zuckerberg recently explained. So I am, indeed, creating a data model of my social network. Fifty million people are doing the same, so the result is a digital representation of the social connectivity of an important part of the Internet-connected world. From a social science research perspective, we could now ask whether Facebook’s social network (as database) is a good model of the social network (as social structure) it supposedly maps. This does, of course, depend on what somebody would want to study but if you ask yourself, whether Facebook is an accurate map of your social connections, you’ll probably say no. For the moment, the formalization and data collection that apply when people use a social networking site does not capture the whole gamut of our daily social interactions (work, institutions, groceries, etc.) and does not include many of the people that play important roles in our lives. This does not mean that Facebook would not be an interesting data set to explore quantitatively; but it means that there still is an important distinction between the formal model (data and algorithm, what? and how?) of “social network” produced by this type of information system and the reality of daily social experience.
So what’s my point? Facebook is not a research tool for the social sciences and nobody cares whether the digital maps of our social networks are accurate or not. Facebook’s data model was not created to represent a social system but to produce a social system. Unlike the descriptive models of science, computer models are performative in a very materialist sense. As Baudrillard argues, the question is no longer whether the map adequately represents the territory, but in which way the map is becoming the new territory. The data model in Facebook is a model in the sense that it orients rather than represents. The “machinery” is not there to measure but to produce a set of possibilities for action. The social network (as database) is set to change the way our social network (as social structure) works – to produce reality rather than map it. But much as we can criticize data models in research for not being adequate to the phenomena they try to describe, we can examine data models, algorithms and interfaces of information systems and decide whether they are adequate for the task at hand. In science, “adequate” can only be defined in connection to the research question. In design and engineering there needs to be a defined goal in order to make such a judgment. Does the system achieve what I set out to achieve? And what is the goal, really?
When looking at Facebook and what the people around me do with it, the question of what “the politics of systems” could mean becomes a little clearer: how does the system affect people’s social network (as social structure) by allowing them to build a social network (as database)? What’s the (implicit?) goal that informs the system’s design?
Social networking systems are in their infancy and both technology and uses will probably evolve rapidly. For the moment, at least, what Facebook seems to be doing is quite simply to sociodigitize as many forms of interaction as possible; to render the implicit explicit by formalizing it into data and algorithms. But beware merry people of The Book of Faces! For in a database “identity” and “consumer profile” are one and the same thing. And that might just be the design goal…
I have been working, for a couple of month now, on what has been called “network theory” – a rather strange amalgam of social theory, applied mathematics and studies on ICT. What has interested me most in that area is the epistemological heterogeneity of the network concept and the difficulties that come with it. Quite obviously, a cable based computer network, an empirically established social network and the mathematical exploration of dendrite connections in worm brains are not one and the same thing. The buzz around a possible “new science of networks” (Duncan J. Watts) suggests, however, that there is enough common ground between a great number of very different phenomena to justify the use of similar (mathematical) tools and concepts to map them as networks. The question of whether of these things (the Internet, the spreading of disease, ecosystems, etc.) “are” networks or not, seems of less importance than the question of whether network models produce interesting new perspectives on the areas they are being applied to. And this is indeed the case.
One important, albeit often overlooked, aspect of any mathematical modeling is the question of formalization: the mapping of entities from the “real” world onto variables and back again, a process that necessarily implies selection and reduction of complexity. This is a first layer of ambiguity and methodological difficulty. A second one has been noted even more rarely, and it concerns software. Let me explain: the goal of network mapping, especially when applied to the humanities, is indeed to produce a map: a representation of numerical relations that is more intuitively readable that a matrix. Although graph (or network) theory does not need to produce a graphical representation as its result, such representations are highly powerful means to communicate complex relationships in a way that works well with the human capacity for visual understanding. These graphs, however, are not drawn by hand but generally modeled by computer software, e.g. programs like InFlow, Pajek, different tools for social network analysis, or a plethora of open source network visualization libraries. It may be a trivial task to visualize a network of five or ten nodes, but the positioning of 50 or more nodes and their connections is quite a daunting task and there are different conceptual and algorithmic solutions to the problem. Some tool use automatic clustering methods that lump nodes together and allow users to explore a network structure as hierarchical system where lower levels only fold up by zooming in on them. Parabolic projection is another method for reducing the number of nodes to draw for a given viewport. Three-dimensional projections represent yet another way to handle big numbers of nodes and connections. Behind these basic questions lurk matters of spatial distribution, i.e. the algorithms that try to make a compromise between accurate representation and visual coherence. Interface design adds yet another layer, in the sense that certain operations are made available to users, while others are not: zooming, dragging, repositioning of nodes, manual clustering, etc.
The point I’m trying to make is the following: the way we map networks is in large part channeled by the software we use and these tools are therefore not mere accessories to research but indeed epistemological agents that participate in the production of knowledge and participate in shaping research results. For the humanities, this is, in a sense, a new situation: while research methods based on mathematics are nothing new (sociometrics, etc.), the new research tools that a network science brings with it (other examples come to mind, e.g. data-mining) might imply a conceptual rift where part of the methodology gets blackboxed into a piece of software. This is not necessarily a problem but something that has to be discusses, examined, and understood.