Category Archives: algorithms
Edit: a map of the English Wikipedia is here.
Wikipedia is a fascinating object for way too many reasons. The way it is produced, the place it has taken in society, it’s size and evolution, and many other aspects are truly remarkable. Studying Wikipedia has become a discipline in itself and while there may be certain signs of fatigue on the editing front, there is still much to learn and to discover. I have recently started to take an interest in looking at the way knowledge is structured in different contexts and the availability of certain tools and datasets makes Wikipedia a perfect object for scrutiny. If it just wasn’t that big. Still, it’s the 21st century and computers are getting really fast, so why not try mapping Wikipedia. All of it.
There are different ways to start such a project, but simply taking the link structure is probably the most obvious. This allows for bypassing the internal taxonomy and may lead to a more “organic” expression of underlying knowledge structures. Unfortunately, computers are not that fast – at least not mine – and so I had to make two concessions: I took a non English variant (I settled for French) and reduced the number of nodes to a (barely) manageable amount. The final graph file (.gdf – do not even think about working with it with less than 4GB of RAM) was built by taking pages that had at least 100 connections with other pages. From an initial 183K pages and 11.5M links I went down to a more manageable 40K and 2M respectively. To make things workable, I chose to visualize the page names only, no nodes, no edges. The result looks like this (click on the image for a very big .png):

Reliable gephi did not only do the graph layout (OpenOrd plugin, 1000 iterations) but dutifully detected “communities” in the network, which actually did work really well. And here is a version in elegant grayscale, this time without community detection:

The graph shows a big dense zone in the middle that is quite unreadable but composed out of world history, politics, geography, and other elements that constitute a core set of knowledge elements that are highly interlinked. While France plays and important role here, these elements are actually very globalized and include countries from all over the world. Could we interpret this as a field of “common” or “shared” knowledge? A set of topics that transcend specialization and form the very core of what our culture considers essential?
To the close right of the very center, there is a rather visible (in orange) cluster on the United States. Around the center you’ll find major historic events and periods (WWII, middle ages, renaissance, etc.). The arts are on the right (mostly music) and France’s most popular art form – Cinema – starts at the top right, in a highly dense orange cluster and goes to the top left, tellingly fusing with theatre. The Sciences form a rather strange blue band the goes from the center top to the top right.
And then there is sports. I was a bit surprised by how much of it there is and how well the clustering and community detection works for identifying individual fields – football, tennis, car racing, and so on. The second surprise was how few “geek” subjects appear on the map. There is a digital technology cluster on the top right but I haven’t found any traces of the legendary Star Trek cluster. In the end, French Wikipedia appears to be a rather classic encyclopedia if you look at it from a subject angle. Could we use such maps to compare subject prominence between cultures?
Obviously, the method for mapping Wikipedia has to be refined to make maps more readable but the results are actually already quite telling. Let’s see whether the same approach can work for the English version – which is a cool 10 times bigger…
The Official Google Blog has recently written about changes to the ranking procedure that were introduced after a NYT article wrote about an online retailer that had apparently found out that being nasty to your customers would help getting good search rankings because all of the complaints and bad user reviews would get you links and boost PageRank. While Google denies that this logic would work, they have added a ranking layer to their search results that specifically targets online merchants. The interesting thing about the blog post is that the author details several things that the company could have done but didn’t do while actually revealing very little about what the “algorithmic solution” they implemented actually consists of. From the post:
Instead, in the last few days we developed an algorithmic solution which detects the merchant from the Times article along with hundreds of other merchants that, in our opinion, provide an extremely poor user experience. The algorithm we incorporated into our search rankings represents an initial solution to this issue, and Google users are now getting a better experience as a result.
While I do not believe that transparency is the prime solution to the gatekeeper issues surrounding search, this paragraph really is strikingly vague. Has Google compiled a list of merchants that are systematically downranked? How is this list compiled? What does “in our opinion” mean? Is this “opinion” expressed in the form of an algorithmic procedure (one could imagine using the hReview microformat to collect reviews on merchants)?
We’ll probably not get any answers to these questions but the case really shows how murky the whole ranking thing really has become: in an always growing online world, search visibility has extremely important financial ramifications (despite the social media hype) and I believe that companies like Google will increasingly rely on human judgment as a complement to algorithmic procedures (which are just another form of human judgment BTW). This will certainly lead to more legal activity around ranking in the future because courts still understand human meddling a lot better than software design…
What is a link? From a methodology standpoint, there is no answer to that question but only the recognition that when using graph theory and associated software tools, we project certain aspects of a dataset as nodes and others as links. In my last post, I “projected” authors from the air-l list as nodes and mail-reply relationships as links. In the example below, I still use authors as nodes but links are derived from a similarity measure of a statistical analysis of each poster’s mails. Here are two gephi graphs:
If you are interested in the technique, it’s a simple similarity measure based on the vector-space model and my amateur computer scientist’s PHP implementation can be found here. The fact that the two posters who changed their “from:” text have both of their accounts close together (can you find them?) is a good indication that the algorithm is not completely botched. The words floating on the links on the right graph are the words that confer the highest value to the similarity calculation, which means that it is a word that is relatively often used by both of the linked authors while being generally rare in the whole corpus. Elis Godard and Dana Boyd for example have both written on air-l about Ron Vietti, a pastor who (rightfully?) thinks the Internet is the devil and because very few other people mentioned the holy warrior, the word “vietti” is the highest value “binder” between the two.
What is important in networks that are the result of heavily iterative processing is that the algorithms used to create them are full of parameters and changing one of these parameters just little bit may (!) have larger repercussions. In the example above I actually calculate a similarity measure between each two nodes (60^2 / 2 results) but in order to make the graph somewhat readable I inserted a threshold that boils it down to 637 links. The missing measures are not taken into account in the physics simulation that produces the layout – although they may (!) be significant. I changed the parameter a couple of times to get the graph “right”, i.e. to find a good compromise between link density for simulation and readability. But look at what happens when I grow the threshold so than only the 100 strongest similarity measures survive:
First, a couple of nodes disconnect, two binary stars form around the “from:” changers and the large component becomes a lot looser. Second, Jeremy Hunsinger looses the highest PageRank to Chris Heidelberg. Hunsinger had more links when lower similarity scores were taken into account, but when things get rough in the network world, bonding is better than bridging. What is result and what is artifact?
Most advanced algorithmic techniques are riddled with such parameters and getting a “good” result not only implies fiddling around a lot (how do I clean the text corpus, what algorithms to look for what kind of structures or dynamics, what parameters, what type of representation, here again, what parameters, and so on…) but also having implicit ideas about what kind of result would be “plausible”. The back and forth with the “algorithmic microscope” is always floating against a backdrop of “domain knowledge” and this is one of the reasons why the idea of a science based purely on data analysis is positively absurd. I believe that the key challenge is to stay clear of methodological monoculture and to articulate different approaches together whenever possible.
The Association of Internet Researchers (AOIR) is an important venue if you’re interested in, like the name indicates, Internet research. But it is also a good primary source if one wants to inquire into how and why people study the Internet, which aspects of it, etc. Conveniently for the lazy empirical researcher that I am, the AOIR has an archive of its mailing-list, which has about 22K mails posted by 3K addresses, enough for a little playing around with the impatient person’s tool, the algorithm. I have downloaded the data and I hope I can motivate some of my students to build something interesting with it, but I just had to put it into gephi right away. Some of the tools we’ll hopefully build will concentrate more on text mining but using an address as a node and a mail-reply relationship as a link, one can easily build a social graph.
I would like to take this example as an occasion to show how different algorithms can produce quite different views on the same data:
So, these are the air-l posters with more than 60 messages posted since 2001. Node size indicates the number of posts, a node’s color (from blue to red) shows its connectivity in the graph (click on the image to see a much larger version). Link strength, i.e. number of replies between two people, is taken into account. You can download the full .gdf here. The only difference between the four graphs is the layout algorithm used (Force Atlas, Force Atlas with attraction distribution, Yifan Hu, and Fruchterman Reingold). You can instantly notice that Yifan Hu pushes nodes with low link count much more strongly to the periphery than the others, while Fruchterman Reingold as always keeps its symmetrical sphere shape, suggesting a more harmonious picture than the rest. Force Atlas’ attraction distribution feature will try to differentiate between hubs and authorities, pushing the former to the periphery while keeping the latter in the center; just compare Barry Wellman’s position over the different graphs.
I’ll probably repeat this experiment with a more segmented graph, but I think this already shows that layout algorithms are not just innocently rendering a graph readable. Every method puts some features of the graph to the forefront and the capacity for critical reading is as important as the willingness for “critical use” that does not gloss over the differences in tools used.
If we want to understand the plethora of very specific roles computers play in today’s world, the question “What is software?” is inevitable. Many different answers have been articulated from different viewpoints and different positions – creator, user, enterprise, etc. – in the networks of practices that surround digital objects. From a scholarly perspective, the question is often tied to another one, “Where does software come from?”, and is connected to a history of mathematical thought and the will/pressure/need to mechanize calculation. There we learn for example that the term “algorithm” is derived from the name of the Persian mathematician al-Khwārizmī and that in mathematical textbooks from the middle ages, the term algorism is used to denote the basic arithmetic techniques – that we now learn in grammar school – which break down e.g. the calculation of a multiplication with large numbers into a series of smaller operations. We learn first about Pascal, Babbage, and Lady Lovelace and then about Hilbert, Gödel, and Turing, about the calculation of projectile trajectories, about cryptography, the halt-problem, and the lambda calculus. The heroic history of bold pioneers driven by an uncompromising vision continues into the PC (Engelbart, Kay, the Steves, etc.) and Network (Engelbart again, Cerf, Berners-Lee, etc.) eras. These trajectories of successive invention (mixed with a sometimes exaggerated emphasis on elements from the arsenal of “identity politics”, counter-culture, hacker ethos, etc.) are an integral part for answering our twin question, but they are not enough.
A second strand of inquiry has developed in the slipstream of the monumental work by economic historian Alfred Chandler Jr. (The Visible Hand) who placed the birth of computers and software in the flux of larger developments like industrialization (and particularly the emergence of the large scale enterprise in the late 19th century), bureaucratization, (systems) management, and the general history of modern capitalism. The books by James Beniger (The Control Revolution), JoAnne Yates (Control through Communication and more recently Structuring the Information Age), James W. Cortada (most notably The Digital Hand in three Volumes), and others deepened the economic perspective while Paul N. Edwards’ Closed World or Jon Agar’s The Government Machine look more closely at the entanglements between computers and government (bureaucracy). While these works supply a much needed corrective to the heroic accounts mentioned above, they rarely go beyond the 1960s and do not aim at understanding the specifics of computer technology and software beyond their capacity to increase efficiency and control in information-rich settings (I have not yet read Martin Campell-Kelly’s From Airline Reservations to Sonic the Hedgehog, the title is a downer but I’m really curious about the book).
Lev Manovich’s Language of New Media is perhaps the most visible work of a third “school”, where computers (equipped with GUIs) are seen as media born from cinema and other analogue technologies of representation (remember Computers as Theatre?). Clustering around an illustrious theoretical neighborhood populated by McLuhan, Metz, Barthes, and many others, these works used to dominate the “XY studies” landscape of the 90s and early 00s before all the excitement went to Web 2.0, participation, amateur culture, and so on. This last group could be seen as a fourth strand but people like Clay Shirky and Yochai Benkler focus so strongly on discontinuity that the question of historical filiation is simply not relevant to their intellectual project. History is there to be baffled by both present and future.
This list could go on, but I do not want to simply inventory work on computers and software but to make the following point: there is a pronounced difference between the questions “What is software?” and “What is today’s software?”. While the first one is relevant to computational theory, software engineering, analytical philosophy, and (curiously) cognitive science, there is no direct line from universal Turing machines to our particular landscape with the millions of specific programs written every year. Digital technology is so ubiquitous that the history of computing is caught up with nearly every aspect of the development of western societies over the last 150 years. Bureaucratization, mass-communication, globalization, artistic avant-garde movements, transformations in the organization of labor, expert movements in public administrations, big science, library classifications, the emergence of statistics, minority struggles, two world wars and too many smaller conflicts to count, accounting procedures, stock markets and the financial crisis, politics from fascism to participatory democracy,… – all of these elements can be examined in connection with computing, shaping the tools and being shaped by them in return. I am starting to believe that for the humanities scholar or the social scientist the question “What is software?” is only slightly less daunting than “What is culture?” or “What is society?”. One thing seems sure: we can no longer pretend to answer the latter two questions without bumping into the first one. The problem for the author, then, becomes to choose the relevant strands, to untangle the mess.
In my view, there is a case to be made for a closer look at the role the library and information sciences played in the development of contemporary software techniques, most obviously on the Internet, by not exclusively. While Bush’s Memex has perhaps been commented on somewhat beyond its actual relevance, the work done by people such as Eugene Garfield (citation analysis), Calvin M. Mooers (information retrieval), Hans-Peter Luhn (KWIC), Edgar Codd (relational database) or Gerard Salton (the vector space model) from the 1950s on has not been worked on much outside of specialist circles – despite the fact that our current ways of working with information (yes, this includes your Facebook profile, everything Google is doing, cloud computing, mobile applications and all the other cool stuff Wired writes about) have left behind the logic of the library catalog quite some time ago. This is also where today’s software comes from.
My colleague Theo Röhle and I went to the Computational Turn conference this week. While I would have preferred to hear a bit more on truly digital research methodology (in the fully scientific sense of the word “method”), the day was really quite interesting and the weather unexpectedly gorgeous. Most of the papers are available on the conference site, make sure to have a look. The text I wrote with Theo tried to structure some of the epistemological challenges and problems to take into account when working with digital methods. Here’s a tidbit:
…digital technology is set to change the way scholars work with their material, how they “see” it and interact with it. The question is, now, how well the humanities are prepared for these transformations. If there truly is a paradigm shift on the horizon, we will have to dig deeper into the methodological assumptions that are folded into the new tools. We will need to uncover the concepts and models that have carried over from different disciplines into the programs we employ today…
This spring worked on an R&D project that was really quite interesting but – as it happens with projects – took up nearly all of my spare time. La montre verte is based on the idea that pollution measurement can be brought down to street level if sensors can be made small enough to be carried around by citizens. Together with a series of partners from the private sector, the CiTu group of my laboratory came up with the idea to put an ozone sensor and a microphone (to measure noise levels) into a watch. That way, the device is not very intrusive and still in direct contact with the surrounding air. We built about 15 prototypes, based on the fact that currently, Paris’ air quality is measured by only a handful of (really high quality) sensors and even the low resolution devices we have in our watches should therefore be able to complement that data with a geographically more fine grained analysis of noise and pollution levels. The watch produces a georeferenced measurement (a GPS is built into the watch) every second and transmits the data via Bluetooth to a Java application on a portable phone, which then sends every data packet via GPRS to a database server.
My job in the project was to build a Web application that allows people to interact with and make sense of the data produced by the watches. Despite the help from several brilliant students from our professional Masters program, this proved to be a daunting task and I spent *at lot* of time programming. The result is quite OK I believe; the application allows users to explore the data (which is organized in localized “experiments”) in different ways, either in real-time or afterward. With a little more time (we had only about three month for the whole project and we got the hardware only days before the first public showcase) we could have done more but I’m still quite content with the result. Especially the heatmap (see image) algorithm was fun to program, I’ve never done a lot of visual stuff so this was new territory and a steep learning curve.
Unfortunately, the strong emphasis on the technological side and the various problems we had (the agile methods one needs for experimental projects are still not understood by many companies) cut down the time for reflection to a minimum and did not allow us to come up with a deeper analysis of the social and political dimensions of what could be called “distributed urban intelligence”. The whole project is embedded in a somewhat naive rhetoric of citizen participation and the idea that technological innovation can solve social problems, in this case matters of urban planning and local governance. A lesson I have learned from this is that the current emphasis in funding on short-term projects that bring together universities and the industry makes it very difficult to carve out an actual space for scientific practice between all the deadlines and the heavy technical demands. And by scientific practice, I mean a *critical* practice that does not only try to base specifications and prototyping on “scientifically valid” approaches to building tools and objects but which includes a reflection on social utility that takes a wider view than just immediate usefulness. In the context of this project, this would have implied a close look at how urban development is currently configured in respect to environmental concerns in order to identify structures of governance and chains of decision-making. This way, the whole project could have targeted issues more clearly and consciously, fine-tuning both the tools and the accompanying discourse to the social dimension it aimed at.
I think my point is that we (at least I) have to learn how to better include a humanities-based research agenda into very high-tech projects. We have known for a long time now that every technical project is in fact a socio-technical enterprise but research funding and the project proposals that it generates are still pretending that the “socio-” part is some fluffy coating that decorates the manly material core where cogs and wire produce tangible effects. As I programmer I know how difficult and time-consuming technical work can be but if there is to be a conscious socio-technical perspective in R&D we have to accept that the fluffy stuff takes even more time – if it is done right. And to do it right means not only reading every book and paper relevant to a subject matter but to take the time to reflect on methodology, to evaluate every step critically, to go back to the drawing board, and to include and to produce theory every step of the way. There is a cost to the scientific method and if that cost is not figured in, the result may still be useful, interesting, thought-provoking, etc. but it will not be truly scientific. I believe that we should defend these costs and show why they are necessary; if we cannot do so, we risk confining the humanities to liberal armchair commentary and the social sciences to ex-post usage analysis.
After having finished my paper for the forthcoming deep search book I’ve been going back to programming a little bit and I’ve added a feature to termCloud search, which is now v0.4. The new “show relations” button highlights the eight terms with the highest co-occurrence frequency for a selected keyword. This is probably not the final form of the feature but if you crank up the number of terms (with the “term+” button) and look at the relations between some of the less common words, there are already quite interesting patterns being swept to the surface. My next Yahoo BOSS project, termZones, will try to use co-occurrence matrices from many more results to map discourse clusters (sets of words that appear very often together), but this will need a little more time because I’ll have to read up on algorithms to get that done…
PS: termCloud Search was recently a “mashup of the day” at programmeableweb.com…
Winter holidays and finally a little bit of time to dive into research and writing. After giving a talk at the Deep Search conference in Vienna last month (videos available here), I’ve been working on the paper for the conference book, which should come out sometime next year. The question is still “democratizing search” and the subject is really growing on me, especially since I started to read more on political theory and the different interpretations of democracy that are out there. But more on that some other time.
One of the vectors of making search more productive in the framework of liberal democracy is to think about search not merely as the fasted way to get from a query to a Web page, but to think about how modern technologies might help in providing an overview on the complex landscape of a topic. I have always thought that clusty – a metasearcher that takes results from Live, Ask, DMOZ, and other sources and organizes them in thematic clusters – does a really good job in that respect. If you search for “globalisation”, the first ten clusters are: Economic, Research, Resources, Anti-globalisation, Definition, Democracy, Management, Impact, Economist-Economics, Human. Clicking on a cluster will bring you the results that clusty’s algorithms judge as pertinent for the term in question. Very often, just looking on the clusters gives you a really good idea of what the topic is about and instead of just homing in on the first result, the search process itself might have taught you something.
I’ve been playing around with Yahoo BOSS for one of the programming classes I teach and I’ve come up with a simple application that follows a similar principle. TermCloud Search (edit: I really would like to work on this some more and the name SearchCloud was already taken, so I changed it…) is a small browser-based app that uses the “keyterms” (a list of keywords the system provides you with for every URL found) feature of Yahoo BOSS to generate a tagcloud for every search you make. It takes 250 results and lets the user navigate these results by clicking on a keyword. The whole thing is really just a quick hack but it shows how easy it is to add such “overview” features to Web search. Just try querying “globalisation” and look at the cloud – although it’s just extracted keywords, a representation of the topic and its complexity does emerge at least somewhat.
I’ll certainly explore this line of experimentation over the next months, jQuery is making the whole API thing really fun, so stay tuned. For the moment I’m kind of fascinated by the possibilities and by imagining search as a pedagogical process, not just a somewhat inconvenient stage in accessing content that has to be speeded up by personalization and such. Search can become in itself a knowledge producing (not just knowledge finding) activity by which we may explore a subject on a more general level.
And now I’ve got an article to finish…
You’ve probably already read it somewhere (like here or here), amazon.com has blundered a little bit – for a couple of hours the search query “terrorist costume” brought up a single hit, a rubber mask with Obama’s face. I really don’t know how many people would have found out on their own but there’s some buzz going now and there actually is something worth pondering about the case. How it happened is quite easy to reconstruct: amazon allows users to label products (Folksonomy) and includes these tags into their general search engine. So somebody tagged the Obama mask with “terrorist” (“costume” was already a common keyword) and there you go. What I find interesting about this is not that there would be any real political consequence to this matter but the fact that folk-tagging can be as easily dragged into different directions as anything else. I’m currently working on a talk for the Deep Search conference (running late as so often these days) and I’ve been looking at Jimmy Wales’ project Wikia Search which uses community feedback in order to re-rank results. The question for me is how this system would be less pervasive to manipulation or SEO than today’s dominant principle, link analysis. The amazon case shows quite well that when you enter a contested field, there’s going to be fallout and the reason that there isn’t more of it already is probably because the masses are not yet aware of the mischief potential. And I don’t see how the “wisdom of the crowd” principle (whether that is folksonomy, voting, result re-ranking, etc.) cannot be hijacked by a determined individual or company that understands the workings of the algorithms that structure results (in the amazon case you would have needed to know that user tags are used in the general search). So what is really interesting about the Obama mask incident is how things continue at amazon (and other folksonomy based servives) – if user tags can be used to drive traffic to specific products, the marketeers will come in droves the moment the numbers are relevant…


