Category Archives: algorithms

This spring worked on an R&D project that was really quite interesting but – as it happens with projects – took up nearly all of my spare time. La montre verte is based on the idea that pollution measurement can be brought down to street level if sensors can be made small enough to be carried around by citizens. Together with a series of partners from the private sector, the CiTu group of my laboratory came up with the idea to put an ozone sensor and a microphone (to measure noise levels) into a watch. That way, the device is not very intrusive and still in direct contact with the surrounding air. We built about 15 prototypes, based on the fact that currently, Paris’ air quality is measured by only a handful of (really high quality) sensors and even the low resolution devices we have in our watches should therefore be able to complement that data with a geographically more fine grained analysis of noise and pollution levels. The watch produces a georeferenced  measurement (a GPS is built into the watch) every second and transmits the data via Bluetooth to a Java application on a portable phone, which then sends every data packet via GPRS to a database server.

heatmapMy job in the project was to build a Web application that allows people to interact with and make sense of the data produced by the watches. Despite the help from several brilliant students from our professional Masters program, this proved to be a daunting task and I spent *at lot* of time programming. The result is quite OK I believe; the application allows users to explore the data (which is organized in localized “experiments”) in different ways, either in real-time or afterward. With a little more time (we had only about three month for the whole project and we got the hardware only days before the first public showcase) we could have done more but I’m still quite content with the result. Especially the heatmap (see image) algorithm was fun to program, I’ve never done a lot of visual stuff so this was new territory and a steep learning curve.

Unfortunately, the strong emphasis on the technological side and the various problems we had (the agile methods one needs for experimental projects are still not understood by many companies) cut down the time for reflection to a minimum and did not allow us to come up with a deeper analysis of the social and political dimensions of what could be called “distributed urban intelligence”. The whole project is embedded in a somewhat naive rhetoric of citizen participation and the idea that technological innovation can solve social problems, in this case matters of urban planning and local governance. A lesson I have learned from this is that the current emphasis in funding on short-term projects that bring together universities and the industry makes it very difficult to carve out an actual space for scientific practice between all the deadlines and the heavy technical demands. And by scientific practice, I mean a *critical* practice that does not only try to base specifications and prototyping on “scientifically valid” approaches to building tools and objects but which includes a reflection on social utility that takes a wider view than just immediate usefulness. In the context of this project, this would have implied a close look at how urban development is currently configured in respect to environmental concerns in order to identify structures of governance and chains of decision-making. This way, the whole project could have targeted issues more clearly and consciously, fine-tuning both the tools and the accompanying discourse to the social dimension it aimed at.

I think my point is that we (at least I) have to learn how to better include a humanities-based research agenda into very high-tech projects. We have known for a long time now that every technical project is in fact a socio-technical enterprise but research funding and the project proposals that it generates are still pretending that the “socio-” part is some fluffy coating that decorates the manly material core where cogs and wire produce tangible effects. As I programmer I know how difficult and time-consuming technical work can be but if there is to be a conscious socio-technical perspective in R&D we have to accept that the fluffy stuff takes even more time – if it is done right. And to do it right means not only reading every book and paper relevant to a subject matter but to take the time to reflect on methodology, to evaluate every step critically, to go back to the drawing board, and to include and to produce theory every step of the way. There is a cost to the scientific method and if that cost is not figured in, the result may still be useful, interesting, thought-provoking, etc. but it will not be truly scientific. I believe that we should defend these costs and show why they are necessary; if we cannot do so, we risk confining the humanities to liberal armchair commentary and the social sciences to ex-post usage analysis.

After having finished my paper for the forthcoming deep search book I’ve been going back to programming a little bit and I’ve added a feature to termCloud search, which is now v0.4. The new “show relations” button highlights the eight terms with the highest co-occurrence frequency for a selected keyword. This is probably not the final form of the feature but if you crank up the number of terms (with the “term+” button) and look at the relations between some of the less common words, there are already quite interesting patterns being swept to the surface. My next Yahoo BOSS project, termZones, will try to use co-occurrence matrices from many more results to map discourse clusters (sets of words that appear very often together), but this will need a little more time because I’ll have to read up on algorithms to get that done…

PS: termCloud Search was recently a “mashup of the day” at programmeableweb.com

Winter holidays and finally a little bit of time to dive into research and writing. After giving a talk at the Deep Search conference in Vienna last month (videos available here), I’ve been working on the paper for the conference book, which should come out sometime next year. The question is still “democratizing search” and the subject is really growing on me, especially since I started to read more on political theory and the different interpretations of democracy that are out there. But more on that some other time.

One of the vectors of making search more productive in the framework of liberal democracy is to think about search not merely as the fasted way to get from a query to a Web page, but to think about how modern technologies might help in providing an overview on the complex landscape of a topic. I have always thought that clusty – a metasearcher that takes results from Live, Ask, DMOZ, and other sources and organizes them in thematic clusters – does a really good job in that respect. If you search for “globalisation”, the first ten clusters are: Economic, Research, Resources, Anti-globalisation, Definition, Democracy, Management, Impact, Economist-Economics, Human. Clicking on a cluster will bring you the results that clusty’s algorithms judge as pertinent for the term in question. Very often, just looking on the clusters gives you a really good idea of what the topic is about and instead of just homing in on the first result, the search process itself might have taught you something.

I’ve been playing around with Yahoo BOSS for one of the programming classes I teach and I’ve come up with a simple application that follows a similar principle. TermCloud Search (edit: I really would like to work on this some more and the name SearchCloud was already taken, so I changed it…) is a small browser-based app that uses the “keyterms” (a list of keywords the system provides you with for every URL found) feature of Yahoo BOSS to generate a tagcloud for every search you make. It takes 250 results and lets the user navigate these results by clicking on a keyword. The whole thing is really just a quick hack but it shows how easy it is to add such “overview” features to Web search. Just try querying “globalisation” and look at the cloud – although it’s just extracted keywords, a representation of the topic and its complexity does emerge at least somewhat.

I’ll certainly explore this line of experimentation over the next months, jQuery is making the whole API thing really fun, so stay tuned. For the moment I’m kind of fascinated by the possibilities and by imagining search as a pedagogical process, not just a somewhat inconvenient stage in accessing content that has to be speeded up by personalization and such. Search can become in itself a knowledge producing (not just knowledge finding) activity by which we may explore a subject on a more general level.

And now I’ve got an article to finish…

You’ve probably already read it somewhere (like here or here), amazon.com has blundered a little bit – for a couple of hours the search query “terrorist costume” brought up a single hit, a rubber mask with Obama’s face. I really don’t know how many people would have found out on their own but there’s some buzz going now and there actually is something worth pondering about the case. How it happened is quite easy to reconstruct: amazon allows users to label products (Folksonomy) and includes these tags into their general search engine. So somebody tagged the Obama mask with “terrorist” (“costume” was already a common keyword) and there you go. What I find interesting about this is not that there would be any real political consequence to this matter but the fact that folk-tagging can be as easily dragged into different directions as anything else. I’m currently working on a talk for the Deep Search conference (running late as so often these days) and I’ve been looking at Jimmy Wales’ project Wikia Search which uses community feedback in order to re-rank results. The question for me is how this system would be less pervasive to manipulation or SEO than today’s dominant principle, link analysis. The amazon case shows quite well that when you enter a contested field, there’s going to be fallout and the reason that there isn’t more of it already is probably because the masses are not yet aware of the mischief potential. And I don’t see how the “wisdom of the crowd” principle (whether that is folksonomy, voting, result re-ranking, etc.) cannot be hijacked by a determined individual or company that understands the workings of the algorithms that structure results (in the amazon case you would have needed to know that user tags are used in the general search). So what is really interesting about the Obama mask incident is how things continue at amazon (and other folksonomy based servives) – if user tags can be used to drive traffic to specific products, the marketeers will come in droves the moment the numbers are relevant…

This is not a substantial post just a pointer to this interview with Digg lead scientist Anton Kast on Digg’s upcoming recommendation engine (which is really just collaborative filtering but as Kast says the engineering challenge is to make it work in real time – which is quite fascinating given the volume of users and content on the site). Around 2:50 Kast explains why Digg will list the “compatibility coefficient” (algorithmic proximity anyone?) with other users and give an indication why stories are recommended to you (because these users dug them): Digg users hate getting stuff imposed and just showing recommendations without a trail “looks editorial”. Wow, “editorial” is becoming a swearword. Who would have thought…

This morning Jonah Bossewitch pointed me to an article over at Wired, authored by Chris Anderson which announces “The End of Theory”. The article’s main argument in itself is not very interesting for anybody with a knack for epistemology – Anderson has apparently never heard of the induction / deduction discussion and a limited idea about what statistics does – but there is a very interesting question lurking somewhere behind all the Californian Ideology and the following citation points right to it:

We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

One could point to the fact that the natural sciences had their experimental side for quite a while (Roger Bacon advocated his scientia experimentalis in the 13th century) and that a laboratory is in a sense a pattern-finding machine where induction continuously plays an important role. What interests me more though is Anderson’s insinuation that statistical algorithms are not models. Let’s just look at one of the examples he uses:

Google’s founding philosophy is that we don’t know why this page is better than that one: If the statistics of incoming links say it is, that’s good enough. No semantic or causal analysis is required.

This is a very limited understanding of what constitutes a model. I would argue that PageRank does in fact rely very explicitly on a model which combines several layers of justification. In their seminal paper on Google, Brin and Page write the following:

PageRank can be thought of as a model of user behavior. We assume there is a “random surfer” who is given a web page at random and keeps clicking on links, never hitting “back” but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank.

The assumption behind this graph oriented justification is that people do not randomly place links but they do so with purpose. Linking implies attribution of importance: we don’t link to documents that we’re indifferent about. The statistical exploration of the huge graph that is the Web is indeed oriented by this basic assumption and adds the quite contestable ruling according to which shall be most visible what is thought important by the greatest number of linkers. I would, then, argue that there is no experimental method that is purely inductive, not even neural networks. Sure, on the mathematical side we can explore data without limitations concerning their dimensionality, i.e. the number of characteristics that can be taken into account; the method of gathering data is however always a process of selection that is influenced by some idea or intuition that at least implicitly has the characteristic of a model. There is a deductive side to even the most inductive approach. Data is made not given and every projection of that data is oriented. To quote Fernando Pereira:

[W]ithout well-chosen constraints — from scientific theories — all that number crunching will just memorize the experimental data.

As Jonah points out, Anderson’s article is probably a straw man argument whose sole purpose is to attract attention but it points to something that is really important: too many people think that mathematical methods for knowledge discovery (datamining that is) are neutral and objective tools that will find what’s really there and show the world as it is without the stain of human intentionality; these algorithms are therefore not seen as objects of political inquiry. In this view statistics is all about counting facts and only higher layers of abstraction (models, theories,…) can have a political dimension. But it matters what we count and how we count.

In the end, Anderson’s piece is little more than the habitual prostration before the altar of emergence and self-organization. Just exchange the invisible hand for the invisible brain and you’ll get pop epistemology for hive minds…

A couple of weeks ago, Google released App Engine a Web hosting platform that makes the company’s extensive knowledge in datacenter technology available to the general public. The service is free for the moment (including 500MB in data storage and a quite generous contingent in CPU cycles) but there is a commercial service in preparation. Apps use Google Passport Google’s account system for user identification and are currently limited to (lovely) Python as programming language. I don’t want to write about the usual Google über alles matter but kind of restate an idea I proposed in a paper in 2005. When criticizing search engine companies, authors generally demand more inclusive search algorithms, less commercial results, transparent ranking algorithms or non-commercial alternatives to the dominant service(s). This is all very important but I fear that a) there cannot be search without bias, b) transparency would not reduce the commercial coloring of search results, and c) open source efforts would have difficulties mustering the support on the hardware and datacenter front to provide services to billions of users and effectively take on the big players. In 2005 I suggested the following:

Instead of trying to mechanize equality, we should obligate search engine companies to perform a much less ambiguous public service by demanding that they grant access to their indexes and server farms. If users have no choice but to place confidence in search engines, why not ask these corporations to return the trust by allowing users to create their own search mechanisms? This would give the public the possibility to develop search algorithms that do not focus on commercial interest: search techniques that build on criteria that render commercial hijacking very difficult. Lately we have seen some action to promote more user participation and control, but the measures undertaken are not going very far. From a technical point of view, it would be easy for the big players to propose programming frameworks that allow writing safe code for execution in their server environment; the conceptual layers already are modules and replacing one search (or representation) module with another should not be a problem. The open source movement as part of the civil society has already proven it’s capabilities in various fields and where control is impossible, choice might be the only answer. To counter complete fragmentation and provide orientation, we could imagine that respected civic organizations like the FSF endorse specific proposals from the chaotic field of search algorithms that would emerge. In France, television networks have to invest a percentage of their revenue in cinema, why not make search engine companies dedicate a percentage of their computer power to algorithms written by the public? This would provide the necessary processing capabilities to civil society without endangering the business model of those companies; they could still place advertising and even keep their own search algorithms a secret. But there would be alternatives – alternative (noncommercial) viewpoints and hierarchies – to choose from.

I believe that the Google App Engine could be the technical basis for what could be called the Google Search Sandbox, a hosting platform equipped with either an API to the company’s vast indexes or even something as simple as a means to change weights for parameters in the existing set of algorithms. A simple JSON input like {“shop”:”-1″, “checkout”:”-1″,”price”:”-1″,”cart”:”-1″,”bestseller”:”-1″} could be enough to e.g. eliminate amazon pages from the result list. SEOing for these scripts would be difficult because there would be many different varieties (one of the first would be bernosworld.google.com – we aim to displease! no useful results guaranteed!). It is of course not in Google’s best interest to implement something like this because many scripts might direct users away from commercial pages using AdSense, the foundation of the company’s revenue stream. But this is why we have governments. Hoping for or even legislating more transparency and “inclusive” search might be less effective than people wish. I demand access to the index!

I have not idea whether it’s going to be accepted but here is my proposal for the Internet Research 9.0: Rethinking Community, Rethinking Place conference. The title is: Algorithmic Proximity – Association and the “Social Web”

How to observe, describe and conceptualize social structure has been a central question in the social sciences since their beginning in the 19th century. From Durkheim’s opposition between organic and mechanic solidarity and Tönnies’ distinction of Gemeinschaft and Gesellschaft to modern Social Network Analysis (Burt, Granovetter, Wellman, etc.), the problem of how individuals and groups relate to each other has been at the core of most attempts to conceive the “social”. The state of “community” – even in the loose understanding that has become prevalent when talking about sociability online – already is an end result of a permanent process of proto-social interaction, the plasma (Latour) from which association and aggregation may arise. In order to understand how the sites and services (Blogs, Social Networking Services, Online Dating, etc.) that make up what has become known as the “Social Web” allow for the emergence of higher-order social forms (communities, networks, crowds, etc.) we need to look at the lower levels of social interaction where sociability is still a relatively open field.
One way of approaching this very basic level of analysis is through the notion of “probability of communication”. In his famous work on the diffusion of innovation, Everett Rogers notes that the absence of social structure would mean that all communication between members of a population would have the same probability of occurring. In any real setting of course this is never the case: people talk (interact, exchange, associate, etc.) with certain individuals more than others. Beyond the limiting aspects of physical space the social sciences have identified numerous parameters – such as age, class, ethnicity, gender, dress, modes of expression, etc. – that make communication and interaction between some people a lot more probable than between others. Higher order social aggregates emerge from this background of attraction and repulsion; sociology has largely concluded that for all practical purposes opposites do not attract.
Digital technology largely obliterates the barriers of physical space: instead of being confined to his or her immediate surroundings an individual can now potentially communicate and interact with all the millions of people registered on the different services of the Social Web. In order to reduce “social overload”, many services allow their users to aggregate around physical or institutional landmarks (cities, universities, etc.) and encourage association through network proximity (the friend of a friend might become my friend too). Many of the social parameters mentioned above are also translated onto the Web in the sense that a person’s informational representations (profile, blog, avatar, etc.) become markers of distinction (Bourdieu) that strongly influence on the probability of communication with other members of the service. Especially in youth culture, opposite cultural interests effectively function as social barriers. These are, in principle, not new; their (partial) digitization however is.
Most of the social services online see themselves as facilitators for association and constantly produce “contact trails” that lead to other people, through category browsing, search technology, or automated path-building via backlinking. Informational representations like member profiles are not only read and interpreted by people but also by algorithms that will make use of this data whenever contact trails are being laid. The most obvious example can be found on dating sites: when searching for a potential partner, most services will rank the list of results based on compatibility calculations that take into account all of the pieces of information members provide. The goal is to compensate for the very large population of potential candidates and to reduce the failure rate of social interaction. Without the randomness that, despite spatial segregation, still marks life offline, the principle of homophily is pushed to the extreme: confrontation with the other as other, i.e. as having different opinions, values, tastes, etc. is reduced to a minimum and the technical nature of this process ensures that it passes without being noticed.
In this paper we will attempt to conceptualize the notion of “algorithmic proximity”, which we understand as the shaping of the probability of association by technological means. We do, however, not intend to argue that algorithms are direct producers of social structure. Rather, they intervene on the level of proto-social interaction and introduce biases whose subtlety makes them difficult to study and theorize conceptually. Their political and cultural significance must therefore be approached with the necessary caution.

When sites that involve any kind of ranking change their algorithm, there’ll probably be a spectacle worth watching. When Google made some changes to their search algorithms in 2005, the company was sued by KinderStart.com (a search engine for kids, talk about irony) who went from PageRank riches to rags and lost 70% of their traffic in a day (the case was dismissed in 2007). When Digg finally gave in to a lot of criticism about organized front page hijacking and changed the way story promotion works to include a measure of “diversity”, the regulars were vocally hurt and unhappy. What I find fascinating about the latter case was the technical problem-solving approach that implied the programming of nothing less that diversity. It’s not that hard to understand how such a thing works (think “anti-recommendation system” or “un-collaborative filtering”), but still, one has to sit back and appreciate the idea. We are talking about social engineering done by software engineers. Social problem = design problem.

The very real-world effects of algorithms are quite baffling and since I started to read this book, I truly appreciate the ingenuity and complex simplicity that cannot be reduced to a pure “this is what I want to achieve and so I do it” narrative. There is a delta between the “want” and the “can” and the final system will be the result of a complex negotiation that will have changed both sides of the story in the end. Programming diversity means to give the elusive concept of diversity an analytical core, to formalize it and to turn it into a machine. The “politics” of a ranking algorithm is not only about the values and the project (make story promotion more diverse) but also a matter – to put it bluntly – of the state of knowledge in computer science. This means, in my opinion, that the politics of systems must be discussed in the larger context of an examination of computer science / engineering / design as in itself an already oriented project, based on yet another layer of “want” and “can”.

Thanks to Joris for pointing out that my blog was hacked. Damn you spammers.