Category Archives: search engines
Winter holidays and finally a little bit of time to dive into research and writing. After giving a talk at the Deep Search conference in Vienna last month (videos available here), I’ve been working on the paper for the conference book, which should come out sometime next year. The question is still “democratizing search” and the subject is really growing on me, especially since I started to read more on political theory and the different interpretations of democracy that are out there. But more on that some other time.
One of the vectors of making search more productive in the framework of liberal democracy is to think about search not merely as the fasted way to get from a query to a Web page, but to think about how modern technologies might help in providing an overview on the complex landscape of a topic. I have always thought that clusty – a metasearcher that takes results from Live, Ask, DMOZ, and other sources and organizes them in thematic clusters – does a really good job in that respect. If you search for “globalisation”, the first ten clusters are: Economic, Research, Resources, Anti-globalisation, Definition, Democracy, Management, Impact, Economist-Economics, Human. Clicking on a cluster will bring you the results that clusty’s algorithms judge as pertinent for the term in question. Very often, just looking on the clusters gives you a really good idea of what the topic is about and instead of just homing in on the first result, the search process itself might have taught you something.
I’ve been playing around with Yahoo BOSS for one of the programming classes I teach and I’ve come up with a simple application that follows a similar principle. TermCloud Search (edit: I really would like to work on this some more and the name SearchCloud was already taken, so I changed it…) is a small browser-based app that uses the “keyterms” (a list of keywords the system provides you with for every URL found) feature of Yahoo BOSS to generate a tagcloud for every search you make. It takes 250 results and lets the user navigate these results by clicking on a keyword. The whole thing is really just a quick hack but it shows how easy it is to add such “overview” features to Web search. Just try querying “globalisation” and look at the cloud – although it’s just extracted keywords, a representation of the topic and its complexity does emerge at least somewhat.
I’ll certainly explore this line of experimentation over the next months, jQuery is making the whole API thing really fun, so stay tuned. For the moment I’m kind of fascinated by the possibilities and by imagining search as a pedagogical process, not just a somewhat inconvenient stage in accessing content that has to be speeded up by personalization and such. Search can become in itself a knowledge producing (not just knowledge finding) activity by which we may explore a subject on a more general level.
And now I’ve got an article to finish…
Mashable.com has a piece on Google’s expanding media empire and there is one observation that is actually quite obvious but which I’ve never really thought about:
It becomes pretty clear how Google is going about launching new products or acquiring others: analyzing the most popular topics within its search engine.
People are searching a lot for second life? All right, let’s launch our own 3D virtual world then. Google Trends already exploits search statistics for really simple trend / market analysis but in a dynamic marketplace like the Web the vast amount of search queries Google registers can really be a much more formidable tool for taking society’s pulse. There is no doubt that Google uses this data internally for some heavy market research and I could imagine that the company might license these tools or data to third parties in the future. Nielsen would get some serious competition.
The point I find really interesting about this matter is that Google is mostly criticized for commercially biases search results, their monopoly on online search and the gathering of data that might be used to spy on citizens – I have yet to read something that reflects data collection on users’ search behavior not only as potentially dangerous to individual rights but as a unique tool for corporate strategy. Mining their all knowing logfile might give Google a competitive advantage that other companies simply cannot emulate. Spotting shifts in cultural trends early could give their business planning an asset that money (currently) cannot buy. It would be prudent to convert to Googlism while they still accept new members.
This morning Jonah Bossewitch pointed me to an article over at Wired, authored by Chris Anderson which announces “The End of Theory”. The article’s main argument in itself is not very interesting for anybody with a knack for epistemology – Anderson has apparently never heard of the induction / deduction discussion and a limited idea about what statistics does – but there is a very interesting question lurking somewhere behind all the Californian Ideology and the following citation points right to it:
We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
One could point to the fact that the natural sciences had their experimental side for quite a while (Roger Bacon advocated his scientia experimentalis in the 13th century) and that a laboratory is in a sense a pattern-finding machine where induction continuously plays an important role. What interests me more though is Anderson’s insinuation that statistical algorithms are not models. Let’s just look at one of the examples he uses:
Google’s founding philosophy is that we don’t know why this page is better than that one: If the statistics of incoming links say it is, that’s good enough. No semantic or causal analysis is required.
This is a very limited understanding of what constitutes a model. I would argue that PageRank does in fact rely very explicitly on a model which combines several layers of justification. In their seminal paper on Google, Brin and Page write the following:
PageRank can be thought of as a model of user behavior. We assume there is a “random surfer” who is given a web page at random and keeps clicking on links, never hitting “back” but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank.
The assumption behind this graph oriented justification is that people do not randomly place links but they do so with purpose. Linking implies attribution of importance: we don’t link to documents that we’re indifferent about. The statistical exploration of the huge graph that is the Web is indeed oriented by this basic assumption and adds the quite contestable ruling according to which shall be most visible what is thought important by the greatest number of linkers. I would, then, argue that there is no experimental method that is purely inductive, not even neural networks. Sure, on the mathematical side we can explore data without limitations concerning their dimensionality, i.e. the number of characteristics that can be taken into account; the method of gathering data is however always a process of selection that is influenced by some idea or intuition that at least implicitly has the characteristic of a model. There is a deductive side to even the most inductive approach. Data is made not given and every projection of that data is oriented. To quote Fernando Pereira:
[W]ithout well-chosen constraints — from scientific theories — all that number crunching will just memorize the experimental data.
As Jonah points out, Anderson’s article is probably a straw man argument whose sole purpose is to attract attention but it points to something that is really important: too many people think that mathematical methods for knowledge discovery (datamining that is) are neutral and objective tools that will find what’s really there and show the world as it is without the stain of human intentionality; these algorithms are therefore not seen as objects of political inquiry. In this view statistics is all about counting facts and only higher layers of abstraction (models, theories,…) can have a political dimension. But it matters what we count and how we count.
In the end, Anderson’s piece is little more than the habitual prostration before the altar of emergence and self-organization. Just exchange the invisible hand for the invisible brain and you’ll get pop epistemology for hive minds…
A couple of weeks ago, Google released App Engine a Web hosting platform that makes the company’s extensive knowledge in datacenter technology available to the general public. The service is free for the moment (including 500MB in data storage and a quite generous contingent in CPU cycles) but there is a commercial service in preparation. Apps use Google Passport Google’s account system for user identification and are currently limited to (lovely) Python as programming language. I don’t want to write about the usual Google über alles matter but kind of restate an idea I proposed in a paper in 2005. When criticizing search engine companies, authors generally demand more inclusive search algorithms, less commercial results, transparent ranking algorithms or non-commercial alternatives to the dominant service(s). This is all very important but I fear that a) there cannot be search without bias, b) transparency would not reduce the commercial coloring of search results, and c) open source efforts would have difficulties mustering the support on the hardware and datacenter front to provide services to billions of users and effectively take on the big players. In 2005 I suggested the following:
Instead of trying to mechanize equality, we should obligate search engine companies to perform a much less ambiguous public service by demanding that they grant access to their indexes and server farms. If users have no choice but to place confidence in search engines, why not ask these corporations to return the trust by allowing users to create their own search mechanisms? This would give the public the possibility to develop search algorithms that do not focus on commercial interest: search techniques that build on criteria that render commercial hijacking very difficult. Lately we have seen some action to promote more user participation and control, but the measures undertaken are not going very far. From a technical point of view, it would be easy for the big players to propose programming frameworks that allow writing safe code for execution in their server environment; the conceptual layers already are modules and replacing one search (or representation) module with another should not be a problem. The open source movement as part of the civil society has already proven it’s capabilities in various fields and where control is impossible, choice might be the only answer. To counter complete fragmentation and provide orientation, we could imagine that respected civic organizations like the FSF endorse specific proposals from the chaotic field of search algorithms that would emerge. In France, television networks have to invest a percentage of their revenue in cinema, why not make search engine companies dedicate a percentage of their computer power to algorithms written by the public? This would provide the necessary processing capabilities to civil society without endangering the business model of those companies; they could still place advertising and even keep their own search algorithms a secret. But there would be alternatives – alternative (noncommercial) viewpoints and hierarchies – to choose from.
I believe that the Google App Engine could be the technical basis for what could be called the Google Search Sandbox, a hosting platform equipped with either an API to the company’s vast indexes or even something as simple as a means to change weights for parameters in the existing set of algorithms. A simple JSON input like {“shop”:”-1″, “checkout”:”-1″,”price”:”-1″,”cart”:”-1″,”bestseller”:”-1″} could be enough to e.g. eliminate amazon pages from the result list. SEOing for these scripts would be difficult because there would be many different varieties (one of the first would be bernosworld.google.com – we aim to displease! no useful results guaranteed!). It is of course not in Google’s best interest to implement something like this because many scripts might direct users away from commercial pages using AdSense, the foundation of the company’s revenue stream. But this is why we have governments. Hoping for or even legislating more transparency and “inclusive” search might be less effective than people wish. I demand access to the index!