Category Archives: search engines

Over the last couple of weeks, things have heated up considerably for Google – on the mobile side with the start of a patent war, but also in the search area, the core of the company’s business. Led by Senator Mike Lee (a Utah Republican), the US Senate’s Antitrust Subcommittee has started to probe into certain aspects of Google’s ranking mechanisms and potential cases of abuse and manipulation.

In a hearing on Wednesday, Lee confronted Eric Schmidt with accusations of tampering with results and the evidence the Senator presented was in fact very interesting because it raises the question of how to show or even prove that a highly complex algorithmic procedure “has been tampered with”. As you can see in this video, a scatter-plot from an “independent study” that compares the search ranking for three price comparison sites (Nextag, Pricegrabber, and Shopper) with Google Price Search using 650 shopping related queries. What we can see on the graph is that while there is considerable variation in ranking for the competitors (a site shows up first for one query and way down for another), Google’s site seems to consistently stick to place three. Lee makes this astounding difference the core of his argument and directly asks Schmidt: “These results are in fact the result of the same algorithm as the rankings for the other comparison sites?” The answer is interesting in itself as Schmidt argues that Google’s service is not a product comparison site but a “product site” and that the study basically compares apples to oranges (“they are different animals”). Lee then homes in on the “uncanny” statistical regularity and says “I don’t know whether you call this a separate algorithm or whether you’re reverse engineered a single algorithm, but either way, you’ve cooked it!” to which Schmidt replies “I can assure you that we haven’t cooked anything.”

According to this LA Times article, Schmidt’s testimony did not satisfy the senators and there’s open talk about bias and conflict of interest. I would like to add to add three things here:

1) The debate shows a real mismatch between 20th century concepts of both bias and technology and the 21st century challenge to both of these question that comes in the form of Google. For the senator, bias is something very blatant and obvious, a malicious individual going to the server room at night, tempering with the machinery, transforming the pure technological objectivity into travesty by inserting a line of code that puts Google to third place most of the time. The problem with this view is of course that it makes a clear and strong distinction between a “biased” and an “unbiased” algorithm and clearly misses the point that every ranking procedure implies a bias. If Schmidt says “We haven’t cooked anything!”, who has written the algorithm? If it comes to an audit of Google’s code, I am certain that no “smoking gun” in the form of a primitive and obvious “manipulation” will be found. If Google wants to favor its own services, there are much more subtle and efficient ways to do so – the company does have the best SEO team one could possibly imagine after all. There is simply no need to “cook” anything if you are the one who specifies the features of the algorithm.

2) The research method applied in the mentioned study however is really quite interesting and I am curious to see how far the Senate committee will be able to take the argument. The statistical regularity shown is certainly astounding and if the hearings attain a deeper level of technological expertise, Google may be forced to detail a significant portion of its ranking procedures to show how something like this can happen. It would, of course, be extremely simple to break the pattern by introducing some random element that does not affect the average rank but adds variation. That’s also the reason why I think that Lee’s argument will ultimately fizzle.

3) The core of the problem, I would argue, is not so much the question of manipulation but the fact that by branching into more and more commercial areas, Google finds itself in a market configuration where conflicts of interest are popping up everywhere they turn. As both a search business and an actor on many of the markets that are, at least in part, ordered by the visibility layering in search results, there is a fundamental and structural problem that cannot be solved by any kind of imagined technical neutrality. Even if there is no “in house SEO” going on, the mere fact that Google search prominently links to other company services could already be seen as problematic. In a sense, Senator Lee’s argument actually creates a potentially useful “way out”: if there is no evil line of code written in the dark of night, no “smoking gun”, then everything is fine. The systematic conflict of interest persists however, and I do not believe that more subtle forms of bias towards Google services could be proven or even be seriously debated in a court of law. This level of technicality, I would argue, is no longer (fully) in reach for this kind of causal demonstration. Not so much because of the complexity of the algorithms, but rather because the “state” of the machine includes the full structure of the dataset it is working on, which means the full index in this case. To understand what Google’s algorithms actually do, looking at these algorithms without the data is no longer enough. And the data is big. Very big.

As you can see, I am quite pessimistic about the possibility to bring the kind of argumentation presented by Senator Lee to a real conclusion. If the case against Microsoft is an indicator, I would argue that this pessimism is warranted.

I do believe that we need to concentrate much more on the principal conflicts of interest rather than actual cases of abuse that may be simply too difficult to prove. The fundamental question is really how far a search company that controls such a large portion of the global market should be allowed to be active in other markets. And, really, should a single company control the search market in the first place? Limiting the very potential for abuse is, in my view, the road that legislators and regulators should take, rather than picking a fight over technological issues that they simply cannot win in the long run.

EDIT: Google has compiled its own Guide to the Hearing. Interesting.

While scholars often underline their commitment to non-deterministic conceptions of “effects”, models of causality in the human and social sciences can still be a bit simplistic sometimes. But a more subtle approach to causality would have to concede that, while most often cumulative and contradictory, lines of causation can sometimes be quite straightforward. Just consider this example from Commensuration as a Social Process, a great text from 1998 by Espeland and Stevens:

Faculty at a well-regarded liberal arts college recently received unexpected, generous raises. Some, concerned over the disparity between their comfortable salaries and those of the college’s arguably underpaid staff, offered to share their raises with staff members. Their offers were rejected by administrators, who explained that their raises were ‘not about them.’ Faculty salaries are one criterion magazines use to rank colleges. (p.313)

This is a rather direct effect of ranking techniques on something very tangible, namely salary. But the relative straightforwardness of the example also highlights a bifurcation of effects: faculty gets paid more, staff less. The specific construction of the ranking mechanism in question therefore produces social segmentation. Or does it simply reinforce the existing segmentation between faculty and staff that lead college evaluators to construct the indicators the way they did in the first place? Well, there goes the simplicity…

In the beginning, it was all about the algorithm. PageRank and its “no humans involved” mantra dominated Google since its inception. In recent years however, Google has started to expand the role of “conceptual” knowledge in different areas of its services. The main search bar and its capacity to do all kinds of little tricks is a good example, but I was really quite astounded how seamless concept integration has become on my last trip to Google Translate:

The Official Google Blog has recently written about changes to the ranking procedure that were introduced after a NYT article wrote about an online retailer that had apparently found out that being nasty to your customers would help getting good search rankings because all of the complaints and bad user reviews would get you links and boost PageRank. While Google denies that this logic would work, they have added a ranking layer to their search results that specifically targets online merchants. The interesting thing about the blog post is that the author details several things that the company could have done but didn’t do while actually revealing very little about what the “algorithmic solution” they implemented actually consists of. From the post:

Instead, in the last few days we developed an algorithmic solution which detects the merchant from the Times article along with hundreds of other merchants that, in our opinion, provide an extremely poor user experience. The algorithm we incorporated into our search rankings represents an initial solution to this issue, and Google users are now getting a better experience as a result.

While I do not believe that transparency is the prime solution to the gatekeeper issues surrounding search, this paragraph really is strikingly vague. Has Google compiled a list of merchants that are systematically downranked? How is this list compiled? What does “in our opinion” mean? Is this “opinion” expressed in the form of an algorithmic procedure (one could imagine using the hReview microformat to collect reviews on merchants)?

We’ll probably not get any answers to these questions but the case really shows how murky the whole ranking thing really has become: in an always growing online world, search visibility has extremely important financial ramifications (despite the social media hype) and I believe that companies like Google will increasingly rely on human judgment as a complement to algorithmic procedures (which are just another form of human judgment BTW). This will certainly lead to more legal activity around ranking in the future because courts still understand human meddling a lot better than software design…

Yesterday, Microsoft announced another step in their “long-term partnership” with Facebook. The two companies have had close ties since Microsoft invested a hefty sum in Facebook in 2007 and the former has managed advertisement on the latter’s site for quite a while. The “next step” will basically add a “social layer” to Bing search results (go to Ars Technica for a writeup or All Things Digital for a liveblog of the PR event) and this is actually a pretty big thing. Google has certainly taken contextual information into account when deciding which results to show and how to rank them: physical location, search history, and gmail contacts have been part of that process for a while, but the effects have been rather subtle.

Bing’s new features basically use the same technical layer as the Facebook boxes that popped up all over the Web about half a year ago (most modern browsers have plug-ins that allow you to block those by the way). If Bing detects the Facebook cookie while you’re on their site and adds a couple of features that allow you to interact with “friends” more easily. There are some basic convenience features but it is the “liked results” that are the most remarkable: results will use your contact’s “likes” to rank results. While we will have to wait to see how these features will pan out, social search may look something like this:

Bing social search interface

In this example, the first result is the announcement of a news article on the release of the DVD version of Iron Man 2 and this would be hardly a top-ranked result without the social layer. If Bing continues to make inroads on Google, the “like” button may take on additional importance for driving traffic and marketeers will most certainly device new ways to get people to “like” stuff – e.g. “press the button and win a free t-shirt”.

Cas Sunstein’s arguments on the dangers of echo chambers – “incestuous amplification” in social groups – will certainly be taken up again, and perhaps rightfully so: while the Internet remains a beautifully heterogeneous mess, the algorithmically sustained support for the logic of homophily (“birds of a feather…”) that can be observed in more and more places on the Web merits critical examination. While Diana Mutz’s work makes the inconvenient argument that “hearing the other side” of political debate may actually lead to less political engagement, our representative systems of democratic governance require a certain willingness to accept different political viewpoints (that always float on less clearly delineated cultural sensibilities) as sincere and legitimate. Also, adding a “friend” dimension to yet another dimension of the Web could be seen as a further reduction of the “publicness” that, according to Michael Schudson, caracterizes working democratic discourse. Being able to dissociate ourselves from our private entanglements and take into account the interests of those who do not ressemble us is perhaps the central prerequisite to successfully navigating a smaller planet.

Bing’s new features are certainly not the end of life as we know it but I believe that the privacy question – as important as it is – is covering a series of more difficult problems that sit at the heart of political life in the age of the Internet…

…but you can go ahead and waste everybody else – according to Google’s suggest function at least:

This is interesting because it is very obvious that Google erases certain queries in their suggest function (porn, etc.) and the idea that the Internet would “make” suggestible teenagers kill themselves is a recurring and media-fed scare that, as a consequence, is one of the few domains where censoring is near consensual. What I find interesting though is that all these other carnage scenarios do not get the DELETE FROM treatment, although one may argue that killing oneself is not more condemnable than killing somebody else.

But independently of this philosophical question (the only one worth pondering according to Camus, remember?), Google suggest is yet another way to query the closest thing to god there is: Google’s database; and the way certain queries are removed, most certainly by hand (BTW, “je veux me” on google.fr DOES suggest that you may want to end your life)…

…is so much easier if you’ve got a couple of popular pages to advertise on…

chrome_suggest_march_2010

…and another one…

chrome_suggest_april_2010.JPG

…browser wars all over again…

When it comes to search interfaces, there are a lot of good ideas out there, but there is also a lot of potential for further experimentation. Search APIs are a great field for experimentation as they allow developers to play around with advanced functionality without forcing them to work on a heavy backend structure.

Together with Alex Beaugrand, a student of mine, I have built (a couple of month ago) another little search mashup / interface that allows users to switch between a tag cloud view and a list / cluster mode. contextDigger uses the delicious and Bing APIs to widen the search space using associated searches / terms and then Yahoo BOSS to download a thousand results that can be filtered through the interface. It uses the principle of faceted navigation to shorten the list : if you click on two terms, only the results associated with both of them will appear…

Since Yahoo recently ~sold its search business to Microsoft (see this NYT article for details) a lot of people where asking themselves what would happen to the Yahoo search APIs, which are in fact some of the most powerful free tools out there to built search mashups with. As Simon Wilson indicates in this blog post, some of them (Term Extraction and Contextual Web Search) are closing down at the end of August. Programmable Web lists 33 mashups that use the Term Extraction service and these sites will either have to close down or start looking for alternatives. This highlights a problem that can be a true roadblock for developing applications making heavy use of APIs. My own termcloud search and its spiced up cousin contextdigger use Yahoo BOSS and quite honestly, if MS kills that Service, these experiments (and many others) will be gone for good, because Yahoo BOSS is the only search API that provides a list of extracted keywords for each delivered Web result.

If service providers can close APIs at will, developers might hesitate when deciding whether to put in the necessary coding hours to built the latest mashup. But it is mashups that over the last years have really explored many of the directions left blank by “pure” applications. This creative force should be cherished and I wonder if there may be a need for something similar to creative commons for APIs – a legal construct that gives at least some basic rights to mashup developers…

After having finished my paper for the forthcoming deep search book I’ve been going back to programming a little bit and I’ve added a feature to termCloud search, which is now v0.4. The new “show relations” button highlights the eight terms with the highest co-occurrence frequency for a selected keyword. This is probably not the final form of the feature but if you crank up the number of terms (with the “term+” button) and look at the relations between some of the less common words, there are already quite interesting patterns being swept to the surface. My next Yahoo BOSS project, termZones, will try to use co-occurrence matrices from many more results to map discourse clusters (sets of words that appear very often together), but this will need a little more time because I’ll have to read up on algorithms to get that done…

PS: termCloud Search was recently a “mashup of the day” at programmeableweb.com