Monthly Archives: May 2012

Last week Google introduced its new, ontology driven infobox and I have just been included into the roll-out. The whole knowledge modelling thing is quite a slippery slope though. Just compare these two boxes:

I don’t know about you, but I guess that Hitler l’auteur is perhaps not the most fitting template.

I have started to work a bit on a forthcoming paper on the history and conceptual thrust of probabilistic indexing in Information Retrieval (“naive Bayes classifiers” for the connoisseurs) – which will also be a chapter of a forthcoming book – and while researching I stumbled over a beautiful paper. E. M. Maron, one of the central pioneers in that field, worked at the RAND corporation in the early 1960 and when not developing prototype systems that used Bayes’ theorem for “fuzzy” document scoring, he wrote memoranda on wider subjects, such as cybernetics ; “Computers and our future” is a very short piece from 1966 that is extremely lucid in terms of the questions it asks. Consider these three points about “the basic characteristics of machines [computers] and at their implications”:

  • Computers operate at exceedingly high speeds. What does this imply? This means that if a high speed machine is used to control a complex situation, then it could compute an action to be taken and execute that action before a human could intervene. What are the potential dangers?
  • Computers, at least at present, demand extreme precision in their instructions. They take their instructions literally. Could there be a tendency to delegate a complex decision to a machine and find out that the machine did what we asked, but that it was not what we wanted – because we ourselves did not fully comprehend the fine structure of our own instructions? What are the full implications of this?
  • Computers have the capacity to handle large amounts of data. They can digest, analyze and relate these data in complete detail. If these data concern financial and personal information on people, what are the implications for the concept of privacy, for improper manipulation and control? What happens when large amounts of information about the economic and political aspects of a society are fed back to the citizens of that society? What is the influence of this information on the truth of the information? Can this type of information, when fed back, cause instabilities – economic and political? (Could the information flow – feedback, overload, and instability – be modeled?)

I would argue that these are extremely timely questions, and they show how aware at least some of the technical pioneers were in terms of the wider implications of the work they were doing. There is a tendency to caricature the cyberneticians of the 1950s and 1960s as narrow-minded taylorist technocrats, but I guess the story is more complicated after all…

Yesterday, Google introduced a new feature, which represents a substantial extension to how their search engine presents information and marks a significant departure from some of the principles that have underpinned their conceptual and technological approach since 1998. The “knowledge graph” basically adds a layer to the search engine that is based on formal knowledge modelling rather than word statistics (relevance measures) and link analysis (authority measures). As the title of the post on Google’s search blog aptly points out, the new features work by searching “things not strings”, because what they call the knowledge graph is simply a – very large – ontology, a formal description of objects in the world. Unfortunately, the roll-out is progressive and I have not yet been able to access the new features, but the descriptions, pictures, and video paint a rather clear picture of what product manager Johanna Wright calls the move “from an information engine to a knowledge engine”. In terms of the DIKW model (Data-Information-Knowledge-Wisdom), the new feature proposes to move up a layer by adding a box of factual information on a recognized object (the examples Google uses are the Taj Mahal, Marie Curie, Matt Groening, etc.) next to the search results. From the presentation, we can gather that the 500 million objects already referenced will include a large variety of things, such as movies, events, organizations, ideas, and so on.

This is really a very significant extension to the current logic and although we’ll need more time to try things out and get a better understanding of what this actually means, there are a couple of things that we can already single out:

  • On a feature level, the fact box brings Google closer to “knowledge engines” such as Wolfram Alpha and as we learn from the explanatory video, this explicitly includes semantic or computational queries, such as “how many women won the Nobel Prize?” type of questions.
  • If we consider Wikipedia to be a similar “description layer”, the fact box can also be seen as a competitor to everybody’s favorite encyclopedia, which is a further step into the direction of bringing information directly to the surface of the results page instead of simply referring to a location. This means that users do not have to leave the Google garden to find a quick answer. It will be interesting to see whether this will actually show up in Wikipedia traffic stats.
  • The introduction of an ontology layer is a significant departure from the largely statistical and graph theoretical methods favored by Google in the past. While features based on knowledge modelling have proliferated around the margins (e.g. in Google Maps and Local Search), the company is now bringing them to the center stage. From what I understand, the selection of “facts” to display will be largely driven by user statistics but the facts themselves come from places like Freebase, which Google bought in 2010. While large scale ontologies were prohibitive in the past, a combination of the availability of crowd-sourced databases (Wikipedia, etc.), the open data movement, better knowledge extraction mechanisms, and simply the resources to hire people to do manual repairs has apparently made them a viable option for a company of Google’s size.
  • Competing with the dominant search engine has just become a lot harder (again). If users like the new feature, the threshold for market entry moves up because this is not a trivial technical gimmick that can be easily replicated.
  • The knowledge graph will most certainly spread out into many other services (it’s already implemented in the new Google Docs research bar), further boosting the company’s economies of scale and enhancing cross-navigation between the different services.
  • If the fact box – and the features that may follow – becomes a pervasive and popular feature, Google’s participation in making information and knowledge accessible, in defining its shape, scope, and relevance, will be further extended. This is a reason to worry a bit more, not because the Google tools as such are a danger, but simply because of the levels of institutional and economic concentration the Internet has enabled. The company has become what Michel Callon calls an “obligatory passage point” in our relation to the Web and beyond; the knowledge graph has the potential to exacerbate the situation even further.

This is a development that looks like another element in the war for dominance on the Web that is currently fought at a frenetic pace. Since the introduction of actions into Facebook’s social graph, it has become clear that approaches based on ontologies and concept modelling will play an increasing role in this. In a world mediated by screens, the technological control of meaning – the one true metamedium – is the new battleground. I guess that this is not what Berners-Lee had in mind for the Semantic Web…