Edit: a map of the English Wikipedia is here.
Wikipedia is a fascinating object for way too many reasons. The way it is produced, the place it has taken in society, it’s size and evolution, and many other aspects are truly remarkable. Studying Wikipedia has become a discipline in itself and while there may be certain signs of fatigue on the editing front, there is still much to learn and to discover. I have recently started to take an interest in looking at the way knowledge is structured in different contexts and the availability of certain tools and datasets makes Wikipedia a perfect object for scrutiny. If it just wasn’t that big. Still, it’s the 21st century and computers are getting really fast, so why not try mapping Wikipedia. All of it.
There are different ways to start such a project, but simply taking the link structure is probably the most obvious. This allows for bypassing the internal taxonomy and may lead to a more “organic” expression of underlying knowledge structures. Unfortunately, computers are not that fast – at least not mine – and so I had to make two concessions: I took a non English variant (I settled for French) and reduced the number of nodes to a (barely) manageable amount. The final graph file (.gdf – do not even think about working with it with less than 4GB of RAM) was built by taking pages that had at least 100 connections with other pages. From an initial 183K pages and 11.5M links I went down to a more manageable 40K and 2M respectively. To make things workable, I chose to visualize the page names only, no nodes, no edges. The result looks like this (click on the image for a very big .png):
Reliable gephi did not only do the graph layout (OpenOrd plugin, 1000 iterations) but dutifully detected “communities” in the network, which actually did work really well. And here is a version in elegant grayscale, this time without community detection:
The graph shows a big dense zone in the middle that is quite unreadable but composed out of world history, politics, geography, and other elements that constitute a core set of knowledge elements that are highly interlinked. While France plays and important role here, these elements are actually very globalized and include countries from all over the world. Could we interpret this as a field of “common” or “shared” knowledge? A set of topics that transcend specialization and form the very core of what our culture considers essential?
To the close right of the very center, there is a rather visible (in orange) cluster on the United States. Around the center you’ll find major historic events and periods (WWII, middle ages, renaissance, etc.). The arts are on the right (mostly music) and France’s most popular art form – Cinema – starts at the top right, in a highly dense orange cluster and goes to the top left, tellingly fusing with theatre. The Sciences form a rather strange blue band the goes from the center top to the top right.
And then there is sports. I was a bit surprised by how much of it there is and how well the clustering and community detection works for identifying individual fields – football, tennis, car racing, and so on. The second surprise was how few “geek” subjects appear on the map. There is a digital technology cluster on the top right but I haven’t found any traces of the legendary Star Trek cluster. In the end, French Wikipedia appears to be a rather classic encyclopedia if you look at it from a subject angle. Could we use such maps to compare subject prominence between cultures?
Obviously, the method for mapping Wikipedia has to be refined to make maps more readable but the results are actually already quite telling. Let’s see whether the same approach can work for the English version – which is a cool 10 times bigger…
Tech support questions will not be answered. Please refer to the FAQ of the tool.