Category Archives: visualization
The New York Times is not only a very good newspaper, it is also a really, really interesting archive that provides search access to all articles since 1851 via a pretty nice API. I’ve been meaning to play with it for some time, but things were extremely busy this year. But yesterday, I had some time in the evening and looked into the system a little bit and wrote a couple of scripts to try out some quick ideas.
While the API has all kinds of interesting things – in particular access to the Times’ controlled vocabulary – I am most interested in the article archive and the different possibilities to explore it. Understandably, the API does not provide the full text of articles; but it does search in the full text and for every found article it delivers quite a number of interesting things. Here is an example of what the returned data for a query (“guantanamo bay”) looks like:
While there are many things to go with, I found the manually attributed (and controlled) keywords to be particularly interesting. So I decided to explore and visualize how a particular subject evolves over time inside of this classificatory structure. Because the request rate for the search API is quite generous (10/s, 10K/day) I wrote a short PHP script (grab.php) that grabs this metadata for every article corresponding to a given search query. It simply downloads the data and stores it in a bunch of JSON files. A second script (analyze.php) then parses these files and creates a simple CSV file that can then be visualized with something like R (which I started working with some weeks ago, much easier than I thought, lots of fun).
With the help of the amazing ggplot2 library in R, using “guantanamo bay” as query, I quickly got a first result (click for larger image):
One can quite easily see that Guantanamo Bay was discussed in the 1990s in terms of immigration, asylum, and similar terms, while the current frame (terrorism, etc.) appears just after 9/11. While this script (bubbles.R) provides overview, a second one (bubbles_numbers.R) provides a combination of bubbles and numbers (click for larger image):
There is certainly much more interesting stuff to do with this data (e.g. different types of normalization, taking into account word count and page number, etc.) and I’ll hopefully come back to this in more detail in the future. In the meantime, all scripts can be found here.
Update June 2, 2013:
I’ve added a network export feature to the scripts on github. Generated network files are not limited to subject tags, but include people, organizations, locations, and creative works (e.g. books or movies). If two tags appear on the same article, a link is created and the more often they appear together, the stronger the connection. Here’s a quick visualization, made with gephi, of the most common people (red), organizations (green), and locations (blue) for the query “climate change” (click for larger image):
Pages are part of Facebook’s project to suck up the Web. They are also full of data. In the next version of netvizz I will add a feature that allows to dig into that data a little bit. Here is a preview:
This network (visualized with gephi) shows interactions on the Facebook page of the Guardian. I extracted all the likes and comments for the last 80 posts. On the whole, there are 9.500 users liking and commenting away. Each dark and labeled node is a post while all the others are users. A heat scale (blue => yellow => red) shows how often a user interacts with the page; size shows how often a node was liked or commented on (for pages) or liked and commented (for users).
One can see a a core of regulars in the middle of the graph, but the main engagement comes from a large majority of users that have only interacted with a single posts. These users drag the big subjects out to the margins in this specific spatialization. Engagement, here, comes from a fleeting audience rather than a more stable group or community.
There is still some testing to do, but I hope to get this feature ready soon for general use.
After trying to map the French version of Wikipedia a couple of days ago, I’ve played around with the much bigger English version (the dbpedia file I worked with contains 130M links between Wikipedia pages in a cool 20GB) this week-end and thanks to a rare lucid moment I was able to transform that thing into a .gdf that is small enough to be opened in gephi. I settled for the 45K pages with the most links (undirected) and started mapping. All three maps I built use the OpenOrd layout algorithm (1000 iterations). The first uses the modularity measure for “community” detection and colors text accordingly (click on the image for a very large version):
The second uses a grey color scale to express the degree (number of links) of a page:
Finally, the same map, but with a different color scale (light blue => yellow => red):
Every version helps with certain readability issues and you can download all tree of the maps as a big .psd so you can easily switch between the different modes.
When comparing these maps with their French counterpart, there are several things than are quite remarkable:
- Most importantly, there is no cluster that I would qualify as “common culture” or “shared knowledge”. There is most certainly a large, dense zone at the center but while the French one draws in all kinds of topics, this version has worldwide country information only. I would prudently argue that the English version of Wikipedia shows a more globalized picture of the world, even if there is a large zone of pages on the left that deals with the United States. It’s a bigger and more heterogeneous world that emerges, but there still is a dominant player.
- Sports is even bigger on the English version and typically American sports (Baseball, NASCAR, etc.) show up on the left in smaller, denser clusters compared to the gigantic football (soccer) area on the center to bottom right.
- The Sciences are smaller but entertainment (TV, popular music, comic books, video games, etc.) is much more present. At least at this level of observation.
- There are some seriously “strange” clusters, such as the dense yellow zone on the far right halfway between top and center that shows a group of Russian painters I have never heard of. Not that I’m an expert but I’ve found little trace of any other painters. This shows the weakness of my selection method by link degree – if there was a way to select nodes by page-views, the results would probably be very different, at least for our Russian painters. But it also shows that despite having become a rather respectable Encyclopedia with a quite classic subject outlook, Wikipedia still is a space for off-the-track topics and for communities that are so passionate about a certain subject that they will groom it and grow it.
I plan on releasing the scripts used to build these maps in the future but I want to try out a couple more things before that, most particularly a version that only takes into account in-links, which should reduce the presence of certain “distributor” pages (“events in 2010″,”people alive”, etc.).
Edit: a map of the English Wikipedia is here.
Wikipedia is a fascinating object for way too many reasons. The way it is produced, the place it has taken in society, it’s size and evolution, and many other aspects are truly remarkable. Studying Wikipedia has become a discipline in itself and while there may be certain signs of fatigue on the editing front, there is still much to learn and to discover. I have recently started to take an interest in looking at the way knowledge is structured in different contexts and the availability of certain tools and datasets makes Wikipedia a perfect object for scrutiny. If it just wasn’t that big. Still, it’s the 21st century and computers are getting really fast, so why not try mapping Wikipedia. All of it.
There are different ways to start such a project, but simply taking the link structure is probably the most obvious. This allows for bypassing the internal taxonomy and may lead to a more “organic” expression of underlying knowledge structures. Unfortunately, computers are not that fast – at least not mine – and so I had to make two concessions: I took a non English variant (I settled for French) and reduced the number of nodes to a (barely) manageable amount. The final graph file (.gdf – do not even think about working with it with less than 4GB of RAM) was built by taking pages that had at least 100 connections with other pages. From an initial 183K pages and 11.5M links I went down to a more manageable 40K and 2M respectively. To make things workable, I chose to visualize the page names only, no nodes, no edges. The result looks like this (click on the image for a very big .png):
Reliable gephi did not only do the graph layout (OpenOrd plugin, 1000 iterations) but dutifully detected “communities” in the network, which actually did work really well. And here is a version in elegant grayscale, this time without community detection:
The graph shows a big dense zone in the middle that is quite unreadable but composed out of world history, politics, geography, and other elements that constitute a core set of knowledge elements that are highly interlinked. While France plays and important role here, these elements are actually very globalized and include countries from all over the world. Could we interpret this as a field of “common” or “shared” knowledge? A set of topics that transcend specialization and form the very core of what our culture considers essential?
To the close right of the very center, there is a rather visible (in orange) cluster on the United States. Around the center you’ll find major historic events and periods (WWII, middle ages, renaissance, etc.). The arts are on the right (mostly music) and France’s most popular art form – Cinema – starts at the top right, in a highly dense orange cluster and goes to the top left, tellingly fusing with theatre. The Sciences form a rather strange blue band the goes from the center top to the top right.
And then there is sports. I was a bit surprised by how much of it there is and how well the clustering and community detection works for identifying individual fields – football, tennis, car racing, and so on. The second surprise was how few “geek” subjects appear on the map. There is a digital technology cluster on the top right but I haven’t found any traces of the legendary Star Trek cluster. In the end, French Wikipedia appears to be a rather classic encyclopedia if you look at it from a subject angle. Could we use such maps to compare subject prominence between cultures?
Obviously, the method for mapping Wikipedia has to be refined to make maps more readable but the results are actually already quite telling. Let’s see whether the same approach can work for the English version – which is a cool 10 times bigger…