Wikipedia graph mining: dynamic structure of collective memory

by on under Research
9 minute read

This is the accompanying blogpost for our upcoming research paper (read preprint on arXiv); joint work with Kirell Benzi, Benjamin Ricaud, and Pierre Vandergheynst (EPFL, LTS2). Here, we focus on the results, omitting the details of the algorithm and the implementation.

Intro

Wikipedia is a great source for data analysis due to its outstanding scale and the graph structure. Tens of millions of visitors surf it daily, leaving their footprint on the Web. The combination of the Wikipedia graph structure and visitor activity on the pages gives us the dynamic graph – the graph with time-series signals on the nodes. The dynamic nature of the graph makes the large-scale analysis problem complicated.

In the original paper we analyze the Wikipedia graph. The aim is to detect collective memories using the activity of the Wikipedia visitors. We use graph-based approach to build our model. The computational model is inspired by the synaptic plasticity and Hebbian theory.

Not surprisingly, we could not fit all the results into the paper. Apart from that, PDF is a poor format for communicating the research findings. The aim of this post is to show the results in an interactive way. While reading the paper and this post, we encourage you to open the graphs, appearing everywhere in this post, and play with them: zoom, click, move, search, and select. This is by far the funniest way to plunge into the main results of our work.

Graphs are interactive

  1. Click on any graph in this post to open it in a new window.
  2. Zoom, click on the nodes, search pages by name, highlight clusters by color.
    • When you click on a node – you select all the neighbors.
    • When you select a cluster – you select all the nodes in this cluster.
    • The list of selected nodes appears on the right.

Works best in the latest version of Chrome. DO NOT try to open the graphs on a smartphone. The graphs are too large and it might take forever to render.

Dataset

The original datasets are publicly available on the Wikimedia website. We took the SQL dumps of the English Wikipedia articles to create the graph. The visitor activity is the number of visits per page per hour. We consider the period from 02:00, 23 September 2014 to 23:00, 30 April 2015. Pre-processing details are described in our paper in the Dataset section.

Dynamics of the Wikipedia network

7 months dynamics Wikipedia graph

In the paper we make an assumption that the dynamics of the graph can affect its structure. We apply the update rule, based on the signal on the nodes, to observe this effect. Here we show that the Wikipedia graph can self-organize into the sets of meaningful communities of the nodes, if we take into account the visitor activity dynamics of the graph. Click on the graph on the right and explore the result by yourself.

This graph is the result of 7-month dynamics of visitor activity on Wikipedia. Here you can find main events that had taken place over the considered period. The stable or scheduled events, like tournaments, awards ceremonies, contests, and most popular holidays form big clusters. The unstable or unexpected events, like incidents and accidents are grouped in small clusters. Even though, this graph provides good summary of the dynamic patterns, we can see only the final result. What is more important, is to get insights on the dynamics of the graph over time. How the clusters emerge, evolve, and disappear? To answer this question, we pick one particular event and look at its dynamics in details.

Dynamics of an event: NFL championship

In order to understand the dynamics of evolution of the graph, we pick one of the most popular events, highlighted in English Wikipedia – the National Football League (NFL) championship. We consider the season 2014-2015. The plot is on the right (click to enlarge). For the plot interpretability we extracted 30 NFL teams out of 485 pages in the original cluster. The timeline shows the overall cluster activity over the period of 7 months. The dynamics timeline of the graph and the NFL cluster evolution is illustrated on the top row. It reflects the interest of the NFL fans in the championship. The cluster is small and sparse at the beginning of the championship and becomes denser and bigger, approaching the final game date. The behavior of the Wikipedia visitors during the day of the final game Super Bowl is outstanding. The activity of the NFL fans is much higher, comparing to the activity of other Wikipedia users. It makes an analogy to the real life, when during the finals the fans become the most active people on the streets.

The NFL championship is just an example of a detected event and its evolution. You can dig into the graphs of monthly activity and check out other detected event-clusters. The total number of the detected events is 172. Click on the graphs below to open an interactive version and explore by yourself.

October   November   December   January   February   March   April
           

The NFL cluster is a good example of a stable event, represented as one of the biggest clusters in the resulting graph. What about the unscheduled events, like attacks and other accidents?

Collective memory

Traumatic events, like terrorist attacks, flight crashes, wars, and conflicts, often remind us of the past. These memories are often common for most people in a social group. That is the reason why they are called Collective memories. Our approach allows to detect these memories and serves as a general model for collective memory emergence. We provide the examples of 3 events, detected among the others.

The table below contains examples of collective memories. To show the details of the detected collective memories, we pick 3 particular events among the others detected: Ferguson unrest (second wave - November 24, 2014), Charlie Hebdo attack (January 7, 2015), Germanwings flight 9525 airplane crash (March 24, 2015). Top row contains the extracted clusters of collective memories for each of the discussed events. Bottom row shows the detailed activity of each page in the clusters.

Ferguson unrest   Charlie Hebdo attack   Germanwings 9525 crash
   
   

We see that the core events trigger relevant memories. Ferguson unrest reminds us of other riots and other shootings of innocent people of color. Charlie Hebdo shooting has links to the other terrorist attacks, bloodshed, and law enforcement agencies. Germanwings crash is surrounded by the dense group of the other airplane crashes, indicating that the flight accidents are thoroughly structured on Wikipedia.

Although, we can see slightly different behavior in the clusters. For example, the activity during the Ferguson unrest has two distinctive spikes that indicate beginning and end dates of the riot. In case of the Germanwings, the activity of the cluster has a spike in December. The reason is another airplane crash that occurred in December and for this reason got attached to the main cluster.

The short-term events and their collective memories can be found on the monthly dynamic graphs, presented in the previous section in the timeline table. Check the graphs out and search for the events of your interest.

Conclusions

Wikipedia can tell us more than is written on its pages. It is a great source of data for the collective human behavior research. Nonetheless, the dynamic nature of the graph-structured data leads to new challenges for data science and machine learning. In the paper we proposed a new method for collective memory modeling and understanding. We applied the method to the Wikipedia datasets. We detected collective memories using the combination of the Wikipedia Web network graph and the its visitor activity history.

We assume that the learned graphs have associative memory properties. To verify this assumption, we model the recall process using Hopfield networks. The results of these experiments are in my blog post describing these memory properties.

Besides, we noticed that the events we detected were triggered by the anomalous activity of Wikipedia users. This inspired an idea of an anomaly detection algorithm that I described in another blog post. Take a look at the interactive demos of the results on the website of our Wikipedia Insights project.

Tools and code

We make all the experiments using Apache Spark GraphX. The code is written in Scala and available on GitHub. If you feel like playing with the code but new to Scala and Spark, take a look at this tutorial.

Acknowledgments

I would like to thank Michaël Defferrard and Andreas Loukas for fruitful discussions and useful suggestions.

Apache Spark, Wikipedia, Scala, GraphX, Machine Learning, Research, Dynamic graph, Network analysis
comments powered by Disqus