Wikipedia graph mining: dynamic structure of collective memory
This is the accompanying blogpost for our upcoming research paper (read preprint on arXiv); joint work with Kirell Benzi, Benjamin Ricaud, and Pierre Vandergheynst (EPFL, LTS2). Here, we focus on the results, omitting the details of the algorithm and the implementation.
Wikipedia is a great source for data analysis due to its outstanding scale and the graph structure. Tens of millions of visitors surf it daily, leaving their footprint on the Web. The combination of the Wikipedia graph structure and visitor activity on the pages gives us the dynamic graph – the graph with time-series signals on the nodes. The dynamic nature of the graph makes the large-scale analysis problem complicated.
In the original paper we analyze the Wikipedia graph. The aim is to detect events and collective memories using the activity of the Wikipedia visitors. We use graph-based approach to build our model. The computational model is inspired by the synaptic plasticity and Hebbian theory.
Not surprisingly, we could not fit all the results into the paper. Apart from that, PDF is a poor format for communicating the research findings. The aim of this post is to show the results in an interactive way. While reading the paper and this post, we encourage you to open the graphs, appearing everywhere in this post, and play with them: zoom, click, move, search, and select. This is by far the funniest way to plunge into the main results of our work.
Graphs are interactive
- Click on any graph in this post to open it in a new window.
- Zoom, click on the nodes, search pages by name, highlight clusters by color.
- When you click on a node – you select all the neighbors.
- When you select a cluster – you select all the nodes in this cluster.
- The list of selected nodes appears on the right.
Works best in the latest version of Chrome. DO NOT try to open the graphs on a smartphone. The graphs are too large and it might take forever to render.
The original datasets are publicly available on the Wikimedia website. We took the SQL dumps of the English Wikipedia articles to create the graph. The visitor activity is the number of visits per page per hour. We consider the period from 02:00, 23 September 2014 to 23:00, 30 April 2015. Pre-processing details are described in our paper in the Dataset section.
Dynamics of the Wikipedia network
In the paper we make an assumption that the dynamics of the graph can affect its structure. We apply the update rule, based on the signal on the nodes, to observe this effect. Here we show that the Wikipedia graph can self-organize into the sets of meaningful communities of the nodes, if we take into account the visitor activity dynamics of the graph. Click on the graph on the right and explore the result by yourself.
This graph is the result of 7-month dynamics of visitor activity on Wikipedia. Here you can find main events that had taken place over the considered period. The stable or scheduled events, like tournaments, awards ceremonies, contests, and most popular holidays form big clusters. The unstable or unexpected events, like incidents and accidents are grouped in small clusters. Even though, this graph provides good summary of the dynamic patterns, we can see only the final result. What is more important, is to get insights on the dynamics of the graph over time. How the clusters emerge, evolve, and disappear? To answer this question, we pick one particular event and look at its dynamics in details.
Dynamics of an event: NFL championship
In order to understand the dynamics of evolution of the graph, we pick one of the most popular events, highlighted in English Wikipedia – the National Football League (NFL) championship. We consider the season 2014-2015. The plot is on the right (click to enlarge). For the plot interpretability we extracted 30 NFL teams out of 485 pages in the original cluster. The timeline shows the overall cluster activity over the period of 7 months. The dynamics timeline of the graph and the NFL cluster evolution is illustrated on the top row. It reflects the interest of the NFL fans in the championship. The cluster is small and sparse at the beginning of the championship and becomes denser and bigger, approaching the final game date. The behavior of the Wikipedia visitors during the day of the final game Super Bowl is outstanding. The activity of the NFL fans is much higher, comparing to the activity of other Wikipedia users. It makes an analogy to the real life, when during the finals the fans become the most active people on the streets.
The NFL championship is just an example of a detected event and its evolution. You can dig into the graphs of monthly activity and check out other detected event-clusters. The total number of the detected events is 172. Click on the graphs below to open an interactive version and explore by yourself.
The NFL cluster is a good example of a stable event, represented as one of the biggest clusters in the resulting graph. What about the unscheduled events, like attacks and other accidents?
Traumatic events, like terrorist attacks, flight crashes, wars, and conflicts, often remind us of the past. These memories are often common for a group of people in a social community. That is the reason why they are called Collective memories. Our approach allows to detect these memories and serves as a general model for collective memory emergence. We provide the examples of 3 events, detected among the others.
The table below contains examples of collective memories. To show the details of the detected collective memories, we pick 3 particular events among the others detected: Ferguson unrest (second wave - November 24, 2014), Charlie Hebdo attack (January 7, 2015), Germanwings flight 9525 airplane crash (March 24, 2015). Top row contains the extracted clusters of collective memories for each of the discussed events. Bottom row shows the detailed activity of each page in the clusters.
|Ferguson unrest||Charlie Hebdo attack||Germanwings 9525 crash|
We see that the core events trigger relevant memories. Ferguson unrest reminds us of other riots, shootings of innocent people, and even leads us back to the slavery in the United States. Charlie Hebdo shooting has links to the other terrorist attacks, bloodshed, and law enforcement agencies. Germanwings crash is surrounded by the dense group of the other airplane crashes, indicating that the flight accidents are thoroughly structured on Wikipedia.
Although, we can see some noise in the clusters. The noise is relevant to the main topics of the clusters and does not affect the cluster formation. Usually, the major source of the noise is a node, that is relevant to several clusters of events. For instance, the cluster of Ferguson unrest contains the node Anonymous group. This node links another large cluster of leading tech and e-commerce corporations. In this case, the Online shopping page causes the first steady raise of the activity, since the most profitable day for online shops was detected on the 11/11/2014. Another example of the noise is in the Germanwings cluster. The main cause of the noise is the page of the day – 24 March – that contains most of the notable historical events.
Despite the fact that popular pages cause the noise, the algorithm is still able to localize the smaller events and create relevant clusters. To detect smaller events, like the ones presented in the examples, we used smaller time window of one week. The small events still can be found on the monthly dynamic graphs, presented in the previous section in the timeline table. Check the graphs out and search for the events of your interest.
Wikipedia can tell us more than is written on its pages. It is a great source of data for the collective human behavior research. Nonetheless, the dynamic nature of the graph-structured data leads to new challenges for data science and machine learning. In the paper we proposed a new method for patterns detection in large-scale dynamic graphs. We applied the method to the Wikipedia datasets. We have managed to detect dynamical patterns in terms of events and collective memories in Wikipedia using the combination of the hyperlinks graph and the visitor activity on the website. Next step is to improve the filtering part of the algorithm to decrease the amount of noise, described in the Collective memory section of this post.
Tools and code
We make all the experiments using Apache Spark GraphX. The code is written in Scala and available on GitHub. Data pre-processing can be done using the Python code, available in another GitHub repository. The git repos will be open after we submit the paper.
I would like to thank Michaël Defferrard for fruitful discussions and useful suggestions.
Let me know what you think of this article on twitter @mizvladimir or leave a comment below!