Wikipedia Graph Dataset
Needless to say that Wikipedia is an invaluable source of free knowledge. In addition to that, Wikipedia weblogs data is a great resource for the research in many different fields such as collective behavior, data mining or network science. After all the projects and hours spent on wikipedia data, we got really tired of the data pre-processing. Even though Wikipedia data dumps are quite organized and clean, it takes time to pre-process data to represent and handle it as a graph data structure. Wikimedia Foundation has already done a great job making this data quite well structured and publicly available (consider donating). We wanted to make this data even more accessible to researchers and tech-savvy Wikipedia enthusiasts interested in studying network-related and socio-dynamic aspects of this knowledge base.
We are focusing on two aspects of the data dumps, Wikipedia web network and pagecounts (number of visits per Wiki page per hour). While working on multiple Wikipedia-related projects, we realized that the life of researchers interested in getting insights from Wikipedia web network structure would have been much easier if the data was natively represented as a graph. Also, we thought that it would be nicer and more convenient to have pagecounts stored in a database. This would allow us to ask questions not only about the static network structure but also about the dynamic aspects of Wikipedia, such as visitors’ interests over time, anomalies in viewership activity, or any other questions related to collective behavior.
All in all, we provide two databases, graph and pagecounts. We store the graph in Neo4J. Time-series pagecounts are stored in Apache Cassandra. To use them, you can deploy both databases (or any of them separately) on your local computer or a remote server (preferably Debian-based distribution). It takes at most 2 hours for both databases to deploy once you have all required dumps downloaded from Wikimedia servers (this may take some time). Take a look at the detailed deployment instructions at the end of this post.
To incentivize you to start using these databases, we will show a few use cases and queries that should give you an idea of what you potentially could do with them in your Wikipedia-related research projects.
Queries and use cases
Pretty much everyone is crazy about Game of Thrones these days so let’s use it as a demonstration example just to show courtesy to pop culture trends. Let’s take a look at the subnetwork of GoT characters and the activity of Wikipedia visitors over time. To do that, we will need to perform the following steps.
1. Graph database. Neo4J.
We will start with a simple Cypher query. To which Wikipedia categories does Daenerys_Targaryen page belong to?
This query returns 15 categories to which Daenerys Targaryen BELOGNS_TO. For example, this list includes Fictional_princesses, Fictional_victims_of_child_abuse, and Female_characters_in_television. Taking this into account, we may assume that all GoT characters belong to two large categories, Female_characters_in_television and Male_characters_in_television. Let’s keep this in mind until we write the final query.
Now, we are interested in a sub-network around category GoT. To get this subnetwork we need to submit the following query:
Technical note: in this query, we set depth of the neighborhood to 2, which means we want a sub-network that includes nodes not further than 2 hops from the GoT category page.
The response is too large to visualize in Neo4j browser. I would recommend either storing the resulting graph in a file and then visualizing it elsewhere or limiting the size of the response with LIMIT keyword at the end of the query. For example, we can save the response in GraphML format and use Gephi to visualize it. To do that, we need to use APOC procedures and the query would look as follows.
Once you get the file and apply some Gephi-fu, you will get a graph similar to the one shown on the image below. To take a closer look on this graph, open this project file in Gephi. You can also click on the image and explore the network in an interactive environment.
In this graph, you can find Wikipedia pages related to GoT such as actors, characters, seasons, episodes, and other related pages. Note that Wikipedia structure can be confusing due to two types of Wikipedia articles: pages and categories. You should take this into account when working with the dataset. For example, in this graph, you will find both types of nodes. Do not get confused when you find two nodes with title Game_of_Thrones. One of them (green) is a category, another one (pink) is a page.
Finally, to extract a subnetwork of GoT characters, we need all Wikipedia pages that BELONGS_TO categories Game_of_Thrones, Female_characters_in_television and Male_characters_in_television. To do this trick, we need a bit more sophisticated yet very concise and self-descriptive query. Combining the two previous queries, we get something like this:
This query returns characters and corresponding page IDs. If we wanted to get the full network of characters with edges between them, we would have to extend the query above in the following way:
It is a bit longer than the previous one. We have added the part that fetches edges between all pages that we have extracted in the previous query. Click on the image below to interact with the network.
Once we have extracted the subnetwork of GoT characters, we can look at the popularity of the characters on Wikipedia according to visitor interests. To do that, we need to query another part of our dataset that is stored in Apache Cassandra database.
2. Time-series dataset. Cassandra
In this section, we will query Cassandra database using CQL, which is very similar to SQL.
To take a look at the popularity of GoT characters over time on Wikipedia, we need IDs of the pages we are interested in. We can extract the IDs from the previous query (we can export them in a CSV file) and use them to query Cassandra in the following way:
In the image below, you can see a partial result of such query. We can notice the day and night fluctuations in the time-series. Here, we display the activity of only two pages to avoid the clutter. To reproduce the result, you can use this notebook. You can also use this notebook as a reference and a starting point for your future projects.
There are many ways to use activity information. For example, take a look at one of our latest projects on Wikipedia viewership anomaly detection.
More details and the team
This dataset is a joint effort of Nicolas Aspert, Volodymyr Miz, Benjamin Ricaud, and Pierre Vandergheynst (EPFL, LTS2). For more details, take a look at the paper we have recently published in the proceedings of the Wiki Workshop 2019 held at The Web Conference 2019. GitHub repo with the deployment instructions is available here. A step-by-step deployment tutorial is available here.
If you have any problems related to deployment or usage of the dataset, we encourage you to create an issue on GitHub. This is the most efficient way of communication. You can also leave a comment below or contact me directly.
Kudos to Kirell Benzi for the idea of the GoT case study.
Let me know what you think of this article on twitter @mizvladimir or leave a comment below!