Wikipedia Graph Dataset
Needless to say that Wikipedia is an invaluable source of free knowledge. In addition to that, Wikipedia weblogs data is a great resource for the research in many different fields such as collective behavior, data mining or network science. After all the projects and hours spent on wikipedia data, we got really tired of the data pre-processing. Even though Wikipedia data dumps are quite organized and clean, it takes time to pre-process data to represent and handle it as a graph data structure. Wikimedia Foundation has already done a great job making this data quite well structured and publicly available (consider donating). We wanted to make this data even more accessible to researchers and tech-savvy Wikipedia enthusiasts interested in studying network-related and socio-dynamic aspects of this knowledge base.
We are focusing on two aspects of the data dumps, Wikipedia web network and pagecounts (number of visits per Wiki page per hour). While working on multiple Wikipedia-related projects, we realized that the life of researchers interested in getting insights from Wikipedia web network structure would have been much easier if the data was natively represented as a graph. Also, we thought that it would be nicer and more convenient to have pagecounts stored in a database. This would allow us to ask questions not only about the static network structure but also about the dynamic aspects of Wikipedia, such as visitors’ interests over time, anomalies in viewership activity, or any other questions related to collective behavior.
All in all, we provide tools to pre-process two types of Wikipedia dumps, page graph (pages as nodes, hyperlinks as edges) and pagecounts. We store the graph in Neo4J graph database. To use the graph dataset you need to deploy Neo4J database on your local computer or a remote server (preferably Debian-based distribution). Pagecounts are stored in .parquet so you can use the files with any compatible framework. You can use the datasets together as well as independently. If you decide to use the datasets together, they can be connected using a key PAGE_ID (as shown in the igure). Take a look at the detailed deployment instructions at the end of this post.
To incentivize you to start using these databases, we will show a few use cases and queries that should give you an idea of what you potentially could do with them in your Wikipedia-related research projects.
Queries and use cases
Pretty much everyone is crazy about Game of Thrones these days so let’s use it as a demonstration example just to show courtesy to pop culture trends. Let’s take a look at the subnetwork of GoT characters and the activity of Wikipedia visitors over time. To do that, we will need to perform the following steps.
1. Graph database. Neo4J.
We will start with a simple Cypher query. To which Wikipedia categories does Daenerys_Targaryen page belong to?
This query returns 15 categories to which Daenerys Targaryen BELOGNS_TO. For example, this list includes Fictional_princesses, Fictional_victims_of_child_abuse, and Female_characters_in_television. Taking this into account, we may assume that all GoT characters belong to two large categories, Female_characters_in_television and Male_characters_in_television. Let’s keep this in mind until we write the final query.
Now, we are interested in a sub-network around category GoT. To get this subnetwork we need to submit the following query:
Technical note: in this query, we set depth of the neighborhood to 2, which means we want a sub-network that includes nodes not further than 2 hops from the GoT category page.
The response is too large to visualize in Neo4j browser. I would recommend either storing the resulting graph in a file and then visualizing it elsewhere or limiting the size of the response with LIMIT keyword at the end of the query. For example, we can save the response in GraphML format and use Gephi to visualize it. To do that, we need to use APOC procedures and the query would look as follows.
Once you get the file and apply some Gephi-fu, you will get a graph similar to the one shown on the image below. To take a closer look on this graph, open this project file in Gephi. You can also click on the image and explore the network in an interactive environment.
In this graph, you can find Wikipedia pages related to GoT such as actors, characters, seasons, episodes, and other related pages. Note that Wikipedia structure can be confusing due to two types of Wikipedia articles: pages and categories. You should take this into account when working with the dataset. For example, in this graph, you will find both types of nodes. Do not get confused when you find two nodes with title Game_of_Thrones. One of them (green) is a category, another one (pink) is a page.
Finally, to extract a subnetwork of GoT characters, we need all Wikipedia pages that BELONGS_TO categories Game_of_Thrones, Female_characters_in_television and Male_characters_in_television. To do this trick, we need a bit more sophisticated yet very concise and self-descriptive query. Combining the two previous queries, we get something like this:
This query returns characters and corresponding page IDs. If we wanted to get the full network of characters with edges between them, we would have to extend the query above in the following way:
It is a bit longer than the previous one. We have added the part that fetches edges between all pages that we have extracted in the previous query. Click on the image below to interact with the network.
Once we have extracted the subnetwork of GoT characters, we can look at the popularity of the characters on Wikipedia according to visitor interests. To do that, we need to query another part of our dataset that is stored in .parquet files.
2. Pagecouts dataset
In this section, we will take a look at the viewership activity on a couple of pages.
To take a look at the popularity of GoT characters over time on Wikipedia, we need IDs of the pages we are interested in. We can extract the IDs from the previous query (we can export them in a CSV file) and use them to query tables stored in parquet files.
In the image below, you can see a partial result of such query. We can notice the day and night fluctuations in the time-series. Here, we display the activity of only two pages to avoid the clutter.
There are many ways to use activity information. For example, take a look at one of our latest projects on Wikipedia viewership anomaly detection.
More details and the team
If you are interested in other languages (apart from English), our pre-processing tools allow you to work with dumps of any language edition of Wikipedia. More details in the deployment instructions below. GitHub repo with the deployment instructions is available here. A step-by-step deployment tutorial is available here.
If you have any problems related to deployment or usage of the dataset, we encourage you to create an issue on GitHub. This is the most efficient way of communication. You can also leave a comment below or contact me directly.
This dataset is a joint effort of Nicolas Aspert, Volodymyr Miz, Benjamin Ricaud, and Pierre Vandergheynst (EPFL, LTS2). For more details, take a look at the paper we have recently published in the proceedings of the Wiki Workshop 2019 held at The Web Conference 2019.
Acknowledgements
Kudos to Kirell Benzi for the idea of the GoT case study.
Let me know what you think of this article on twitter @mizvladimir or leave a comment below!