Wikipedia, COVID-19, and readers' interests across languages
To tell you the truth, I’d better be writing my PhD thesis right now. But the recent findings got me so excited that I cannot wait to share them. Anyways, the thesis can wait; hello
darkness procrastination, my old friend.
If you are only interested in the results, feel free to skip to 3. Results.
A few months before I started writing my thesis, the COVID-19 pandemic had unfolded. At the beginning of the pandemic, strict confinement measures were introduced, first, in China, then in several other countries in Asia, and finally, across Europe and in other countries around the world. These measures globally affected the world, leading to drastic changes in mobility patterns, among many others. Following these restrictions, the interests of Wikipedia readers have also changed (you can read more about it in a recent study by Manoel Horta Ribeiro et al.).
All in all, the striking influence of the pandemic on everyday life inspired us to run another experiment. We decided to investigate the evolution of trending topics throughout the pandemic across different language editions of Wikipedia. We focused on the data over the period from December 2019 until May 2020. The study involves 7 language editions of Wikipedia, including English, French, Russian, Italian, Spanish, German, and Chinese (note that Wikipedia has been banned in China since 23 April 2019).
In the previous post, I wrote about an algorithm for detecting and comparing trending topics across different Wikipedia language editions. The trend detection approach we use here is almost the same as in our previous study. The core difference is the classification engine. This time, we didn’t train it. Instead, we adopted a model designed and trained by Wikimedia Foundation researchers, Isaac Johnson, Martin Gerlach, and Diego Saez-Trumper. This model is more versatile than the one we used before because it makes predictions for Wikidata items instead of Wikipedia articles, which makes its application in multilingual settings easier. Read more about their classification model here.
Another difference is the way we label trends. In the previous study, we labeled an entire cluster with the most popular topic, computing popularity of the topic using a complex formula that involves graph-based attributes. This time, we decided to keep it simple. Each cluster has multiple topic labels that give a more fine-grained outlook on each trend. As we can see in the figure below, when we use the Wikidata-based approach for topic classification, we obtain heterogeneous clusters. Left: general trends with mixed topic labels; right: the same general trends decomposed into sub-clusters of articles on the same topic.
Hence, each cluster corresponds to one trend, which covers multiple topics. These topics are implicitly related to trending events that triggered the creation of each cluster. To understand better, let us look at several examples in more detail. The majority of pages in Sports-related clusters (highlighted in blue) are labeled as Sports. However, depending on the nature of sports, we can also see smaller sub-clusters of pages related to other topics, such as Media, Education, Healthcare, and Engineering inside of the main cluster. Needless to say that media and education are inalienable parts of many popular sports these days.
We can also see that trends related to politics (highlighted in red) are also very diverse. If we look at the cluster that emerged after the death of John McCain, we can see that the majority of pages, indeed, belong to politics. What is interesting is that we can also see fairly big sub-clusters that comprise articles on topics such as Society, Military and Warfare, History, and Business. All these areas are inevitably involved in political careers and processes.
Clusters created as a result of the readers’ interest in natural disasters (yellow) are also heterogeneous. In the cluster that emerged after the Hurricane Florence, we can see that it is comprised of pages that cover a wide range of topics including STEM (mostly articles on Earth & Environment), Politics (articles about politicians involved into solving the crisis), and History (articles related to previous disasters and their causalities).
Finally, art-related topics (green) are also diverse. We can see that the cluster that emerged due to the anniversary of Aretha Franklin’s death includes articles on different topics associated with various artistries, such as Media, Music, and Visual Arts.
To get a better view of the heterogeneity of the clusters classified using the Wikidata-based approach, let’s look at another plot, which illustrates more concrete examples.
We can see that clusters related to sports, such as NCAA (American football), US Open (Tennis), and Belgian Grand Prix (Auto racing, Formula~1), have the least diverse set of topics among others. Naturally, most of the pages are classified as Sports. Media, Business, and Politics are also among common topics in the sports-related clusters. However, it is interesting to see that secondary topics in these clusters reflect specific features of different sports. Engineering advancements in the automotive industry play a significant role in Formula~1, which is reflected by the presence of topics Transportation and Engineering in its cluster. NCAA is an organization that regulates student athletes from North American institutions, so we can see Education among the most represented secondary topics in that cluster. We can also notice a similar effect in the clusters related to politics and show business where secondary topics give us a wider perspective on the specific nature of the occurred events.
3.1 COVID-19 and trends
Now, let’s focus on general trends over the first months of the pandemic. On the plot below, we can see that the distribution of trending topics looks similar to what we saw in the previous experiments.
Sports and Media are the leading trends, followed by Music, Films, and Politics & Government. We also noticed an emergence of two new clusters that were not captured before, namely Medicine & Health, and Biology, which was triggered by the increased interest of the readers in the articles related to viruses, influenza, and the COVID-19 pandemic itself.
The popularity of Society is 20-25% higher in Chinese and Russian editions compared to other languages that we analyzed. After a qualitative analysis of the classification results, we have discovered that there is a significant overlap between the topics Politics and Society in Chinese and Russian editions. We found that Wikidata items, related to local political figures and elections in the regions where the majority of people speak Russian and Chinese, often belong to the topic Society. This finding is thought-provoking and requires more research.
3.2 Evolution of trends during COVID-19
The evolution of trends during the pandemic is even more interesting. To capture the dynamics, we aggregated trending topics bi-weekly; each data point represents the popularity of a topic during a selected two-weeks period. In this study, we focus on the short-term dynamics, which reflects change points in the trends in a moving time-window. This allows us to get a live picture of how users shifted their attention from one topic to another. In the stacked chart below, you can see a dynamic view of the changing popularity of some of the most popular topics. We normalized the popularity of each topic in each language between 0 and 1. The more drastic the attention shift, the thicker the line on the plot.
We can see that the COVID-19-related topics, such as Biology and Medicine & Health, have an attention spike in January. Then, after a short-term drop, these topics develop a steady momentum starting from February. In Chinese Wikipedia, we observe the most significant increase in attention to these topics in January. The interest in the topics remains consistent throughout the entire period and only slightly diminishes in April. Looking at other language editions, we observe that at the beginning of February, the interest of the readers to Biology and Medicine & Health drops. However, soon after that, the topic Biology regains popularity among Italian-speaking readers, followed by English- and German-speaking audience. Attention to Medicine & Health also bounces back, first, in Italian and French editions, and then in German and English. Russian-speaking readers develop an interest in both topics closer to the end of March. All in all, most of these observations reflect the COVID-19 development timeline in the locations where these languages are spoken primarily, however, it is still hard to align geographically the results for English, French, and Spanish language editions because of their global adoption in different regions of the world.
Sports is the most popular topic in all languages at the beginning of the pandemic. However, we can notice an abrupt change of attention levels across all languages. The readers become indifferent to this topic starting from March. One of the possible explanations is that the pandemic resulted in the cancellation of the majority of sports events around the world.
Attention to Media, Films, and Music is mostly uniform across all languages during the pandemic. The spike in the topic Films can be explained by the worldwide popularity 92nd Academy Awards ceremony, which took place on February 9. When we look at Music, we can see slight shifts towards indifference among Italian-speaking readers, which happens in the second half of February. This can be attributed to the strict lock-down measures that were introduced on March 9, however, this is just an observational hypothesis.
Finally, let’s compare our results to the ones reported in the study by Manoel Horta Ribeiro et al. The main difference between the two approaches is the comparison strategy. In our approach, we focused on the live short-term attention shifts or change-points, while the authors of the study compared attention levels to the previous year, reporting long-term changes. Even though we used a different approach, it is exciting to see that our findings confirm some of the observations reported in that study. During the first months of the pandemic, Wikipedia articles on Biology and Medicine & Health gained a lot of interest from readers across all language editions, while Sports-related articles lost a significant share of their audience. Nonetheless, there are a few discrepancies. For instance, we did not notice similar short-term changes in the attention to the topics Media, Films, and Music. We found that these topics retained the same level of short-term attention throughout the studied period.
The COVID-19 pandemic has affected pretty much every aspect of everyday life, including collective behavior and interests of digital users worldwide. This global crisis highlighted the utter importance of Wikipedia. We can see that people reached out to the encyclopedia looking for answers to pandemic-related questions. It is also exciting to see that, despite the ban, Chinese Wikipedia is still very active.
The difference between long- and short-term changes is interesting too. Attention to Biology, Medicine & Health, and Sports changed the most in the long- and short-term. Other topics, mostly related to entertainment, such as Media, Films, and Music, gained more interest compared to the previous year. However, throughout the pandemic, the attention level to these topics did not fluctuate.
Also, we noticed that trends in Wikipedia’s viewership are more complex than we previously thought. Each trend that we detected contains a set of articles on various topics. This gives an interesting perspective on the collective perception of trends across languages. We can see which topics are associated to which trends in every language edition.
From a technical point of view, the new topic classification model gives really promising results and inspires us to develop the project further. Now, the master plan is to
finish my thesis scale the study to even more languages, make it more automated, and create web-based service reporting trends in different languages editions of Wikipedia.
5. Code, data, results
Everything related to the COVID-19 case study is on GitHub. If you are interested in the trend detection algorithm, take a look at another blog post explaining it in more detail. If you want to run similar experiments, take a look at the distributed graph-based framework for Wikipedia data processing, which we use in all experiments.
Interactive visualizations (similar to this one) are coming soon. Stay tuned!
This project would not have happened without a great effort by Etienne Chalot, who did a summer internship (and the actual work on the project) in LTS2. I would like to thank Isaac Johnson for sharing the topic classification model and for providing advice on how to deploy it locally. Also, kudos to Nicolas Aspert and Benjamin Ricaud for coming up with interesting ideas that shaped this project. Special thanks to Nicolas for helping us with technical issues and taming our greed for computing power. Lastly, I hope my advisor, Pierre Vandergheynst, is not reading this, but just in case he is, I am grateful to him for letting me spend the entire morning not writing my thesis and writing this blog post instead :D
If you have any questions or suggestions, leave a comment below or contact me directly.
Let me know what you think of this article on twitter @mizvladimir or leave a comment below!