Anomaly detection in the dynamics of Web and social networks
Imagine, you have a large network. Say, a social one. Every day, users of this network (aka nodes) generate massive amounts of likes, messages, and clicks (aka activity logs). Also, these users are connected through some links (aka edges), say, follow or befriend each other. There is nothing special about daily activity in this network. But, at some point, something unexpected happens in the network dynamics (e.g. an attack). Since the network is huge and the activity logs are massive, it is hard to explain WHAT happened, WHERE it happened, and, what is more important, WHY it happened. And the more active your users and the larger your network, the more complicated these questions become to answer.
This social network is just one example of a large dynamic network where anomaly detection, inspection, and analysis may be required. Apart from that, good candidates are Web networks, email networks, and brain networks. You can think of any other domain that could fit this data model. There are only three requirements for the data. Entities in the dataset 1) could be represented as nodes, 2) have some relations among each other, and 3) generate some activity over time.
Introduction
In this post, I am going to present our recent work with Benjamin Ricaud, Kirell Benzi, and Pierre Vandergheynst (EPFL, LTS2). It is about anomaly detection in the dynamics of social and Web networks (preprint is available on ArXiV). This is yet another application of our algorithm that I already described in the previous blog posts. So far, we were able to detect events and study collective memories. This time, we realized that we can use the same (but slightly modified) algorithm to detect anomalies in the dynamics of communication networks. Moreover, I will show you how we can use it to investigate network attacks and recover traces of an attack if the intruders tried to hide or erase them.
Take a look at the related research and interactive demos on the website of our Wikipedia Insights project.
Method
We define an anomaly in the network dynamics as an unexpected spike of activity in a cluster of nodes. So, the objective is to find those clusters. As I said before, our networks are very dense, large, and the corresponding activity recordings are very noisy. This makes it hard to localize the anomalous clusters. This is where we came up with an idea of the algorithm that would help us to keep only those parts of the network that could potentially be related to anomalies in the network.
The core of the algorithm is a modified Hebbian learning rule, a theoretical assumption made by neuroscientists to explain the learning function of the brain. The simplified original rule implies that if two neurons (nodes) are active together, they tend to increase the strength of the connecting synapse (edge weight) between them. We modify this rule to fit our dynamic network data model. First, we do not care about the causality of activations and focus on simultaneous activation of the nodes in the network. Second, we introduce a convenient update function that allows us to forget about such problems as normalization and weight thresholds during learning.
Speaking of our data model, the algorithm consists of three main steps and works as follows. First, we select “spiky” nodes that have unexpected bursts of activity. We simply use mean and standard deviation to assess the level of the bursts. Once we identify the spiky nodes, we remove the ones with uniform activity. Second, we compute the strength of the connections (weights of the edges) between the spiky nodes using the modified Hebbian learning rule. In other words, if two nodes are active at the same time, the weight of an edge connecting them increases, otherwise, it either remains unchanged or decreases when the forgetting parameter is on. When the forgetting parameter is on, we want the algorithm to forget about previous anomalies. If it is off, the algorithm accumulates anomalous events and stores them as a memory. Finally, when the weights are computed, we remove low weight edges so the remainder of the initial graph contains clusters of the nodes with anomalous activity.
As you can see, all computations are local on the graph. This means that the weight updates of adjacent edges of a node depend on the node’s attributes and the attributes of its neighbors. This allowed us to fit the algorithm in GraphX framework and implement it in a distributed fashion. See the code on GitHub. If you are new to Scala and Spark, take a look at this tutorial.
UPDATE 0: We have a new enhanced and more readable version of the code available here. To run it, you need to deploy Neo4J and Apache Cassandra databases. You can find deployment instructions in the same repo on GitHub. See details here.
UPDATE 1: If Scala code is confusing, I wrote a minimal Python code example with experiments on random data. Check out this GitHub repo for more details. It should provide a quick but good understanding of our approach. Since the algorithm relies on the structure on the graph, the results with random graph are not very impressive. However, even though the structure of random graph is very dense we have managed to get decent results (see images below below).
Anomalous time-series | Initial random graph | Detected anomalous cluster (bold edges) | |||
---|---|---|---|---|---|
Experiments
We conduct experiments on two datasets: Wikipedia web network and Enron email communication network. The datasets are publicly available (check out the links).
Wikipedia web network
In this experiment, we studied Wikipedia web network. The aim was to detect anomalous behavior of Wikipedia visitors. Here, we use Wikipedia pages as nodes and hyperlinks connecting the pages as edges of the graph. Every node has a time-series attribute corresponding to the number of visits per hour. Wikimedia Foundation makes viewership activity publicly available so we used this data to run the experiment. The pre-processed dataset is on Zenodo (see links in the previous paragraph).
This experiment is quite interesting because it is hard to validate the results quantitatively or present a numerical measure that would reflect the success rate of the algorithm. As a workaround, we decided to validate its results using another anomaly detection framework as ground truth. In this experiment, we use Google Trends API as a benchmark.
As you see in the figures below, the results are quite impressive. The algorithm has managed to detect multiple anomalies in Wikipedia viewership activity. These anomalies correlate with the trending keywords on Google Trends.
Besides, there is another detail I want you to focus on. Look at the figure reflecting anomaly related to Germanwings 9525 crash. There is a smaller spike of activity at the end of December 2014. The fact is that there was another airplane crash at the end of December 2014, Indonesia AirAsia Flight 8501. We detected this event not because of a bug or a noisy activity pattern. This is a feature of our approach that reflects an aspect of collective behavior of Wikipedia visitors, which is called Collective Memory. The phenomenon of collective memory suggests that social groups recall past traumatic events when another similar traumatic event happens at present. This is exactly what our algorithm detected. When Germanwings crash happened, people started searching for other related airplane accidents. Google Trends has not reflected this since it focuses on particular keywords whereas our approach looks at more general picture due to the graph representation of the data. This means that our algorithm detects complex anomalies in the viewership dynamics that can only be spotted if one looks at the group of concepts rather than a particular keyword.
Ferguson unrest | Charlie Hebdo attack | Germanwings 9525 crash | |||
---|---|---|---|---|---|
Enron email network
We also investigated Enron email dataset and ran another anomaly detection experiment on it. We built a communication network based on the emails sent among employees. Nodes of the network correspond to employees; we drew an edge between two employees if they have exchanged at least a few emails. Activity on the nodes is the number of emails sent by corresponding employees.
In this experiment, we had ground truth. We knew when the anomalies in the dynamics happened. This gave us a chance to verify the accuracy of our algorithm. There are four main events related to the Enron scandal (see details in the paper). We found that the real-world events related to the scandal triggered the anomalies in the communication dynamics of the corporate email network. The image below shows the result of the anomaly detection algorithm.
Attack investigation. Data recovery
We got to the final part of the blog post. Imagine, someone attacked your network trying to wipe network activity data and this resulted in an anomalous spike of activity on the nodes of your network. Or, simply, you lost information about network activity due to an unexpected shutdown of the hosting server. In any case, the main problem when this happens is to recover missing information about the network activity. Our algorithm can help.
Below, you see an example of the recovery of damaged parts of the data. Here, we wipe 20% of viewership activity data in a cluster of Wikipedia pages and recover it using our algorithm. To memorize and recover missing data we used Hopfield neural network, a basic model of artificial associative memory. Incidentally, Hopfield nets also use the Hebbian learning rule.
We use prior activity of the network to learn collective patterns in the network dynamics. The patterns correspond to clusters of pages with similar anomalous behavior. We then use these clusters to recover the missing data. We found that the algorithm allows recovering missing data for the period when the anomaly has happened (in this example, it is bounded by the red vertical lines). It is also possible to recover more information outside these bounds but here we focused on the information that is most interesting for us, the missing information related to the moment of the attack.
As you can see, when recovering the data we heavily rely on the neighboring nodes in the cluster. This means, if we had the entire cluster wiped by the intruders, we would not be able to recover the original data. To recover missing information in the cluster we need to have at least 50% of nodes with valid undamaged observations. This is a limitation of our approach.
Read more about the memory properties and the data recovery feature of our algorithm in my previous blog post.
Conclusions
Networks are everywhere. We are surrounded by networks in pretty much all aspects of our lives explicitly or implicitly. It is important to develop a stable and robust toolkit for monitoring these networks and protect us from unexpected accidents related to them. Here, we introduced a new method for anomaly detection and missing information recovery in the large networks. We showed that the network structure carries a lot of useful information in itself, hence, it is very important to use it when working with the dynamic or time-varying data that have spatial or structural component. All things considered, I believe that in the following years, we will see more and more applications related to spatio-temporal data mining (for more details about this emerging field, take a look at this recent survey).
Let me know what you think of this article on twitter @mizvladimir or leave a comment below!