Movie genres classification using character interaction networks
Social interactions are among the most important needs of everyday life. Whenever we communicate with relatives, colleagues, friends or acquaintances, we create invisible social links adding those people into our personal social network (here, I mean the real one, not 500+ Facebook friends and LinkedIn connections). Sometimes, we look at interactions in our networks and compare our lives to movies and TV series trying to find similarities with real life. This made me think of a couple of interesting questions. Can the structure of our social network tell us anything about ourselves? What if it defines the genre of our life? Let’s try to look at this question from a data scientist’s perspective.
To do that, first, we would need a labeled dataset reflecting “social network”=>”genre” relation. Clearly, we do not have access to real-life social networks or even if we were able to extract them from, say, Facebook, first, that would be considered a privacy violation, and second, we would not have labels or “genres” characterizing each network. Although, the good news is that we have got movies and it is possible to extract networks of character interactions. Also, we know the genres of each movie. Fortunately, the guys from Moviegalaxies have already captured quite a few of these networks and published them on their website.
All in all, we found a suitably labeled dataset containing a “character network of a movie”=>”genre” relation. Now, we can build a machine learning model to find out whether the structure of character networks has anything to do with genres of corresponding movies.
Data
Moviegalaxies is a collection of social graphs in movies. A team of researchers from RWTH, Cologne University, and MIT (Jermain Kaminski, Michael Schober, Raymond Albaladejo, Oleksandr Zastupailo, and Cesar Hidalgo) has collected and published social network interactions among characters in 774 movies (and counting). They used “same-scene appearance” approach to build these social networks. This means, for example, if Padme and Anakin appear in the same scene in Star Wars: Episode II Attack of Clones, they are connected in the social network of the movie. On the other hand, Amidala and Yoda have never appeared together, therefore there is no edge linking them in the graph.
Although, it is not that easy to get genre information for each movie from Moviegalaxies website. In order to obtain this information, I used Open Movie Database (OMDB) API. The service provides 1000 requests per day for free which are enough to get more information about each movie in the Moviegalaxies collection.
For each movie, we consider only the main genre and omit the secondary ones. Here and in all experiments below, we keep only the top 7 most popular genres. As we can see in the chart below, the dataset is quite imbalanced.
Note: Visualizations work best with Chrome.
Features
We want to find out whether the structure of character networks is enough to define movie genres. To do that, we are going to look at the topological properties of graphs, corresponding to the character networks. In particular, we are interested in assortativity, transitivity, degree distribution, clustering coefficient, and modularity. Also, we consider the number of nodes and edges in social graphs as features.
Apart from having fancy names, topological properties of graphs can tell us interesting stories about social networks. For example, clustering coefficient, transitivity, and modularity reflect to what extent members of a network tend to connect or cluster together. High assortativity coefficient indicates that people in the network tend to socialize with peers that have similar properties (popularity, for instance).
After removing outliers and normalizing, let’s take a look at pairwise relationships of the features in our dataset. To reduce clutter, you can choose genres using check-boxes below. I recommend comparing genres that have a similar number of samples, e.g. Horror vs Adventure.
Note: you may need to wait a second after you press a checkbox. Works best with Chrome.
As we can see, our dataset is quite complicated since we have overlapping classes. I had an assumption that the features may have non-linear relationships that would help to separate the classes. To verify that, I applied manifold learning to the dataset in order to learn non-linear features, although this had not improved the classification results. Check out this Jupyter notebook for more details.
Even though the dataset is far from being perfect, we see that feature distributions of some genres are slightly different. Take a look at Horror, Biography, and Adventure, for example. This means we could try to use these features to classify movie genres.
Classification
We want an interpretable model since we need to know which topological properties are common to a certain genre. Hence, Decision tree algorithm is a good candidate.
Our dataset is imbalanced. Fighting imbalanced problems is quite difficult especially when we have a very limited number of samples. Also, this is not the purpose of this experiment, so we go for a workaround. To simplify our task and the model itself, let’s split the data into two groups to have more or less the same number of samples in each group: 1) Action, Comedy, Drama and 2) Biography, Adventure, Horror. Also, let’s put aside Crime movies for now.
Below, you can see the results of classification using Decision tree model that we trained using our data. We built decision trees for each pair of genres in our groups. We can see that when we try to classify only two genres (binary classification), the accuracy of the models is quite high, reaching 79% for Horror vs Biography pair, for example. However, when we have more classes in our model, the results are only slightly better than a random guess for Horror vs Adventure vs Biography (56%) and even worse than that for Action vs Comedy vs Drama (45%). This implies that some genres have similar topological properties of character networks.
To read and interpret the trees below, assume that the upper branches correspond to True conditions and the lower ones are False.
Let’s take look at the actual output of the model. Below, you can see classification results for each genre in every group. We can see that the majority of the movies are classified correctly in all cases. Although, the number of misclassified movies grows with the number of classes in our model.
Interpretation of the results
By looking at the decision trees we built, we can say which features characterize each genre. For example, if we compare Horror vs Biography, we can see that Biography movies tend to have lower modularity and higher assortativity measures. In other words, characters in Biography movies tend to socialize with those who have a similar level of centrality and importance in a movie, but, at the same time, they do not form strong communities. Reminds of real life, doesn’t it? On the other hand, characters in Horror movies socialize with everybody but tend to interact in groups (to me, this sounds like networking at a conference). Take a look at the decision trees above and try to interpret the results explaining particular features of each genre.
Conclusions and possible extensions
I always feel like static networks lack an important component – temporal information. In this project, we ignored the time-series of character co-occurrences. It would be interesting to look at the dynamics of character interactions in movies. If we had time-series of character interactions, we would be able to learn weights between nodes in graphs. That would give us insights into the strength and importance of connections in social networks in the movies. One way to learn the weights would be to use the approach we developed for Wikipedia graph mining.
Besides, this time, we chose to avoid the problem of imbalanced classes. It would be interesting to apply weighted versions of Random trees or XGBoost algorithms to compare classification results for all genres in the dataset.
Due to the low number of data samples in the Moviegalaxies dataset, it was more of a fun project rather than a serious research. I am looking forward to the next data release by Moviegalaxies team. Nonetheless, the results are quite interesting and we have actually managed to find a few insights about social networks in movies. In general, the idea of using topological properties of graphs for classification problems still excites me. If we had more data related to this problem, it would be interesting to build another model at a larger scale.
Let me know what you think of this article on twitter @mizvladimir or leave a comment below!