Instagram social graph analysis using Apache Spark and Graph Frames
Databricks recently released Graph Frames - graph processing library for Apache Spark. I was so excited, so I decided to play with it - build a graph and try out most popular methods of the library.
In this post I will show how to build and analyze social graph based on the data that we collected using Instagram API. Check out my previous post if you want to know how to collect the data for the analysis.
I store the data in my local MongoDB instance, so I will be doing all my queries using PyMongo. I will use Mongo Aggregation on the preprocessing stage. After that I will show how to use recently released Graph Frames library in order to build and analyze Instagram social graph.
Data preparation
First, extract follower ids from collected followers
. Here I show the code only for followers
. You should do the same for follows
as well. The code is identical.
After that, I extract user ids and follower ids and represent them as vertices and edges respectively in order to construct our social graph.
Graph Frames
Now we can construct a graph using Graph Frames and Apache Spark.
Now we have social graph, so we can extract information about its properties. You can check out the documentation if you want to know all features that are available in Graph Frames.
Page Rank of a user
I am interested in Page Rank and degree of each user. Page Rank (PR) of each vertex – numerical measure of importance of a vertex in a graph. In the context of a social graph PR represents the influence of a user over a social network. The algorithm considers only incoming links, in other words - who follows a user, instead of counting all edges (in other words calculating degree) which would not give reasonable result. Moreover, PR algorithm takes into account the PR of a user who subscribes. Hence, the higher PR of subscriber, the higher influence it has to a user’s PR.
We can compute degree and PR calling corresponding methods of Graph Frames library:
tol
in this case sets algorithm to run until convergence. resetProbability
represent alpha
parameter. For more information about the Page Rank algorithm realisation read the documentation
I build the graph using 45 588 unique users as vertices. Edges represent the sum of 197 643 and 274 726 connections of followers and follows respectively.
The result of the analysis shows that influence over the Instagram network depends on the number of follows links, which is not surprising. However the term influence
in this case refers only to the fact that the higher PR user has the higher number of people see his content in their timelines. Nonetheless, this does not mean that higher PR implies popularity on Instagram, which is expressed in likes.
Degrees of a social graph
Another interesting property of a graph is the degrees of its nodes. I collected data from a certain location and intended to separate the group of potential tourists. We can do this by computing degrees of the nodes of our social graph. I have an assumption that those users (nodes) who have 0-degree in this area represent the potential group of tourists. We can easily check the assumption if we compare latest locations of the users to the locations, collected during holidays. Indeed, if we look at the map of the users spread, we will see that our assumption concerning the tourists is confirmed.
This means that we can use Instagram social graph for tourist flow identification and analysis.
Let me know what you think of this article on twitter @mizvladimir or leave a comment below!