How to cluster users on Instagram using hashtags and Apache Spark

by Volodymyr Miz on May 13, 2016 under Clustering

5 minute read

Instagram does not contain explicit grouping mechanisms like communities or groups in other popular social networks. I found a group of users that represent tourists in our data set by means of social graph analysis (see my previous post). The main purpose of this post is to identify implicit groups of users on Instagram that post about certain topics. In order to do that we use a clustering algorithm on features, extracted from hashtags. In this post I use data from public accounts, collected during Easter Holidays 2016 in Switzerland. Follow this link to see the clusters, that you can obtain following the steps, listed below.

wordcloud

The word cloud is created using this service.

Data preparation

Initially, I extract hashtags from all user accounts in my dataset. Data collection process is described here. All data is stored in a local MongoDB instance.

Then, I prepare a Spark Data Frame for clustering algorithm.

Clustering

For clustering purposes I use Latent Dirichlet Allocation (LDA) algorithm. LDA is a topic model which extracts topics from collection of documents. We represented feature vectors as vectors of words with TF-IDF coefficients. We did not use Bag of Words approach, since it would bias the clusters to most popular hashtags due to the nature of posts on Instagram - each word appears only once in one post. As with many clustering models, such a model restricts a document to being associated with a single topic. LDA, on the other hand, works such that a topic node is sampled repeatedly within the document. Being modelled this way, documents can be associated with multiple topics. This is exactly what we need for tags clustering, since one user can post about different topics such as nature, traveling, food and events in one post.

I use LDA implementation provided by Spark MLlib.

LDA does not perform well with the EMLDAOptimizer which is used by default. In the case of EMLDAOptimizer we have significant bias to the most popular hashtags. I used the OnlineLDAOptimizer instead. The Optimizer implements the Online Variational Bayes LDA algorithm, which processes a subset of the corpus on each iteration, and updates the term-topic distribution adaptively. The algorithm provides slightly different results every run due to approximation of minimization algorithms, however the clusters and topic models remain more or less the same.

Visualization

Let’s extract topics from LDA model we build in the previous section and visualize data using D3JS framework. Let’s prepare the data and transform it into JSON format.

Now we have data that is ready to be plotted. Click this link to see the clusters. Each color of the graph represents one topic of a cluster. The cluster diagram clearly describes 15 topics that was found in the dataset.

To conclude, we used a mix of TF-IDF and LDA algorithms in order to obtain efficient and effective clustering results using hashtag data. This approach allows to reveal certain events (e.g. Basel World 2016 exhibition) and even groups of posts that do not contain meaningful information (e.g. #l4l, #followme, #instalike) only from hashtags data.

Another way to use Instagram hashtags is to plot the most popular ones on the map and see what people see, do and feel in certain places.

Check out more visualizations in my next blog post.

If you want to reproduce the results or just check out the full code, take a look at the public iPython notebook. It is a bit messy, so feel free to leave comments and questions below.

Data analysis, Instagram, Clustering, Spark, Python

I feedback.
Let me know what you think of this article on twitter @mizvladimir or leave a comment below!