How to integrate Apache Spark, Intellij Idea and Scala
I love Jupyter Notebook. It is really useful when I want to present some code, let someone reproduce my research or just learn how to use new tools and libraries. I use Jupyter almost every day and, as many others, I started learning Spark and developed my first data analysis pipelines using interactive notebooks and Python API. Then I realized that I want more and running notebooks locally is not enough for me, so I signed up for Databricks Community Ediditon subscription. Databricks allows to forget about the problems related to setting up and maintaining the environment.
Everyone who is learning and using Spark eventually realize that Python API is not as powerful and flexible as the core language of the framework - Scala. This language allows to start feeling the full power of Spark comprising Analytics, Streaming and Graph processing tools. However, Spark is just yet another framework for large scale data analytics. Yes, it is convenient and powerful, but it has a limited number of algorithms and sometimes you need to implement your own custom algorithm. And that is the moment when you need an IDE.
I faced this problem when we decided to implement a recommendation algorithm that was recently developed in LTS2 where I pursue my PhD degree. The algorithm has custom loss function, gradient, update rules and tricky optimization part, so I cannot use the recommendation algorithms already implemented in Spark (e. g. ALS). But this is the topic for another blogpost. Now we are going to create Spark Scala project in Intellij Idea IDE.
Create Spark Scala project in Intellij Idea
I decided to use Intellij Idea Community Edition and I am going to show how to run Apache Spark programs written in Scala using this IDE. I assume that you have already installed the IDE, Scala plugin, SBT and JDK. If not, first follow the instructions: install SBT and JDK,install Intellij Idea and Scala plugin. The instructions below tested on Ubuntu and Windows. Should work for Mac OS as well.
1. Create Scala project
Once you have everything installed, first step is to create SBT-based Scala project. Your JDK, Scala and SBT versions may vary.
2. Add libraries
Next step is to add the libraries to the project. For this purpose update the content of the
build.sbt file simply by copying and pasting the code below. In this example I added Spark core library only. If you need to use other libraries, you can find them in Maven repository.
Rebuild the project. You might also need to open the console in Intellij (
ALT + F12) and type:
sbt plugins update
Sometimes this is done automatically, but if the compiler says that something is undefined, run this command in order to force the download process.
3. Run Spark program
Now we can check if the project works. Create a Scala class (e.g. in
src/main/scala folder) and run the simple word count example.
Alternatively, you can check out a similar project from my Git repository.
That is it. Now you can build your custom Machine Learning algorithms using Scala, Apache Spark and Intellij Idea IDE.
Let me know what you think of this article on twitter @mizvladimir or leave a comment below!