How to integrate Apache Spark, Intellij Idea and Scala
Updated 7 June 2020.
Here is a quick 1-minute walk through tutorial on how to create a Spark Scala SBT project in IntelliJ Idea IDE. If you need more detailed instructions, keep scrolling and you will find them below.
I love Jupyter Notebook. It is really useful when I want to present some code, let someone reproduce my research or just learn how to use new tools and libraries. I use Jupyter almost every day and, as many others, when I first started learning Spark I developed my first data analysis pipelines using interactive notebooks and Python API. Then I have realized that I wanted more and running notebooks locally was not enough for me, so in 2015, I signed up for Databricks Community Ediditon subscription. Databricks allowed to forget about the problems related to setting up and maintaining the environment.
Everyone who is learning and using Spark eventually realizes that Python API is not as powerful and flexible as the core language of the framework - Scala. This language allows to start feeling the full power of Spark comprising analytics, streaming and graph processing tools. However, Spark is just yet another framework for large scale data analytics. Yes, it is convenient and powerful, but it has a limited number of algorithms and sometimes you need to implement your own custom algorithm. And that is the moment when you need an IDE.
Motivation
I faced this problem for the first time in 2016, when we decided to implement a recommendation algorithm that was recently developed in LTS2 where I just started my PhD. The algorithm had a custom loss function, gradient, update rules and tricky optimization part, so I could not use the recommendation algorithms already implemented in Spark (e. g. ALS). But this is the topic for another blogpost. Now we are going to create Spark Scala project in Intellij Idea IDE.
Create Spark Scala project in Intellij Idea
I decided to use Intellij Idea Community Edition and I am going to show how to run Apache Spark programs written in Scala using this IDE. I assume that you have already installed the IDE, Scala plugin, SBT and JDK. If you do not have them installed, do that first:
I have tested this tutorial on Ubuntu 16.04 and 18.04, Windows 8, and MacOS Catalina.
1. Create Scala project
Once you have everything installed, first step is to create SBT-based Scala project. Your JDK, Scala and SBT versions may vary but make sure that they are compatible with Spark libraries that you are going to use.
2. Add libraries
Next step is to add a few Spark libraries to the project. To do that, update the content of the build.sbt
file simply by copying and pasting the code below. In this example I added Spark Core and Spark SQL libraries. If you need to use other libraries, you can find them in Maven repository.
Rebuild the project. You might also need to open the console in IntelliJ (ALT + F12
) and type:
sbt plugins update
Sometimes this is done automatically, but if the compiler says that something is undefined, run this command in order to force the download process.
3. Run Spark program
Now we can check if the project works. Create a Scala class (e.g. in src/main/scala
folder) and run a simple word count example.
Alternatively, you can check out a similar project from my GitHub repository.
That’s it. Now you can build your custom Machine Learning algorithms using Scala, Apache Spark and Intellij Idea IDE.
Happy coding!
Let me know what you think of this article on twitter @mizvladimir or leave a comment below!