Pyspark setup for IntelliJ IDEA

Simple configuration of a new Python IntelliJ IDEA project with working pyspark. I was inspired by "Pyspark on IntelliJ" blog post by Gaurav M Shah, I just removed all the parts about deep learning libraries.

I assume that you have a working IntelliJ IDEA IDE with Python plugin installed, and Python 3 installed on your machine. We will create a Python project in IntelliJ IDEA, change its Python SDK to a virtualenv based Python SDK, add Pyspark dependency to this VirtualEnv, install Pyspark in this VirtualEnv and finally test it using a small Pyspark hello world.

Create a Python Project

Open Intellij IDEA
Click on File → New → Project
Select Python and click Next, do not select any additional libraries
Do not tickle "Create project from template" and click Next
Name your project and click Finish

Create new Python SDK using VirtualEnv

Go to File → Project Structure
In Project Structure window, go to SDKs section
Click on the + on top of window → Add Python SDK
Go to Virtualenv Environment section
Click OK, the new Virtual Environment base directory will be /venv in your project’s root directory

Change Python SDK used by project

In Project Structure window, go to Project section
Select the Python SDK created in previous section in dropdown menu in Project SDK section
Click on Apply

Add Pyspark Dependency to VirtualEnv

In your project root directory, create file requirements.txt containing the following line:

pyspark==3.0.1

Download Pyspark Dependency in VirtualEnv

In a terminal, go to your project root directory
Activate the virtual environment

% source venv/bin/activate

Install Pyspark dependency in VirtualEnv

% pip install -r requirements.txt
Collecting pyspark==3.0.1
  Downloading pyspark-3.0.1.tar.gz (204.2 MB)
     |████████████████████████████████| 204.2 MB 94 kB/s
Collecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
     |████████████████████████████████| 198 kB 8.0 MB/s
Using legacy 'setup.py install' for pyspark, since package 'wheel' is not installed.
Installing collected packages: py4j, pyspark
    Running setup.py install for pyspark ... done
Successfully installed py4j-0.10.9 pyspark-3.0.1
pip install -r requirements.txt  9,10s user 2,80s system 47% cpu 24,840 total

Deactivate virtual environment

% deactivate

Test your Project’s configuration

In root directory of your project, create a file PysparkTest.py, containing the following lines:

from pyspark.sql import SparkSession

sparkSession = SparkSession.builder.appName("test").getOrCreate()

sparkSession.createDataFrame([(1, "value1"), (2, "value2")], ["id", "value"]).show()

Run this file in your Intellij IDEA IDE (Right click on file → Run). You should get something similar to the lines below:

/home/vincent/Projets/Personnel/pyspark-sandbox/venv/bin/python /home/vincent/Projets/Personnel/pyspark-sandbox/PysparkTest.py
21/01/24 15:07:14 WARN Utils: Your hostname, clementine resolves to a loopback address: 127.0.1.1; using 192.168.1.13 instead (on interface wlp3s0)
21/01/24 15:07:14 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/01/24 15:07:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
+---+------+
| id| value|
+---+------+
|  1|value1|
|  2|value2|
+---+------+