Simple configuration of a new Python IntelliJ IDEA project with working pyspark. I was inspired by "Pyspark on IntelliJ" blog post by Gaurav M Shah, I just removed all the parts about deep learning libraries.

I assume that you have a working IntelliJ IDEA IDE with Python plugin installed, and Python 3 installed on your machine. We will create a Python project in IntelliJ IDEA, change its Python SDK to a virtualenv based Python SDK, add Pyspark dependency to this VirtualEnv, install Pyspark in this VirtualEnv and finally test it using a small Pyspark hello world.

Create a Python Project

  • Open Intellij IDEA

  • Click on File → New → Project

  • Select Python and click Next, do not select any additional libraries

  • Do not tickle "Create project from template" and click Next

  • Name your project and click Finish

Create new Python SDK using VirtualEnv

  • Go to File → Project Structure

  • In Project Structure window, go to SDKs section

  • Click on the + on top of window → Add Python SDK

  • Go to Virtualenv Environment section

  • Click OK, the new Virtual Environment base directory will be /venv in your project’s root directory

Change Python SDK used by project

  • In Project Structure window, go to Project section

  • Select the Python SDK created in previous section in dropdown menu in Project SDK section

  • Click on Apply

Add Pyspark Dependency to VirtualEnv

  • In your project root directory, create file requirements.txt containing the following line:

pyspark==3.0.1

Download Pyspark Dependency in VirtualEnv

  • In a terminal, go to your project root directory

  • Activate the virtual environment

% source venv/bin/activate
  • Install Pyspark dependency in VirtualEnv

% pip install -r requirements.txt
Collecting pyspark==3.0.1
  Downloading pyspark-3.0.1.tar.gz (204.2 MB)
     |████████████████████████████████| 204.2 MB 94 kB/s
Collecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
     |████████████████████████████████| 198 kB 8.0 MB/s
Using legacy 'setup.py install' for pyspark, since package 'wheel' is not installed.
Installing collected packages: py4j, pyspark
    Running setup.py install for pyspark ... done
Successfully installed py4j-0.10.9 pyspark-3.0.1
pip install -r requirements.txt  9,10s user 2,80s system 47% cpu 24,840 total
  • Deactivate virtual environment

% deactivate

Test your Project’s configuration

  • In root directory of your project, create a file PysparkTest.py, containing the following lines:

from pyspark.sql import SparkSession

sparkSession = SparkSession.builder.appName("test").getOrCreate()

sparkSession.createDataFrame([(1, "value1"), (2, "value2")], ["id", "value"]).show()
  • Run this file in your Intellij IDEA IDE (Right click on file → Run). You should get something similar to the lines below:

/home/vincent/Projets/Personnel/pyspark-sandbox/venv/bin/python /home/vincent/Projets/Personnel/pyspark-sandbox/PysparkTest.py
21/01/24 15:07:14 WARN Utils: Your hostname, clementine resolves to a loopback address: 127.0.1.1; using 192.168.1.13 instead (on interface wlp3s0)
21/01/24 15:07:14 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/01/24 15:07:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
+---+------+
| id| value|
+---+------+
|  1|value1|
|  2|value2|
+---+------+