Simple configuration of a new Python IntelliJ IDEA project with working pyspark. I was inspired by "Pyspark on IntelliJ" blog post by Gaurav M Shah, I just removed all the parts about deep learning libraries.
I assume that you have a working IntelliJ IDEA IDE with Python plugin installed, and Python 3 installed on your machine. We will create a Python project in IntelliJ IDEA, change its Python SDK to a virtualenv based Python SDK, add Pyspark dependency to this VirtualEnv, install Pyspark in this VirtualEnv and finally test it using a small Pyspark hello world.
Create a Python Project
-
Open Intellij IDEA
-
Click on File → New → Project
-
Select Python and click Next, do not select any additional libraries
-
Do not tickle "Create project from template" and click Next
-
Name your project and click Finish
Create new Python SDK using VirtualEnv
-
Go to File → Project Structure
-
In Project Structure window, go to SDKs section
-
Click on the
+
on top of window → Add Python SDK -
Go to Virtualenv Environment section
-
Click OK, the new Virtual Environment base directory will be
/venv
in your project’s root directory
Change Python SDK used by project
-
In Project Structure window, go to Project section
-
Select the Python SDK created in previous section in dropdown menu in Project SDK section
-
Click on Apply
Add Pyspark Dependency to VirtualEnv
-
In your project root directory, create file
requirements.txt
containing the following line:
pyspark==3.0.1
Download Pyspark Dependency in VirtualEnv
-
In a terminal, go to your project root directory
-
Activate the virtual environment
% source venv/bin/activate
-
Install Pyspark dependency in VirtualEnv
% pip install -r requirements.txt Collecting pyspark==3.0.1 Downloading pyspark-3.0.1.tar.gz (204.2 MB) |████████████████████████████████| 204.2 MB 94 kB/s Collecting py4j==0.10.9 Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB) |████████████████████████████████| 198 kB 8.0 MB/s Using legacy 'setup.py install' for pyspark, since package 'wheel' is not installed. Installing collected packages: py4j, pyspark Running setup.py install for pyspark ... done Successfully installed py4j-0.10.9 pyspark-3.0.1 pip install -r requirements.txt 9,10s user 2,80s system 47% cpu 24,840 total
-
Deactivate virtual environment
% deactivate
Test your Project’s configuration
-
In root directory of your project, create a file
PysparkTest.py
, containing the following lines:
from pyspark.sql import SparkSession sparkSession = SparkSession.builder.appName("test").getOrCreate() sparkSession.createDataFrame([(1, "value1"), (2, "value2")], ["id", "value"]).show()
-
Run this file in your Intellij IDEA IDE (Right click on file → Run). You should get something similar to the lines below:
/home/vincent/Projets/Personnel/pyspark-sandbox/venv/bin/python /home/vincent/Projets/Personnel/pyspark-sandbox/PysparkTest.py 21/01/24 15:07:14 WARN Utils: Your hostname, clementine resolves to a loopback address: 127.0.1.1; using 192.168.1.13 instead (on interface wlp3s0) 21/01/24 15:07:14 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 21/01/24 15:07:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). +---+------+ | id| value| +---+------+ | 1|value1| | 2|value2| +---+------+