Apache Spark is developed in Scala. However Python API is more and more popular as Python is becoming the main language of Data Science. Although Python and Scala APIs are very close, there are some differences that can prevent a developer used to one API to smoothly use the other. This article lists those small differences, from the point of view of a Scala Spark developer wanting to use PySpark.

Pyspark’s Libraries

SQL API for pyspark is in the package pyspark.sql. You can have a description of this API in Spark’s Documentation.

Scala Python
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types._
import org.apache.spark.sql.Window
from pyspark.sql.functions import col
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType
from pyspark.sql.types import *
from pyspark.sql import Window

Create DataFrame is explicit in Pyspark

In Pyspark, you can’t import implicits from spark context and use them to create dataframes as in Scala.

Scala Python
import sparkSession.implicits._

Seq((1, "a"), (2, "b")).toDF("id", "value")
spark.createDataFrame([(1, "a"), (2, "b")], ["id", "value"])

Moreover, if you want to create a dataframe with only one column, you should explicitly create an array of arrays in Pyspark.

Scala Python
import sparkSession.implicits._

Seq(1, 2).toDF()
spark.createDataFrame([[1], [2]])

== for column equality

As in Python, you can overload operators, equality comparison between columns is done using == instead of === in Scala API.

Scala Python
col("column_name") === 3
col("column_name") == 3

& for logical conjunction and | for logical disjunction

In Pyspark, you use & for logical conjunction and | for logical disjunction, instead of && and || in Scala. Moreover, as operators can be overloaded, you should use parenthesis to bound your logical expressions.

Scala Python
col("column_name") < 4 && col("column_name") > 2
(col("column_name") < 4) & (col("column_name") > 2)

End of line antislash \ for multiline commands

In Python, you can’t chain method on several lines like in Scala, you need to break the line using antislash \.

Scala Python
df.filter(...)
  .withColumn(...)
  .show()
df.filter(...) \
  .withColumn(...) \
  .show()

Named arguments in function

In Python, there is no method overloading. You can’t do polymorphism having several methods sharing the same name but having different number/type of arguments like in Scala. When you call a method with several arguments, you need to use named argument if you want to change trailing arguments.

Scala Python
df.show(false)
df.show(truncate=False)

Parenthesis when calling function/object

In Python, you need to add parenthesis when you call methods.

Scala Python
df.distinct.show
df.distinct().show()

This is also the case for singleton objects. In Scala, you have the object keyword to create a singleton, but not in Python. Thus, you need to create your singleton every time you use it.

Scala Python
df.withColumn("cast", col("raw").cast(IntegerType))
df.withColumn("cast", col("raw").cast(IntegerType()))

Boolean values start with uppercase

In Python, boolean values are True and False instead of true and false in Scala.

Scala Python
col("column_name") === false
col("column_name") == False

Rename column with alias instead of as

In Python, as is a reserved keyword. So this method doesn’t exist in Column class, you have to use alias instead when you want to rename a column.

Scala Python
col("column_name").as("column_alias")
col("column_name").alias("column_alias")

There is no Dataset API in Python

Pyspark does not support Dataset API. If you need to perform transformations such as map, you need to use RDD.

Scala Python
case class Item(id: Int, value: String)

import sparkSession.implicits._
df.as[Item].map(_.value).toDF
class Item:
    def __init__(self, id, value):
       self.id = id
       self.value = value

df.rdd.map(lambda x: x.value).toDF()

Some methods does not exist in Pyspark’s DataFrame

As of Spark 3.0, several methods of Scala’s DataFrame API are not available in Pyspark’s Dataframes. Here is the list:

collectAsList, reduce, takeAsList, inputFiles, isEmpty, javaRDD, writeTo, as, except, flatMap, groupByKey, joinWith, map, mapPartitions, observe, randomSplitAsList, apply, encoder, queryExecution, sparkSession, sqlContext

Types

You can find all the types in Spark Documentation. The following array is merely a copy of this documentation, with Scala types along Python types.

Spark SQL Type Scala Type Python Type

BooleanType

Boolean

bool

ByteType

Byte

int or long

ShortType

Short

int or long

IntegerType

Int

int or long

LongType

Long

long

FloatType

Float

float

DoubleType

Double

float

DecimalType

java.math.BigDecimal

decimal.Decimal

StringType

String

string

BinaryType

Array[Byte]

bytearray

DateType

java.sql.Date

datetime.date

TimestampType

java.sql.Timestamp

datetime.datetime

ArrayType

scala.collection.Seq

list, tuple, or array

MapType

scala.collection.Map

dict

StructType

org.apache.spark.sql.Row

list or tuple

Wrap Up

To summarize this article, here is an example which highlights all the point developed before.

Scala Python
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType

import sparkSession.implicits._

case class Item(id: Long, value: Boolean)

Seq((1, true), (2, false))
  .toDF("id", "value")
  .filter(col("id") === 1 || col("value") === false)
  .as[Item]
  .map(item => Item(item.id + 1, item.value))
  .toDF
  .distinct
  .withColumn("id2", col("id").cast(IntegerType))
  .select(col("id2").as("intId"))
  .show(false)
from pyspark.sql.functions import *
from pyspark.sql.types import LongType

# this class should be defined in another file
class Item:
    def __init__(self, id, value):
       self.id = id
       self.value = value

sparkSession \
  .createDataFrame([(1, True), (2, False)], ["id", "value"]) \
  .filter((col("id") == 1) | (col("value") == False)) \
  .rdd \
  .map(lambda item: Item(item.id + 1, item.value)) \
  .toDF() \
  .distinct() \
  .withColumn("id2", col("id").cast(IntegerType())) \
  .select(col("id2").alias("intId")) \
  .show(truncate=False)