Pyspark gotchas for Scala Spark developers

Apache Spark is developed in Scala. However Python API is more and more popular as Python is becoming the main language of Data Science. Although Python and Scala APIs are very close, there are some differences that can prevent a developer used to one API to smoothly use the other. This article lists those small differences, from the point of view of a Scala Spark developer wanting to use PySpark.

Pyspark’s Libraries

SQL API for pyspark is in the package pyspark.sql. You can have a description of this API in Spark’s Documentation.

Scala Python

Scala	Python
`import org.apache.spark.sql.functions.col import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.IntegerType import org.apache.spark.sql.types._ import org.apache.spark.sql.Window`	`from pyspark.sql.functions import col from pyspark.sql.functions import * from pyspark.sql.types import IntegerType from pyspark.sql.types import * from pyspark.sql import Window`

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types._
import org.apache.spark.sql.Window

from pyspark.sql.functions import col
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType
from pyspark.sql.types import *
from pyspark.sql import Window

Create DataFrame is explicit in Pyspark

In Pyspark, you can’t import implicits from spark context and use them to create dataframes as in Scala.

Scala Python

Scala	Python
`import sparkSession.implicits._ Seq((1, "a"), (2, "b")).toDF("id", "value")`	`spark.createDataFrame([(1, "a"), (2, "b")], ["id", "value"])`

import sparkSession.implicits._

Seq((1, "a"), (2, "b")).toDF("id", "value")

spark.createDataFrame([(1, "a"), (2, "b")], ["id", "value"])

Moreover, if you want to create a dataframe with only one column, you should explicitly create an array of arrays in Pyspark.

Scala Python

Scala	Python
`import sparkSession.implicits._ Seq(1, 2).toDF()`	`spark.createDataFrame([[1], [2]])`

import sparkSession.implicits._

Seq(1, 2).toDF()

spark.createDataFrame([[1], [2]])

`==` for column equality

As in Python, you can overload operators, equality comparison between columns is done using == instead of === in Scala API.

Scala Python

Scala	Python
`col("column_name") === 3`	`col("column_name") == 3`

col("column_name") === 3

col("column_name") == 3

`&` for logical conjunction and `|` for logical disjunction

In Pyspark, you use & for logical conjunction and | for logical disjunction, instead of && and || in Scala. Moreover, as we use bitwise logical operators that don’t have precedence over comparison, you should use parenthesis to bound your logical expressions.

Scala Python

Scala	Python
`col("column_name") < 4 && col("column_name") > 2`	`(col("column_name") < 4) & (col("column_name") > 2)`

col("column_name") < 4 && col("column_name") > 2

(col("column_name") < 4) & (col("column_name") > 2)

End of line antislash `\` for multiline commands

In Python, you can’t chain method on several lines like in Scala, you need to break the line using antislash \.

Scala Python

Scala	Python
`df.filter(...) .withColumn(...) .show()`	`df.filter(...) \ .withColumn(...) \ .show()`

df.filter(...)
  .withColumn(...)
  .show()

df.filter(...) \
  .withColumn(...) \
  .show()

Named arguments in function

In Python, there is no method overloading. You can’t do polymorphism having several methods sharing the same name but having different number/type of arguments like in Scala. When you call a method with several arguments, you need to use named argument if you want to change trailing arguments.

Scala Python

Scala	Python
`df.show(false)`	`df.show(truncate=False)`

df.show(false)

df.show(truncate=False)

Parenthesis when calling function/object

In Python, you need to add parenthesis when you call methods.

Scala Python

Scala	Python
`df.distinct.show`	`df.distinct().show()`

df.distinct.show

df.distinct().show()

This is also the case for singleton objects. In Scala, you have the object keyword to create a singleton, but not in Python. Thus, you need to create your singleton every time you use it.

Scala Python

Scala	Python
`df.withColumn("cast", col("raw").cast(IntegerType))`	`df.withColumn("cast", col("raw").cast(IntegerType()))`

df.withColumn("cast", col("raw").cast(IntegerType))

df.withColumn("cast", col("raw").cast(IntegerType()))

Boolean values start with uppercase

In Python, boolean values are True and False instead of true and false in Scala.

Scala Python

Scala	Python
`col("column_name") === false`	`col("column_name") == False`

col("column_name") === false

col("column_name") == False

Rename column with `alias` instead of `as`

In Python, as is a reserved keyword. So this method doesn’t exist in Column class, you have to use alias instead when you want to rename a column.

Scala Python

Scala	Python
`col("column_name").as("column_alias")`	`col("column_name").alias("column_alias")`

col("column_name").as("column_alias")

col("column_name").alias("column_alias")

There is no Dataset API in Python

Pyspark does not support Dataset API. If you need to perform transformations such as map, you need to use RDD.

Scala Python

Scala	Python
`case class Item(id: Int, value: String) import sparkSession.implicits._ df.as[Item].map(_.value).toDF`	`class Item: def __init__(self, id, value): self.id = id self.value = value df.rdd.map(lambda x: x.value).toDF()`

case class Item(id: Int, value: String)

import sparkSession.implicits._
df.as[Item].map(_.value).toDF

class Item:
    def __init__(self, id, value):
       self.id = id
       self.value = value

df.rdd.map(lambda x: x.value).toDF()

Use alias for importing functions in Pyspark

Some Spark built-in functions' names are in conflict with python functions, so it is better to use alias when importing Spark built-in functions.

Scala Python

Scala	Python
`import org.apache.spark.sql.functions._ df.agg(max("a_column"))`	`from pyspark.sql import functions as F df.agg(F.max("a_column"))`

import org.apache.spark.sql.functions._

df.agg(max("a_column"))

from pyspark.sql import functions as F

df.agg(F.max("a_column"))

Some methods does not exist in Pyspark’s DataFrame

As of Spark 3.0, several methods of Scala’s DataFrame API are not available in Pyspark’s Dataframes. Here is the list:

collectAsList, reduce, takeAsList, inputFiles, isEmpty, javaRDD, writeTo, as, except, flatMap, groupByKey, joinWith, map, mapPartitions, observe, randomSplitAsList, apply, encoder, queryExecution, sparkSession, sqlContext

Types

You can find all the types in Spark Documentation. The following array is merely a copy of this documentation, with Scala types along Python types.

Spark SQL Type	Scala Type	Python Type
BooleanType	Boolean	bool
ByteType	Byte	int or long
ShortType	Short	int or long
IntegerType	Int	int or long
LongType	Long	long
FloatType	Float	float
DoubleType	Double	float
DecimalType	java.math.BigDecimal	decimal.Decimal
StringType	String	string
BinaryType	Array[Byte]	bytearray
DateType	java.sql.Date	datetime.date
TimestampType	java.sql.Timestamp	datetime.datetime
ArrayType	scala.collection.Seq	list, tuple, or array
MapType	scala.collection.Map	dict
StructType	org.apache.spark.sql.Row	list or tuple

Spark SQL Type

Scala Type

Python Type

BooleanType

Boolean

bool

ByteType

Byte

int or long

ShortType

Short

int or long

IntegerType

Int

int or long

LongType

Long

long

FloatType

Float

float

DoubleType

Double

float

DecimalType

java.math.BigDecimal

decimal.Decimal

StringType

String

string

BinaryType

Array[Byte]

bytearray

DateType

java.sql.Date

datetime.date

TimestampType

java.sql.Timestamp

datetime.datetime

ArrayType

scala.collection.Seq

list, tuple, or array

MapType

scala.collection.Map

dict

StructType

org.apache.spark.sql.Row

list or tuple

Wrap Up

To summarize this article, here is an example which highlights all the point developed before.

Scala Python

Scala	Python
`import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.IntegerType import sparkSession.implicits._ case class Item(id: Long, value: Boolean) Seq((1, true), (2, false)) .toDF("id", "value") .filter(col("id") === 1 \|\| col("value") === false) .as[Item] .map(item => Item(item.id + 1, item.value)) .toDF .distinct .withColumn("id2", col("id").cast(IntegerType)) .select(col("id2").as("intId")) .show(false)`	from pyspark.sql import functions as F from pyspark.sql.types import LongType # this class should be defined in another file class Item: def __init__(self, id, value): self.id = id self.value = value sparkSession \ .createDataFrame([(1, True), (2, False)], ["id", "value"]) \ .filter((F.col("id") == 1) \| (F.col("value") == False)) \ .rdd \ .map(lambda item: Item(item.id + 1, item.value)) \ .toDF() \ .distinct() \ .withColumn("id2", F.col("id").cast(IntegerType())) \ .select(F.col("id2").alias("intId")) \ .show(truncate=False)

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType

import sparkSession.implicits._

case class Item(id: Long, value: Boolean)

Seq((1, true), (2, false))
  .toDF("id", "value")
  .filter(col("id") === 1 || col("value") === false)
  .as[Item]
  .map(item => Item(item.id + 1, item.value))
  .toDF
  .distinct
  .withColumn("id2", col("id").cast(IntegerType))
  .select(col("id2").as("intId"))
  .show(false)

from pyspark.sql import functions as F
from pyspark.sql.types import LongType

# this class should be defined in another file
class Item:
    def __init__(self, id, value):
       self.id = id
       self.value = value

sparkSession \
  .createDataFrame([(1, True), (2, False)], ["id", "value"]) \
  .filter((F.col("id") == 1) | (F.col("value") == False)) \
  .rdd \
  .map(lambda item: Item(item.id + 1, item.value)) \
  .toDF() \
  .distinct() \
  .withColumn("id2", F.col("id").cast(IntegerType())) \
  .select(F.col("id2").alias("intId")) \
  .show(truncate=False)

Pyspark’s Libraries

Create DataFrame is explicit in Pyspark

== for column equality

& for logical conjunction and | for logical disjunction

End of line antislash \ for multiline commands

Named arguments in function

Parenthesis when calling function/object

Boolean values start with uppercase

Rename column with alias instead of as

There is no Dataset API in Python

Use alias for importing functions in Pyspark

Some methods does not exist in Pyspark’s DataFrame

Types

Wrap Up

`==` for column equality

`&` for logical conjunction and `|` for logical disjunction

End of line antislash `\` for multiline commands

Rename column with `alias` instead of `as`