Apache Spark is developed in Scala. However Python API is more and more popular as Python is becoming the main language of Data Science. Although Python and Scala APIs are very close, there are some differences that can prevent a developer used to one API to smoothly use the other. This article lists those small differences, from the point of view of a Scala Spark developer wanting to use PySpark.
Pyspark’s Libraries
SQL API for pyspark is in the package pyspark.sql
. You can have a description of this API in
Spark’s Documentation.
Scala | Python |
---|---|
|
|
Create DataFrame is explicit in Pyspark
In Pyspark, you can’t import implicits from spark context and use them to create dataframes as in Scala.
Scala | Python |
---|---|
|
|
Moreover, if you want to create a dataframe with only one column, you should explicitly create an array of arrays in Pyspark.
Scala | Python |
---|---|
|
|
==
for column equality
As in Python, you can overload operators, equality comparison between columns is done using ==
instead of ===
in Scala API.
Scala | Python |
---|---|
|
|
&
for logical conjunction and |
for logical disjunction
In Pyspark, you use &
for logical conjunction and |
for logical disjunction, instead of &&
and ||
in Scala.
Moreover, as we use bitwise logical operators that don’t have
precedence over comparison, you should use
parenthesis to bound your logical expressions.
Scala | Python |
---|---|
|
|
End of line antislash \
for multiline commands
In Python, you can’t chain method on several lines like in Scala, you need to break the line using antislash \
.
Scala | Python |
---|---|
|
|
Named arguments in function
In Python, there is no method overloading. You can’t do polymorphism having several methods sharing the same name but having different number/type of arguments like in Scala. When you call a method with several arguments, you need to use named argument if you want to change trailing arguments.
Scala | Python |
---|---|
|
|
Parenthesis when calling function/object
In Python, you need to add parenthesis when you call methods.
Scala | Python |
---|---|
|
|
This is also the case for singleton objects. In Scala, you have the object
keyword to create a singleton, but not in Python.
Thus, you need to create your singleton every time you use it.
Scala | Python |
---|---|
|
|
Boolean values start with uppercase
In Python, boolean values are True
and False
instead of true
and false
in Scala.
Scala | Python |
---|---|
|
|
Rename column with alias
instead of as
In Python, as
is a reserved keyword. So this method doesn’t exist in Column class, you have to use alias
instead
when you want to rename a column.
Scala | Python |
---|---|
|
|
There is no Dataset API in Python
Pyspark does not support Dataset API. If you need to perform transformations such as map
, you need to use RDD.
Scala | Python |
---|---|
|
|
Use alias for importing functions in Pyspark
Some Spark built-in functions' names are in conflict with python functions, so it is better to use alias when importing Spark built-in functions.
Scala | Python |
---|---|
|
|
Some methods does not exist in Pyspark’s DataFrame
As of Spark 3.0, several methods of Scala’s DataFrame API are not available in Pyspark’s Dataframes. Here is the list:
collectAsList, reduce, takeAsList, inputFiles, isEmpty, javaRDD, writeTo, as, except, flatMap, groupByKey, joinWith, map,
mapPartitions, observe, randomSplitAsList, apply, encoder, queryExecution, sparkSession, sqlContext
Types
You can find all the types in Spark Documentation. The following array is merely a copy of this documentation, with Scala types along Python types.
Spark SQL Type | Scala Type | Python Type |
---|---|---|
BooleanType |
Boolean |
bool |
ByteType |
Byte |
int or long |
ShortType |
Short |
int or long |
IntegerType |
Int |
int or long |
LongType |
Long |
long |
FloatType |
Float |
float |
DoubleType |
Double |
float |
DecimalType |
java.math.BigDecimal |
decimal.Decimal |
StringType |
String |
string |
BinaryType |
Array[Byte] |
bytearray |
DateType |
java.sql.Date |
datetime.date |
TimestampType |
java.sql.Timestamp |
datetime.datetime |
ArrayType |
scala.collection.Seq |
list, tuple, or array |
MapType |
scala.collection.Map |
dict |
StructType |
org.apache.spark.sql.Row |
list or tuple |
Wrap Up
To summarize this article, here is an example which highlights all the point developed before.
Scala | Python |
---|---|
|
|