DDL stands for Data Definition Language, and comes from SQL. It is the language you use when for instance you create
a table in SQL. So in the following CREATE TABLE
SQL query:
CREATE TABLE test (id INTEGER PRIMARY KEY, value VARCHAR(50) NOT NULL);
Your DDL representation of fields of the table is:
id INTEGER PRIMARY KEY, value VARCHAR(50) NOT NULL
This language can also be used to represents Spark Schema.
Convert Spark Schema from/to DDL
To convert a Spark Schema to DDL string, you can use method
toDDL
. So if
you have a dataframe df
, you can retrieve its schema in DDL string like this:
import sparkSession.implicits._
val df = Seq(
(1, "value1"),
(2, "value2")
).toDF("id", "value")
// returns "`id` INT, `value` STRING"
df.schema.toDDL
If you have a Spark DataType that you want to convert to DDL string, you can use its
sql
method:
import org.apache.spark.sql.types.StringType
// returns "STRING"
StringType.sql
To transform a DDL to Spark Schema, you can use function
fromDDL
of object
DataType
:
import org.apache.spark.sql.types.DataType
// returns StructType(Seq(StructField("a", StringType), StructField("b", IntegerType)))
DataType.fromDDL("`a` STRING, `b` INT")
Spark’s DDL structure
To create a DDL string that can be transformed to a Spark Schema, you just have to list your fields and their types,
separated by a comma. Field name should be between two grave accents `
, Field name and Field type are separated by a
space. Case is ignored for field types. You can add optional NOT NULL
part if you don’t want your field to be nullable.
Here is an example of valid DDL string:
`a` string NOT NULL, `b` INT, `c` STRUCT<`c1`: INT, `c2`: DOUBLE>
That translates into the following Spark Schema:
root |-- a: string (nullable = false) |-- b: integer (nullable = true) |-- c: struct (nullable = true) | |-- c1: integer (nullable = true) | |-- c2: double (nullable = true)
Note
|
As of Spark 3.1, support for |
Now that we have the global structure of a DDL string, let’s look at the different DDL types available in Spark
Spark’s DDL types
Type | Spark Type | DDL Type |
---|---|---|
string |
StringType |
STRING |
byte |
ByteType |
TINYINT |
short |
ShortType |
SMALLINT |
integer |
IntegerType |
INT |
long |
LongType |
BIGINT |
boolean |
BooleanType |
BOOLEAN |
float |
FloatType |
FLOAT |
double |
DoubleType |
DOUBLE |
decimal(5,2) |
DecimalType(5,2) |
DECIMAL(5,2) |
date |
DateType |
DATE |
timestamp |
TimestampType |
TIMESTAMP |
struct |
StructType(Seq( |
STRUCT<`field1`: INT, `field2`: STRING> |
array |
ArrayType(LongType) |
ARRAY<BIGINT> |
map |
MapType(StringType, LongType) |
MAP<STRING, BIGINT> |
binary |
BinaryType |
BINARY |
Note
|
In |
Complete Example
To recap everything, if we have the following example of a Spark Schema representing a customer:
root |-- id: long (nullable = false) |-- name: string (nullable = true) |-- address: struct (nullable = true) | |-- number: integer (nullable = true) | |-- road: string (nullable = true) | |-- city: struct (nullable = true) | | |-- name: string (nullable = true) | | |-- postcode: long (nullable = true) |-- options: map (nullable = true) | |-- key: string | |-- value: boolean (valueContainsNull = true) |-- created_at: timestamp (nullable = true) |-- phone_numbers: array (nullable = true) | |-- element: integer (containsNull = true)
DDL representation of this schema is:
`id` BIGINT NOT NULL,`name` STRING,`address` STRUCT<`number`: INT, `road`: STRING, `city`: STRUCT<`name`: STRING, `postcode`: BIGINT>>,`options` MAP<STRING, BOOLEAN>,`created_at` TIMESTAMP,`phone_numbers` ARRAY<INT>