Data Definition Language (DDL) for defining Spark Schema

DDL stands for Data Definition Language, and comes from SQL. It is the language you use when for instance you create a table in SQL. So in the following CREATE TABLE SQL query:

CREATE TABLE test (id INTEGER PRIMARY KEY, value VARCHAR(50) NOT NULL);

Your DDL representation of fields of the table is:

id INTEGER PRIMARY KEY, value VARCHAR(50) NOT NULL

This language can also be used to represents Spark Schema.

Convert Spark Schema from/to DDL

To convert a Spark Schema to DDL string, you can use method toDDL. So if you have a dataframe df, you can retrieve its schema in DDL string like this:

import sparkSession.implicits._

val df = Seq(
  (1, "value1"),
  (2, "value2")
).toDF("id", "value")

// returns "`id` INT, `value` STRING"
df.schema.toDDL

If you have a Spark DataType that you want to convert to DDL string, you can use its sql method:

import org.apache.spark.sql.types.StringType

// returns "STRING"
StringType.sql

To transform a DDL to Spark Schema, you can use function fromDDL of object DataType:

import org.apache.spark.sql.types.DataType

// returns StructType(Seq(StructField("a", StringType), StructField("b", IntegerType)))
DataType.fromDDL("`a` STRING, `b` INT")

Spark’s DDL structure

To create a DDL string that can be transformed to a Spark Schema, you just have to list your fields and their types, separated by a comma. Field name should be between two grave accents `, Field name and Field type are separated by a space. Case is ignored for field types. You can add optional NOT NULL part if you don’t want your field to be nullable. Here is an example of valid DDL string:

`a` string NOT NULL, `b` INT, `c` STRUCT<`c1`: INT, `c2`: DOUBLE>

That translates into the following Spark Schema:

root
 |-- a: string (nullable = false)
 |-- b: integer (nullable = true)
 |-- c: struct (nullable = true)
 |    |-- c1: integer (nullable = true)
 |    |-- c2: double (nullable = true)

Note	As of Spark 3.1, support for `NOT NULL` is not complete. If you transform a Spark Schema into DDL, you will lose nullable flag. It is corrected in Spark 3.2, see SPARK-36012

Now that we have the global structure of a DDL string, let’s look at the different DDL types available in Spark

Spark’s DDL types

Type	Spark Type	DDL Type
string	StringType	STRING
byte	ByteType	TINYINT
short	ShortType	SMALLINT
integer	IntegerType	INT
long	LongType	BIGINT
boolean	BooleanType	BOOLEAN
float	FloatType	FLOAT
double	DoubleType	DOUBLE
decimal(5,2)	DecimalType(5,2)	DECIMAL(5,2)
date	DateType	DATE
timestamp	TimestampType	TIMESTAMP
struct	StructType(Seq( StructField("field1", IntegerType), StructField("field2", StringType) ))	STRUCT<`field1`: INT, `field2`: STRING>
array	ArrayType(LongType)	ARRAY<BIGINT>
map	MapType(StringType, LongType)	MAP<STRING, BIGINT>
binary	BinaryType	BINARY

Type

Spark Type

DDL Type

string

StringType

STRING

byte

ByteType

TINYINT

short

ShortType

SMALLINT

integer

IntegerType

INT

long

LongType

BIGINT

boolean

BooleanType

BOOLEAN

float

FloatType

FLOAT

double

DoubleType

DOUBLE

decimal(5,2)

DecimalType(5,2)

DECIMAL(5,2)

date

DateType

DATE

timestamp

TimestampType

TIMESTAMP

struct

StructType(Seq(
StructField("field1", IntegerType),
StructField("field2", StringType)
))

STRUCT<`field1`: INT, `field2`: STRING>

array

ArrayType(LongType)

ARRAY<BIGINT>

map

MapType(StringType, LongType)

MAP<STRING, BIGINT>

binary

BinaryType

BINARY

Note	In `STRUCT`, you can see that fields are separated by colon and not by space

Complete Example

To recap everything, if we have the following example of a Spark Schema representing a customer:

root
|-- id: long (nullable = false)
|-- name: string (nullable = true)
|-- address: struct (nullable = true)
|    |-- number: integer (nullable = true)
|    |-- road: string (nullable = true)
|    |-- city: struct (nullable = true)
|    |    |-- name: string (nullable = true)
|    |    |-- postcode: long (nullable = true)
|-- options: map (nullable = true)
|    |-- key: string
|    |-- value: boolean (valueContainsNull = true)
|-- created_at: timestamp (nullable = true)
|-- phone_numbers: array (nullable = true)
|    |-- element: integer (containsNull = true)

DDL representation of this schema is:

`id` BIGINT NOT NULL,`name` STRING,`address` STRUCT<`number`: INT, `road`: STRING, `city`: STRUCT<`name`: STRING, `postcode`: BIGINT>>,`options` MAP<STRING, BOOLEAN>,`created_at` TIMESTAMP,`phone_numbers` ARRAY<INT>