Create a Spark aggregator to retrieve schema of json string in a column
on 2021-10-05
To transform a dataframe with a column containing a json string to a typed dataframe, we have to know exactly what is the schema of our json string. This blog post presents a method to infer a global schema from our column containing different json strings by using an user-defined aggregate function
Aggregate to a Map in Spark
on 2021-03-30
A small code snippet to aggregate two columns of a Spark dataframe to a map grouped by a third column
Initialize testcontainers postgresql database with flyway
on 2021-03-14
With Testcontainers library, you can use a docker container providing services such as a database for your test. With Flyway library, you can track the schema changes of your database and ensure that those changes are applied on all its instances. How can you initialize your test database provided by Testcontainers with the schema described in Flyway ? In this post, we will see how to initialize a postgresql database in a docker container with Flyway scripts.
#scala #postgresql #testcontainers #flyway
Read more of Initialize testcontainers postgresql database with flyway
Pyspark gotchas for Scala Spark developers
on 2021-01-22
Apache Spark is developed in Scala. However Python API is more and more popular as Python is becoming the main language of Data Science. Although Python and Scala APIs are very close, there are some differences that can prevent a developer used to one API to smoothly use the other. This article lists those small differences, from the point of view of a Scala Spark developer wanting to use PySpark.
Spark custom aggregator behavior on ordered window with duplicates
on 2020-12-06
User-defined aggregated functions are a powerful tool in Spark: you can avoid a lot of useless computation by crafting aggregated functions that does exactly what you want. However, sometimes their behavior can be surprising. For instance, be careful when using a custom aggregator over a windows ordered by a column that contains duplicate values: buffer is not flushed at each line but only when the value in ordering column changes.
Read more of Spark custom aggregator behavior on ordered window with duplicates
Option versus nullable: which type spark deserializes faster
on 2020-11-12
Recently, I was wondering about Spark’s deserialization performance. Especially this question: when you have a nullable column in a dataframe, is it better to deserialize it to an option or to a nullable type ? Let’s answer this question in this blog post. The benchmark To answer this question, I define the following benchmark. I create simple input data, read it with three Spark applications that select a column, replace its null value with a default value, and write the result to parquet.
Read more of Option versus nullable: which type spark deserializes faster
Reading parquets with different schemas in Spark
on 2020-10-25
Yesterday, I ran into a behavior of Spark’s DataFrameReader when reading Parquet data that can be misleading. If we have several parquet files in a parquet data directory having different schemas, and if we don’t provide any schema or if we don’t use the option mergeSchema, the inferred schema depends on the order of the parquet files in the data directory. The setup I am reading data stored in Parquet format.
Read more of Reading parquets with different schemas in Spark