Vincent Doba's Blog

Technical Blog

Python script to test Airflow’s S3 connection

on 2021-10-11

Test S3 connection defined in airflow with a small python script that you can execute on your airflow server

#airflow #python

Create a Spark aggregator to retrieve schema of json string in a column

on 2021-10-05

To transform a dataframe with a column containing a json string to a typed dataframe, we have to know exactly what is the schema of our json string. This blog post presents a method to infer a global schema from our column containing different json strings by using an user-defined aggregate function

#spark #scala

Data Definition Language (DDL) for defining Spark Schema

on 2021-10-04

If you want to transform a Spark’s dataframe schema into a String, you have two schema string representation available: JSON and DDL. DDL stands for Data Definition Language and provides a very concise way to represent a Spark Schema. But how do we represent a Spark’s schema in DDL ?

#spark #sql

Remove logs from third-party librairies in tests

on 2021-04-10

Do not display logs from dependencies when running tests during build of your java project

#java #log4j

Install Fiona on Windows using pip

on 2021-04-07

Install fiona on windows using pip without gdal-config related hassle

#python #pip #fiona

Aggregate to a Map in Spark

on 2021-03-30

A small code snippet to aggregate two columns of a Spark dataframe to a map grouped by a third column

#spark #scala

Leverage docker multistage builds to create tiny docker image

on 2021-03-29

You have to create a docker image containing an artifact. However to be built, this artifact requires tools that you don’t need to put in your docker image. How to ensure to have the smallest docker image without loading useless tools only used for building artifact ? The solution is docker multi-stage builds

#docker

Create docker image for gitlab CI

on 2021-03-28

Create a docker image containing some utils, push it on dockerhub and use it directly on gitlab-ci

#docker #gitlab-ci

List all csv files in a directory with databricks in python

on 2021-03-17

A small code snippet to recursively list all csv files in a directory on a databricks notebook in Python.

#databricks #python

Initialize testcontainers postgresql database with flyway

on 2021-03-14

With Testcontainers library, you can use a docker container providing services such as a database for your test. With Flyway library, you can track the schema changes of your database and ensure that those changes are applied on all its instances. How can you initialize your test database provided by Testcontainers with the schema described in Flyway ? In this post, we will see how to initialize a postgresql database in a docker container with Flyway scripts.

#scala #postgresql #testcontainers #flyway

Read more of Initialize testcontainers postgresql database with flyway