Vincent Doba's Blog

Technical Blog

Remove logs from third-party librairies in tests

on 2021-04-10

Do not display logs from dependencies when running tests during build of your java project

#java #log4j

Install Fiona on Windows using pip

on 2021-04-07

Install fiona on windows using pip without gdal-config related hassle

#python #pip #fiona

Aggregate to a Map in Spark

on 2021-03-30

A small code snippet to aggregate two columns of a Spark dataframe to a map grouped by a third column

#spark #scala

Leverage docker multistage builds to create tiny docker image

on 2021-03-29

You have to create a docker image containing an artifact. However to be built, this artifact requires tools that you don’t need to put in your docker image. How to ensure to have the smallest docker image without loading useless tools only used for building artifact ? The solution is docker multi-stage builds


Create docker image for gitlab CI

on 2021-03-28

Create a docker image containing some utils, push it on dockerhub and use it directly on gitlab-ci

#docker #gitlab-ci

List all csv files in a directory with databricks in python

on 2021-03-17

A small code snippet to recursively list all csv files in a directory on a databricks notebook in Python.

#databricks #python

Initialize testcontainers postgresql database with flyway

on 2021-03-14

With Testcontainers library, you can use a docker container providing services such as a database for your test. With Flyway library, you can track the schema changes of your database and ensure that those changes are applied on all its instances. How can you initialize your test database provided by Testcontainers with the schema described in Flyway ? In this post, we will see how to initialize a postgresql database in a docker container with Flyway scripts.

#scala #postgresql #testcontainers #flyway

Read more of Initialize testcontainers postgresql database with flyway

Pyspark setup for IntelliJ IDEA

on 2021-01-24

Simple configuration of a new Python IntelliJ IDEA project with working pyspark. I was inspired by "Pyspark on IntelliJ" blog post by Gaurav M Shah, I just removed all the parts about deep learning libraries. I assume that you have a working IntelliJ IDEA IDE with Python plugin installed, and Python 3 installed on your machine. We will create a Python project in IntelliJ IDEA, change its Python SDK to a virtualenv based Python SDK, add Pyspark dependency to this VirtualEnv, install Pyspark in this VirtualEnv and finally test it using a small Pyspark hello world.

#pyspark #spark #python

Read more of Pyspark setup for IntelliJ IDEA

Pyspark gotchas for Scala Spark developers

on 2021-01-22

Apache Spark is developed in Scala. However Python API is more and more popular as Python is becoming the main language of Data Science. Although Python and Scala APIs are very close, there are some differences that can prevent a developer used to one API to smoothly use the other. This article lists those small differences, from the point of view of a Scala Spark developer wanting to use PySpark.

#pyspark #spark #scala #python

Read more of Pyspark gotchas for Scala Spark developers

Spark custom aggregator behavior on ordered window with duplicates

on 2020-12-06

User-defined aggregated functions are a powerful tool in Spark: you can avoid a lot of useless computation by crafting aggregated functions that does exactly what you want. However, sometimes their behavior can be surprising. For instance, be careful when using a custom aggregator over a windows ordered by a column that contains duplicate values: buffer is not flushed at each line but only when the value in ordering column changes.

#spark #scala

Read more of Spark custom aggregator behavior on ordered window with duplicates