tag: parquet

Reading parquets with different schemas in Spark

on 2020-10-25

Yesterday, I ran into a behavior of Spark’s DataFrameReader when reading Parquet data that can be misleading. If we have several parquet files in a parquet data directory having different schemas, and if we don’t provide any schema or if we don’t use the option mergeSchema, the inferred schema depends on the order of the parquet files in the data directory. The setup I am reading data stored in Parquet format.

#parquet #spark #scala

Read more of Reading parquets with different schemas in Spark