Data lake

Curated list of data lake and lakehouse tools: open table formats (Apache Iceberg, Delta Lake, Apache Hudi), file formats (Parquet, ORC, Avro), metastores, object storage, and codecs.

Data lake

The data lake approach (or "lakehouse") is a semi-structured schema that sits on top of object storage in the cloud.

It is composed of a few layers (from lower to higher level): codec, file format, table format + metastore, and the ingestion/query layer.

File formats and serialization

These formats are popular for shared-everything databases, using object storage as a persistence layer. The data is organized in row or column, with strict schema definition. These files are immutable and offer partial reads (only headers, metadata, data page, etc). Mutation requires a new upload. Most formats support nested schema, codecs, compression, and data encryption. Index can be added to file metadata for faster processing.

A single file can weight between tens of MB to a few GB. Lots of small files require more merge operation. Larger files can be costly to update.

Open table formats

Open table formats are abstraction layer on top of Avro/Parquet files, with support for ACID transaction, CDC, partitioning, mixed streaming/batching processing, schema evolution and mutation. Schema and statistics are stored in a metastore, data is persisted locally or in a remote/cloud object storage.

Open tables are a cost-effective datawarehouse for petabyte scale.

Comparison:

👆 Warning: pre-2022 articles should be considered as out-of-date, as open table formats are evolving quickly.

Metastore

Object Storage

Codecs, encoding and compression