Data Lake & Lakehouse Tools (2026) — Iceberg, Delta Lake, Parquet

Data lake

The data lake approach (or "lakehouse") is a semi-structured schema that sits on top of object storage in the cloud.

It is composed of a few layers (from lower to higher level): codec, file format, table format + metastore, and the ingestion/query layer.

File formats and serialization

These formats are popular for shared-everything databases, using object storage as a persistence layer. The data is organized in row or column, with strict schema definition. These files are immutable and offer partial reads (only headers, metadata, data page, etc). Mutation requires a new upload. Most formats support nested schema, codecs, compression, and data encryption. Index can be added to file metadata for faster processing.

A single file can weight between tens of MB to a few GB. Lots of small files require more merge operation. Larger files can be costly to update.

Apache Arrow Columnar Format - Columnar format for in-memory Apache Arrow processing.
Apache Avro - Row-oriented serialization for data streaming purpose.
Apache ORC - Column-oriented serialization for data storage purpose. Part of Hadoop platform.
Apache Parquet - Column-oriented serialization for data storage purpose.
Apache Thrift - Row-oriented serialization for RPC purpose.
Cap’n Proto - Row-oriented serialization with zero-copy access, as fast as mmap.
Flatbuffer - Row-oriented serialization with zero-copy access, as fast as mmap.
Google Protobuf - Row-oriented serialization for RPC purpose.
Schema Registry - Centralized repository for validating row-oriented events. Part of Kafka and Confluent platform.

Open table formats

Open table formats are abstraction layer on top of Avro/Parquet files, with support for ACID transaction, CDC, partitioning, mixed streaming/batching processing, schema evolution and mutation. Schema and statistics are stored in a metastore, data is persisted locally or in a remote/cloud object storage.

Open tables are a cost-effective datawarehouse for petabyte scale.

Apache Hive - SQL-based data warehouse and query engine on top of Hadoop, and the origin of the Hive Metastore used by modern table formats.
Apache Hudi - Open table format with strong CDC and upsert support, designed for incremental data pipelines.
Apache Iceberg - Open table format for huge analytic datasets, with snapshot isolation, schema evolution, and partition pruning.
DeltaLake - Open table format bringing ACID transactions and scalable metadata to Apache Spark and beyond.

Comparison:

👆 Warning: pre-2022 articles should be considered as out-of-date, as open table formats are evolving quickly.

Metastore

AWS Glue - Serverless data integration service with a managed catalog for AWS data assets.
Databricks unity catalog - Unified governance layer for data and AI assets across the Databricks platform.
Hive Metastore - Component of Hadoop HiveServer2 that can be used standalone as a schema registry for table metadata.
Nessie - Git-like versioning catalog for data lakes, enabling branch and merge operations on Iceberg/Delta/Hudi tables.

Object Storage

Apache HDFS - Hadoop distributed file system, the original large-scale storage layer for the big data ecosystem.
AWS S3 - Highly durable and available object storage service, the dominant cloud storage backend for data lakes.
Azure Blob Storage - Microsoft's massively scalable object storage for unstructured data.
GCP Cloud Storage - Google's unified object storage service for any amount of data.
Minio - S3-compatible self-hosted object storage, suitable for on-premise data lake deployments.

Codecs, encoding and compression

Bit packing - Encoding integers using only the bits required, eliminating wasted high-order zeros in columnar data.
Brotli - General-purpose lossless compression by Google, offering better ratios than gzip at comparable speed.
Deflate - Classic lossless compression combining LZ77 and Huffman coding; the basis of gzip and zlib.
Delta - Stores differences between successive values instead of absolutes, ideal for monotonically increasing columns like timestamps.
Dictionary + RLE - Replaces repeated values with dictionary codes, then run-length encodes consecutive duplicates; effective for low-cardinality columns.
Gorilla - Facebook's XOR-based float compression for time-series metrics, achieving 1.37 bytes/value on typical monitoring data.
LZ4 - Extremely fast lossless compression algorithm prioritizing throughput over ratio, widely used in real-time pipelines.
Snappy - Google's fast lossless codec optimized for speed over compression ratio, default in many Hadoop/Parquet deployments.
zstd - Facebook's modern lossless codec delivering high compression ratios at fast speeds; often preferred over gzip in data lakes.