Data lake
The data lake approach (or "lakehouse") is a semi-structured schema that sits on top of object storage in the cloud.
It is composed of a few layers (from lower to higher level): codec, file format, table format + metastore, and the ingestion/query layer.
File formats and serialization
These formats are popular for shared-everything databases, using object storage as a persistence layer. The data is organized in row or column, with strict schema definition. These files are immutable and offer partial reads (only headers, metadata, data page, etc). Mutation requires a new upload. Most formats support nested schema, codecs, compression, and data encryption. Index can be added to file metadata for faster processing.
A single file can weight between tens of MB to a few GB. Lots of small files require more merge operation. Larger files can be costly to update.
- Apache Arrow Columnar Format - Columnar format for in-memory Apache Arrow processing.
- Apache Avro - Row-oriented serialization for data streaming purpose.
- Apache ORC - Column-oriented serialization for data storage purpose. Part of Hadoop platform.
- Apache Parquet - Column-oriented serialization for data storage purpose.
- Apache Thrift - Row-oriented serialization for RPC purpose.
- Cap’n Proto - Row-oriented serialization with zero-copy access, as fast as mmap.
- Flatbuffer - Row-oriented serialization with zero-copy access, as fast as mmap.
- Google Protobuf - Row-oriented serialization for RPC purpose.
- Schema Registry - Centralized repository for validating row-oriented events. Part of Kafka and Confluent platform.
Open table formats
Open table formats are abstraction layer on top of Avro/Parquet files, with support for ACID transaction, CDC, partitioning, mixed streaming/batching processing, schema evolution and mutation. Schema and statistics are stored in a metastore, data is persisted locally or in a remote/cloud object storage.
Open tables are a cost-effective datawarehouse for petabyte scale.
- Apache Hive - SQL-based data warehouse and query engine on top of Hadoop, and the origin of the Hive Metastore used by modern table formats.
- Apache Hudi - Open table format with strong CDC and upsert support, designed for incremental data pipelines.
- Apache Iceberg - Open table format for huge analytic datasets, with snapshot isolation, schema evolution, and partition pruning.
- DeltaLake - Open table format bringing ACID transactions and scalable metadata to Apache Spark and beyond.
Comparison:
- (2022) Open Table Formats: Delta vs Iceberg vs Hudi
- (2023) Choosing an open table format for your transactional data lake on AWS
- (2024) Apache Iceberg vs Delta Lake vs Apache Hudi: Choosing the Right Table Format
👆 Warning: pre-2022 articles should be considered as out-of-date, as open table formats are evolving quickly.
Metastore
- AWS Glue - Serverless data integration service with a managed catalog for AWS data assets.
- Databricks unity catalog - Unified governance layer for data and AI assets across the Databricks platform.
- Hive Metastore - Component of Hadoop HiveServer2 that can be used standalone as a schema registry for table metadata.
- Nessie - Git-like versioning catalog for data lakes, enabling branch and merge operations on Iceberg/Delta/Hudi tables.
Object Storage
- Apache HDFS - Hadoop distributed file system, the original large-scale storage layer for the big data ecosystem.
- AWS S3 - Highly durable and available object storage service, the dominant cloud storage backend for data lakes.
- Azure Blob Storage - Microsoft's massively scalable object storage for unstructured data.
- GCP Cloud Storage - Google's unified object storage service for any amount of data.
- Minio - S3-compatible self-hosted object storage, suitable for on-premise data lake deployments.
Codecs, encoding and compression
- Bit packing - Encoding integers using only the bits required, eliminating wasted high-order zeros in columnar data.
- Brotli - General-purpose lossless compression by Google, offering better ratios than gzip at comparable speed.
- Deflate - Classic lossless compression combining LZ77 and Huffman coding; the basis of gzip and zlib.
- Delta - Stores differences between successive values instead of absolutes, ideal for monotonically increasing columns like timestamps.
- Dictionary + RLE - Replaces repeated values with dictionary codes, then run-length encodes consecutive duplicates; effective for low-cardinality columns.
- Gorilla - Facebook's XOR-based float compression for time-series metrics, achieving 1.37 bytes/value on typical monitoring data.
- LZ4 - Extremely fast lossless compression algorithm prioritizing throughput over ratio, widely used in real-time pipelines.
- Snappy - Google's fast lossless codec optimized for speed over compression ratio, default in many Hadoop/Parquet deployments.
- zstd - Facebook's modern lossless codec delivering high compression ratios at fast speeds; often preferred over gzip in data lakes.