OLAP & Data Engineering Reading List — Papers, Architecture, Internals

Readings

Papers

Apache Flink state management - Carbone et al. (2017) on Flink's state backend, incremental checkpointing, and exactly-once fault tolerance.
Apache Parquet format - Formal specification of the Parquet columnar storage format, including encoding, compression, and nested schema representation.
Dremel paper - Google's 2010 paper introducing columnar storage for nested data and interactive ad-hoc queries at petabyte scale.
RDD - Zaharia et al. (2012) introducing Resilient Distributed Datasets, the fault-tolerant in-memory abstraction that became the foundation of Apache Spark.
RocksDB - Facebook's paper on how RocksDB's design priorities evolved to serve large-scale production workloads.
Spanner paper - Corbett et al. (2012) on Google Spanner, the globally distributed ACID database using TrueTime for external consistency.

Architecture

CoW vs MoR - Comparison of Copy-on-Write and Merge-on-Read table strategies in Apache Hudi, with trade-offs for read vs write performance.
CQRS (Command Query Responsibility Segregation) - Martin Fowler's guide on separating read and write models to independently scale and optimize each path.
DAG - Directed Acyclic Graph — the dependency model used to represent task ordering in data pipeline orchestration.
Event sourcing - Pattern for persisting application state as an immutable log of events rather than mutable records.
Kappa architecture - Streaming-only alternative to the Lambda architecture that eliminates the batch layer.
Lambda architecture - Big data pattern combining a batch layer for accuracy and a speed layer for low-latency results.
Medallion architecture - Bronze/silver/gold data quality layering pattern for incrementally refining data in a lakehouse.
Reactive programming - Event-driven programming model based on asynchronous, composable data streams.
Star schema vs Snowflake schema - Dimensional data modeling patterns for organizing fact and dimension tables in an analytics warehouse.

Data modeling

Schema evolution - How Delta Lake enforces and evolves schemas without breaking downstream readers.
CDC - Change Data Capture — tracking row-level insertions, updates, and deletes for replication, audit, and event-driven pipelines.

Index

Partitioning - Dividing data into logical subsets (by date, region, etc.) to prune irrelevant partitions at query time.
Data skipping - ClickHouse's secondary skipping indexes for filtering data granules without full column scans.
Statistics - Column-level statistics in Hive used by the query planner to estimate cardinality and choose optimal join strategies.
High cardinality - Columns with many distinct values and how high cardinality impacts indexing and storage in time-series databases.
HyperLogLog - Probabilistic cardinality estimation algorithm used for COUNT(DISTINCT) at scale with sub-1% error.
Bloom filters - Space-efficient probabilistic data structure for fast set-membership testing in query engines and storage layers.
Minmax - Parquet page-level min/max statistics used for predicate pushdown to skip irrelevant row groups.
Z-ordering - Multi-dimensional data clustering that co-locates related values to speed up multi-column filter queries.
Bitmap index - Compact bitset-based index efficient for low-cardinality columns and multi-predicate AND/OR queries.
Dense index - Index with an entry for every row in the table, enabling direct lookups at the cost of index size.
Sparse index - Index with entries for only a subset of rows, trading lookup precision for smaller index footprint.
Reverse index - Inverted index mapping terms or values back to the rows containing them, foundational to full-text search.
N-gram - PostgreSQL trigram-based similarity index for fast fuzzy text matching and LIKE query acceleration.
TF-IDF - Term Frequency-Inverse Document Frequency scoring algorithm for ranking documents by relevance in full-text search.
LSM Tree - Log-Structured Merge-Tree — write-optimized storage structure that batches writes in memory before merging to disk, used in RocksDB, Cassandra, and LevelDB.

Vector similarity search

Algorithms and indexes:

ANN (approximate nearest neighbor) - Family of algorithms that trade exact accuracy for speed when finding the closest vectors in high-dimensional space.
kNN (k nearest neighbor) - Exact algorithm returning the K closest vectors by distance; accurate but slow at scale without an index.
Faiss - Facebook AI library for efficient similarity search and clustering of dense vectors.
HNSW - Hierarchical Navigable Small World graph index for approximate nearest neighbor search.

Dedicated vector databases:

Chroma - Lightweight open-source vector database for AI/RAG applications, optimized for developer simplicity.
LanceDB - Embedded, serverless vector database built on the Lance columnar format (Apache Arrow-based).
Milvus - Distributed open-source vector database designed for billion-scale similarity search.
pgvector - Open-source vector similarity search extension for PostgreSQL.
Qdrant - High-performance vector search engine written in Rust, with rich payload filtering and production-grade reliability.
Weaviate - AI-native vector database with built-in vectorization modules, hybrid search, and GraphQL/REST APIs.

Vectorized query processing

Apache Arrow vectorized execution - Talk on how Arrow's columnar memory layout enables SIMD-accelerated batch processing in query engines.
Apache Arrow SIMD parallel processing - Single Instruction Multiple Data — CPU instruction-level parallelism that processes multiple columnar values in a single clock cycle.
Cockroach vectorized JOIN - CockroachDB's vectorized join implementation and the performance gains from columnar execution over row-at-a-time processing.
Latency comparison numbers - Reference table of hardware latency numbers (L1/L2/L3 cache, RAM, SSD, network) essential for reasoning about database performance.

Querying

Cost Based Optimization - How Spark 2.2's cost-based optimizer uses column statistics to choose better join strategies and query plans.
Sampling - Statistical sampling techniques used in approximate query processing to return fast estimates over large datasets.
GraphX - Apache Spark's graph processing framework and Pregel-based API for iterative graph algorithms at scale.

Transactions

ACID properties - Atomicity, Consistency, Isolation, Durability — the four guarantees that define correct database transaction behavior.
Serializable transaction - CockroachDB's interactive demo illustrating how serializable isolation prevents anomalies like write skew and phantom reads.

Consensus

Paxos - Lamport's foundational distributed consensus algorithm for agreeing on a value across unreliable nodes.
Raft - Understandable distributed consensus algorithm designed as a more accessible alternative to Paxos, used in etcd and CockroachDB.

Challenging platforms

Datadog event store - How Datadog built Husky, a column-store for ingesting and querying billions of tagged events at scale.
Cloudflare logging - How Cloudflare processes 6 million HTTP requests per second using ClickHouse for real-time log analytics.

Blogs to follow

Engineering at Meta - Meta's engineering blog covering data infrastructure, distributed systems, and AI at hyperscale.
Engineering at Criteo - Criteo's engineering blog on ad-tech data pipelines, Spark, and large-scale machine learning.
Engineering at Uber - Uber's engineering blog covering data infrastructure, streaming systems, and distributed databases.
Engineering at Airbnb - Airbnb's engineering blog on data platform, analytics engineering, and ML infrastructure.
Databricks - Engineering blog on lakehouse architecture, Apache Spark, Delta Lake, and MLflow.
Towards Data Science - Community publication covering data science, data engineering, and machine learning.
Antithesis - Blog from the autonomous testing platform covering distributed systems correctness, fault injection, and database reliability.

Modern Data Stack - Directory of tools and companies in the modern data stack ecosystem.
The Internals Of... (books.japila.pl) - Free online books covering the internals of Apache Spark, Kafka, Delta Lake, and related tools.
Jepsen analyses - Kyle Kingsbury's safety analyses of distributed databases, queues, and consensus systems.
Designing Data-Intensive Applications reading list - Kyle Kingsbury's distributed systems course materials and reading list.

Readings

Readings

Papers

Architecture

Data modeling

Index

Vector similarity search

Vectorized query processing

Querying

Transactions

Consensus

Challenging platforms

Blogs to follow

More