Readings
Papers
- Apache Flink state management - Carbone et al. (2017) on Flink's state backend, incremental checkpointing, and exactly-once fault tolerance.
- Apache Parquet format - Formal specification of the Parquet columnar storage format, including encoding, compression, and nested schema representation.
- Dremel paper - Google's 2010 paper introducing columnar storage for nested data and interactive ad-hoc queries at petabyte scale.
- RDD - Zaharia et al. (2012) introducing Resilient Distributed Datasets, the fault-tolerant in-memory abstraction that became the foundation of Apache Spark.
- RocksDB - Facebook's paper on how RocksDB's design priorities evolved to serve large-scale production workloads.
- Spanner paper - Corbett et al. (2012) on Google Spanner, the globally distributed ACID database using TrueTime for external consistency.
Architecture
- CoW vs MoR - Comparison of Copy-on-Write and Merge-on-Read table strategies in Apache Hudi, with trade-offs for read vs write performance.
- CQRS (Command Query Responsibility Segregation) - Martin Fowler's guide on separating read and write models to independently scale and optimize each path.
- DAG - Directed Acyclic Graph — the dependency model used to represent task ordering in data pipeline orchestration.
- Event sourcing - Pattern for persisting application state as an immutable log of events rather than mutable records.
- Kappa architecture - Streaming-only alternative to the Lambda architecture that eliminates the batch layer.
- Lambda architecture - Big data pattern combining a batch layer for accuracy and a speed layer for low-latency results.
- Medallion architecture - Bronze/silver/gold data quality layering pattern for incrementally refining data in a lakehouse.
- Reactive programming - Event-driven programming model based on asynchronous, composable data streams.
- Star schema vs Snowflake schema - Dimensional data modeling patterns for organizing fact and dimension tables in an analytics warehouse.
Data modeling
- Schema evolution - How Delta Lake enforces and evolves schemas without breaking downstream readers.
- CDC - Change Data Capture — tracking row-level insertions, updates, and deletes for replication, audit, and event-driven pipelines.
Index
- Partitioning - Dividing data into logical subsets (by date, region, etc.) to prune irrelevant partitions at query time.
- Data skipping - ClickHouse's secondary skipping indexes for filtering data granules without full column scans.
- Statistics - Column-level statistics in Hive used by the query planner to estimate cardinality and choose optimal join strategies.
- High cardinality - Columns with many distinct values and how high cardinality impacts indexing and storage in time-series databases.
- HyperLogLog - Probabilistic cardinality estimation algorithm used for COUNT(DISTINCT) at scale with sub-1% error.
- Bloom filters - Space-efficient probabilistic data structure for fast set-membership testing in query engines and storage layers.
- Minmax - Parquet page-level min/max statistics used for predicate pushdown to skip irrelevant row groups.
- Z-ordering - Multi-dimensional data clustering that co-locates related values to speed up multi-column filter queries.
- Bitmap index - Compact bitset-based index efficient for low-cardinality columns and multi-predicate AND/OR queries.
- Dense index - Index with an entry for every row in the table, enabling direct lookups at the cost of index size.
- Sparse index - Index with entries for only a subset of rows, trading lookup precision for smaller index footprint.
- Reverse index - Inverted index mapping terms or values back to the rows containing them, foundational to full-text search.
- N-gram - PostgreSQL trigram-based similarity index for fast fuzzy text matching and LIKE query acceleration.
- TF-IDF - Term Frequency-Inverse Document Frequency scoring algorithm for ranking documents by relevance in full-text search.
- LSM Tree - Log-Structured Merge-Tree — write-optimized storage structure that batches writes in memory before merging to disk, used in RocksDB, Cassandra, and LevelDB.
Vector similarity search
Algorithms and indexes:
- ANN (approximate nearest neighbor) - Family of algorithms that trade exact accuracy for speed when finding the closest vectors in high-dimensional space.
- kNN (k nearest neighbor) - Exact algorithm returning the K closest vectors by distance; accurate but slow at scale without an index.
- Faiss - Facebook AI library for efficient similarity search and clustering of dense vectors.
- HNSW - Hierarchical Navigable Small World graph index for approximate nearest neighbor search.
Dedicated vector databases:
- Chroma - Lightweight open-source vector database for AI/RAG applications, optimized for developer simplicity.
- LanceDB - Embedded, serverless vector database built on the Lance columnar format (Apache Arrow-based).
- Milvus - Distributed open-source vector database designed for billion-scale similarity search.
- pgvector - Open-source vector similarity search extension for PostgreSQL.
- Qdrant - High-performance vector search engine written in Rust, with rich payload filtering and production-grade reliability.
- Weaviate - AI-native vector database with built-in vectorization modules, hybrid search, and GraphQL/REST APIs.
Vectorized query processing
- Apache Arrow vectorized execution - Talk on how Arrow's columnar memory layout enables SIMD-accelerated batch processing in query engines.
- Apache Arrow SIMD parallel processing - Single Instruction Multiple Data — CPU instruction-level parallelism that processes multiple columnar values in a single clock cycle.
- Cockroach vectorized JOIN - CockroachDB's vectorized join implementation and the performance gains from columnar execution over row-at-a-time processing.
- Latency comparison numbers - Reference table of hardware latency numbers (L1/L2/L3 cache, RAM, SSD, network) essential for reasoning about database performance.
Querying
- Cost Based Optimization - How Spark 2.2's cost-based optimizer uses column statistics to choose better join strategies and query plans.
- Sampling - Statistical sampling techniques used in approximate query processing to return fast estimates over large datasets.
- GraphX - Apache Spark's graph processing framework and Pregel-based API for iterative graph algorithms at scale.
Transactions
- ACID properties - Atomicity, Consistency, Isolation, Durability — the four guarantees that define correct database transaction behavior.
- Serializable transaction - CockroachDB's interactive demo illustrating how serializable isolation prevents anomalies like write skew and phantom reads.
Consensus
- Paxos - Lamport's foundational distributed consensus algorithm for agreeing on a value across unreliable nodes.
- Raft - Understandable distributed consensus algorithm designed as a more accessible alternative to Paxos, used in etcd and CockroachDB.
Challenging platforms
- Datadog event store - How Datadog built Husky, a column-store for ingesting and querying billions of tagged events at scale.
- Cloudflare logging - How Cloudflare processes 6 million HTTP requests per second using ClickHouse for real-time log analytics.
Blogs to follow
- Engineering at Meta - Meta's engineering blog covering data infrastructure, distributed systems, and AI at hyperscale.
- Engineering at Criteo - Criteo's engineering blog on ad-tech data pipelines, Spark, and large-scale machine learning.
- Engineering at Uber - Uber's engineering blog covering data infrastructure, streaming systems, and distributed databases.
- Engineering at Airbnb - Airbnb's engineering blog on data platform, analytics engineering, and ML infrastructure.
- Databricks - Engineering blog on lakehouse architecture, Apache Spark, Delta Lake, and MLflow.
- Towards Data Science - Community publication covering data science, data engineering, and machine learning.
- Antithesis - Blog from the autonomous testing platform covering distributed systems correctness, fault injection, and database reliability.
More
- Modern Data Stack - Directory of tools and companies in the modern data stack ecosystem.
- The Internals Of... (books.japila.pl) - Free online books covering the internals of Apache Spark, Kafka, Delta Lake, and related tools.
- Jepsen analyses - Kyle Kingsbury's safety analyses of distributed databases, queues, and consensus systems.
- Designing Data-Intensive Applications reading list - Kyle Kingsbury's distributed systems course materials and reading list.