A curated list of OLAP databases, data engineering tools, columnar databases, data lake and lakehouse frameworks — covering 100+ tools across 20+ categories, for data engineers.
OLAP (Online Analytical Processing) refers to databases and query engines optimized for complex, read-heavy analytical queries over large datasets. Unlike OLTP systems, OLAP databases use columnar storage, vectorized execution, and distributed processing to aggregate and analyze billions of rows in seconds.
Contents
- OLAP Databases
- Storage engines
- Data lake
- Brokers and distributed messaging
- Ingestion and querying
- Scheduler
- Durable execution
- ETL, ELT and reverse ETL
- BI & Visualization
- Datasets
- Benchmark
- Readings
- FAQ
- People to follow
- Events
- Communities
- 🤝 Contributing
- 👤 Contributors
- 💫 Show your support
OLAP Databases
Real-time analytics
The following columnar databases use a shared-nothing architecture and provide a sub-second response time. DDL, DML and DCL are operated via SQL. These databases also support tiering for long-term cold storage.
- Apache Doris - MPP analytical database with MySQL-compatible interface, optimized for high-concurrency queries and real-time data ingestion.
- Apache Druid - Real-time OLAP database optimized for streaming ingestion, time-series analytics, and sub-second queries on high-cardinality data.
- Apache HBase - Distributed, wide-column NoSQL database on top of HDFS, modeled after Google Bigtable.
- Apache Pinot - Distributed OLAP datastore for user-facing real-time analytics, designed for low-latency queries at high concurrency.
- ClickHouse - Column-oriented DBMS for online analytical processing, capable of processing billions of rows per second.
- StarRocks - MPP OLAP database with vectorized execution engine, optimized for real-time analytics and high-concurrency workloads.
Search engines
Search engines complement OLAP systems for full-text search and log analytics use cases, where keyword relevance and inverted indexes matter more than aggregate query performance.
- Elasticsearch - Search and analytics engine based on Apache Lucene.
- Meilisearch - Open source search engine, that aims to be a ready-to-go solution.
- OpenSearch - Apache 2.0 fork of Elasticsearch.
- Quickwit - Search engine on top of object storage, using shared-everything architecture.
- Typesense - Open-source, typo-tolerant search engine optimized for instant search-as-you-type experiences and developer productivity.
Hybrid OLAP/OLTP NewSQL (aka HTAP)
HTAP (Hybrid Transactional-Analytical Processing) databases handle both transactional writes and analytical reads in a single engine, eliminating the need to maintain a separate data warehouse for reporting.
- Citus - PostgreSQL compatible distributed table.
- CockroachDB - Distributed SQL database with strong consistency, horizontal scaling, and PostgreSQL compatibility for HTAP workloads.
- TiDB - MySQL compatible SQL database that supports hybrid transactional and analytical processing workloads.
- YugabyteDB - Distributed SQL database compatible with PostgreSQL and Cassandra APIs, designed for global, cloud-native HTAP applications.
Timeseries
Time-series databases are optimized for append-heavy workloads where data is tagged, timestamped, and queried by time range — distinct from general OLAP because they prioritize ingestion throughput, automatic retention, and time-aligned aggregations.
- Grafana Mimir - Prometheus-compatible TSDB on top of object storage, horizontally scalable.
- InfluxDB - Purpose-built time series database optimized for high-write-throughput metrics, events, and IoT data with a SQL-like query language.
- Prometheus - Pull-based metrics collection and time series database, de facto standard for cloud-native monitoring.
- QuestDB - High-performance time series database written in Java and C++, with SQL support and ingestion rates exceeding millions of rows per second.
- TimeScaleDB - PostgreSQL-compatible TSDB with automatic partitioning and time-series-specific SQL extensions.
- VictoriaMetrics - Fast, cost-effective Prometheus-compatible TSDB with low memory and storage footprint.
Managed cloud services
Fully managed cloud data warehouses trade self-hosted operational overhead for elastic scaling and pay-as-you-go pricing. All handle petabyte-scale analytics; they differ in cost model, latency profile, and ecosystem integrations.
- AWS Redshift - Fully managed petabyte-scale data warehouse on AWS.
- Azure Synapse Analytics - Unified analytics service combining data integration, warehousing, and big data on Azure.
- Databricks - Lakehouse platform combining data warehousing and ML, built on Delta Lake and Apache Spark.
- Firebolt - Cloud-native OLAP warehouse engineered for sub-second query performance at scale.
- Google BigQuery - Serverless, pay-as-you-go data warehouse with built-in ML and BI capabilities.
- Snowflake - Cloud data platform with a decoupled storage and compute architecture, supporting multi-cloud deployments.
- Tinybird - Real-time analytics API platform built on ClickHouse.
Storage engines
Storage engines are the foundational frameworks on top of which higher-level databases and data systems are built. They handle durability, transactions, and low-level data organization.
- FoundationDB - Distributed ordered key-value store with full ACID transactions, designed as a reliable foundation layer for building higher-level databases and services.
- LevelDB - Google's embeddable key-value store using a log-structured merge-tree (LSM-tree); the inspiration for RocksDB and widely used in Blockchain and embedded systems.
- RocksDB - Embeddable persistent key-value store by Meta, optimized for fast storage and used as the storage engine inside many distributed databases (TiKV, CockroachDB, Kafka).
Data lake
The data lake approach (or "lakehouse") is a semi-structured schema that sits on top of object storage in the cloud.
It is composed of a few layers (from lower to higher level): codec, file format, table format + metastore, and the ingestion/query layer.
File formats and serialization
These formats are popular for shared-everything databases, using object storage as a persistence layer. The data is organized in row or column, with strict schema definition. These files are immutable and offer partial reads (only headers, metadata, data page, etc). Mutation requires a new upload. Most formats support nested schema, codecs, compression, and data encryption. Index can be added to file metadata for faster processing.
A single file can weight between tens of MB to a few GB. Lots of small files require more merge operation. Larger files can be costly to update.
- Apache Arrow Columnar Format - Columnar format for in-memory Apache Arrow processing.
- Apache Avro - Row-oriented serialization for data streaming purpose.
- Apache ORC - Column-oriented serialization for data storage purpose. Part of Hadoop platform.
- Apache Parquet - Column-oriented serialization for data storage purpose.
- Apache Thrift - Row-oriented serialization for RPC purpose.
- Cap’n Proto - Row-oriented serialization with zero-copy access, as fast as mmap.
- Flatbuffer - Row-oriented serialization with zero-copy access, as fast as mmap.
- Google Protobuf - Row-oriented serialization for RPC purpose.
- Schema Registry - Centralized repository for validating row-oriented events. Part of Kafka and Confluent platform.
Open table formats
Open table formats are abstraction layer on top of Avro/Parquet files, with support for ACID transaction, CDC, partitioning, mixed streaming/batching processing, schema evolution and mutation. Schema and statistics are stored in a metastore, data is persisted locally or in a remote/cloud object storage.
Open tables are a cost-effective datawarehouse for petabyte scale.
- Apache Hive - SQL-based data warehouse and query engine on top of Hadoop, and the origin of the Hive Metastore used by modern table formats.
- Apache Hudi - Open table format with strong CDC and upsert support, designed for incremental data pipelines.
- Apache Iceberg - Open table format for huge analytic datasets, with snapshot isolation, schema evolution, and partition pruning.
- DeltaLake - Open table format bringing ACID transactions and scalable metadata to Apache Spark and beyond.
Comparison:
- (2022) Open Table Formats: Delta vs Iceberg vs Hudi
- (2023) Choosing an open table format for your transactional data lake on AWS
- (2024) Apache Iceberg vs Delta Lake vs Apache Hudi: Choosing the Right Table Format
👆 Warning: pre-2022 articles should be considered as out-of-date, as open table formats are evolving quickly.
Metastore
- AWS Glue - Serverless data integration service with a managed catalog for AWS data assets.
- Databricks unity catalog - Unified governance layer for data and AI assets across the Databricks platform.
- Hive Metastore - Component of Hadoop HiveServer2 that can be used standalone as a schema registry for table metadata.
- Nessie - Git-like versioning catalog for data lakes, enabling branch and merge operations on Iceberg/Delta/Hudi tables.
Object Storage
- Apache HDFS - Hadoop distributed file system, the original large-scale storage layer for the big data ecosystem.
- AWS S3 - Highly durable and available object storage service, the dominant cloud storage backend for data lakes.
- Azure Blob Storage - Microsoft's massively scalable object storage for unstructured data.
- GCP Cloud Storage - Google's unified object storage service for any amount of data.
- Minio - S3-compatible self-hosted object storage, suitable for on-premise data lake deployments.
Codecs, encoding and compression
- Bit packing - Encoding integers using only the bits required, eliminating wasted high-order zeros in columnar data.
- Brotli - General-purpose lossless compression by Google, offering better ratios than gzip at comparable speed.
- Deflate - Classic lossless compression combining LZ77 and Huffman coding; the basis of gzip and zlib.
- Delta - Stores differences between successive values instead of absolutes, ideal for monotonically increasing columns like timestamps.
- Dictionary + RLE - Replaces repeated values with dictionary codes, then run-length encodes consecutive duplicates; effective for low-cardinality columns.
- Gorilla - Facebook's XOR-based float compression for time-series metrics, achieving 1.37 bytes/value on typical monitoring data.
- LZ4 - Extremely fast lossless compression algorithm prioritizing throughput over ratio, widely used in real-time pipelines.
- Snappy - Google's fast lossless codec optimized for speed over compression ratio, default in many Hadoop/Parquet deployments.
- zstd - Facebook's modern lossless codec delivering high compression ratios at fast speeds; often preferred over gzip in data lakes.
Brokers and distributed messaging
Message brokers sit between producers and consumers in the data stack, providing durable, ordered event streams that decouple ingestion from processing and enable replay, fan-out, and exactly-once delivery semantics.
- Apache Kafka - Distributed event streaming platform, the de facto standard for high-throughput data pipelines and event-driven architectures.
- Apache Pulsar - Distributed messaging and streaming platform with multi-tenancy, geo-replication, and a decoupled storage layer.
- NATS / JetStream - Lightweight cloud-native messaging system; JetStream adds persistence, replay, and streaming semantics.
- RabbitMQ Streams - Persistent, append-only log streams for RabbitMQ, enabling high-throughput message replay and fan-out.
- Redpanda - Kafka-compatible streaming data platform written in C++, with no ZooKeeper dependency and lower latency.
Ingestion and querying
Stream processing
Real-time data processing (also called event streaming) handles data as it is generated — enabling low-latency pipelines, continuous aggregations, and immediate downstream reactions to events.
- Akka Streams - Reactive stream processing library for JVM, built on the actor model.
- Apache Beam - Unified SDK for cross-language stream and batch processing. Available in Go, Python, Java, Scala and TypeScript.
- Apache Flink - Stateful stream processing with exactly-once semantics, supporting event time and out-of-order data.
- Apache Kafka Streams - Lightweight stream processing library embedded in the Kafka client, no separate cluster required.
- Apache Spark Streaming - Micro-batch stream processing on top of Spark, integrating with the broader Spark ecosystem.
- Redpanda Connect (formerly Benthos) - Declarative stream processing toolkit in Go, with a wide connector library.
- Materialize - Operational data warehouse that incrementally maintains SQL views over streaming data, always-fresh without recomputation.
- RisingWave - Distributed SQL streaming database (PostgreSQL-compatible) with sub-100ms end-to-end freshness and native Iceberg integration.
Batch processing
Process periodically a large amount of data in a single batch.
- Apache Spark - Distributed batch processing engine with in-memory computation, supporting SQL, ML, and graph workloads.
- MapReduce - Programming model for processing large datasets in parallel across a cluster; the foundation of the Hadoop ecosystem.
In-memory processing
Non real-time SQL queries executed against a large database can be processed locally. This method might not fit into memory or lead to very long job duration.
- Apache Arrow - Low-level in-memory columnar data format with zero-copy access across languages via gRPC/IPC interfaces.
- Apache Arrow DataFusion - High-level SQL and DataFrame query engine built on Apache Arrow, written in Rust.
- chDB - Embeddable in-process OLAP engine powered by ClickHouse, callable from Python without a server.
- clickhouse-local - Lightweight CLI version of ClickHouse for running SQL queries against CSV, JSON, Parquet and other files.
- delta-rs - Standalone DeltaLake driver for Python and Rust. Does not depend on Spark.
- Delta Standalone - Standalone DeltaLake driver for Java and Scala. Does not depend on Spark.
- DuckDB - In-process SQL OLAP query engine for Parquet, CSV, and JSON files. Built on Apache Arrow.
- Pandas - Python DataFrame library for data analysis and manipulation, the standard for data science workflows.
- Polars - High-performance DataFrame library written in Rust with a lazy query optimizer, significantly faster than Pandas.
Distributed SQL processing
These SQL engines execute distributed queries over very large datasets across a cluster. Many support ANSI SQL or ANSI-SQL-like interfaces. Some can also act as federated query engines, querying across heterogeneous data sources.
- Apache Spark SQL - Distributed SQL query engine that sit on top of Spark.
- Dremio - SQL lakehouse platform providing a semantic layer and query acceleration on top of data lakes.
- ksql - SQL interface for Kafka.
- PrestoDB - Distributed SQL query engine.
- Trino - Distributed SQL query engine. Fork of PrestoDB.
Scheduler
Orchestrators define and monitor complex multi-step DAG workflows with dependency management, retries, and observability. Cron-style schedulers simply trigger jobs at fixed time intervals. The tools below are full orchestrators.
- Apache Airflow - Platform for programmatically authoring, scheduling, and monitoring data pipelines as DAGs.
- Dagster - Data orchestration platform with an asset-centric approach, lineage tracking, and built-in observability.
Durable execution
Durable execution frameworks guarantee that workflows survive process crashes, network failures, and infrastructure restarts by persisting execution state automatically.
- Temporal - Durable workflow execution platform for building fault-tolerant pipelines and long-running data processes.
ETL, ELT and reverse ETL
The popular acronym for Extracting, Transforming and Loading data (also called data pipeline tools or data integration). ELT performs data transformations directly within the data warehouse. Reverse ETL is the process of copying data from your datawarehouse to external tools or SaaS.
- Airbyte - Open-source ELT platform with 300+ pre-built connectors for syncing data to your warehouse.
- Census - Reverse ETL platform for syncing data warehouse data to CRMs, ad tools, and other SaaS.
- dbt - SQL-based transformation framework that runs inside your warehouse; the standard tool for the T in ELT.
- Debezium - Open-source CDC (Change Data Capture) platform that streams row-level changes from databases like PostgreSQL, MySQL, and MongoDB into Kafka and downstream systems.
- RudderStack - Customer Data Platform providing a pipeline between a tracking plan, event transformation, and destination tools.
BI & Visualization
Business intelligence and visualization tools sit on top of OLAP databases, enabling analysts to explore data, build dashboards, and share insights without writing SQL.
- Apache Superset - Open-source business intelligence platform with a rich SQL editor, drag-and-drop chart builder, and support for 40+ data sources.
- Grafana - Open-source observability and analytics platform for visualizing metrics, logs, and traces; widely used with time-series and OLAP backends.
- Metabase - Open-source BI tool focused on ease of use, letting non-technical users explore data and build dashboards without SQL.
- Redash - Open-source query editor and dashboarding tool with broad database connector support.
Datasets
Large-scale public datasets commonly used for benchmarking OLAP databases, query engines, and data lake tools.
- awesome-public-datasets - Curated list of high-quality public datasets organized by domain.
- CommonCrawl - Petabyte-scale web crawl dataset updated monthly; used for NLP, analytics, and link graph research.
- Criteo - 1TB click log dataset from Criteo, a standard benchmark for ad-tech and high-cardinality analytics workloads.
- Entso-e - European electricity generation and consumption statistics, useful for time-series and energy analytics benchmarks.
- GitHub Archives - Timestamped record of all public GitHub events; a popular dataset for querying with ClickHouse, BigQuery, and DuckDB.
- Kaggle - Community-sourced datasets and competitions covering a wide range of domains.
- NYCTaxi - NYC taxi trip records dating back to 2009; a classic columnar query benchmark with billions of rows.
Benchmark
Benchmarks help select the right database for a workload. Always run benchmarks on your own data and query patterns — published numbers reflect vendor-tuned configurations.
- ClickBench - De facto OLAP benchmark maintained by ClickHouse; covers 43 analytical queries on a 100GB web analytics dataset and includes results for 50+ engines.
- Jepsen - Distributed databases, queues and consensus protocols testing.
- TPC-DS - Decision support benchmark modeling a retail data warehouse with 99 complex SQL queries across multiple fact and dimension tables.
- TPC-H - Business intelligence benchmark with 22 ad-hoc queries on supply-chain data; the most widely cited OLAP benchmark in academic literature.
- TPC family benchmarks - Full catalog of TPC benchmarks for big data and analytical databases.
Readings
Papers
- Apache Flink state management - Carbone et al. (2017) on Flink's state backend, incremental checkpointing, and exactly-once fault tolerance.
- Apache Parquet format - Formal specification of the Parquet columnar storage format, including encoding, compression, and nested schema representation.
- Dremel paper - Google's 2010 paper introducing columnar storage for nested data and interactive ad-hoc queries at petabyte scale.
- RDD - Zaharia et al. (2012) introducing Resilient Distributed Datasets, the fault-tolerant in-memory abstraction that became the foundation of Apache Spark.
- RocksDB - Facebook's paper on how RocksDB's design priorities evolved to serve large-scale production workloads.
- Spanner paper - Corbett et al. (2012) on Google Spanner, the globally distributed ACID database using TrueTime for external consistency.
Architecture
- CoW vs MoR - Comparison of Copy-on-Write and Merge-on-Read table strategies in Apache Hudi, with trade-offs for read vs write performance.
- CQRS (Command Query Responsibility Segregation) - Martin Fowler's guide on separating read and write models to independently scale and optimize each path.
- DAG - Directed Acyclic Graph — the dependency model used to represent task ordering in data pipeline orchestration.
- Event sourcing - Pattern for persisting application state as an immutable log of events rather than mutable records.
- Kappa architecture - Streaming-only alternative to the Lambda architecture that eliminates the batch layer.
- Lambda architecture - Big data pattern combining a batch layer for accuracy and a speed layer for low-latency results.
- Medallion architecture - Bronze/silver/gold data quality layering pattern for incrementally refining data in a lakehouse.
- Reactive programming - Event-driven programming model based on asynchronous, composable data streams.
- Star schema vs Snowflake schema - Dimensional data modeling patterns for organizing fact and dimension tables in an analytics warehouse.
Data modeling
- Schema evolution - How Delta Lake enforces and evolves schemas without breaking downstream readers.
- CDC - Change Data Capture — tracking row-level insertions, updates, and deletes for replication, audit, and event-driven pipelines.
Index
- Partitioning - Dividing data into logical subsets (by date, region, etc.) to prune irrelevant partitions at query time.
- Data skipping - ClickHouse's secondary skipping indexes for filtering data granules without full column scans.
- Statistics - Column-level statistics in Hive used by the query planner to estimate cardinality and choose optimal join strategies.
- High cardinality - Columns with many distinct values and how high cardinality impacts indexing and storage in time-series databases.
- HyperLogLog - Probabilistic cardinality estimation algorithm used for COUNT(DISTINCT) at scale with sub-1% error.
- Bloom filters - Space-efficient probabilistic data structure for fast set-membership testing in query engines and storage layers.
- Minmax - Parquet page-level min/max statistics used for predicate pushdown to skip irrelevant row groups.
- Z-ordering - Multi-dimensional data clustering that co-locates related values to speed up multi-column filter queries.
- Bitmap index - Compact bitset-based index efficient for low-cardinality columns and multi-predicate AND/OR queries.
- Dense index - Index with an entry for every row in the table, enabling direct lookups at the cost of index size.
- Sparse index - Index with entries for only a subset of rows, trading lookup precision for smaller index footprint.
- Reverse index - Inverted index mapping terms or values back to the rows containing them, foundational to full-text search.
- N-gram - PostgreSQL trigram-based similarity index for fast fuzzy text matching and LIKE query acceleration.
- TF-IDF - Term Frequency-Inverse Document Frequency scoring algorithm for ranking documents by relevance in full-text search.
- LSM Tree - Log-Structured Merge-Tree — write-optimized storage structure that batches writes in memory before merging to disk, used in RocksDB, Cassandra, and LevelDB.
Vector similarity search
Algorithms and indexes:
- ANN (approximate nearest neighbor) - Family of algorithms that trade exact accuracy for speed when finding the closest vectors in high-dimensional space.
- kNN (k nearest neighbor) - Exact algorithm returning the K closest vectors by distance; accurate but slow at scale without an index.
- Faiss - Facebook AI library for efficient similarity search and clustering of dense vectors.
- HNSW - Hierarchical Navigable Small World graph index for approximate nearest neighbor search.
Dedicated vector databases:
- Chroma - Lightweight open-source vector database for AI/RAG applications, optimized for developer simplicity.
- LanceDB - Embedded, serverless vector database built on the Lance columnar format (Apache Arrow-based).
- Milvus - Distributed open-source vector database designed for billion-scale similarity search.
- pgvector - Open-source vector similarity search extension for PostgreSQL.
- Qdrant - High-performance vector search engine written in Rust, with rich payload filtering and production-grade reliability.
- Weaviate - AI-native vector database with built-in vectorization modules, hybrid search, and GraphQL/REST APIs.
Vectorized query processing
- Apache Arrow vectorized execution - Talk on how Arrow's columnar memory layout enables SIMD-accelerated batch processing in query engines.
- Apache Arrow SIMD parallel processing - Single Instruction Multiple Data — CPU instruction-level parallelism that processes multiple columnar values in a single clock cycle.
- Cockroach vectorized JOIN - CockroachDB's vectorized join implementation and the performance gains from columnar execution over row-at-a-time processing.
- Latency comparison numbers - Reference table of hardware latency numbers (L1/L2/L3 cache, RAM, SSD, network) essential for reasoning about database performance.
Querying
- Cost Based Optimization - How Spark 2.2's cost-based optimizer uses column statistics to choose better join strategies and query plans.
- Sampling - Statistical sampling techniques used in approximate query processing to return fast estimates over large datasets.
- GraphX - Apache Spark's graph processing framework and Pregel-based API for iterative graph algorithms at scale.
Transactions
- ACID properties - Atomicity, Consistency, Isolation, Durability — the four guarantees that define correct database transaction behavior.
- Serializable transaction - CockroachDB's interactive demo illustrating how serializable isolation prevents anomalies like write skew and phantom reads.
Consensus
- Paxos - Lamport's foundational distributed consensus algorithm for agreeing on a value across unreliable nodes.
- Raft - Understandable distributed consensus algorithm designed as a more accessible alternative to Paxos, used in etcd and CockroachDB.
Challenging platforms
- Datadog event store - How Datadog built Husky, a column-store for ingesting and querying billions of tagged events at scale.
- Cloudflare logging - How Cloudflare processes 6 million HTTP requests per second using ClickHouse for real-time log analytics.
Blogs to follow
- Engineering at Meta - Meta's engineering blog covering data infrastructure, distributed systems, and AI at hyperscale.
- Engineering at Criteo - Criteo's engineering blog on ad-tech data pipelines, Spark, and large-scale machine learning.
- Engineering at Uber - Uber's engineering blog covering data infrastructure, streaming systems, and distributed databases.
- Engineering at Airbnb - Airbnb's engineering blog on data platform, analytics engineering, and ML infrastructure.
- Databricks - Engineering blog on lakehouse architecture, Apache Spark, Delta Lake, and MLflow.
- Towards Data Science - Community publication covering data science, data engineering, and machine learning.
- Antithesis - Blog from the autonomous testing platform covering distributed systems correctness, fault injection, and database reliability.
More
- Modern Data Stack - Directory of tools and companies in the modern data stack ecosystem.
- The Internals Of... (books.japila.pl) - Free online books covering the internals of Apache Spark, Kafka, Delta Lake, and related tools.
- Jepsen analyses - Kyle Kingsbury's safety analyses of distributed databases, queues, and consensus systems.
- Designing Data-Intensive Applications reading list - Kyle Kingsbury's distributed systems course materials and reading list.
FAQ
What is the best OLAP database
There is no single best OLAP database — the right choice depends on your latency, scale, and operational constraints:
ClickHouse — best raw query speed on a single node or small cluster; ideal for user-facing analytics, logs, and event data.
Apache Druid / Apache Pinot — best for sub-second queries at high concurrency over streaming-ingested data (ad tech, real-time dashboards).
StarRocks — strong alternative to ClickHouse/Druid for hybrid batch+streaming with a MySQL-compatible interface.
DuckDB — best for local or embedded analytics on files (Parquet, CSV); no server required.
Trino / PrestoDB — best for federated queries across heterogeneous sources (S3, Hive, RDBMS) without moving data.
Apache Spark — best for large-scale batch ETL and ML pipelines where latency is not critical.
Snowflake / BigQuery / Redshift — best when you want fully managed infrastructure with elastic scaling and no ops overhead.
OLAP vs OLTP
| OLAP | OLTP | |
|---|---|---|
| Workload | Complex analytical queries (aggregations, scans) | Simple transactional queries (reads/writes by key) |
| Storage | Columnar | Row-oriented |
| Typical query | SELECT sum(revenue) GROUP BY region |
SELECT * FROM orders WHERE id = 42 |
| Scale | Billions of rows, read-heavy | Millions of rows, write-heavy |
| Examples | ClickHouse, Druid, BigQuery | PostgreSQL, MySQL, DynamoDB |
What is a data lakehouse
A data lakehouse combines the low-cost scalable storage of a data lake (files on S3/GCS/ADLS) with the ACID transactions, schema enforcement, and query performance of a data warehouse. Open table formats like Apache Iceberg, Delta Lake, and Apache Hudi implement the lakehouse pattern on top of Parquet files.
Kafka vs Pulsar
Apache Kafka is the de facto standard with the largest ecosystem, best tooling support, and widest operator knowledge. Apache Pulsar offers multi-tenancy, geo-replication, and a decoupled storage layer (via BookKeeper) out of the box — useful when those features are required from day one. Most teams should start with Kafka.
Open table formats: Iceberg vs Delta Lake vs Hudi
| Iceberg | Delta Lake | Hudi | |
|---|---|---|---|
| Best for | Large-scale analytics, multi-engine | Spark-native workloads, Databricks | CDC / upsert-heavy pipelines |
| Engine support | Spark, Flink, Trino, Hive, Dremio | Spark (best), Flink, Trino | Spark, Flink |
| Upserts | Merge-on-read or copy-on-write | Copy-on-write (merge-on-read in progress) | First-class, optimized |
| Governance | Apache Foundation | Linux Foundation | Apache Foundation |
See the comparison links in the Open table formats section for detailed benchmarks.
ClickHouse vs Apache Druid vs Apache Pinot vs StarRocks
All four are real-time OLAP databases with sub-second query latency. Key differences:
| ClickHouse | Apache Druid | Apache Pinot | StarRocks | |
|---|---|---|---|---|
| Best for | Log/event analytics, ad-hoc queries | Streaming-ingested time-series data | User-facing analytics, high concurrency | Hybrid batch+streaming, flexible schema |
| Architecture | Shared-nothing, columnar MergeTree | Segment-based, time-partitioned | Segment-based, real-time + offline tables | MPP with vectorized execution engine |
| Ingestion | Kafka, files, HTTP push | Kafka, Kinesis, native streaming | Kafka, Kinesis, files | Kafka, files, Flink, Spark |
| Upserts | Limited (ReplacingMergeTree) | No | No (append-only) | Yes (primary key tables) |
| Query concurrency | Medium | High | Very high (user-facing) | High |
| SQL dialect | ClickHouse SQL (mostly ANSI) | Druid SQL (ANSI subset) | PQL + Druid-compatible SQL | MySQL-compatible SQL |
| Written in | C++ | Java | Java | C++ / Java |
| Managed cloud | ClickHouse Cloud | Imply Polaris | StarTree Cloud | CelerData |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 | Apache 2.0 (Elastic for some features) |
When to pick which
ClickHouse — highest raw throughput for analytics on a single cluster; ideal for logs, metrics, and BI queries.
Apache Druid — best when data arrives via Kafka and you need time-partitioned rollups with guaranteed low latency.
Apache Pinot — best for user-facing products where thousands of end-users hit the DB concurrently (dashboards, embedded analytics).
StarRocks — best when you need upserts, a MySQL-compatible interface, or a single engine for both batch and streaming.
See the Benchmark section for query performance comparisons across engines.
People to follow
| Name | Description | GitHub | Twitter/X | Bluesky | |
|---|---|---|---|---|---|
| Alexey Milovidov | Co-founder and CTO of ClickHouse | alexey-milovidov | @fdooch123 | in/alexey-milovidov-clickhouse | |
| Hannes Mühleisen | Co-creator of DuckDB, CEO of DuckDB Labs | hannes | @hfmuehleisen | in/hfmuehleisen | bsky |
| Mark Raasveldt | Co-creator of DuckDB | Mytherin | @mraasveldt | in/mark-raasveldt-256b9a70 | bsky |
| Wes McKinney | Creator of Pandas, co-creator of Apache Arrow and Parquet | wesm | @wesmckinn | in/wesmckinn | bsky |
| Martin Traverso | Creator of Presto and Trino, CTO at Starburst | martint | @mtraverso | in/traversomartin | |
| Matei Zaharia | Creator of Apache Spark, co-founder and CTO of Databricks | mateiz | @matei_zaharia | in/mateizaharia | |
| Jacques Nadeau | Co-creator of Apache Arrow, Apache Drill, and Dremio | jacques-n | in/jacquesnadeau | bsky | |
| Andrew Lamb | PMC member for Apache Arrow, DataFusion, and Parquet | alamb | @andrewlamb1111 | in/andrewalamb | bsky |
| Andy Grove | PMC member of Apache Arrow and DataFusion. Author of "How Query Engines Work" | andygrove | in/andygrove | bsky | |
| Tristan Handy | Founder and CEO of dbt Labs | jthandy | @jthandy | in/tristanhandy | bsky |
| Fokko Driesprong | PMC member on Apache Avro, Airflow, Druid, Iceberg, and Parquet | Fokko | @_Fokko | in/fokkodriesprong | |
| Gian Merlino | Co-founder and CTO of Imply, co-creator of Apache Druid | gianm | @gianmerlino | in/gianmerlino | |
| Phil Eaton | Database and systems engineer, writer on database internals | eatonphil | @eatonphil | in/eatonphil | bsky |
Events
- Databricks Data+AI Summit - The world's largest data, analytics, and AI conference.
- Snowflake Summit - Annual conference for data and AI practitioners.
- Confluent Current - The Data Streaming Event focused on Apache Kafka and real-time data streaming.
- Data Council - Technical conference on data engineering, infrastructure, and analytics.
- dbt Summit - The world's largest gathering of dbt users and analytics engineering practitioners.
- Flink Forward - Conference dedicated to real-time stream processing and Apache Flink.
- Community Over Code - The Apache Software Foundation's official conference (formerly ApacheCon).
- VLDB - Premier academic conference on Very Large Data Bases.
- ACM SIGMOD/PODS - Leading international forum for database researchers and practitioners.
Communities
Generalist
- r/dataengineering - The largest Reddit community for data engineering discussions (300k+ members).
- r/databasedevelopment - Subreddit for database internals, query engines, and storage systems (10k+ members).
- DataTalks.Club - Global Slack community for data practitioners covering data engineering, ML, and MLOps.
- Big Data Hebdo - French-language Slack community and podcast covering big data, data engineering, and analytics.
- Software Internals - Discord community by Phil Eaton focused on database internals, compilers, and distributed systems (9k+ members).
- Locally Optimistic - Curated Slack community for current and aspiring analytics leaders.
Tool-specific
- dbt Community - 100,000+ member Slack workspace for analytics engineering and modern data stack discussions.
- ClickHouse Community Slack - Active Slack workspace for ClickHouse users and developers.
- DuckDB Discord - Discord server for DuckDB users covering Q&A, performance tuning, and feature discussions.
- Trino Community Slack - 13,000+ members discussing the Trino distributed SQL query engine.
- Apache Druid Community - Slack workspace and mailing lists for Apache Druid users and committers.
- Apache Pinot Community - Community Slack for real-time distributed OLAP datastore users.
🤝 Contributing
Contributions of any kind welcome! Read the guidelines before opening a PR.
👤 Contributors
💫 Show your support
Give a ⭐️ if this project helped you!