Awesome List

Awesome OLAP

A curated list of OLAP databases, columnar stores, data lake & lakehouse frameworks, query engines, and data engineering tools.

100+tools
20+categories
opensource

Awesome OLAP

A curated list of OLAP databases, data engineering tools, columnar databases, data lake and lakehouse frameworks — covering 100+ tools across 20+ categories, for data engineers.

OLAP (Online Analytical Processing) refers to databases and query engines optimized for complex, read-heavy analytical queries over large datasets. Unlike OLTP systems, OLAP databases use columnar storage, vectorized execution, and distributed processing to aggregate and analyze billions of rows in seconds.

Contents

OLAP Databases

Real-time analytics

The following columnar databases use a shared-nothing architecture and provide a sub-second response time. DDL, DML and DCL are operated via SQL. These databases also support tiering for long-term cold storage.

Search engines

Search engines complement OLAP systems for full-text search and log analytics use cases, where keyword relevance and inverted indexes matter more than aggregate query performance.

Hybrid OLAP/OLTP NewSQL (aka HTAP)

HTAP (Hybrid Transactional-Analytical Processing) databases handle both transactional writes and analytical reads in a single engine, eliminating the need to maintain a separate data warehouse for reporting.

Timeseries

Time-series databases are optimized for append-heavy workloads where data is tagged, timestamped, and queried by time range — distinct from general OLAP because they prioritize ingestion throughput, automatic retention, and time-aligned aggregations.

Managed cloud services

Fully managed cloud data warehouses trade self-hosted operational overhead for elastic scaling and pay-as-you-go pricing. All handle petabyte-scale analytics; they differ in cost model, latency profile, and ecosystem integrations.

Storage engines

Storage engines are the foundational frameworks on top of which higher-level databases and data systems are built. They handle durability, transactions, and low-level data organization.

Data lake

The data lake approach (or "lakehouse") is a semi-structured schema that sits on top of object storage in the cloud.

It is composed of a few layers (from lower to higher level): codec, file format, table format + metastore, and the ingestion/query layer.

File formats and serialization

These formats are popular for shared-everything databases, using object storage as a persistence layer. The data is organized in row or column, with strict schema definition. These files are immutable and offer partial reads (only headers, metadata, data page, etc). Mutation requires a new upload. Most formats support nested schema, codecs, compression, and data encryption. Index can be added to file metadata for faster processing.

A single file can weight between tens of MB to a few GB. Lots of small files require more merge operation. Larger files can be costly to update.

Open table formats

Open table formats are abstraction layer on top of Avro/Parquet files, with support for ACID transaction, CDC, partitioning, mixed streaming/batching processing, schema evolution and mutation. Schema and statistics are stored in a metastore, data is persisted locally or in a remote/cloud object storage.

Open tables are a cost-effective datawarehouse for petabyte scale.

Comparison:

👆 Warning: pre-2022 articles should be considered as out-of-date, as open table formats are evolving quickly.

Metastore

Object Storage

Codecs, encoding and compression

Brokers and distributed messaging

Message brokers sit between producers and consumers in the data stack, providing durable, ordered event streams that decouple ingestion from processing and enable replay, fan-out, and exactly-once delivery semantics.

Ingestion and querying

Stream processing

Real-time data processing (also called event streaming) handles data as it is generated — enabling low-latency pipelines, continuous aggregations, and immediate downstream reactions to events.

Batch processing

Process periodically a large amount of data in a single batch.

In-memory processing

Non real-time SQL queries executed against a large database can be processed locally. This method might not fit into memory or lead to very long job duration.

Distributed SQL processing

These SQL engines execute distributed queries over very large datasets across a cluster. Many support ANSI SQL or ANSI-SQL-like interfaces. Some can also act as federated query engines, querying across heterogeneous data sources.

Scheduler

Orchestrators define and monitor complex multi-step DAG workflows with dependency management, retries, and observability. Cron-style schedulers simply trigger jobs at fixed time intervals. The tools below are full orchestrators.

Durable execution

Durable execution frameworks guarantee that workflows survive process crashes, network failures, and infrastructure restarts by persisting execution state automatically.

ETL, ELT and reverse ETL

The popular acronym for Extracting, Transforming and Loading data (also called data pipeline tools or data integration). ELT performs data transformations directly within the data warehouse. Reverse ETL is the process of copying data from your datawarehouse to external tools or SaaS.

BI & Visualization

Business intelligence and visualization tools sit on top of OLAP databases, enabling analysts to explore data, build dashboards, and share insights without writing SQL.

Datasets

Large-scale public datasets commonly used for benchmarking OLAP databases, query engines, and data lake tools.

Benchmark

Benchmarks help select the right database for a workload. Always run benchmarks on your own data and query patterns — published numbers reflect vendor-tuned configurations.

Readings

Papers

Architecture

Data modeling

Index

Algorithms and indexes:

Dedicated vector databases:

Vectorized query processing

Querying

Transactions

Consensus

Challenging platforms

Blogs to follow

More

FAQ

What is the best OLAP database

There is no single best OLAP database — the right choice depends on your latency, scale, and operational constraints:

ClickHouse — best raw query speed on a single node or small cluster; ideal for user-facing analytics, logs, and event data.

Apache Druid / Apache Pinot — best for sub-second queries at high concurrency over streaming-ingested data (ad tech, real-time dashboards).

StarRocks — strong alternative to ClickHouse/Druid for hybrid batch+streaming with a MySQL-compatible interface.

DuckDB — best for local or embedded analytics on files (Parquet, CSV); no server required.

Trino / PrestoDB — best for federated queries across heterogeneous sources (S3, Hive, RDBMS) without moving data.

Apache Spark — best for large-scale batch ETL and ML pipelines where latency is not critical.

Snowflake / BigQuery / Redshift — best when you want fully managed infrastructure with elastic scaling and no ops overhead.

OLAP vs OLTP

OLAP OLTP
Workload Complex analytical queries (aggregations, scans) Simple transactional queries (reads/writes by key)
Storage Columnar Row-oriented
Typical query SELECT sum(revenue) GROUP BY region SELECT * FROM orders WHERE id = 42
Scale Billions of rows, read-heavy Millions of rows, write-heavy
Examples ClickHouse, Druid, BigQuery PostgreSQL, MySQL, DynamoDB

What is a data lakehouse

A data lakehouse combines the low-cost scalable storage of a data lake (files on S3/GCS/ADLS) with the ACID transactions, schema enforcement, and query performance of a data warehouse. Open table formats like Apache Iceberg, Delta Lake, and Apache Hudi implement the lakehouse pattern on top of Parquet files.

Kafka vs Pulsar

Apache Kafka is the de facto standard with the largest ecosystem, best tooling support, and widest operator knowledge. Apache Pulsar offers multi-tenancy, geo-replication, and a decoupled storage layer (via BookKeeper) out of the box — useful when those features are required from day one. Most teams should start with Kafka.

Open table formats: Iceberg vs Delta Lake vs Hudi

Iceberg Delta Lake Hudi
Best for Large-scale analytics, multi-engine Spark-native workloads, Databricks CDC / upsert-heavy pipelines
Engine support Spark, Flink, Trino, Hive, Dremio Spark (best), Flink, Trino Spark, Flink
Upserts Merge-on-read or copy-on-write Copy-on-write (merge-on-read in progress) First-class, optimized
Governance Apache Foundation Linux Foundation Apache Foundation

See the comparison links in the Open table formats section for detailed benchmarks.

ClickHouse vs Apache Druid vs Apache Pinot vs StarRocks

All four are real-time OLAP databases with sub-second query latency. Key differences:

ClickHouse Apache Druid Apache Pinot StarRocks
Best for Log/event analytics, ad-hoc queries Streaming-ingested time-series data User-facing analytics, high concurrency Hybrid batch+streaming, flexible schema
Architecture Shared-nothing, columnar MergeTree Segment-based, time-partitioned Segment-based, real-time + offline tables MPP with vectorized execution engine
Ingestion Kafka, files, HTTP push Kafka, Kinesis, native streaming Kafka, Kinesis, files Kafka, files, Flink, Spark
Upserts Limited (ReplacingMergeTree) No No (append-only) Yes (primary key tables)
Query concurrency Medium High Very high (user-facing) High
SQL dialect ClickHouse SQL (mostly ANSI) Druid SQL (ANSI subset) PQL + Druid-compatible SQL MySQL-compatible SQL
Written in C++ Java Java C++ / Java
Managed cloud ClickHouse Cloud Imply Polaris StarTree Cloud CelerData
License Apache 2.0 Apache 2.0 Apache 2.0 Apache 2.0 (Elastic for some features)

When to pick which

ClickHouse — highest raw throughput for analytics on a single cluster; ideal for logs, metrics, and BI queries.

Apache Druid — best when data arrives via Kafka and you need time-partitioned rollups with guaranteed low latency.

Apache Pinot — best for user-facing products where thousands of end-users hit the DB concurrently (dashboards, embedded analytics).

StarRocks — best when you need upserts, a MySQL-compatible interface, or a single engine for both batch and streaming.

See the Benchmark section for query performance comparisons across engines.

People to follow

Name Description GitHub Twitter/X LinkedIn Bluesky
Alexey Milovidov Co-founder and CTO of ClickHouse alexey-milovidov @fdooch123 in/alexey-milovidov-clickhouse
Hannes Mühleisen Co-creator of DuckDB, CEO of DuckDB Labs hannes @hfmuehleisen in/hfmuehleisen bsky
Mark Raasveldt Co-creator of DuckDB Mytherin @mraasveldt in/mark-raasveldt-256b9a70 bsky
Wes McKinney Creator of Pandas, co-creator of Apache Arrow and Parquet wesm @wesmckinn in/wesmckinn bsky
Martin Traverso Creator of Presto and Trino, CTO at Starburst martint @mtraverso in/traversomartin
Matei Zaharia Creator of Apache Spark, co-founder and CTO of Databricks mateiz @matei_zaharia in/mateizaharia
Jacques Nadeau Co-creator of Apache Arrow, Apache Drill, and Dremio jacques-n in/jacquesnadeau bsky
Andrew Lamb PMC member for Apache Arrow, DataFusion, and Parquet alamb @andrewlamb1111 in/andrewalamb bsky
Andy Grove PMC member of Apache Arrow and DataFusion. Author of "How Query Engines Work" andygrove in/andygrove bsky
Tristan Handy Founder and CEO of dbt Labs jthandy @jthandy in/tristanhandy bsky
Fokko Driesprong PMC member on Apache Avro, Airflow, Druid, Iceberg, and Parquet Fokko @_Fokko in/fokkodriesprong
Gian Merlino Co-founder and CTO of Imply, co-creator of Apache Druid gianm @gianmerlino in/gianmerlino
Phil Eaton Database and systems engineer, writer on database internals eatonphil @eatonphil in/eatonphil bsky

Events

Communities

Generalist

Tool-specific

🤝 Contributing

Contributions of any kind welcome! Read the guidelines before opening a PR.

👤 Contributors

Contributors

💫 Show your support

Give a ⭐️ if this project helped you!

GitHub Sponsors