Ingestion and querying
Stream processing
Real-time data processing (also called event streaming) handles data as it is generated — enabling low-latency pipelines, continuous aggregations, and immediate downstream reactions to events.
- Akka Streams - Reactive stream processing library for JVM, built on the actor model.
- Apache Beam - Unified SDK for cross-language stream and batch processing. Available in Go, Python, Java, Scala and TypeScript.
- Apache Flink - Stateful stream processing with exactly-once semantics, supporting event time and out-of-order data.
- Apache Kafka Streams - Lightweight stream processing library embedded in the Kafka client, no separate cluster required.
- Apache Spark Streaming - Micro-batch stream processing on top of Spark, integrating with the broader Spark ecosystem.
- Redpanda Connect (formerly Benthos) - Declarative stream processing toolkit in Go, with a wide connector library.
- Materialize - Operational data warehouse that incrementally maintains SQL views over streaming data, always-fresh without recomputation.
- RisingWave - Distributed SQL streaming database (PostgreSQL-compatible) with sub-100ms end-to-end freshness and native Iceberg integration.
Batch processing
Process periodically a large amount of data in a single batch.
- Apache Spark - Distributed batch processing engine with in-memory computation, supporting SQL, ML, and graph workloads.
- MapReduce - Programming model for processing large datasets in parallel across a cluster; the foundation of the Hadoop ecosystem.
In-memory processing
Non real-time SQL queries executed against a large database can be processed locally. This method might not fit into memory or lead to very long job duration.
- Apache Arrow - Low-level in-memory columnar data format with zero-copy access across languages via gRPC/IPC interfaces.
- Apache Arrow DataFusion - High-level SQL and DataFrame query engine built on Apache Arrow, written in Rust.
- chDB - Embeddable in-process OLAP engine powered by ClickHouse, callable from Python without a server.
- clickhouse-local - Lightweight CLI version of ClickHouse for running SQL queries against CSV, JSON, Parquet and other files.
- delta-rs - Standalone DeltaLake driver for Python and Rust. Does not depend on Spark.
- Delta Standalone - Standalone DeltaLake driver for Java and Scala. Does not depend on Spark.
- DuckDB - In-process SQL OLAP query engine for Parquet, CSV, and JSON files. Built on Apache Arrow.
- Pandas - Python DataFrame library for data analysis and manipulation, the standard for data science workflows.
- Polars - High-performance DataFrame library written in Rust with a lazy query optimizer, significantly faster than Pandas.
Distributed SQL processing
These SQL engines execute distributed queries over very large datasets across a cluster. Many support ANSI SQL or ANSI-SQL-like interfaces. Some can also act as federated query engines, querying across heterogeneous data sources.
- Apache Spark SQL - Distributed SQL query engine that sit on top of Spark.
- Dremio - SQL lakehouse platform providing a semantic layer and query acceleration on top of data lakes.
- ksql - SQL interface for Kafka.
- PrestoDB - Distributed SQL query engine.
- Trino - Distributed SQL query engine. Fork of PrestoDB.