Data Ingestion & Query Engine Tools — Flink, Spark, DuckDB

Ingestion and querying

Stream processing

Real-time data processing (also called event streaming) handles data as it is generated — enabling low-latency pipelines, continuous aggregations, and immediate downstream reactions to events.

Akka Streams - Reactive stream processing library for JVM, built on the actor model.
Apache Beam - Unified SDK for cross-language stream and batch processing. Available in Go, Python, Java, Scala and TypeScript.
Apache Flink - Stateful stream processing with exactly-once semantics, supporting event time and out-of-order data.
Apache Kafka Streams - Lightweight stream processing library embedded in the Kafka client, no separate cluster required.
Apache Spark Streaming - Micro-batch stream processing on top of Spark, integrating with the broader Spark ecosystem.
Redpanda Connect (formerly Benthos) - Declarative stream processing toolkit in Go, with a wide connector library.
Materialize - Operational data warehouse that incrementally maintains SQL views over streaming data, always-fresh without recomputation.
RisingWave - Distributed SQL streaming database (PostgreSQL-compatible) with sub-100ms end-to-end freshness and native Iceberg integration.

Batch processing

Process periodically a large amount of data in a single batch.

Apache Spark - Distributed batch processing engine with in-memory computation, supporting SQL, ML, and graph workloads.
MapReduce - Programming model for processing large datasets in parallel across a cluster; the foundation of the Hadoop ecosystem.

In-memory processing

Non real-time SQL queries executed against a large database can be processed locally. This method might not fit into memory or lead to very long job duration.

Apache Arrow - Low-level in-memory columnar data format with zero-copy access across languages via gRPC/IPC interfaces.
Apache Arrow DataFusion - High-level SQL and DataFrame query engine built on Apache Arrow, written in Rust.
chDB - Embeddable in-process OLAP engine powered by ClickHouse, callable from Python without a server.
clickhouse-local - Lightweight CLI version of ClickHouse for running SQL queries against CSV, JSON, Parquet and other files.
delta-rs - Standalone DeltaLake driver for Python and Rust. Does not depend on Spark.
Delta Standalone - Standalone DeltaLake driver for Java and Scala. Does not depend on Spark.
DuckDB - In-process SQL OLAP query engine for Parquet, CSV, and JSON files. Built on Apache Arrow.
Pandas - Python DataFrame library for data analysis and manipulation, the standard for data science workflows.
Polars - High-performance DataFrame library written in Rust with a lazy query optimizer, significantly faster than Pandas.

Distributed SQL processing

These SQL engines execute distributed queries over very large datasets across a cluster. Many support ANSI SQL or ANSI-SQL-like interfaces. Some can also act as federated query engines, querying across heterogeneous data sources.

Apache Spark SQL - Distributed SQL query engine that sit on top of Spark.
Dremio - SQL lakehouse platform providing a semantic layer and query acceleration on top of data lakes.
ksql - SQL interface for Kafka.
PrestoDB - Distributed SQL query engine.
Trino - Distributed SQL query engine. Fork of PrestoDB.