Ingestion and querying

Curated list of stream processing (Flink, Kafka Streams), batch processing (Spark), in-memory query engines (DuckDB, Polars), and distributed SQL (Trino, PrestoDB) tools.

Ingestion and querying

Stream processing

Real-time data processing (also called event streaming) handles data as it is generated — enabling low-latency pipelines, continuous aggregations, and immediate downstream reactions to events.

Batch processing

Process periodically a large amount of data in a single batch.

In-memory processing

Non real-time SQL queries executed against a large database can be processed locally. This method might not fit into memory or lead to very long job duration.

Distributed SQL processing

These SQL engines execute distributed queries over very large datasets across a cluster. Many support ANSI SQL or ANSI-SQL-like interfaces. Some can also act as federated query engines, querying across heterogeneous data sources.