Datasets
Large-scale public datasets commonly used for benchmarking OLAP databases, query engines, and data lake tools.
- awesome-public-datasets - Curated list of high-quality public datasets organized by domain.
- CommonCrawl - Petabyte-scale web crawl dataset updated monthly; used for NLP, analytics, and link graph research.
- Criteo - 1TB click log dataset from Criteo, a standard benchmark for ad-tech and high-cardinality analytics workloads.
- Entso-e - European electricity generation and consumption statistics, useful for time-series and energy analytics benchmarks.
- GitHub Archives - Timestamped record of all public GitHub events; a popular dataset for querying with ClickHouse, BigQuery, and DuckDB.
- Kaggle - Community-sourced datasets and competitions covering a wide range of domains.
- NYCTaxi - NYC taxi trip records dating back to 2009; a classic columnar query benchmark with billions of rows.