PandasBench – The first benchmark for the Pandas API

Juin 23, 2025

—

Pandas is the driving force behind millions of notebooks (estimates suggest that almost every other notebook uses Pandas), and multiple replacements have been created, like: Modin, Dask, and Koalas. Yet, there is no benchmark for the Pandas API.

We’re announcing PandasBench.

What my project does: PandasBench is the first systematic effort to create a benchmark for the Pandas API for single-machine workloads.

Target Audience: Data scientists, researchers in data management, and anyone who cares about the performance of pandas and its alternatives.

Comparison: PandasBench is the largest Pandas API benchmark to date with 102 notebooks and 3,721 cells. We used it to evaluate Modin, Dask, Koalas, and Dias, over randomly-selected real-world notebooks from Kaggle, creating the largest-scale evaluation of any of these techniques to date.

We used PandasBench to show that slowdowns over these single-machine notebooks are the norm, and we also identify many failures of these systems. Read more in our blog post.

Blog post: https://adapt.cs.illinois.edu/projects/PandasBench.html
Repository: https://github.com/ADAPT-uiuc/PandasBench
Paper (open access): https://arxiv.org/abs/2506.02345

submitted by /u/baziotis to r/Python
[link] [comments]

PandasBench – The first benchmark for the Pandas API

Commentaires

Laisser un commentaire Annuler la réponse