Pandas is the driving force behind millions of notebooks (estimates suggest that almost every other notebook uses Pandas), and multiple replacements have been created, like: Modin, Dask, and Koalas. Yet, there is no benchmark for the Pandas API.
We’re announcing PandasBench.
What my project does: PandasBench is the first systematic effort to create a benchmark for the Pandas API for single-machine workloads.
Target Audience: Data scientists, researchers in data management, and anyone who cares about the performance of pandas and its alternatives.
Comparison: PandasBench is the largest Pandas API benchmark to date with 102 notebooks and 3,721 cells. We used it to evaluate Modin, Dask, Koalas, and Dias, over randomly-selected real-world notebooks from Kaggle, creating the largest-scale evaluation of any of these techniques to date.
We used PandasBench to show that slowdowns over these single-machine notebooks are the norm, and we also identify many failures of these systems. Read more in our blog post.
Blog post: https://adapt.cs.illinois.edu/projects/PandasBench.html
Repository: https://github.com/ADAPT-uiuc/PandasBench
Paper (open access): https://arxiv.org/abs/2506.02345
submitted by /u/baziotis to r/Python
[link] [comments]
Laisser un commentaire