TL;DR: Comprehensive benchmarks of Kreuzberg, Docling, MarkItDown, and Unstructured across 94 real-world documents. Results might surprise you.
📊 Live Results: https://goldziher.github.io/python-text-extraction-libs-benchmarks/
Context
As the author of Kreuzberg, I wanted to create an honest, comprehensive benchmark of Python text extraction libraries. No cherry-picking, no marketing fluff – just real performance data across 94 documents (~210MB) ranging from tiny text files to 59MB academic papers.
Full disclosure: I built Kreuzberg, but these benchmarks are automated, reproducible, and the methodology is completely open-source.
🔬 What I Tested
Libraries Benchmarked:
Kreuzberg (71MB, 20 deps) – My library Docling (1,032MB, 88 deps) – IBM’s ML-powered solution MarkItDown (251MB, 25 deps) – Microsoft’s Markdown converter Unstructured (146MB, 54 deps) – Enterprise document processing
Test Coverage:
94 real documents: PDFs, Word docs, HTML, images, spreadsheets 5 size categories: Tiny (<100KB) to Huge (>50MB) 6 languages: English, Hebrew, German, Chinese, Japanese, Korean CPU-only processing: No GPU acceleration for fair comparison Multiple metrics: Speed, memory usage, success rates, installation sizes
🏆 Results Summary
Speed Champions 🚀
Kreuzberg: 35+ files/second, handles everything Unstructured: Moderate speed, excellent reliability MarkItDown: Good on simple docs, struggles with complex files Docling: Often 60+ minutes per file (!!)
Installation Footprint 📦
Kreuzberg: 71MB, 20 dependencies ⚡ Unstructured: 146MB, 54 dependencies MarkItDown: 251MB, 25 dependencies (includes ONNX) Docling: 1,032MB, 88 dependencies 🐘
Reality Check ⚠️
Docling: Frequently fails/times out on medium files (>1MB) MarkItDown: Struggles with large/complex documents (>10MB) Kreuzberg: Consistent across all document types and sizes Unstructured: Most reliable overall (88%+ success rate)
🎯 When to Use What
⚡ Kreuzberg (Disclaimer: I built this)
Best for: Production workloads, edge computing, AWS Lambda Why: Smallest footprint (71MB), fastest speed, handles everything Bonus: Both sync/async APIs with OCR support
🏢 Unstructured
Best for: Enterprise applications, mixed document types Why: Most reliable overall, good enterprise features Trade-off: Moderate speed, larger installation
📝 MarkItDown
Best for: Simple documents, LLM preprocessing Why: Good for basic PDFs/Office docs, optimized for Markdown Limitation: Fails on large/complex files
🔬 Docling
Best for: Research environments (if you have patience) Why: Advanced ML document understanding Reality: Extremely slow, frequent timeouts, 1GB+ install
📈 Key Insights
Installation size matters: Kreuzberg’s 71MB vs Docling’s 1GB+ makes a huge difference for deployment Performance varies dramatically: 35 files/second vs 60+ minutes per file Document complexity is crucial: Simple PDFs vs complex layouts show very different results Reliability vs features: Sometimes the simplest solution works best
🔧 Methodology
Automated CI/CD: GitHub Actions run benchmarks on every release Real documents: Academic papers, business docs, multilingual content Multiple iterations: 3 runs per document, statistical analysis Open source: Full code, test documents, and results available Memory profiling: psutil-based resource monitoring Timeout handling: 5-minute limit per extraction
🤔 Why I Built This
Working on Kreuzberg, I worked on performance and stability, and then wanted a tool to see how it measures against other frameworks – which I could also use to further develop and improve Kreuzberg itself. I therefore created this benchmark. Since it was fun, I invested some time to pimp it out:
Uses real-world documents, not synthetic tests Tests installation overhead (often ignored) Includes failure analysis (libraries fail more than you think) Is completely reproducible and open Updates automatically with new releases
📊 Data Deep Dive
The interactive dashboard shows some fascinating patterns:
Kreuzberg dominates on speed and resource usage across all categories Unstructured excels at complex layouts and has the best reliability MarkItDown is useful for simple docs shows in the data Docling’s ML models create massive overhead for most use cases making it a hard sell
🚀 Try It Yourself
bash git clone https://github.com/Goldziher/python-text-extraction-libs-benchmarks.git cd python-text-extraction-libs-benchmarks uv sync –all-extras uv run python -m src.cli benchmark –framework kreuzberg_sync –category small
Or just check the live results: https://goldziher.github.io/python-text-extraction-libs-benchmarks/
🔗 Links
📊 Live Benchmark Results: https://goldziher.github.io/python-text-extraction-libs-benchmarks/ 📁 Benchmark Repository: https://github.com/Goldziher/python-text-extraction-libs-benchmarks ⚡ Kreuzberg (my library): https://github.com/Goldziher/kreuzberg 🔬 Docling: https://github.com/DS4SD/docling 📝 MarkItDown: https://github.com/microsoft/markitdown 🏢 Unstructured: https://github.com/Unstructured-IO/unstructured
🤝 Discussion
What’s your experience with these libraries? Any others I should benchmark? I tried benchmarking marker, but the setup required a GPU.
Some important points regarding how I used these benchmarks for Kreuzberg:
I fine tuned the default settings for Kreuzberg. I updated our docs to give recommendations on different settings for different use cases. E.g. Kreuzberg can actually get to 75% reliability, with about 15% slow-down. I made a best effort to configure the frameworks following the best practices of their docs and using their out of the box defaults. If you think something is off or needs adjustment, feel free to let me know here or open an issue in the repository.
submitted by /u/Goldziher to r/learnmachinelearning
[link] [comments]
Laisser un commentaire