I’ve been working on a Python tool called epub-utils that lets you inspect and extract data from EPUB files directly from the command line. I just shipped some major updates and wanted to share what it can do.
What My Project Does
A command-line tool that treats EPUB files like objects you can query:
pip install epub-utils # Quick metadata extraction epub-utils book.epub metadata –format kv # title: The Great Gatsby # creator: F. Scott Fitzgerald # language: en # publisher: Scribner # See the complete structure epub-utils book.epub manifest epub-utils book.epub spine
Target Audience
Developers building publishing tools that make heavy use of EPUB archives.
Comparison
I kept running into situations where I needed to peek inside EPUB files – checking metadata for publishing workflows, extracting content for analysis, debugging malformed files. For this I was simply using the unzip command but it didn’t give me the structured data access I wanted for scripting. epub-utils instead allows you to inspect specific parts of the archive
The files command lets you access any file in the EPUB by its path relative to the archive root:
# List all files with compression info epub-utils book.epub files # Extract specific files directly epub-utils book.epub files OEBPS/chapter1.xhtml –format plain epub-utils book.epub files OEBPS/styles/main.css
Content extraction by manifest ID:
# Get chapter text for analysis epub-utils book.epub content chapter1 –format plain
Pretty-printing for all XML output:
epub-utils book.epub package –pretty-print
A Python API is also available
from epub_utils import Document doc = Document(« book.epub ») # Direct attribute access to metadata print(f »Title: {doc.package.metadata.title} ») print(f »Author: {doc.package.metadata.creator} ») # File system access css_content = doc.get_file_by_path(‘OEBPS/styles/main.css’) chapter_text = doc.find_content_by_id(‘chapter1’).to_plain()
epub-utils Handles both EPUB 2.0.1 and EPUB 3.0+ with proper Dublin Core metadata parsing and W3C specification adherence.
It makes it easy to
Automate publishing pipeline validation Debug EPUB structure issues Extract metadata for catalogs Quickly inspect EPUB without opening GUI apps
The tool is still in alpha (version 0.0.0a5) but the API is stabilising. I’ve been using it daily for EPUB work and it’s saved me tons of time.
GitHub: https://github.com/ernestofgonzalez/epub-utils
PyPI: https://pypi.org/project/epub-utils/
Would love feedback from anyone else working with EPUB files programmatically!
submitted by /u/makeascript to r/Python
[link] [comments]
Laisser un commentaire