Docling is a powerful tool that streamlines the way you handle documents, allowing you to effortlessly parse and convert various formats like PDF, DOCX, and HTML into Markdown and JSON. With advanced features such as OCR support for scanned PDFs and integration with AI frameworks, Docling makes preparing your documents for generative AI simple and efficient.
Docling is a powerful document processing tool designed to seamlessly convert various document formats into Markdown and JSON. Whether you're working with PDFs, DOCX, PPTX, images, or HTML, Docling offers a fast and reliable solution.
Key Features
- Versatile Format Support: Effortlessly read and convert popular document types, including PDF, DOCX, PPTX, Images, HTML, AsciiDoc, and Markdown.
- Advanced Document Understanding: Gain insights into page layout, reading order, and table structures with our sophisticated PDF parsing capabilities.
- Unified Document Representation: Utilize the expressive DoclingDocument format to manage documents with ease.
- Integration-Friendly: Easily incorporate Docling with leading frameworks like LlamaIndex 🦙 and LangChain 🦜🔗 for robust Retrieval-Augmented Generation (RAG) and Question Answering (QA) solutions.
- OCR Functionality: Convert scanned PDFs using our Optical Character Recognition (OCR) support, making text extraction from images possible.
- User-Friendly CLI: Simplify your workflow with a convenient command-line interface.
Explore the extensive documentation for examples and tips to unlock the full potential of Docling.
What's Coming Next?
- Equation & Code Extraction: Future updates will enable extraction of mathematical equations and code snippets.
- Metadata Extraction: Expect features for extracting essential document metadata such as titles, authors, references, and languages.
- Native LangChain Extension: An integration with LangChain will empower users with even more tools for document manipulation.
Quick Start Example
Begin transforming documents swiftly:
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # Replace with your document's local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown()) # Outputs Markdown formatted document
For further assistance, connect with our community via the discussion section or refer to the Docling Technical Report for in-depth information about the project.
Join us in advancing document processing with Docling—brought to you by IBM and open-source contributors!