Docling

Convert messy documents into structured data. Detect tables, formulas, and reading order to simplify downstream AI processing and RAG ingestion.

Screenshot of Docling website

Easily convert messy documents from various formats like PDF, DOCX, and XLSX into a single, structured format. It intelligently processes your files, including scanned pages using an OCR engine of your choice, to prepare them for AI, RAG, and agentic systems.

The tool excels at understanding document structure and content with several key features:

  • Component Detection: Accurately identifies and extracts tables with complex cell content, mathematical formulas (converting them to LaTeX), code blocks with language classification, and lists.
  • Reading Order: Preserves the logical flow of the document by traversing components in the correct reading order and concatenating fragmented paragraphs across pages.
  • Image Analysis: Extracts pictures as image data, classifies their contents (like charts and diagrams), and groups them with their corresponding captions.
  • AI-Ready Output: Partitions documents into bite-sized chunks ready for ingestion and exports the structured data to formats like JSON, Markdown, and HTML.

Share:

Similar to Docling

Favicon

 

  
  
Favicon

 

  
  
Favicon