Docling is a tool that parses documents and exports them to desired formats like Markdown and JSON. It supports various document formats including PDF, DOCX, PPTX, Images, HTML, AsciiDoc, and Markdown.
A post discussing new techniques developed for parsing and searching PDFs, focusing on turning them into a hierarchical structure for RAG search. The approach involves dynamically generating chunks for searches, sending headers and sub-headers to the Language Model along with relevant chunks.
The llmsherpa project provides APIs to accelerate Large Language Model (LLM) projects. It includes features like LayoutPDFReader for PDF text parsing, smart chunking for vector search and Retrieval Augmented Generation, and table analysis. It is open-sourced under Apache 2.0 license.