How to read and convert PDFs to Markdown for better RAG results with LLMs.
Docling is a tool that parses documents and exports them to desired formats like Markdown and JSON. It supports various document formats and provides advanced PDF understanding, metadata extraction, and integration with LlamaIndex and LangChain for RAG / QA applications.
Docling is a tool that parses documents and exports them to desired formats like Markdown and JSON. It supports various document formats including PDF, DOCX, PPTX, Images, HTML, AsciiDoc, and Markdown.
A guided series of tutorials/notebooks to build a PDF to Podcast workflow using Llama models for text processing, transcript writing, dramatization, and text-to-speech conversion.
- Approximate Tokens, Words and Characters Calculator for LLM's and Text Trimmer — Simple calculator to estimate tokens for Large Language Models and text editor to trim text
- Text File Merger for LLM — This tool combines multiple text files into a single document, with clear separation between files
- PDF to TXT Converter — Convert PDF documents to plain text format for use with LLMs and text analysis
- HTML to TXT Converter — Remove HTML tags and extract clean text content for LLM processing
- LLM System Prompt Generator — Generate optimized system prompts for different LLM model sizes (3B, 33B, 70B, etc.)
- Creative Idea Generator — AI-powered brainstorming tool for generating creative solutions and ideas
This blog post explores scaling ColPali for efficient document retrieval across large collections of PDFs using Vespa's phased retrieval and ranking pipeline, including the use of a hamming-based MaxSim similarity function.
IncarnaMind enables chatting with personal documents (PDF, TXT) using Large Language Models (LLMs) like GPT. It uses a Sliding Window Chunking mechanism and Ensemble Retriever for efficient querying.
A post discussing new techniques developed for parsing and searching PDFs, focusing on turning them into a hierarchical structure for RAG search. The approach involves dynamically generating chunks for searches, sending headers and sub-headers to the Language Model along with relevant chunks.
The llmsherpa project provides APIs to accelerate Large Language Model (LLM) projects. It includes features like LayoutPDFReader for PDF text parsing, smart chunking for vector search and Retrieval Augmented Generation, and table analysis. It is open-sourced under Apache 2.0 license.
We introduce LayoutLM, one of the renowned models for extracting information from documents, developed by Microsoft. To tailor a solution for our specific needs, we label our documents using Label Studio, an open-source labeling tool, connected to our remote storage AWS S3.