SemanticScuttle - klotz.me

Tags: pdf*

0 bookmark(s) - Sort by: Date ↓ / Title /

Parse PDFs for RAG locally with Docling (Rich Tables, No Cloud Upload)

This article examines Docling, a tool from IBM Research that converts complex PDF documents into structured Markdown or JSON for RAG applications. It offers a local-first approach to ensure data privacy and provides high-fidelity extraction of rich tables and layouts without relying on cloud services.

2026-06-14 Tags: docling, pdf, rag, ibm by klotz

Beyond extract_text: The two layers of a PDF that drive RAG quality

This article examines why basic text extraction from PDFs often falls short when building Retrieval Augmented Generation (RAG) pipelines. It highlights how losing visual layout information results in lost semantic context, affecting model accuracy and retrieval performance. The author introduces the concept of two critical layers within a document: the physical layer involving raw character data and coordinates, and the logical layer that constructs meaning through structural elements like headings, tables, and multi-column layouts.
- Why standard text extraction limits RAG performance
- Understanding physical versus logical PDF layers
- The role of layout awareness in preserving semantic context

2026-06-12 Tags: rag, pdf, parsing, document layout analysis, information retrieval, llm by klotz

5 Useful Python Scripts to Automate Boring PDF Tasks

This article provides five practical Python scripts designed to automate repetitive and tedious PDF management tasks, facilitating efficient batch processing for various document workflows.

The featured capabilities include:

* pypdf: Merging and splitting PDFs
* pdfplumber: Extracting text and tables
* reportlab: Applying stamps and watermarks
* pymupdf: Redacting sensitive content
* pypdf/pdfplumber: Generating metadata inventories

2026-06-11 Tags: python, pdf, data extraction, document processing, bala priya c by klotz

ReaderView

ReaderView is a productivity tool designed to transform cluttered web pages and PDFs into calm, readable documents. By stripping away distracting elements like ads, sidebars, and awkward layouts, it allows users to focus entirely on the content. The app is ideal for individuals who save numerous articles throughout the day and want a streamlined way to consume them later. Key features include the ability to highlight important sections, add personal notes, and save items for offline reading to solve the "read it later" dilemma. Additionally, users can customize their reading experience with preferred fonts and themes, and even share specific passages or full texts with others.

2026-04-07 Tags: readerview, reading mode, pdf, laurent denoue, ios, app by klotz

LiteParse

LiteParse is a lightweight, open‑source PDF parsing tool that delivers fast, high‑quality spatial text extraction with bounding boxes. Built on PDF.js and Tesseract.js, it runs entirely locally without cloud dependencies, supporting PDF, Office, and image formats via automatic conversion. Users can parse documents via a CLI or as a library, generate high‑resolution screenshots, and integrate custom OCR servers through a simple API. Ideal for production pipelines, LiteParse offers JSON or text outputs, precise bounding boxes, and multi‑platform support across Linux, macOS, and Windows.

2026-03-21 Tags: pdf, ocr, text‑extraction, pdf‑parser, document‑processing by klotz

MinerU

MinerU is a tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format.

2026-01-04 Tags: pdf, markdown, json, ocr, latex, html, scientific literature, llm, document format conversion, foss by klotz

Summarize and Chat

This repository contains the source code for the summarize-and-chat project. This project provides a unified document summarization and chat framework with LLMs, aiming to address the challenges of building a scalable solution for document summarization while facilitating natural language interactions through chat interfaces.

2025-08-19 Tags: summarization, chat, llm, document processing, langchain, llamaindex, ai, openai, pdf, docx, audio by klotz

Docling

Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

2025-05-25 Tags: document, pdf, ocr, github, ibm, conversion by klotz

Let’s Build a RAG-Powered Research Paper Assistant

This article details building a Retrieval-Augmented Generation (RAG) system to assist with research paper tasks, specifically question answering over a PDF document. It covers document loading, splitting, embedding with Sentence Transformers, using ChromaDB as a vector database, and implementing a query interface with LangChain.

2025-04-23 Tags: docker, rag, langchain, sentence transformers, chromadb, vector database, pdf, llm by klotz

From PDF to Markdown with Local LLMs — Fast, Private, and Free

This article details a method for converting PDFs to Markdown using a local LLM (Gemma 3 via Ollama), focusing on privacy and efficiency. It involves rendering PDF pages as images and then using the LLM for content extraction, even from scanned PDFs.

2025-04-16 Tags: pdf, markdown, llm, self-hosted, gemma, ollama, ocr, pymupdf, pillow by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

Tags: pdf*

Linked Tags

Related Tags