Tags: pdf*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. ReaderView is a productivity tool designed to transform cluttered web pages and PDFs into calm, readable documents. By stripping away distracting elements like ads, sidebars, and awkward layouts, it allows users to focus entirely on the content. The app is ideal for individuals who save numerous articles throughout the day and want a streamlined way to consume them later. Key features include the ability to highlight important sections, add personal notes, and save items for offline reading to solve the "read it later" dilemma. Additionally, users can customize their reading experience with preferred fonts and themes, and even share specific passages or full texts with others.
  2. LiteParse is a lightweight, open‑source PDF parsing tool that delivers fast, high‑quality spatial text extraction with bounding boxes. Built on PDF.js and Tesseract.js, it runs entirely locally without cloud dependencies, supporting PDF, Office, and image formats via automatic conversion. Users can parse documents via a CLI or as a library, generate high‑resolution screenshots, and integrate custom OCR servers through a simple API. Ideal for production pipelines, LiteParse offers JSON or text outputs, precise bounding boxes, and multi‑platform support across Linux, macOS, and Windows.
  3. MinerU is a tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format.
  4. This repository contains the source code for the summarize-and-chat project. This project provides a unified document summarization and chat framework with LLMs, aiming to address the challenges of building a scalable solution for document summarization while facilitating natural language interactions through chat interfaces.
  5. Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
    2025-05-25 Tags: , , , , , by klotz
  6. This article details building a Retrieval-Augmented Generation (RAG) system to assist with research paper tasks, specifically question answering over a PDF document. It covers document loading, splitting, embedding with Sentence Transformers, using ChromaDB as a vector database, and implementing a query interface with LangChain.
  7. This article details a method for converting PDFs to Markdown using a local LLM (Gemma 3 via Ollama), focusing on privacy and efficiency. It involves rendering PDF pages as images and then using the LLM for content extraction, even from scanned PDFs.
    2025-04-16 Tags: , , , , , , , , by klotz
  8. A toolkit for training language models to work with PDF documents in the wild, including prompting strategies, evaluation tools, filtering, finetuning code, and processing PDFs through finetuned models.
  9. The article discusses the process of preparing PDFs for use in Retrieval-Augmented Generation (RAG) systems, with a focus on creating graph-based RAGs from annual reports containing tables. It highlights the benefits of Graph RAGs over vector store-backed RAGs, particularly in terms of reasoning capabilities, and explores the construction of knowledge graphs for better information retrieval. The author shares insights into the challenges and solutions involved in building an enterprise-ready graph data store for RAG applications.
    2025-01-20 Tags: , , , by klotz
  10. MarkItDown is a utility for converting various files to Markdown, including PDF, PowerPoint, Word, Excel, Images, Audio, HTML, text-based formats, and ZIP files.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "pdf"

About - Propulsed by SemanticScuttle