SemanticScuttle - klotz.me

Kreuzberg - A Polyglot Document Intelligence Framework

A polyglot document intelligence framework with a Rust core that extracts text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, TypeScript (Node/Bun/Wasm/Deno) or use via CLI, REST API, or MCP server.

2026-01-11 Tags: document-intelligence, text-extraction, metadata-extraction, pdf-extraction, ocr, table-extraction, rust, python, ruby, java, go, php, elixir, typescript, wasm, tesseract, pdfium, rag by klotz

MinerU

MinerU is a tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format.

2026-01-04 Tags: pdf, markdown, json, ocr, latex, html, scientific literature, llm, document format conversion, foss by klotz

kdnuggets: Top 7 open source OCR Midels

| **Model** | **Parameters (B)** | **Main Strength** | **Special Capabilities** | **Best Use Case** |
|----------------------|--------------------|------------------------------|-------------------------------------------------------|---------------------------------------------------|
| olmOCR-2-7B-1025 | 7 | High-accuracy document OCR | GRPO RL training, equation/table OCR | Large-scale document pipelines, technical PDFs |
| PaddleOCR v5/VL | 1 | Multilingual parsing (109 langs) | Text, tables, formulas, charts, dynamic visual encoder | Global multilingual OCR, efficient inference |
| OCRFlux-3B | 3 | Markdown-accurate parsing | Cross-page merging, vLLM optimization | PDF-to-Markdown, consumer GPU friendly |
| MiniCPM-V 4.5 | 8 | State-of-the-art multimodal OCR| Video OCR, high-resolution images, fast/deep modes | Mobile/edge OCR, video understanding |
| InternVL 2.5-4B | 4 | Efficient OCR & reasoning | Dynamic tiling, strong text extraction | Resource-limited environments, multi-image/video |
| Granite Vision 3.3 2b| 2 | Visual document understanding| Charts, tables, diagrams, segmentation, multi-page QA| Enterprise document extraction |
| TrOCR Large Printed | 0.6 | Clean printed-text OCR | 16x16 patch encoder, BEiT/RoBERTa | Simple, high-quality printed text extraction |

2025-12-27 Tags: llm, ocr, kdnuggets by klotz

NVIDIA-Nemotron-Parse-v1.1

NVIDIA Nemotron Parse v1.1 is designed to understand document semantics and extract text and tables elements with spatial grounding. It transforms unstructured documents into actionable and machine-usable representations.

2025-11-28 Tags: image-to-text, transformers, ocr, vlm, feature-extraction, nvidia, document understanding, table extraction by klotz

IBM Granite-Docling: End-to-end document understanding

IBM is releasing Granite-Docling-258M, an ultra-compact and cutting-edge open-source vision-language model (VLM) for converting documents to machine-readable formats while preserving layout, tables, equations, and more. It's designed for accurate and efficient document conversion and excels beyond simple text extraction.

2025-10-14 Tags: vision language models, docling, ibm llm document, conversion, granite-docling, ocr, rag, foss by klotz

Docling

Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

2025-05-25 Tags: document, pdf, ocr, github, ibm, conversion by klotz

Custom OCR for mono-spaced line-printer listings

This document details a custom OCR program designed for recovering old computer programs from line-printer listings. It focuses on accuracy for mono-spaced fonts, even at the cost of speed, and outlines the algorithm, implementation details, and necessary preparation steps.

2025-05-16 Tags: ocr, line-printer, image processing, retrocomputing by klotz

From PDF to Markdown with Local LLMs — Fast, Private, and Free

This article details a method for converting PDFs to Markdown using a local LLM (Gemma 3 via Ollama), focusing on privacy and efficiency. It involves rendering PDF pages as images and then using the LLM for content extraction, even from scanned PDFs.

2025-04-16 Tags: pdf, markdown, llm, self-hosted, gemma, ollama, ocr, pymupdf, pillow by klotz

AI models make precise copies of cuneiform characters

Machine Learning models can now accurately replicate cuneiform characters from photos of ancient tablets, facilitating the reading of complex scripts. The ProtoSnap approach aligns a prototype character with individual variations on tablets, enabling precise reproduction. This method enhances optical character recognition, improving the identification of rare and varied characters. The advancement could significantly increase the availability of ancient texts for analysis.

2025-03-05 Tags: llm, vlm cuneiform, ocr, ancient history by klotz

olmOCR: Toolkit for Training Language Models to Work with PDF Documents

A toolkit for training language models to work with PDF documents in the wild, including prompting strategies, evaluation tools, filtering, finetuning code, and processing PDFs through finetuned models.

2025-02-28 Tags: pdf, llm, pdf processing, olmocr, allenai, ocr, document management, document conversion by klotz

SemanticScuttle - klotz.me

klotz: ocr*

Linked Tags

Related Tags