SemanticScuttle - klotz.me

klotz: pdf*

Let’s Build a RAG-Powered Research Paper Assistant

This article details building a Retrieval-Augmented Generation (RAG) system to assist with research paper tasks, specifically question answering over a PDF document. It covers document loading, splitting, embedding with Sentence Transformers, using ChromaDB as a vector database, and implementing a query interface with LangChain.

2025-04-23 Tags: docker, rag, langchain, sentence transformers, chromadb, vector database, pdf, llm by klotz

From PDF to Markdown with Local LLMs — Fast, Private, and Free

This article details a method for converting PDFs to Markdown using a local LLM (Gemma 3 via Ollama), focusing on privacy and efficiency. It involves rendering PDF pages as images and then using the LLM for content extraction, even from scanned PDFs.

2025-04-16 Tags: pdf, markdown, llm, self-hosted, gemma, ollama, ocr, pymupdf, pillow by klotz

olmOCR: Toolkit for Training Language Models to Work with PDF Documents

A toolkit for training language models to work with PDF documents in the wild, including prompting strategies, evaluation tools, filtering, finetuning code, and processing PDFs through finetuned models.

2025-02-28 Tags: pdf, llm, pdf processing, olmocr, allenai, ocr, document management, document conversion by klotz

Preparing PDFs for RAGs

The article discusses the process of preparing PDFs for use in Retrieval-Augmented Generation (RAG) systems, with a focus on creating graph-based RAGs from annual reports containing tables. It highlights the benefits of Graph RAGs over vector store-backed RAGs, particularly in terms of reasoning capabilities, and explores the construction of knowledge graphs for better information retrieval. The author shares insights into the challenges and solutions involved in building an enterprise-ready graph data store for RAG applications.

2025-01-20 Tags: pdf, rags, knowledge graph, llm by klotz

MarkItDown - Python tool for converting files and office documents to Markdown

MarkItDown is a utility for converting various files to Markdown, including PDF, PowerPoint, Word, Excel, Images, Audio, HTML, text-based formats, and ZIP files.

2024-12-30 Tags: markitdown, markdown, file conversion, python, office documents, pdf, powerpoint, word, excel, images, audio, html, csv, json, xml, zip, openai, large language models, docker, llm, document, conversion by klotz

Improved RAG Document Processing With Markdown

How to read and convert PDFs to Markdown for better RAG results with LLMs.

2024-11-19 Tags: markdown, conversion, pdf, llm by klotz

Docling

Docling is a tool that parses documents and exports them to desired formats like Markdown and JSON. It supports various document formats and provides advanced PDF understanding, metadata extraction, and integration with LlamaIndex and LangChain for RAG / QA applications.

2024-11-01 Tags: docling, document, parsing, export, markdown, json, pdf, ibm, github, foss by klotz

DS4SD / docling

Docling is a tool that parses documents and exports them to desired formats like Markdown and JSON. It supports various document formats including PDF, DOCX, PPTX, Images, HTML, AsciiDoc, and Markdown.

2024-11-01 Tags: docling, ibm, document, parsing, markdown, json, pdf, docx, pptx, ocr, llm by klotz

NotebookLlama: An Open Source version of NotebookLM

A guided series of tutorials/notebooks to build a PDF to Podcast workflow using Llama models for text processing, transcript writing, dramatization, and text-to-speech conversion.

2024-10-28 Tags: notebookllama, pdf, llama, text processing, foss, facebook by klotz

Useful LLM Tools

Approximate Tokens, Words and Characters Calculator for LLM's and Text Trimmer — Simple calculator to estimate tokens for Large Language Models and text editor to trim text
Text File Merger for LLM — This tool combines multiple text files into a single document, with clear separation between files
PDF to TXT Converter — Convert PDF documents to plain text format for use with LLMs and text analysis
HTML to TXT Converter — Remove HTML tags and extract clean text content for LLM processing
LLM System Prompt Generator — Generate optimized system prompts for different LLM model sizes (3B, 33B, 70B, etc.)
Creative Idea Generator — AI-powered brainstorming tool for generating creative solutions and ideas

2024-10-26 Tags: llm, tools, linux, denis shiryaev, pdf, html, text by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

klotz: pdf*

Linked Tags

Related Tags