klotz: rag*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. MarkItDown is an open-source Python utility that simplifies converting diverse file formats into Markdown, designed to prepare data for LLMs and RAG systems. It handles various file types, preserves document structure, and integrates with LLMs for tasks like image description.
  2. IBM announces Granite 3.3, featuring a new speech-to-text model (Granite Speech 3.3 8B), enhanced reasoning capabilities in Granite 3.3 8B Instruct, and RAG-focused LoRA adapters for Granite 3.2. The release also includes activated LoRAs (aLoRAs) for improved efficiency and all models are open source.
  3. This article details the often overlooked cost of storing embeddings for RAG systems, and how quantization techniques (int8 and binary) can significantly reduce storage requirements and improve retrieval speed without substantial accuracy loss.
  4. This article details building a Retrieval-Augmented Generation (RAG) system to assist with research paper tasks, specifically question answering over a PDF document. It covers document loading, splitting, embedding with Sentence Transformers, using ChromaDB as a vector database, and implementing a query interface with LangChain.
  5. A Reddit thread discussing preferred local Large Language Model (LLM) setups for tasks like summarizing text, coding, and general use. Users share their model choices (Gemma, Qwen, Phi, etc.) and frameworks (llama.cpp, Ollama, EXUI) along with potential issues and configurations.

    | **Model** | **Use Cases** | **Size (Parameters)** | **Approx. VRAM (Q4 Quantization)** | **Approx. RAM (Q4)** | **Notes/Requirements** |
    |----------------|---------------------------------------------------|------------------------|-----------------------------------|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
    | **Gemma 3 (Meta)** | Summarization, conversational tasks, image recognition, translation, simple writing | 3B, 4B, 7B, 8B, 12B, 27B+ | 2-4GB (3B), 4-6GB (7B), 8-12GB (12B) | 4-8GB (3B), 8-12GB (7B), 16-24GB (12B) | Excellent performance for its size. Recent versions have had memory leak issues (see Reddit post – use Ollama 0.6.6 or later, but even that may not be fully fixed). QAT versions are highly recommended. |
    | **Qwen 2.5 (Alibaba)** | Summarization, coding, reasoning, decision-making, technical material processing | 3.5B, 7B, 72B | 2-3GB (3.5B), 4-6GB (7B), 26-30GB (72B) | 4-6GB (3.5B), 8-12GB (7B), 50-60GB (72B) | Qwen models are known for strong performance. Coder versions specifically tuned for code generation. |
    | **Qwen3 (Alibaba - upcoming)**| General purpose, likely similar to Qwen 2.5 with improvements | 70B | Estimated 25-30GB (Q4) | 50-60GB | Expected to be a strong competitor. |
    | **Llama 3 (Meta)**| General purpose, conversation, writing, coding, reasoning | 8B, 13B, 70B+ | 4-6GB (8B), 7-9GB (13B), 25-30GB (70B) | 8-12GB (8B), 14-18GB (13B), 50-60GB (70B) | Current state-of-the-art open-source model. Excellent balance of performance and size. |
    | **YiXin (01.AI)** | Reasoning, brainstorming | 72B | ~26-30GB (Q4) | ~50-60GB | A powerful model focused on reasoning and understanding. Similar VRAM requirements to Qwen 72B. |
    | **Phi-4 (Microsoft)** | General purpose, writing, coding | 14B | ~7-9GB (Q4) | 14-18GB | Smaller model, good for resource-constrained environments, but may not match larger models in complexity. |
    | **Ling-Lite** | RAG (Retrieval-Augmented Generation), fast processing, text extraction | Variable | Varies with size | Varies with size | MoE (Mixture of Experts) model known for speed. Good for RAG applications where quick responses are important. |

    **Key Considerations:**

    * **Quantization:** The VRAM and RAM estimates above are based on 4-bit quantization (Q4). Lower quantization (e.g., Q2) will reduce memory usage further, but *may* impact quality. Higher quantization (e.g., Q8, FP16) will increase quality but require significantly more memory.
    * **Frameworks:** Popular frameworks for running these models locally include:
    * **llama.cpp:** Highly optimized for CPU and GPU, especially on Apple Silicon.
    * **Ollama:** Simplified setup and management of LLMs. (Be aware of the Gemma 3 memory leak issue!)
    * **Text Generation WebUI (oobabooga):** Web-based interface with many features and customization options.
    * **Hardware:** A dedicated GPU with sufficient VRAM is highly recommended for decent performance. CPU-only inference is possible but can be slow. More RAM is generally better, even if the model fits in VRAM.
    * **Context Length:** The "40k" context mentioned in the Reddit post refers to the maximum number of tokens (words or sub-words) the model can process at once. Longer context lengths require more memory.
  6. This article details the creation of 'Stevens', a personal AI assistant built using a single SQLite table to store 'memories' and cron jobs to ingest data and generate daily briefs. It emphasizes a simple architecture leveraging Val.town for hosting and highlights the benefits of broader context for personal AI tools.
    2025-04-17 Tags: , , , , , , by klotz
  7. Articles on Large Language Models, including RAG, Jupyter integration, complexity and pricing, and more.
  8. Ryan speaks with Edo Liberty, Founder and CEO of Pinecone, about building vector databases, the power of embeddings, the evolution of RAG, and fine-tuning AI models.
  9. This article details how to automate embedding generation and updates in Postgres using Supabase Vector, Queues, Cron, and pg_net extension with Edge Functions, addressing the issues of drift, latency, and complexity found in traditional external embedding pipelines.
  10. This repository organizes public content to train an LLM to answer questions and generate summaries in an author's voice, focusing on the content of 'virtual_adrianco' but designed to be extensible to other authors.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: rag

About - Propulsed by SemanticScuttle