klotz: retrieval-augmented generation* + ai*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. Amazon S3 Vectors is now generally available with increased scale and production-grade performance capabilities. It offers native support to store and query vector data, potentially reducing costs by up to 90% compared to specialized vector databases.
  2. A comprehensive overview of the current state of Multi-Concept Prompting (MCP), including advancements, challenges, and future directions.
  3. This article explains the internal workings of vector databases, highlighting that they don't perform a brute-force search as commonly described. It details algorithms like HNSW, IVF, and PQ, the tradeoffs between recall, speed, and memory, and how different RAG patterns impact vector database usage. It also discusses production challenges like filtering, updates, and sharding.
  4. Google DeepMind research reveals a fundamental architectural limitation in Retrieval-Augmented Generation (RAG) systems related to fixed-size embeddings. The research demonstrates that retrieval performance degrades as database size increases, with theoretical limits based on embedding dimensionality. They introduce the LIMIT benchmark to empirically test these limitations and suggest alternatives like cross-encoders, multi-vector models, and sparse models.
  5. Nvidia’s NeMo Retriever models and RAG pipeline make quick work of ingesting PDFs and generating reports based on them. Chalk one up for the plan-reflect-refine architecture.
    2025-08-23 Tags: , , , , by klotz
  6. MarkItDown is an open-source Python utility that simplifies converting diverse file formats into Markdown, designed to prepare data for LLMs and RAG systems. It handles various file types, preserves document structure, and integrates with LLMs for tasks like image description.
  7. A Reddit thread discussing preferred local Large Language Model (LLM) setups for tasks like summarizing text, coding, and general use. Users share their model choices (Gemma, Qwen, Phi, etc.) and frameworks (llama.cpp, Ollama, EXUI) along with potential issues and configurations.

    | **Model** | **Use Cases** | **Size (Parameters)** | **Approx. VRAM (Q4 Quantization)** | **Approx. RAM (Q4)** | **Notes/Requirements** |
    |----------------|---------------------------------------------------|------------------------|-----------------------------------|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
    | **Gemma 3 (Meta)** | Summarization, conversational tasks, image recognition, translation, simple writing | 3B, 4B, 7B, 8B, 12B, 27B+ | 2-4GB (3B), 4-6GB (7B), 8-12GB (12B) | 4-8GB (3B), 8-12GB (7B), 16-24GB (12B) | Excellent performance for its size. Recent versions have had memory leak issues (see Reddit post – use Ollama 0.6.6 or later, but even that may not be fully fixed). QAT versions are highly recommended. |
    | **Qwen 2.5 (Alibaba)** | Summarization, coding, reasoning, decision-making, technical material processing | 3.5B, 7B, 72B | 2-3GB (3.5B), 4-6GB (7B), 26-30GB (72B) | 4-6GB (3.5B), 8-12GB (7B), 50-60GB (72B) | Qwen models are known for strong performance. Coder versions specifically tuned for code generation. |
    | **Qwen3 (Alibaba - upcoming)**| General purpose, likely similar to Qwen 2.5 with improvements | 70B | Estimated 25-30GB (Q4) | 50-60GB | Expected to be a strong competitor. |
    | **Llama 3 (Meta)**| General purpose, conversation, writing, coding, reasoning | 8B, 13B, 70B+ | 4-6GB (8B), 7-9GB (13B), 25-30GB (70B) | 8-12GB (8B), 14-18GB (13B), 50-60GB (70B) | Current state-of-the-art open-source model. Excellent balance of performance and size. |
    | **YiXin (01.AI)** | Reasoning, brainstorming | 72B | ~26-30GB (Q4) | ~50-60GB | A powerful model focused on reasoning and understanding. Similar VRAM requirements to Qwen 72B. |
    | **Phi-4 (Microsoft)** | General purpose, writing, coding | 14B | ~7-9GB (Q4) | 14-18GB | Smaller model, good for resource-constrained environments, but may not match larger models in complexity. |
    | **Ling-Lite** | RAG (Retrieval-Augmented Generation), fast processing, text extraction | Variable | Varies with size | Varies with size | MoE (Mixture of Experts) model known for speed. Good for RAG applications where quick responses are important. |

    **Key Considerations:**

    * **Quantization:** The VRAM and RAM estimates above are based on 4-bit quantization (Q4). Lower quantization (e.g., Q2) will reduce memory usage further, but *may* impact quality. Higher quantization (e.g., Q8, FP16) will increase quality but require significantly more memory.
    * **Frameworks:** Popular frameworks for running these models locally include:
    * **llama.cpp:** Highly optimized for CPU and GPU, especially on Apple Silicon.
    * **Ollama:** Simplified setup and management of LLMs. (Be aware of the Gemma 3 memory leak issue!)
    * **Text Generation WebUI (oobabooga):** Web-based interface with many features and customization options.
    * **Hardware:** A dedicated GPU with sufficient VRAM is highly recommended for decent performance. CPU-only inference is possible but can be slow. More RAM is generally better, even if the model fits in VRAM.
    * **Context Length:** The "40k" context mentioned in the Reddit post refers to the maximum number of tokens (words or sub-words) the model can process at once. Longer context lengths require more memory.
  8. This repository organizes public content to train an LLM to answer questions and generate summaries in an author's voice, focusing on the content of 'virtual_adrianco' but designed to be extensible to other authors.
  9. pgai brings AI workflows to your PostgreSQL database. It simplifies the process of building search and Retrieval Augmented Generation (RAG) AI applications with PostgreSQL by bringing embedding and generation AI models closer to the database.
  10. The Towards Data Science team highlights recent articles on the rise of open-source LLMs, ethical considerations with chatbots, potential manipulation of LLM recommendations, and techniques for temperature scaling and re-ranking in generative AI.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: retrieval-augmented generation + ai

About - Propulsed by SemanticScuttle