Tags: quantization*

0 bookmark(s) - Sort by: Date โ†“ / Title /

  1. The article discusses the growing trend of running Large Language Models (LLMs) locally on personal machines, exploring the motivations behind this shift โ€“ including privacy concerns, cost savings, and a desire for technological sovereignty โ€“ as well as the hardware and software advancements making it increasingly feasible.
  2. This article details 7 lessons the author learned while self-hosting Large Language Models (LLMs), covering topics like the importance of memory bandwidth, quantization, electricity costs, hardware choices beyond Nvidia, prompt engineering, Mixture of Experts models, and starting with simpler tools like LM Studio.
  3. This page details the DeepSeek-R1-0528-Qwen3-8B model, a quantized version of DeepSeek-R1-0528, highlighting its improved reasoning capabilities, evaluation results, usage guidelines, and licensing information. It offers various quantization options (GGUF) for local execution.
  4. SGLang is a fast serving framework for large language models and vision language models. It focuses on efficient serving and controllable interaction through co-designed backend runtime and frontend language.
  5. This article details the often overlooked cost of storing embeddings for RAG systems, and how quantization techniques (int8 and binary) can significantly reduce storage requirements and improve retrieval speed without substantial accuracy loss.
  6. A user is seeking advice on deploying a new server with 4x H100 GPUs (320GB VRAM) for on-premise AI workloads. They are considering a Kubernetes-based deployment with RKE2, Nvidia GPU Operator, and tools like vLLM, llama.cpp, and Litellm. They are also exploring the option of GPU pass-through with a hypervisor. The post details their current infrastructure and asks for potential gotchas or best practices.
  7. This document details how to run Gemma models, covering framework selection, variant choice, and running generation/inference requests. It emphasizes considering available hardware resources and provides recommendations for beginners.
    2025-04-18 Tags: , , , by klotz
  8. This document details how to run Qwen models locally using the Text Generation Web UI (oobabooga), covering installation, setup, and launching the web interface.
  9. Meta AI has released quantized versions of the Llama 3.2 models (1B and 3B), which improve inference speed by up to 2-4x and reduce model size by 56%, making advanced AI technology more accessible to a wider range of users.
    2024-10-26 Tags: , , , , by klotz
  10. This article discusses the extensive evaluation of quantized large language models (LLMs) by Neural Magic, finding that quantized LLMs maintain competitive accuracy and efficiency with their full-precision counterparts.

    - **Quantization Schemes**: Three different quantization schemes were tested: W8A8-INT, W8A8-FP, and W4A16-INT, each optimized for different hardware and deployment scenarios.
    - **Accuracy Recovery**: The quantized models demonstrated high accuracy recovery, often reaching over 99%, across a range of benchmarks, including OpenLLM Leaderboard v1 and v2, Arena-Hard, and HumanEval.
    - **Text Similarity**: Text generated by quantized models was found to be highly similar to that generated by full-precision models, maintaining semantic and structural consistency.
    2025-02-27 Tags: , , , by klotz

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "quantization"

About - Propulsed by SemanticScuttle