SemanticScuttle - klotz.me » Tags: quantization

Tags: quantization*

0 bookmark(s) - Sort by: Date ↓ / Title /

Bring your own brain? Why local LLMs are taking off

The article discusses the growing trend of running Large Language Models (LLMs) locally on personal machines, exploring the motivations behind this shift – including privacy concerns, cost savings, and a desire for technological sovereignty – as well as the hardware and software advancements making it increasingly feasible.

2025-08-31 Tags: llm, local llm, privacy, cost, open source, hardware, software, quantization by klotz

7 things I wish I knew when I started self-hosting LLMs

This article details 7 lessons the author learned while self-hosting Large Language Models (LLMs), covering topics like the importance of memory bandwidth, quantization, electricity costs, hardware choices beyond Nvidia, prompt engineering, Mixture of Experts models, and starting with simpler tools like LM Studio.

2025-07-23 Tags: llm, self-hosting, gpu, quantization, memory bandwidth, ollama, lm studio, mixture of experts by klotz

DeepSeek-R1-0528-Qwen3-8B-GGUF

This page details the DeepSeek-R1-0528-Qwen3-8B model, a quantized version of DeepSeek-R1-0528, highlighting its improved reasoning capabilities, evaluation results, usage guidelines, and licensing information. It offers various quantization options (GGUF) for local execution.

2025-05-30 Tags: deepseek-r1, qwen3, gguf, llm, quantization, reasoning, text generation, transformers, model card, mcp, huggingface by klotz

SGLang - Home

SGLang is a fast serving framework for large language models and vision language models. It focuses on efficient serving and controllable interaction through co-designed backend runtime and frontend language.

2025-04-30 Tags: llm, vision language models, inference engineering, quantization, sglang by klotz

Why Your RAG Embeddings Are Costing You a Fortune (And How I Fixed It)

This article details the often overlooked cost of storing embeddings for RAG systems, and how quantization techniques (int8 and binary) can significantly reduce storage requirements and improve retrieval speed without substantial accuracy loss.

2025-04-30 Tags: rag, embedding, vector database, transformers, llm, quantization by klotz

Server approved! 4xH100 (320gb vram). Looking for advice

A user is seeking advice on deploying a new server with 4x H100 GPUs (320GB VRAM) for on-premise AI workloads. They are considering a Kubernetes-based deployment with RKE2, Nvidia GPU Operator, and tools like vLLM, llama.cpp, and Litellm. They are also exploring the option of GPU pass-through with a hypervisor. The post details their current infrastructure and asks for potential gotchas or best practices.

2025-04-28 Tags: h100, kubernetes, vllm, llama.cpp, gpu, ai, deployment, rke2, litellm, quantization, sxm, fp8, awq, gguf, production engineering, inference engineering, scale, reddit, localllama by klotz

Run Gemma

This document details how to run Gemma models, covering framework selection, variant choice, and running generation/inference requests. It emphasizes considering available hardware resources and provides recommendations for beginners.

2025-04-18 Tags: gemma, llm, quantization, google by klotz

Text Generation Web UI

This document details how to run Qwen models locally using the Text Generation Web UI (oobabooga), covering installation, setup, and launching the web interface.

2025-04-08 Tags: alibaba, qwen, text generation web ui, oobabooga, llm, inference, llama.cpp, transformers, quantization, python by klotz

Meta AI Releases New Quantized Versions of Llama 3.2 (1B & 3B): Delivering Up To 2-4x Increases in Inference Speed and 56% Reduction in Model Size

Meta AI has released quantized versions of the Llama 3.2 models (1B and 3B), which improve inference speed by up to 2-4x and reduce model size by 56%, making advanced AI technology more accessible to a wider range of users.

2024-10-26 Tags: meta, llm, llama 3.2, quantization, small models by klotz

We Ran Over Half a Million Evaluations on Quantized LLMs: Here's What We Found

This article discusses the extensive evaluation of quantized large language models (LLMs) by Neural Magic, finding that quantized LLMs maintain competitive accuracy and efficiency with their full-precision counterparts.

- **Quantization Schemes**: Three different quantization schemes were tested: W8A8-INT, W8A8-FP, and W4A16-INT, each optimized for different hardware and deployment scenarios.
- **Accuracy Recovery**: The quantized models demonstrated high accuracy recovery, often reaching over 99%, across a range of benchmarks, including OpenLLM Leaderboard v1 and v2, Arena-Hard, and HumanEval.
- **Text Similarity**: Text generated by quantized models was found to be highly similar to that generated by full-precision models, maintaining semantic and structural consistency.

2025-02-27 Tags: quantization, llm, evaluation, neural magic by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

Tags: quantization*

Linked Tags

Related Tags