SGLang is a fast serving framework for large language models and vision language models. It focuses on efficient serving and controllable interaction through co-designed backend runtime and frontend language.
This article details the often overlooked cost of storing embeddings for RAG systems, and how quantization techniques (int8 and binary) can significantly reduce storage requirements and improve retrieval speed without substantial accuracy loss.
A user is seeking advice on deploying a new server with 4x H100 GPUs (320GB VRAM) for on-premise AI workloads. They are considering a Kubernetes-based deployment with RKE2, Nvidia GPU Operator, and tools like vLLM, llama.cpp, and Litellm. They are also exploring the option of GPU pass-through with a hypervisor. The post details their current infrastructure and asks for potential gotchas or best practices.
This document details how to run Gemma models, covering framework selection, variant choice, and running generation/inference requests. It emphasizes considering available hardware resources and provides recommendations for beginners.
This document details how to run Qwen models locally using the Text Generation Web UI (oobabooga), covering installation, setup, and launching the web interface.
Meta AI has released quantized versions of the Llama 3.2 models (1B and 3B), which improve inference speed by up to 2-4x and reduce model size by 56%, making advanced AI technology more accessible to a wider range of users.
This article discusses the extensive evaluation of quantized large language models (LLMs) by Neural Magic, finding that quantized LLMs maintain competitive accuracy and efficiency with their full-precision counterparts.
- **Quantization Schemes**: Three different quantization schemes were tested: W8A8-INT, W8A8-FP, and W4A16-INT, each optimized for different hardware and deployment scenarios.
- **Accuracy Recovery**: The quantized models demonstrated high accuracy recovery, often reaching over 99%, across a range of benchmarks, including OpenLLM Leaderboard v1 and v2, Arena-Hard, and HumanEval.
- **Text Similarity**: Text generated by quantized models was found to be highly similar to that generated by full-precision models, maintaining semantic and structural consistency.
A guide on how to download, convert, quantize, and use Llama 3.1 8B model with llama.cpp on a Mac.
This paper evaluates the performance of instruction-tuned LLMs across various quantization methods, including GPTQ, AWQ, SmoothQuant, and FP8, on models ranging from 7B to 405B. Key findings include quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following.
This article explains how to accurately quantize a Large Language Model (LLM) and convert it to the GGUF format for efficient CPU inference. It covers using an importance matrix (imatrix) and K-Quantization method with Gemma 2 Instruct as an example, while highlighting its applicability to other models like Qwen2, Llama 3, and Phi-3.