0 bookmark(s) - Sort by: Date ↓ / Title /
Alibaba's Qwen 2.5 LLM now supports input token limits up to 1 million using Dual Chunk Attention. Two models are released on Hugging Face, requiring significant VRAM for full capacity. Challenges in deployment with quantized GGUF versions and system resource constraints are discussed.
Ollama now supports HuggingFace GGUF models, making it easier for users to run AI models locally without internet. The GGUF format allows for the use of AI models on modest-sized consumer hardware.
A step-by-step guide on building llamafiles from Llama 3.2 GGUFs, including scripting and Dockerization.
This article explains how to accurately quantize a Large Language Model (LLM) and convert it to the GGUF format for efficient CPU inference. It covers using an importance matrix (imatrix) and K-Quantization method with Gemma 2 Instruct as an example, while highlighting its applicability to other models like Qwen2, Llama 3, and Phi-3.
This document contains the quantized LLM inference performance results on 70b+ models.
Mistral.rs is a fast LLM inference platform supporting inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings. It supports the latest Llama and Phi models, as well as X-LoRA and LoRA support. The project aims to provide the fastest LLM inference platform possible.
Quantized models from
A deep dive into model quantization with GGUF and llama.cpp and model evaluation with LlamaIndex
Exploring Pre-Quantized Large Language Models
First / Previous / Next / Last / Page 1 of 0