SemanticScuttle - klotz.me

klotz: gguf*

Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens

Alibaba's Qwen 2.5 LLM now supports input token limits up to 1 million using Dual Chunk Attention. Two models are released on Hugging Face, requiring significant VRAM for full capacity. Challenges in deployment with quantized GGUF versions and system resource constraints are discussed.

2025-01-28 Tags: qwen2.5-1m, alibaba, hugging face, gguf, llm, simon willison by klotz

Ollama just made it easier to use AI on your laptop — with no internet required

Ollama now supports HuggingFace GGUF models, making it easier for users to run AI models locally without internet. The GGUF format allows for the use of AI models on modest-sized consumer hardware.

2024-10-24 Tags: ollama, huggingface, gguf, llm, localllama by klotz

TIL: Building llamafiles from Llama 3.2 GGUFs

A step-by-step guide on building llamafiles from Llama 3.2 GGUFs, including scripting and Dockerization.

2024-09-28 Tags: llamafile, llama.cpp, llm, llama 3.2, gguf, model quantization, docker, mozilla-ocho by klotz

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

This article explains how to accurately quantize a Large Language Model (LLM) and convert it to the GGUF format for efficient CPU inference. It covers using an importance matrix (imatrix) and K-Quantization method with Gemma 2 Instruct as an example, while highlighting its applicability to other models like Qwen2, Llama 3, and Phi-3.

2024-09-14 Tags: gguf, quantization, llm, cpu, inference, imatrix by klotz

Artifacts Quantized LLM Inference Performance Results on 70b+ Models

This document contains the quantized LLM inference performance results on 70b+ models.

2024-06-23 Tags: artifacts, quantization, llm, gguf by klotz

mistral.rs

Mistral.rs is a fast LLM inference platform supporting inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings. It supports the latest Llama and Phi models, as well as X-LoRA and LoRA support. The project aims to provide the fastest LLM inference platform possible.

2024-04-29 Tags: rust, llm, mistral, gguf, github by klotz

GoogleCloudPlatform/localllm: Run LLMs locally on Cloud Workstations

create a custom base image for a Cloud Workstation environment using a Dockerfile . Uses:

Quantized models from

2024-02-08 Tags: llm, google, locallama, github, foss, gguf, huggingface, llama.cpp by klotz

Democratizing LLMs: 4-bit Quantization for Optimal LLM Inference

A deep dive into model quantization with GGUF and llama.cpp and model evaluation with LlamaIndex

2024-01-15 Tags: llm, gguf, georgi gerganov, llama.cpp, llamaindex, huggingface, rag by klotz

Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)

Exploring Pre-Quantized Large Language Models

2023-11-15 Tags: llm, quantization, gguf by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

klotz: gguf*

Linked Tags

Related Tags