SemanticScuttle - klotz.me

Tags: gguf*

0 bookmark(s) - Sort by: Date ↓ / Title /

How to run Gemma 3 effectively with our GGUFs on llama.cpp, Ollama, Open WebUI and how to fine-tune with Unsloth! This page details running Gemma 3 on various platforms, including phones, and fine-tuning it using Unsloth, addressing potential issues with float16 precision and providing optimal configuration settings.

2025-08-16 Tags: gemma 3, llm, fine-tuning, llama.cpp, unsloth, gguf, gpu, colab, vision, audio, oobabooga by klotz

DeepSeek-R1-0528-Qwen3-8B-GGUF

This page details the DeepSeek-R1-0528-Qwen3-8B model, a quantized version of DeepSeek-R1-0528, highlighting its improved reasoning capabilities, evaluation results, usage guidelines, and licensing information. It offers various quantization options (GGUF) for local execution.

2025-05-30 Tags: deepseek-r1, qwen3, gguf, llm, quantization, reasoning, text generation, transformers, model card, mcp, huggingface by klotz

gguf-parser-web

A web application for parsing GGUF files.

2025-04-28 Tags: gguf, parser, huggingface, llm, gpu by klotz

Server approved! 4xH100 (320gb vram). Looking for advice

A user is seeking advice on deploying a new server with 4x H100 GPUs (320GB VRAM) for on-premise AI workloads. They are considering a Kubernetes-based deployment with RKE2, Nvidia GPU Operator, and tools like vLLM, llama.cpp, and Litellm. They are also exploring the option of GPU pass-through with a hypervisor. The post details their current infrastructure and asks for potential gotchas or best practices.

2025-04-28 Tags: h100, kubernetes, vllm, llama.cpp, gpu, ai, deployment, rke2, litellm, quantization, sxm, fp8, awq, gguf, production engineering, inference engineering, scale, reddit, localllama by klotz

Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens

Alibaba's Qwen 2.5 LLM now supports input token limits up to 1 million using Dual Chunk Attention. Two models are released on Hugging Face, requiring significant VRAM for full capacity. Challenges in deployment with quantized GGUF versions and system resource constraints are discussed.

2025-01-28 Tags: qwen2.5-1m, alibaba, hugging face, gguf, llm, simon willison by klotz

Ollama just made it easier to use AI on your laptop — with no internet required

Ollama now supports HuggingFace GGUF models, making it easier for users to run AI models locally without internet. The GGUF format allows for the use of AI models on modest-sized consumer hardware.

2024-10-24 Tags: ollama, huggingface, gguf, llm, localllama by klotz

TIL: Building llamafiles from Llama 3.2 GGUFs

A step-by-step guide on building llamafiles from Llama 3.2 GGUFs, including scripting and Dockerization.

2024-09-28 Tags: llamafile, llama.cpp, llm, llama 3.2, gguf, model quantization, docker, mozilla-ocho by klotz

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

This article explains how to accurately quantize a Large Language Model (LLM) and convert it to the GGUF format for efficient CPU inference. It covers using an importance matrix (imatrix) and K-Quantization method with Gemma 2 Instruct as an example, while highlighting its applicability to other models like Qwen2, Llama 3, and Phi-3.

2024-09-14 Tags: gguf, quantization, llm, cpu, inference, imatrix by klotz

Artifacts Quantized LLM Inference Performance Results on 70b+ Models

This document contains the quantized LLM inference performance results on 70b+ models.

2024-06-23 Tags: artifacts, quantization, llm, gguf by klotz

mistral.rs

Mistral.rs is a fast LLM inference platform supporting inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings. It supports the latest Llama and Phi models, as well as X-LoRA and LoRA support. The project aims to provide the fastest LLM inference platform possible.

2024-04-29 Tags: rust, llm, mistral, gguf, github by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

Tags: gguf*

Linked Tags

Related Tags