klotz: gguf* + llama.cpp*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. A comprehensive technical guide on setting up a high-performance local large language model environment for agentic coding tasks. The author demonstrates how to run a quantized Qwen3.5-27B model on a remote RTX 4090 workstation and access it from a MacBook using Tailscale, integrating the setup with OpenCode and Codex.
    Key topics include:
    * Step-by-step llama.cpp build configuration for CUDA support.
    * Using Tailscale to create a secure network between client and GPU machine.
    * Optimizing VRAM usage through specific quantization (UD-Q4_K_XL) and context size management.
    * Implementing a corrected chat template to prevent tool-calling errors in agentic workflows.
    * Performance insights regarding hybrid architectures and KV cache precision.
  2. The llama.cpp server has introduced support for the Anthropic Messages API, a highly requested feature that allows users to run Claude-compatible clients with locally hosted models. This implementation enables powerful tools like Claude Code to interface directly with local GGUF models by internally converting Anthropic's message format to OpenAI's standard. Key features of this update include full support for chat completions with streaming, advanced tool use through function calling, token counting capabilities, vision support for multimodal models, and extended thinking for reasoning models. This development bridges the gap between proprietary AI ecosystems and local, privacy-focused inference pipelines, providing a seamless experience for developers working with agentic workloads and coding assistants.

    ANTHROPIC_AUTH_TOKEN, ANTHROPIC_MODEL=
  3. A technical guide to running lightweight OCR models (LightOnOCR, GLM-OCR, Deepseek-OCR) on low-end hardware using llama.cpp. Includes implementation details for CLI, REST APIs, and performance optimization.

    Topics Covered:

    - llama.cpp OCR integration
    - Low-spec hardware optimization
    - CLI & REST API setup
    - Quantization & Prompting
    - Hallucination mitigation
  4. Bonsai-8B-GGUF-1bit is an end-to-end 1-bit language model designed for high-efficiency deployment using llama.cpp across CUDA, Metal, and CPU architectures. This model provides a massive 14.1x reduction in memory footprint compared to standard FP16, requiring only 1.15 GB of parameter memory. By leveraging the GGUF Q1_0_g128 format, it achieves significant performance boosts, including 6.2x faster throughput on an RTX 4090 and substantially lower energy consumption per token. It is an ideal solution for on-device assistants, mobile applications, and edge robotics where memory, thermal, and power constraints are paramount.
  5. Announcement that ggml.ai is joining Hugging Face to ensure the long-term sustainability and progress of the ggml/llama.cpp community and Local AI. Highlights continued open-source development, improved user experience, and integration with the Hugging Face transformers library.
  6. A detailed guide for running the new gpt-oss models locally with the best performance using `llama.cpp`. The guide covers a wide range of hardware configurations and provides CLI argument explanations and benchmarks for Apple Silicon devices.
  7. How to run Gemma 3 effectively with our GGUFs on llama.cpp, Ollama, Open WebUI and how to fine-tune with Unsloth! This page details running Gemma 3 on various platforms, including phones, and fine-tuning it using Unsloth, addressing potential issues with float16 precision and providing optimal configuration settings.
  8. A user is seeking advice on deploying a new server with 4x H100 GPUs (320GB VRAM) for on-premise AI workloads. They are considering a Kubernetes-based deployment with RKE2, Nvidia GPU Operator, and tools like vLLM, llama.cpp, and Litellm. They are also exploring the option of GPU pass-through with a hypervisor. The post details their current infrastructure and asks for potential gotchas or best practices.
  9. A step-by-step guide on building llamafiles from Llama 3.2 GGUFs, including scripting and Dockerization.
  10. - create a custom base image for a Cloud Workstation environment using a Dockerfile
    . Uses:

    Quantized models from

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: gguf + llama.cpp

About - Propulsed by SemanticScuttle