SemanticScuttle - klotz.me » klotz: inference+llm

klotz: inference* + llm*

A deep dive into the process of LLM inference, covering tokenization, transformer architecture, KV caching, and optimization techniques for efficient text generation.

2025-11-26 Tags: llm, inference, transformer, tokenization, kv cache, quantization, deep learning, machine learning, neural networks by klotz

Get started with Inference Snaps using Qwen VL

This tutorial guides you through installing and using an inference snap, specifically Qwen 2.5 VL, a multi-modal large language model. It covers installation, status checks, basic chat, and configuring Open WebUI for image-based prompts.

2025-10-31 Tags: inference, snap, qwen vl, llm, vlm, ubuntu, docker, open webui, image processing by klotz

Introducing silicon-optimized inference snaps

Canonical today announced optimized inference snaps, a new way to deploy AI models on Ubuntu devices, with automatic selection of optimized engines, quantizations and architectures based on the specific silicon of the device.

2025-10-31 Tags: llm, inference, snap, ubuntu, deepseek r1, qwen 2.5 vl by klotz

Why we brought hardware-optimized GenAI inference to Ubuntu

On October 23rd, we announced the beta availability of silicon-optimized AI models in Ubuntu. Developers can locally install DeepSeek R1 and Qwen 2.5 VL with a single command, benefiting from maximized hardware performance and automated dependency management.

2025-10-31 Tags: genai, inference, llm, ubuntu, snap by klotz

Unsloth Dynamic GGUFs on Aider Polyglot

This article details the performance of Unsloth Dynamic GGUFs on the Aider Polyglot benchmark, showcasing how it can quantize LLMs like DeepSeek-V3.1 to as low as 1-bit while outperforming models like GPT-4.5 and Claude-4-Opus. It also covers benchmark setup, comparisons to other quantization methods, and chat template bug fixes.

2025-10-13 Tags: unsloth, gguf, aider polyglot, llm, quantization, deepseek-v3.1, gpt-4, claude-4, model compression, fine-tuning, inference by klotz

Nvidia's new CPX GPU aims to change the game in AI inference — how the debut of cheaper and cooler GDDR7 memory could redefine AI inference infrastructure

Nvidia introduces the Rubin CPX GPU, designed to accelerate AI inference by decoupling the context and generation phases. It utilizes GDDR7 memory for lower cost and power consumption, aiming to redefine AI infrastructure.

2025-10-05 Tags: nvidia, cpx gpu, inference, gddr7, rubin, hardware, data center, gpu, llm by klotz

guide : running gpt-oss with llama.cpp · Discussion #15396

A detailed guide for running the new gpt-oss models locally with the best performance using `llama.cpp`. The guide covers a wide range of hardware configurations and provides CLI argument explanations and benchmarks for Apple Silicon devices.

2025-10-04 Tags: llama.cpp, gpt-oss, large language model, inference, apple silicon, benchmarks, performance, gguf by klotz

Meet oLLM: A Lightweight Python Library that brings 100K-Context LLM Inference to 8 GB Consumer GPUs via SSD Offload—No Quantization Required

oLLM is a Python library for running large-context Transformers on NVIDIA GPUs by offloading weights and KV-cache to SSDs. It supports models like Llama-3, GPT-OSS-20B, and Qwen3-Next-80B, enabling up to 100K tokens of context on 8-10 GB GPUs without quantization.

2025-09-30 Tags: ollm, llm, inference, python, huggingface, pytorch, llama-3, gpt-oss, qwen3-next by klotz

Defeating Nondeterminism in LLM Inference

This blog post explains the causes of nondeterminism in LLM inference, arguing that it's not simply due to floating-point non-associativity and concurrency, but rather a lack of batch invariance in kernels. It details how to achieve batch invariance in RMSNorm, matrix multiplication, and attention, and presents experimental results demonstrating deterministic completions and the benefits for on-policy RL.

2025-09-19 Tags: llm, inference, nondeterminism, determinism, floating-point, batch invariance, rmsnorm, matrix multiplication, attention, vllm, on-policy rl, reproducibility by klotz

This is GPT-OSS 120b on Ollama, running on a i7

A user shares their experience running the GPT-OSS 120b model on Ollama with an i7 6700, 64GB DDR4 RAM, RTX 3090, and a 1TB SSD. They note slow initial token generation but acceptable performance overall, highlighting it's possible on a relatively modest setup. The discussion includes comparisons to other hardware configurations, optimization techniques (llama.cpp), and the model's quality.

>I have a 3090 with 64gb ddr4 3200 RAM and am getting around 50 t/s prompt processing speed and 15 t/s generation speed using the following:
>
>`llama-server -m <path to gpt-oss-120b> --ctx-size 32768 --temp 1.0 --top-p 1.0 --jinja -ub 2048 -b 2048 -ngl 99 -fa 'on' --n-cpu-moe 24`
> This about fills up my VRAM and RAM almost entirely. For more wiggle room for other applications use `--n-cpu-moe 26`.

2025-09-01 Tags: gpt-oss, 120b, reddit, localllama, llm, inference, rtx 3090, llama.cpp, hardware by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

klotz: inference* + llm*

Linked Tags

Related Tags