SemanticScuttle - klotz.me » klotz: gpt-oss

klotz: gpt-oss*

I'm running a 120B local LLM on 24GB of VRAM, and now it powers my smart home

This article details how to run a 120B parameter LLM locally with 24GB of VRAM and 64GB of system RAM, using a setup with Proxmox LXCs, Whisper for voice transcription, and integration with Home Assistant for smart home automation.

2025-12-29 Tags: llm, local llm, smart home, proxmox, whisper, home assistant, llama.cpp, gpt-oss by klotz

Using Codex CLI with gpt-oss:120b on an NVIDIA DGX Spark via Tailscale

This article details how the author successfully ran OpenAI's Codex CLI against a gpt-oss:120b model hosted on an NVIDIA DGX Spark, accessed through a Tailscale network. It covers the setup of Tailscale, Ollama configuration, and the process of running the Codex CLI with the remote model, including building a Space Invaders game.

2025-11-07 Tags: llm, codex, gpt-oss, nvidia dgx spark, tailscale, ollama, ai, large language model, space invaders by klotz

guide : running gpt-oss with llama.cpp · Discussion #15396

A detailed guide for running the new gpt-oss models locally with the best performance using `llama.cpp`. The guide covers a wide range of hardware configurations and provides CLI argument explanations and benchmarks for Apple Silicon devices.

2025-10-04 Tags: llama.cpp, gpt-oss, large language model, inference, apple silicon, benchmarks, performance, gguf by klotz

Meet oLLM: A Lightweight Python Library that brings 100K-Context LLM Inference to 8 GB Consumer GPUs via SSD Offload—No Quantization Required

oLLM is a Python library for running large-context Transformers on NVIDIA GPUs by offloading weights and KV-cache to SSDs. It supports models like Llama-3, GPT-OSS-20B, and Qwen3-Next-80B, enabling up to 100K tokens of context on 8-10 GB GPUs without quantization.

2025-09-30 Tags: ollm, llm, inference, python, huggingface, pytorch, llama-3, gpt-oss, qwen3-next by klotz

Inside GPT-OSS: OpenAI’s Latest LLM Architecture

An in-depth look at the architecture of OpenAI's GPT-OSS models, detailing tokenization, embeddings, transformer blocks, Mixture of Experts, attention mechanisms (GQA and RoPE), and quantization techniques.

2025-09-27 Tags: llm, gpt-oss, openai, transformer, mixture of experts, moe, attention, gqa, rope, quantization, machine learning, .qwen3–30b-a3b. by klotz

This is GPT-OSS 120b on Ollama, running on a i7

A user shares their experience running the GPT-OSS 120b model on Ollama with an i7 6700, 64GB DDR4 RAM, RTX 3090, and a 1TB SSD. They note slow initial token generation but acceptable performance overall, highlighting it's possible on a relatively modest setup. The discussion includes comparisons to other hardware configurations, optimization techniques (llama.cpp), and the model's quality.

>I have a 3090 with 64gb ddr4 3200 RAM and am getting around 50 t/s prompt processing speed and 15 t/s generation speed using the following:
>
>`llama-server -m <path to gpt-oss-120b> --ctx-size 32768 --temp 1.0 --top-p 1.0 --jinja -ub 2048 -b 2048 -ngl 99 -fa 'on' --n-cpu-moe 24`
> This about fills up my VRAM and RAM almost entirely. For more wiggle room for other applications use `--n-cpu-moe 26`.

2025-09-01 Tags: gpt-oss, 120b, reddit, localllama, llm, inference, rtx 3090, llama.cpp, hardware by klotz

Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training

This blog post details a fine-tuning workflow for the gpt-oss model that recovers post-training accuracy while retaining the performance benefits of FP4. It involves supervised fine-tuning (SFT) on an upcasted BF16 version of the model, followed by quantization-aware training (QAT) using NVIDIA TensorRT Model Optimizer. The article also discusses the benefits of using NVFP4 for even better convergence and accuracy recovery.

2025-08-30 Tags: gpt-oss, fine-tuning, quantization-aware training, qat, tensorrt model optimizer, mxfp4, nvfp4, bf16, fp4, llm, nvidia by klotz

The Illustrated GPT-OSS

OpenAI's release of GPT-OSS marks their first major open source LLM since GPT-2, featuring improvements in reasoning, tool usage, and problem-solving capabilities. The article explores its architecture, message formatting, reasoning modes, and tokenizer details.

2025-08-22 Tags: gpt-oss, openai, llm, reasoning, tool usage, tokenizer, mixture-of-experts by klotz

120B runs awesome on just 8GB VRAM!

A user demonstrates how to run a 120B model efficiently on hardware with only 8GB VRAM by offloading MOE layers to CPU and keeping only attention layers on GPU, achieving high performance with minimal VRAM usage.

2025-08-21 Tags: 120b, moe, llama.cpp, gpt-oss, localllama, gpt-oss-120b, openai, llm by klotz

guide : running gpt-oss with llama.cpp

2025-08-19 Tags: gpt-oss, -20b, openai, github, llama.cpp, llm, ggml by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

klotz: gpt-oss*

Linked Tags

Related Tags