Tags: gpt-oss* + inference*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. A detailed guide for running the new gpt-oss models locally with the best performance using `llama.cpp`. The guide covers a wide range of hardware configurations and provides CLI argument explanations and benchmarks for Apple Silicon devices.
  2. oLLM is a Python library for running large-context Transformers on NVIDIA GPUs by offloading weights and KV-cache to SSDs. It supports models like Llama-3, GPT-OSS-20B, and Qwen3-Next-80B, enabling up to 100K tokens of context on 8-10 GB GPUs without quantization.
  3. A user shares their experience running the GPT-OSS 120b model on Ollama with an i7 6700, 64GB DDR4 RAM, RTX 3090, and a 1TB SSD. They note slow initial token generation but acceptable performance overall, highlighting it's possible on a relatively modest setup. The discussion includes comparisons to other hardware configurations, optimization techniques (llama.cpp), and the model's quality.

    >I have a 3090 with 64gb ddr4 3200 RAM and am getting around 50 t/s prompt processing speed and 15 t/s generation speed using the following:
    >
    >`llama-server -m <path to gpt-oss-120b> --ctx-size 32768 --temp 1.0 --top-p 1.0 --jinja -ub 2048 -b 2048 -ngl 99 -fa 'on' --n-cpu-moe 24`
    > This about fills up my VRAM and RAM almost entirely. For more wiggle room for other applications use `--n-cpu-moe 26`.
  4. OpenAI releases gpt-oss-120b and gpt-oss-20b, two state-of-the-art open-weight language models that deliver strong real-world performance at low cost. They outperform similarly sized open models on reasoning tasks and are optimized for efficient deployment.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "gpt-oss+inference"

About - Propulsed by SemanticScuttle