Tags: inference* + llama.cpp*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. A detailed guide for running the new gpt-oss models locally with the best performance using `llama.cpp`. The guide covers a wide range of hardware configurations and provides CLI argument explanations and benchmarks for Apple Silicon devices.
  2. A user shares their experience running the GPT-OSS 120b model on Ollama with an i7 6700, 64GB DDR4 RAM, RTX 3090, and a 1TB SSD. They note slow initial token generation but acceptable performance overall, highlighting it's possible on a relatively modest setup. The discussion includes comparisons to other hardware configurations, optimization techniques (llama.cpp), and the model's quality.

    >I have a 3090 with 64gb ddr4 3200 RAM and am getting around 50 t/s prompt processing speed and 15 t/s generation speed using the following:
    >
    >`llama-server -m <path to gpt-oss-120b> --ctx-size 32768 --temp 1.0 --top-p 1.0 --jinja -ub 2048 -b 2048 -ngl 99 -fa 'on' --n-cpu-moe 24`
    > This about fills up my VRAM and RAM almost entirely. For more wiggle room for other applications use `--n-cpu-moe 26`.
  3. This document details how to run and fine-tune Gemma 3 models (1B, 4B, 12B, and 27B) using Unsloth, covering setup with Ollama and llama.cpp, and addressing potential float16 precision issues. It also highlights Unsloth's unique ability to run Gemma 3 in float16 on machines like Colab notebooks with Tesla T4 GPUs.
  4. This document details how to run Qwen models locally using the Text Generation Web UI (oobabooga), covering installation, setup, and launching the web interface.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "inference+llama.cpp"

About - Propulsed by SemanticScuttle