klotz: septerium* + nvidia*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. Based on the discussion, /u/septerium achieved optimal performance for GLM 4.7 Flash (UD-Q6_K_XL) on an RTX 5090 using these specific settings and parameters:
    - GPU: NVIDIA RTX 5090.
    - 150 tokens/s
    - Context: 48k tokens squeezed into VRAM.
    - UD-Q6_K_XL (Unsloth quantized GGUF).
    - Flash Attention: Enabled (-fa on).
    - Context Size: 48,000 (--ctx-size 48000).
    - GPU Layers: 99 (-ngl 99) to ensure the entire model runs on the GPU.
    - Sampler & Inference Parameters
    - Temperature: 0.7 (recommended by Unsloth for tool calls).
    - Top-P: 1.0.
    - Min-P: 0.01.
    - Repeat Penalty: Must be disabled (llama.cpp does this by default, but users warned other platforms might not).

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: septerium + nvidia

About - Propulsed by SemanticScuttle