SemanticScuttle - klotz.me » klotz: septerium+nvidia

reddit: Personal experience with GLM 4.7 Flash Q6 (unsloth) + Roo Code + RTX 5090

Based on the discussion, /u/septerium achieved optimal performance for GLM 4.7 Flash (UD-Q6_K_XL) on an RTX 5090 using these specific settings and parameters:
- GPU: NVIDIA RTX 5090.
- 150 tokens/s
- Context: 48k tokens squeezed into VRAM.
- UD-Q6_K_XL (Unsloth quantized GGUF).
- Flash Attention: Enabled (-fa on).
- Context Size: 48,000 (--ctx-size 48000).
- GPU Layers: 99 (-ngl 99) to ensure the entire model runs on the GPU.
- Sampler & Inference Parameters
- Temperature: 0.7 (recommended by Unsloth for tool calls).
- Top-P: 1.0.
- Min-P: 0.01.
- Repeat Penalty: Must be disabled (llama.cpp does this by default, but users warned other platforms might not).

2026-01-25 Tags: glm-4.7-flash, nvidia, llm, inference, local deployment, reddit, septerium by klotz

SemanticScuttle - klotz.me

klotz: septerium* + nvidia*

Linked Tags

Related Tags