Based on the discussion, /u/septerium achieved optimal performance for GLM 4.7 Flash (UD-Q6_K_XL) on an RTX 5090 using these specific settings and parameters:
- GPU: NVIDIA RTX 5090.
- 150 tokens/s
- Context: 48k tokens squeezed into VRAM.
- UD-Q6_K_XL (Unsloth quantized GGUF).
- Flash Attention: Enabled (-fa on).
- Context Size: 48,000 (--ctx-size 48000).
- GPU Layers: 99 (-ngl 99) to ensure the entire model runs on the GPU.
- Sampler & Inference Parameters
- Temperature: 0.7 (recommended by Unsloth for tool calls).
- Top-P: 1.0.
- Min-P: 0.01.
- Repeat Penalty: Must be disabled (llama.cpp does this by default, but users warned other platforms might not).