SemanticScuttle - klotz.me » klotz: inference+nvidia

klotz: inference* + nvidia*

reddit: Personal experience with GLM 4.7 Flash Q6 (unsloth) + Roo Code + RTX 5090

Based on the discussion, /u/septerium achieved optimal performance for GLM 4.7 Flash (UD-Q6_K_XL) on an RTX 5090 using these specific settings and parameters:
- GPU: NVIDIA RTX 5090.
- 150 tokens/s
- Context: 48k tokens squeezed into VRAM.
- UD-Q6_K_XL (Unsloth quantized GGUF).
- Flash Attention: Enabled (-fa on).
- Context Size: 48,000 (--ctx-size 48000).
- GPU Layers: 99 (-ngl 99) to ensure the entire model runs on the GPU.
- Sampler & Inference Parameters
- Temperature: 0.7 (recommended by Unsloth for tool calls).
- Top-P: 1.0.
- Min-P: 0.01.
- Repeat Penalty: Must be disabled (llama.cpp does this by default, but users warned other platforms might not).

2026-01-25 Tags: glm-4.7-flash, nvidia, llm, inference, local deployment, reddit, septerium by klotz

Nvidia's new CPX GPU aims to change the game in AI inference — how the debut of cheaper and cooler GDDR7 memory could redefine AI inference infrastructure

Nvidia introduces the Rubin CPX GPU, designed to accelerate AI inference by decoupling the context and generation phases. It utilizes GDDR7 memory for lower cost and power consumption, aiming to redefine AI infrastructure.

2025-10-05 Tags: nvidia, cpx gpu, inference, gddr7, rubin, hardware, data center, gpu, llm by klotz

El Reg's essential guide to deploying LLMs in production

Running GenAI models is easy. Scaling them to thousands of users, not so much. This guide details avenues for scaling AI workloads from proofs of concept to production-ready deployments, covering API integration, on-prem deployment considerations, hardware requirements, and tools like vLLM and Nvidia NIMs.

2025-04-28 Tags: llm, ai, production engineering, inference engineering, deployment, vllm, nvidia, kubernetes, inference, api, scaling, gpu, machine learning by klotz

NVIDIA DGX Spark

NVIDIA DGX Spark is a desktop-friendly AI supercomputer powered by the NVIDIA GB10 Grace Blackwell Superchip, delivering 1000 AI TOPS of performance with 128GB of memory. It is designed for prototyping, fine-tuning, and inference of large AI models.

2025-03-24 Tags: machine learning, nvidia, dgx spark, llm, grace blackwell, ai development, inference, data science, gpu, cpu by klotz

NVIDIA Jetson Orin Nano Super: Powering DeepSeek R1 70B Inference at the Edge!

The NVIDIA Jetson Orin Nano Super is highlighted as a compact, powerful computing solution for edge AI applications. It enables sophisticated AI capabilities at the edge, supporting large-scale inference tasks with the help of high-capacity storage solutions like the Solidigm 122.88TB SSD. This review explores its use in various applications including wildlife conservation, surveillance, and AI model distribution, emphasizing its potential in real-world deployments.

2025-02-20 Tags: nvidia, jetson, orin nano super, deepseek r1 70b, llm, inference, iot, edge, hardware by klotz

Mastering LLM Techniques: Inference Optimization

2023-11-18 Tags: llm, inference, performance, optimization, nvidia by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

klotz: inference* + nvidia*

Linked Tags

Related Tags