Tags: nvidia* + llm*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. >The method, called KV Cache Transform Coding (KVTC), applies ideas from media compression formats like JPEG to shrink the key-value cache behind multi-turn AI systems, lowering GPU memory demands and speeding up time-to-first-token by up to 8x.
  2. The RTX 3090 offers a compelling combination of performance and 24GB of VRAM, making it a better choice for local LLM and AI workloads than newer Nvidia Blackwell GPUs like the RTX 5070 and even the RTX 5080, due to VRAM limitations and pricing.
    2026-02-07 Tags: , , , , , , , , , by klotz
  3. Based on the discussion, /u/septerium achieved optimal performance for GLM 4.7 Flash (UD-Q6_K_XL) on an RTX 5090 using these specific settings and parameters:
    - GPU: NVIDIA RTX 5090.
    - 150 tokens/s
    - Context: 48k tokens squeezed into VRAM.
    - UD-Q6_K_XL (Unsloth quantized GGUF).
    - Flash Attention: Enabled (-fa on).
    - Context Size: 48,000 (--ctx-size 48000).
    - GPU Layers: 99 (-ngl 99) to ensure the entire model runs on the GPU.
    - Sampler & Inference Parameters
    - Temperature: 0.7 (recommended by Unsloth for tool calls).
    - Top-P: 1.0.
    - Min-P: 0.01.
    - Repeat Penalty: Must be disabled (llama.cpp does this by default, but users warned other platforms might not).
  4. NVIDIA AI releases Nemotron-Elastic-12B, a 12B parameter reasoning model that embeds nested 9B and 6B variants in the same parameter space, allowing for multiple model sizes from a single training job.
  5. This blog post details how to build a natural language Bash agent using NVIDIA Nemotron Nano v2, requiring roughly 200 lines of Python code. It covers the core components, safety considerations, and offers both a from-scratch implementation and a simplified approach using LangGraph.
    2025-11-17 Tags: , , , , , , , by klotz
  6. This discussion details performance benchmarks of llama.cpp on an NVIDIA DGX Spark, including tests for various models (gpt-oss-20b, gpt-oss-120b, Qwen3, Qwen2.5, Gemma, GLM) with different context depths and batch sizes.
    2025-10-15 Tags: , , , , , , , , by klotz
  7. Simon Willison received a preview unit of the NVIDIA DGX Spark, a desktop "AI supercomputer" retailing around $4,000. He details his experience setting it up and navigating the ecosystem, highlighting both the hardware's impressive specs (ARM64, 128GB RAM, Blackwell GPU) and the initial software challenges.

    Key takeaways:

    * **Hardware:** The DGX Spark is a compact, powerful machine aimed at AI researchers.
    * **Software Hurdles:** Initial setup was complicated by the need for ARM64-compatible software and CUDA configurations, though NVIDIA has significantly improved documentation recently.
    * **Tools & Ecosystem:** Claude Code was invaluable for troubleshooting. Ollama, `llama.cpp`, LM Studio, and vLLM are already gaining support for the Spark, indicating a growing ecosystem.
    * **Networking:** Tailscale simplifies remote access.
    * **Early Verdict:** It's too early to definitively recommend the device, but recent ecosystem improvements are promising.
    2025-10-15 Tags: , , , , by klotz
  8. Nvidia's DGX Spark is a relatively affordable AI workstation that prioritizes capacity over raw speed, enabling it to run models that consumer GPUs cannot. It features 128GB of memory and is based on the Blackwell architecture.
  9. Nvidia introduces the Rubin CPX GPU, designed to accelerate AI inference by decoupling the context and generation phases. It utilizes GDDR7 memory for lower cost and power consumption, aiming to redefine AI infrastructure.
  10. Canonical announced today that they will formally support the NVIDIA CUDA toolkit and also make it available via the Ubuntu repositories. This aims to simplify CUDA installation and usage on Ubuntu, particularly with the rise of AI development.
    2025-09-19 Tags: , , , , , , by klotz

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "nvidia+llm"

About - Propulsed by SemanticScuttle