Tags: gpu*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. This article details the journey of deploying an on-premise Large Language Model (LLM) server, focusing on security considerations. It explores the rationale behind on-premise deployment for privacy and data control, outlining the goals of creating an air-gapped, isolated infrastructure. The authors delve into the hardware selection process, choosing components like an Nvidia RTX Pro 6000 Max-Q for its memory capacity. The deployment process starts with a minimal setup using llama.cpp, then progresses to containerization with Podman and the use of CDI for GPU access. Finally, the article discusses hardening techniques, including kernel module management and file permission restrictions, to minimize the attack surface and enhance security.
  2. >The method, called KV Cache Transform Coding (KVTC), applies ideas from media compression formats like JPEG to shrink the key-value cache behind multi-turn AI systems, lowering GPU memory demands and speeding up time-to-first-token by up to 8x.
  3. Google AI introduces STATIC, a sparse matrix framework that accelerates constrained decoding for LLM-based generative retrieval. It addresses the inefficiency of traditional trie implementations on hardware accelerators by flattening the trie into a static Compressed Sparse Row (CSR) matrix, achieving up to 948x speedup and demonstrating improvements in YouTube video recommendations.
  4. A user is experiencing slow performance with Qwen3-Coder-Next on their local system despite having a capable setup. They are using a tensor-split configuration with two GPUs (RTX 5060 Ti and RTX 3060) and are seeing speeds between 2-15 tokens/second, with high swap usage. The post details their hardware, parameters used, and seeks advice on troubleshooting the issue.
  5. The RTX 3090 offers a compelling combination of performance and 24GB of VRAM, making it a better choice for local LLM and AI workloads than newer Nvidia Blackwell GPUs like the RTX 5070 and even the RTX 5080, due to VRAM limitations and pricing.
    2026-02-07 Tags: , , , , , , , , , by klotz
  6. This blog post details how to implement high-performance matrix multiplication using NVIDIA cuTile, focusing on Tile loading, computation, storage, and block-level parallel programming. It also covers best practices for Tile programming and performance optimization strategies.
  7. CUDA Tile is a new Python package that simplifies GPU programming by automatically tiling loops, handling data transfer, and optimizing memory access. It allows developers to write concise and readable code that leverages the full power of NVIDIA GPUs without needing to manually manage the complexities of parallel programming.
  8. A user shares their optimal settings for running the gpt-oss-120b model on a system with dual RTX 3090 GPUs and 128GB of RAM, aiming for a balance between performance and quality.
  9. A new patch enables Nvidia GPU support on Raspberry Pi 5 and Rockchip devices, allowing for GPU-accelerated compute tasks. The article details the setup process, performance testing with llama.cpp, and current limitations with display output.
  10. This post explores a new idea for parallelizing a simplified parsing task using a "stack monoid" and scan operations, potentially enabling efficient GPU implementation of parsing algorithms.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "gpu"

About - Propulsed by SemanticScuttle