This article details the journey of deploying an on-premise Large Language Model (LLM) server, focusing on security considerations. It explores the rationale behind on-premise deployment for privacy and data control, outlining the goals of creating an air-gapped, isolated infrastructure. The authors delve into the hardware selection process, choosing components like an Nvidia RTX Pro 6000 Max-Q for its memory capacity. The deployment process starts with a minimal setup using llama.cpp, then progresses to containerization with Podman and the use of CDI for GPU access. Finally, the article discusses hardening techniques, including kernel module management and file permission restrictions, to minimize the attack surface and enhance security.
>The method, called KV Cache Transform Coding (KVTC), applies ideas from media compression formats like JPEG to shrink the key-value cache behind multi-turn AI systems, lowering GPU memory demands and speeding up time-to-first-token by up to 8x.
CUDA 13.2 brings full support for CUDA Tile on Ampere, Ada, and Blackwell architectures, alongside enhancements to cuTile Python including recursive functions, closures, and custom reductions. Core updates include improved memory transfer APIs, reduced LMEM footprint in Windows, and a shift to MCDM for better compatibility. Math libraries gain experimental Grouped GEMM with MXFP8 and FP64-emulated cuSOLVERD. Developer tools see updates to Nsight Python, Nsight Compute, and Nsight Systems, alongside a modern C++ runtime in CCCL 3.2. CuPy also gains support for CUDA 13 and stream sharing.
NVIDIA GTC is the premier AI conference and exhibition. Learn about the latest advancements in AI, deep learning, and accelerated computing. Includes keynote speakers, sessions, workshops, and an exhibit hall.
The RTX 3090 offers a compelling combination of performance and 24GB of VRAM, making it a better choice for local LLM and AI workloads than newer Nvidia Blackwell GPUs like the RTX 5070 and even the RTX 5080, due to VRAM limitations and pricing.
Based on the discussion, /u/septerium achieved optimal performance for GLM 4.7 Flash (UD-Q6_K_XL) on an RTX 5090 using these specific settings and parameters:
- GPU: NVIDIA RTX 5090.
- 150 tokens/s
- Context: 48k tokens squeezed into VRAM.
- UD-Q6_K_XL (Unsloth quantized GGUF).
- Flash Attention: Enabled (-fa on).
- Context Size: 48,000 (--ctx-size 48000).
- GPU Layers: 99 (-ngl 99) to ensure the entire model runs on the GPU.
- Sampler & Inference Parameters
- Temperature: 0.7 (recommended by Unsloth for tool calls).
- Top-P: 1.0.
- Min-P: 0.01.
- Repeat Penalty: Must be disabled (llama.cpp does this by default, but users warned other platforms might not).
CUDA Tile is a new Python package that simplifies GPU programming by automatically tiling loops, handling data transfer, and optimizing memory access. It allows developers to write concise and readable code that leverages the full power of NVIDIA GPUs without needing to manually manage the complexities of parallel programming.
NVIDIA Nemotron Parse v1.1 is designed to understand document semantics and extract text and tables elements with spatial grounding. It transforms unstructured documents into actionable and machine-usable representations.
A new patch enables Nvidia GPU support on Raspberry Pi 5 and Rockchip devices, allowing for GPU-accelerated compute tasks. The article details the setup process, performance testing with llama.cpp, and current limitations with display output.
NVIDIA AI releases Nemotron-Elastic-12B, a 12B parameter reasoning model that embeds nested 9B and 6B variants in the same parameter space, allowing for multiple model sizes from a single training job.