Tags: inference*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. A deep dive into the process of LLM inference, covering tokenization, transformer architecture, KV caching, and optimization techniques for efficient text generation.
  2. A visual introduction to probability and statistics, covering basic probability, compound probability, probability distributions, frequentist inference, Bayesian inference, and regression analysis. Created by Daniel Kunin and team with interactive visualizations using D3.js.
  3. This tutorial guides you through installing and using an inference snap, specifically Qwen 2.5 VL, a multi-modal large language model. It covers installation, status checks, basic chat, and configuring Open WebUI for image-based prompts.
  4. Canonical today announced optimized inference snaps, a new way to deploy AI models on Ubuntu devices, with automatic selection of optimized engines, quantizations and architectures based on the specific silicon of the device.
    2025-10-31 Tags: , , , , , by klotz
  5. On October 23rd, we announced the beta availability of silicon-optimized AI models in Ubuntu. Developers can locally install DeepSeek R1 and Qwen 2.5 VL with a single command, benefiting from maximized hardware performance and automated dependency management.
    2025-10-31 Tags: , , , , by klotz
  6. This article details the performance of Unsloth Dynamic GGUFs on the Aider Polyglot benchmark, showcasing how it can quantize LLMs like DeepSeek-V3.1 to as low as 1-bit while outperforming models like GPT-4.5 and Claude-4-Opus. It also covers benchmark setup, comparisons to other quantization methods, and chat template bug fixes.
  7. Nvidia introduces the Rubin CPX GPU, designed to accelerate AI inference by decoupling the context and generation phases. It utilizes GDDR7 memory for lower cost and power consumption, aiming to redefine AI infrastructure.
  8. A detailed guide for running the new gpt-oss models locally with the best performance using `llama.cpp`. The guide covers a wide range of hardware configurations and provides CLI argument explanations and benchmarks for Apple Silicon devices.
  9. oLLM is a Python library for running large-context Transformers on NVIDIA GPUs by offloading weights and KV-cache to SSDs. It supports models like Llama-3, GPT-OSS-20B, and Qwen3-Next-80B, enabling up to 100K tokens of context on 8-10 GB GPUs without quantization.
  10. A unified memory stack that functions as a memristor as well as a ferroelectric capacitor is reported, enabling both energy-efficient inference and learning at the edge.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "inference"

About - Propulsed by SemanticScuttle