Tags: inference*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. This repository provides the official implementation of the STATIC (Sparse Transition-Accelerated Trie Index for Constrained decoding) framework, as described in Su et al., 2026. STATIC is a high-performance method for enforcing outputs to stay within a prespecified set during autoregressive decoding from large language models, designed for maximum efficiency on modern hardware accelerators like GPUs and TPUs.
  2. This article details benchmarks for Unsloth Dynamic GGUFs of the Qwen3.5 model, including analysis of perplexity, KL divergence, and MXFP4. It covers performance across different bit widths and quant types, highlighting the impact of Imatrix and the limitations of certain quantization approaches. Full benchmark data is also provided.
  3. Announcement that ggml.ai is joining Hugging Face to ensure the long-term sustainability and progress of the ggml/llama.cpp community and Local AI. Highlights continued open-source development, improved user experience, and integration with the Hugging Face transformers library.
  4. The open-source AI landscape is rapidly evolving, and recent developments surrounding GGML and Llama.cpp are significant for those interested in running large language models locally. GGML, a C library for machine learning, has joined Hugging Face, ensuring its continued development and accessibility. Meanwhile, Llama.cpp, a project focused on running Llama models on CPUs, remains open-source and is finding a stable home. This article details these changes, the implications for local AI enthusiasts, and the benefits of an open ecosystem.
  5. A terminal tool that right-sizes LLM models to your system's RAM, CPU, and GPU. Detects your hardware, scores each model across quality, speed, fit, and context dimensions, and tells you which ones will actually run well on your machine.
  6. An OWL 2 RL reasoner with Z3-verified inference rules. Written in SLOP, it compiles to efficient C code while using SMT solving to prove properties about the inference logic.
  7. This guide explains how to use tool calling with local LLMs, including examples with mathematical, story, Python code, and terminal functions, using llama.cpp, llama-server, and OpenAI endpoints.
  8. Qwen3-Coder-Next is an 80B MoE model with 256K context designed for fast, agentic coding and local use. It offers performance comparable to models with 10-20x more active parameters and excels in long-horizon reasoning, complex tool use, and recovery from execution failures.
  9. Based on the discussion, /u/septerium achieved optimal performance for GLM 4.7 Flash (UD-Q6_K_XL) on an RTX 5090 using these specific settings and parameters:
    - GPU: NVIDIA RTX 5090.
    - 150 tokens/s
    - Context: 48k tokens squeezed into VRAM.
    - UD-Q6_K_XL (Unsloth quantized GGUF).
    - Flash Attention: Enabled (-fa on).
    - Context Size: 48,000 (--ctx-size 48000).
    - GPU Layers: 99 (-ngl 99) to ensure the entire model runs on the GPU.
    - Sampler & Inference Parameters
    - Temperature: 0.7 (recommended by Unsloth for tool calls).
    - Top-P: 1.0.
    - Min-P: 0.01.
    - Repeat Penalty: Must be disabled (llama.cpp does this by default, but users warned other platforms might not).
  10. This article details how to run Large Language Models (LLMs) on Intel GPUs using the llama.cpp framework and its new SYCL backend, offering performance improvements and broader hardware support.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "inference"

About - Propulsed by SemanticScuttle