klotz: vllm* + llm*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. Qwen3-Coder-Next is an 80B MoE model with 256K context designed for fast, agentic coding and local use. It offers performance comparable to models with 10-20x more active parameters and excels in long-horizon reasoning, complex tool use, and recovery from execution failures.
  2. This blog post explains the causes of nondeterminism in LLM inference, arguing that it's not simply due to floating-point non-associativity and concurrency, but rather a lack of batch invariance in kernels. It details how to achieve batch invariance in RMSNorm, matrix multiplication, and attention, and presents experimental results demonstrating deterministic completions and the benefits for on-policy RL.
  3. The article discusses the evolution of model inference techniques from 2017 to a projected 2025, highlighting the progression from simple frameworks like Flask and FastAPI to more advanced solutions like Triton Inference Server and vLLM. It details the increasing demands on inference infrastructure driven by larger and more complex models, and the need for optimization in areas like throughput, latency, and cost.
  4. This article details how to accelerate deep learning and LLM inference using Apache Spark, focusing on distributed inference strategies. It covers basic deployment with `predict_batch_udf`, advanced deployment with inference servers like NVIDIA Triton and vLLM, and deployment on cloud platforms like Databricks and Dataproc. It also provides guidance on resource management and configuration for optimal performance.
  5. Running GenAI models is easy. Scaling them to thousands of users, not so much. This guide details avenues for scaling AI workloads from proofs of concept to production-ready deployments, covering API integration, on-prem deployment considerations, hardware requirements, and tools like vLLM and Nvidia NIMs.
  6. K8S-native cluster-wide deployment for vLLM. Provides a reference implementation for building an inference stack on top of vLLM, enabling scaling, monitoring, request routing, and KV cache offloading with easy cloud deployment.
  7. vLLM Production Stack provides a reference implementation on how to build an inference stack on top of vLLM, allowing for scalable, monitored, and performant LLM deployments using Kubernetes and Helm.
  8. The article explores the evolution of large language model (LLM) serving, highlighting significant advancements from pre-2020 frameworks to the introduction of vLLM in 2023. It discusses the challenges of efficient memory management in LLM serving and how vLLM's PagedAttention technique revolutionizes the field by reducing memory wastage and enabling better utilization of GPU resources.
    2025-02-17 Tags: , , , by klotz
  9. The article discusses the importance of fine-tuning machine learning models for optimal inference performance and explores popular tools like vLLM, TensorRT, ONNX Runtime, TorchServe, and DeepSpeed.
  10. A comparison of frameworks, models, and costs for deploying Llama models locally and privately.

    - Four tools were analyzed: HuggingFace, vLLM, Ollama, and llama.cpp.
    - HuggingFace has a wide range of models but struggles with quantized models.
    - vLLM is experimental and lacks full support for quantized models.
    - Ollama is user-friendly but has some customization limitations.
    - llama.cpp is preferred for its performance and customization options.
    - The analysis focused on llama.cpp and Ollama, comparing speed and power consumption across different quantizations.
    2024-11-03 Tags: , , , , , by klotz

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: vllm + llm

About - Propulsed by SemanticScuttle