SemanticScuttle - klotz.me » Tags: vllm+llama.cpp

Tags: vllm* + llama.cpp*

0 bookmark(s) - Sort by: Date ↓ / Title /

AMD Rolls Out Gemma 4 Model Support Across Full Range of GPUs & CPUs

AMD now supports Google’s Gemma 4 models (2B–31B parameters) across its entire hardware lineup, including Instinct GPUs (datacenters), Radeon GPUs (workstations), and Ryzen AI processors (PCs). The integration is compatible with vLLM, SGLang, llama.cpp, Ollama, and Lemonade Server, aiming to optimize AI performance for both cloud and local deployment.

2026-04-05 Tags: amd, google, gemma 4, gpu, cpu, radeon, ryzen, models, vllm, sglang, llama.cpp, machine learning, hardware support, llm, hardware by klotz

Unsloth: running Qwen3-Coder-Next

Qwen3-Coder-Next is an 80B MoE model with 256K context designed for fast, agentic coding and local use. It offers performance comparable to models with 10-20x more active parameters and excels in long-horizon reasoning, complex tool use, and recovery from execution failures.

2026-02-04 Tags: qwen3-coder-next, llm, coding, model, local, inference, quantization, vllm, llama.cpp, tool calling by klotz

Server approved! 4xH100 (320gb vram). Looking for advice

A user is seeking advice on deploying a new server with 4x H100 GPUs (320GB VRAM) for on-premise AI workloads. They are considering a Kubernetes-based deployment with RKE2, Nvidia GPU Operator, and tools like vLLM, llama.cpp, and Litellm. They are also exploring the option of GPU pass-through with a hypervisor. The post details their current infrastructure and asks for potential gotchas or best practices.

2025-04-28 Tags: h100, kubernetes, vllm, llama.cpp, gpu, ai, deployment, rke2, litellm, quantization, sxm, fp8, awq, gguf, production engineering, inference engineering, scale, reddit, localllama by klotz

LLooM: Leverage raw LLM logits to weave threads

This page provides information about LLooM, a tool that uses raw LLM logits to weave threads in a probabilistic way. It includes instructions on how to use LLooM with various environments, such as vLLM, llama.cpp, and OpenAI. The README also explains the parameters and configurations for LLooM.

2024-07-04 Tags: lloom, llm, logits, vllm, llama.cpp, openai, greedy decoding, beamsearch, github by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

Tags: vllm* + llama.cpp*

Linked Tags

Related Tags