This guide provides instructions for running Alibaba's Qwen3.6 multimodal hybrid-thinking models locally using Unsloth tools. It covers the 27B and 35B-A3B variants, which support a 256K context window across 201 languages and excel in agentic coding, vision, and chat tasks. The article details hardware requirements for various quantization levels and explains how to leverage Multi Token Prediction (MTP) for significantly faster inference.
Key topics:
- Hardware memory requirements for quantized models
- Faster generation via Multi Token Prediction (MTP)
- Integration with Unsloth Studio, llama.cpp, and MLX
- Preserved thinking mode configurations
A comprehensive technical guide on setting up a high-performance local large language model environment for agentic coding tasks. The author demonstrates how to run a quantized Qwen3.5-27B model on a remote RTX 4090 workstation and access it from a MacBook using Tailscale, integrating the setup with OpenCode and Codex.
Key topics include:
* Step-by-step llama.cpp build configuration for CUDA support.
* Using Tailscale to create a secure network between client and GPU machine.
* Optimizing VRAM usage through specific quantization (UD-Q4_K_XL) and context size management.
* Implementing a corrected chat template to prevent tool-calling errors in agentic workflows.
* Performance insights regarding hybrid architectures and KV cache precision.
The llama.cpp server has introduced support for the Anthropic Messages API, a highly requested feature that allows users to run Claude-compatible clients with locally hosted models. This implementation enables powerful tools like Claude Code to interface directly with local GGUF models by internally converting Anthropic's message format to OpenAI's standard. Key features of this update include full support for chat completions with streaming, advanced tool use through function calling, token counting capabilities, vision support for multimodal models, and extended thinking for reasoning models. This development bridges the gap between proprietary AI ecosystems and local, privacy-focused inference pipelines, providing a seamless experience for developers working with agentic workloads and coding assistants.
ANTHROPIC_AUTH_TOKEN, ANTHROPIC_MODEL=