SemanticScuttle - klotz.me » klotz: cpu+llm

klotz: cpu* + llm*

Apple unleashes M5, the next big leap in AI performance for Apple silicon

M5 delivers over 4x the peak GPU compute performance for AI compared to M4, featuring a next-generation GPU with a Neural Accelerator in each core, a more powerful CPU, a faster Neural Engine, and higher unified memory bandwidth.

2025-10-15 Tags: m5, apple, llm, gpu, cpu, neural engine, macbook pro, ipad pro, apple vision pro by klotz

My mind was blown: running a 120B parameter AI model on a budget GPU at home

A 120 billion parameter OpenAI model can now run on consumer hardware thanks to the Mixture of Experts (MoE) technique, which significantly reduces memory requirements and allows processing on CPUs while offloading key parts to modest GPUs.

2025-08-21 Tags: llm, mixture of experts, 120b, gpu, cpu, openai, gpt-oss-120b by klotz

LocalScore

LocalScore is an open benchmark to evaluate local AI task performance across various hardware configurations, measuring Prompt Processing speed, Token Generation speed, Time-to-First-Token (TTFT), and a combined LocalScore.

2025-04-17 Tags: llm, benchmark, performance, gpu, cpu, inference, localscore by klotz

NVIDIA DGX Spark

NVIDIA DGX Spark is a desktop-friendly AI supercomputer powered by the NVIDIA GB10 Grace Blackwell Superchip, delivering 1000 AI TOPS of performance with 128GB of memory. It is designed for prototyping, fine-tuning, and inference of large AI models.

2025-03-24 Tags: machine learning, nvidia, dgx spark, llm, grace blackwell, ai development, inference, data science, gpu, cpu by klotz

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

This article explains how to accurately quantize a Large Language Model (LLM) and convert it to the GGUF format for efficient CPU inference. It covers using an importance matrix (imatrix) and K-Quantization method with Gemma 2 Instruct as an example, while highlighting its applicability to other models like Qwen2, Llama 3, and Phi-3.

2024-09-14 Tags: gguf, quantization, llm, cpu, inference, imatrix by klotz

PowerInfer - High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

PowerInfer is a CPU/GPU LLM inference engine leveraging activation locality for your device.

2026-01-13 Tags: llm, serving, cpu, gpu, github by klotz

How to Fine-Tune Llama2 for Python Coding | Towards Data Science

2023-08-28 Tags: llama-2, llm, fine tuning, cpu by klotz

Fine-Tune Your LLM Without Maxing Out Your GPU | by John Adeojo | Jul, 2023 | Towards Data Science

2023-08-03 Tags: llm, gpu, cpu, fine-tune by klotz

llama-2 on cpu inference for document q-and a

2023-07-22 Tags: llama-2, llm, cpu, inference, document, q-and a, langchain by klotz

Hugging face getting started

2023-06-25 Tags: huggingface, llm, python, cpu by klotz