SemanticScuttle - klotz.me » klotz: large language model+performance

klotz: large language model* + performance*

Prompt Repetition Improves Non-Reasoning LLMs

Repeating the input prompt improves performance for popular LLMs (Gemini, GPT, Claude, and Deepseek) without increasing the number of generated tokens or latency, when not using reasoning.

2026-01-18 Tags: large language model, prompt engineering, prompt repetition, performance, google by klotz

Choosing the Right Chunking Strategy: A Comprehensive Guide to RAG Optimization

This article explores different chunking strategies for Retrieval-Augmented Generation (RAG) systems, comparing nine approaches using the agenticmemory library to improve retrieval accuracy and reduce hallucinations.

2025-12-22 Tags: llm, performance, rag, chunking, embedding, vector database, rag optimization by klotz

guide : running gpt-oss with llama.cpp · Discussion #15396

A detailed guide for running the new gpt-oss models locally with the best performance using `llama.cpp`. The guide covers a wide range of hardware configurations and provides CLI argument explanations and benchmarks for Apple Silicon devices.

2025-10-04 Tags: llama.cpp, gpt-oss, large language model, inference, apple silicon, benchmarks, performance, gguf by klotz

LocalScore

LocalScore is an open benchmark to evaluate local AI task performance across various hardware configurations, measuring Prompt Processing speed, Token Generation speed, Time-to-First-Token (TTFT), and a combined LocalScore.

2025-04-17 Tags: llm, benchmark, performance, gpu, cpu, inference, localscore by klotz

How did we get to vLLM, and what was its genius?

The article explores the evolution of large language model (LLM) serving, highlighting significant advancements from pre-2020 frameworks to the introduction of vLLM in 2023. It discusses the challenges of efficient memory management in LLM serving and how vLLM's PagedAttention technique revolutionizes the field by reducing memory wastage and enabling better utilization of GPU resources.

2025-02-17 Tags: vllm, llm, performance, pagedattention by klotz

LLM Calculator

A tool to estimate the memory requirements and performance of Hugging Face models based on quantization levels.

2025-01-28 Tags: llm, calculator, performance, github copilot by klotz

DDR5 Speed, CPU and LLM Inference

Investigation into the effect of DDR5 speed on local LLM inference speed.

2025-01-26 Tags: llm, machine learning, inference, performance, memory, ddr5 by klotz

LLM Tools by Examples: Exploring Tools for Optimal Inference Performance

The article discusses the importance of fine-tuning machine learning models for optimal inference performance and explores popular tools like vLLM, TensorRT, ONNX Runtime, TorchServe, and DeepSpeed.

2025-01-02 Tags: llm, inference, performance, vllm, tensorrt, onnx, torchserve, deepspeed by klotz

vLLM Benchmark

This repository contains scripts for benchmarking the performance of large language models (LLMs) served using vLLM.

2024-08-24 Tags: vllm, benchmark, llm, performance, backprop.co by klotz

Benchmarks show even an old Nvidia RTX 3090 is enough to serve LLMs to thousands

A startup called Backprop has demonstrated that a single Nvidia RTX 3090 GPU, released in 2020, can handle serving a modest large language model (LLM) like Llama 3.1 8B to over 100 concurrent users with acceptable throughput. This suggests that expensive enterprise GPUs may not be necessary for scaling LLMs to a few thousand users.

2024-08-24 Tags: nvidia, rtx 3090, llm, gpu, performance, benchmark, llama 3.1 8b, vllm, production engineering, backprop.co by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

klotz: large language model* + performance*

Linked Tags

Related Tags