SemanticScuttle - klotz.me

klotz: vllm*

LLM Tools by Examples: Exploring Tools for Optimal Inference Performance

The article discusses the importance of fine-tuning machine learning models for optimal inference performance and explores popular tools like vLLM, TensorRT, ONNX Runtime, TorchServe, and DeepSpeed.

2025-01-02 Tags: llm, inference, performance, vllm, tensorrt, onnx, torchserve, deepspeed by klotz

Running Large Language Models Privately

A comparison of frameworks, models, and costs for deploying Llama models locally and privately.

- Four tools were analyzed: HuggingFace, vLLM, Ollama, and llama.cpp.
- HuggingFace has a wide range of models but struggles with quantized models.
- vLLM is experimental and lacks full support for quantized models.
- Ollama is user-friendly but has some customization limitations.
- llama.cpp is preferred for its performance and customization options.
- The analysis focused on llama.cpp and Ollama, comparing speed and power consumption across different quantizations.

2024-11-03 Tags: llm, self-hosted, huggingface, vllm, ollama, llama-2 by klotz

Serving Large models (part one): VLLM, LLAMA CPP Server, and SGLang

This guide delves into three prominent projects for serving large language models and vision-language models: VLLM, LLAMA CPP Server, and SGLang. Each project offers distinct functionalities and is explained with usage instructions, features, and deployment methods.

2024-09-30 Tags: vllm, llama cpp, llm by klotz

vLLM Benchmark

This repository contains scripts for benchmarking the performance of large language models (LLMs) served using vLLM.

2024-08-24 Tags: vllm, benchmark, llm, performance, backprop.co by klotz

Benchmarks show even an old Nvidia RTX 3090 is enough to serve LLMs to thousands

A startup called Backprop has demonstrated that a single Nvidia RTX 3090 GPU, released in 2020, can handle serving a modest large language model (LLM) like Llama 3.1 8B to over 100 concurrent users with acceptable throughput. This suggests that expensive enterprise GPUs may not be necessary for scaling LLMs to a few thousand users.

2024-08-24 Tags: nvidia, rtx 3090, llm, gpu, performance, benchmark, llama 3.1 8b, vllm, production engineering, backprop.co by klotz

vLLM: Serve LLMs at Scale

High-performance deployment of the vLLM serving engine, optimized for serving large language models at scale.

2024-08-16 Tags: vllm, llm, scalability, openai, api, production engineering by klotz

Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM)

This blog post benchmarks and compares the performance of SGLang, TensorRT-LLM, and vLLM for serving large language models (LLMs). SGLang demonstrates superior or competitive performance in offline and online scenarios, often outperforming vLLM and matching or exceeding TensorRT-LLM.

2024-07-27 Tags: sglang, tensorrt-llm, vllm, llama, llm by klotz

LLooM: Leverage raw LLM logits to weave threads

This page provides information about LLooM, a tool that uses raw LLM logits to weave threads in a probabilistic way. It includes instructions on how to use LLooM with various environments, such as vLLM, llama.cpp, and OpenAI. The README also explains the parameters and configurations for LLooM.

2024-07-04 Tags: lloom, llm, logits, vllm, llama.cpp, openai, greedy decoding, beamsearch, github by klotz

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog

2024-01-10 Tags: llm, vllm by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

klotz: vllm*

Linked Tags

Related Tags