SemanticScuttle - klotz.me » klotz: llm+production engineering+gpu

klotz: llm* + production engineering* + gpu*

How Much Stress Can Your Server Endure if You’re Self-Hosting LLMs?

The article discusses the challenges and strategies for load testing and infrastructure decisions when self-hosting Large Language Models (LLMs).

2024-10-20 Tags: load testing, self-hosted, llm, gpu, production engineering by klotz

Run:ai - Accelerate AI Development & Innovation

Run:ai offers a platform to accelerate AI development, optimize GPU utilization, and manage AI workloads. It is designed for GPUs, offers CLI & GUI interfaces, and supports various AI tools & frameworks.

2024-08-26 Tags: llm, orchestration, infrastructure, gpu, workload management, k8s, nvidia, production engineering by klotz

Maximize your LLM serving throughput for GPUs on GKE — a practical guide

This blog post provides a guide for optimizing LLM serving performance on Google Kubernetes Engine (GKE) by covering infrastructure decisions, model server optimizations, and best practices for maximizing GPU utilization. It includes recommendations for quantization, GPU selection (G2 vs A3), batching strategies, and leveraging model server features like PagedAttention.

2024-08-25 Tags: llm, gke, gpu, production engineering by klotz

Benchmarks show even an old Nvidia RTX 3090 is enough to serve LLMs to thousands

A startup called Backprop has demonstrated that a single Nvidia RTX 3090 GPU, released in 2020, can handle serving a modest large language model (LLM) like Llama 3.1 8B to over 100 concurrent users with acceptable throughput. This suggests that expensive enterprise GPUs may not be necessary for scaling LLMs to a few thousand users.

2024-08-24 Tags: nvidia, rtx 3090, llm, gpu, performance, benchmark, llama 3.1 8b, vllm, production engineering, backprop.co by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

klotz: llm* + production engineering* + gpu*

Linked Tags

Related Tags