SemanticScuttle - klotz.me » Tags: llms+gpu

Tags: llms* + gpu*

0 bookmark(s) - Sort by: Date ↓ / Title /

exo: Run your own AI cluster at home with everyday devices

Unify your existing devices into one powerful GPU: iPhone, iPad, Android, Mac, NVIDIA, Raspberry Pi, pretty much any device!

2025-02-28 Tags: llm, cluster, gpu, mlx, tinygrad, pytorch, llama.cpp, distributed systems by klotz

Hat uPCIty Lite for Raspberry Pi 5

The Hat uPCIty Lite is a PCI Express evaluation board with an open-ended PCIe X4 slot, designed for the Raspberry Pi 5. It supports external power, isolates PCIe Express power delivery to protect the Pi, and is compatible with PCIe x1 interface in Gen2 and Gen3 standards. The board includes all necessary accessories and is built with high-quality components.

2025-02-06 Tags: upcity lite, raspberry pi 5, pcie x4, llm, gpu, hardware by klotz

Nvidia's CUDA moat may not be as impenetrable as you think • The Register

The article discusses the competition Nvidia faces from Intel and AMD in the GPU market. While these competitors have introduced new accelerators that match or surpass Nvidia's offerings in terms of memory capacity, performance, and price, Nvidia maintains a strong advantage through its CUDA software ecosystem. CUDA has been a significant barrier for developers switching to alternative hardware due to the effort required to port and optimize existing code. However, both Intel and AMD have developed tools to ease this transition, like AMD's HIPIFY and Intel's SYCL. Despite these efforts, the article notes that the majority of developers now write higher-level code using frameworks like PyTorch, which can run on different hardware with varying levels of support and performance. This shift towards higher-level programming languages has reduced the impact of Nvidia's CUDA moat, though challenges still exist in ensuring compatibility and performance across different hardware platforms.

2024-12-25 Tags: nvidia, cuda, the register, gpu, llm, pytorch by klotz

How Much Stress Can Your Server Endure if You’re Self-Hosting LLMs?

The article discusses the challenges and strategies for load testing and infrastructure decisions when self-hosting Large Language Models (LLMs).

2024-10-20 Tags: load testing, self-hosted, llm, gpu, production engineering by klotz

Run:ai - Accelerate AI Development & Innovation

Run:ai offers a platform to accelerate AI development, optimize GPU utilization, and manage AI workloads. It is designed for GPUs, offers CLI & GUI interfaces, and supports various AI tools & frameworks.

2024-08-26 Tags: llm, orchestration, infrastructure, gpu, workload management, k8s, nvidia, production engineering by klotz

Maximize your LLM serving throughput for GPUs on GKE — a practical guide

This blog post provides a guide for optimizing LLM serving performance on Google Kubernetes Engine (GKE) by covering infrastructure decisions, model server optimizations, and best practices for maximizing GPU utilization. It includes recommendations for quantization, GPU selection (G2 vs A3), batching strategies, and leveraging model server features like PagedAttention.

2024-08-25 Tags: llm, gke, gpu, production engineering by klotz

Benchmarks show even an old Nvidia RTX 3090 is enough to serve LLMs to thousands

A startup called Backprop has demonstrated that a single Nvidia RTX 3090 GPU, released in 2020, can handle serving a modest large language model (LLM) like Llama 3.1 8B to over 100 concurrent users with acceptable throughput. This suggests that expensive enterprise GPUs may not be necessary for scaling LLMs to a few thousand users.

2024-08-24 Tags: nvidia, rtx 3090, llm, gpu, performance, benchmark, llama 3.1 8b, vllm, production engineering, backprop.co by klotz

Honey, I shrunk the LLM! A beginner's guide to quantization

This article explores the concept of quantization in large language models (LLMs) and its benefits, including reducing memory usage and improving performance. It also discusses various quantization methods and their effects on model quality.

2024-07-14 Tags: llm, quantization, gpu, benchmark by klotz

How to log output of running models and performance monitoring

A discussion post on Reddit's LocalLLaMA subreddit about logging the output of running models and monitoring performance, specifically for debugging errors, warnings, and performance analysis. The post also mentions the need for flags to output logs as flat files, GPU metrics (GPU utilization, RAM usage, TensorCore usage, etc.) for troubleshooting and analytics.

2024-06-12 Tags: llama, python, logging, performance, monitoring, gpu, metrics, debugging, nvidia, analytics, product lion engineering, llms by klotz

GPU-Accelerated LLM on a $100 Orange Pi: 2.3 tok/sec for Llama3-8b, 2.5 tok/sec for Llama2-7b, and 5 tok/sec for RedPajama-3b

GPU-accelerated LLMs on Odrange Pi 5, which features a Mali-G610 GPU. The authors used Machine Learning Compilation (MLC) techniques to achieve speeds of 2.3 tok/sec for Llama3-8b, 2.5 tok/sec for Llama2-7b, and 5 tok/sec for RedPajama-3b. They also managed to run a Llama-2 13b model at 1.5 tok/sec on a 16GB version of the Orange Pi 5+.

2024-05-20 Tags: llm, orange pi, gpu, mali-g610, llama3-8b, llama2-7b, redpajama-3b, ipt, raspberry pi by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

Tags: llms* + gpu*

Linked Tags

Related Tags