Tags: llms* + performance*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. A detailed guide for running the new gpt-oss models locally with the best performance using `llama.cpp`. The guide covers a wide range of hardware configurations and provides CLI argument explanations and benchmarks for Apple Silicon devices.
  2. LocalScore is an open benchmark to evaluate local AI task performance across various hardware configurations, measuring Prompt Processing speed, Token Generation speed, Time-to-First-Token (TTFT), and a combined LocalScore.
  3. The article explores the evolution of large language model (LLM) serving, highlighting significant advancements from pre-2020 frameworks to the introduction of vLLM in 2023. It discusses the challenges of efficient memory management in LLM serving and how vLLM's PagedAttention technique revolutionizes the field by reducing memory wastage and enabling better utilization of GPU resources.
    2025-02-17 Tags: , , , by klotz
  4. A tool to estimate the memory requirements and performance of Hugging Face models based on quantization levels.
    2025-01-28 Tags: , , , by klotz
  5. Investigation into the effect of DDR5 speed on local LLM inference speed.
  6. The article discusses the importance of fine-tuning machine learning models for optimal inference performance and explores popular tools like vLLM, TensorRT, ONNX Runtime, TorchServe, and DeepSpeed.
  7. This repository contains scripts for benchmarking the performance of large language models (LLMs) served using vLLM.
    2024-08-24 Tags: , , , , by klotz
  8. A startup called Backprop has demonstrated that a single Nvidia RTX 3090 GPU, released in 2020, can handle serving a modest large language model (LLM) like Llama 3.1 8B to over 100 concurrent users with acceptable throughput. This suggests that expensive enterprise GPUs may not be necessary for scaling LLMs to a few thousand users.
  9. A study investigating whether format restrictions like JSON or XML impact the performance of large language models (LLMs) in tasks like reasoning and domain knowledge comprehension.
  10. A discussion post on Reddit's LocalLLaMA subreddit about logging the output of running models and monitoring performance, specifically for debugging errors, warnings, and performance analysis. The post also mentions the need for flags to output logs as flat files, GPU metrics (GPU utilization, RAM usage, TensorCore usage, etc.) for troubleshooting and analytics.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "llms+performance"

About - Propulsed by SemanticScuttle