klotz: llm* + performance*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. This repository contains scripts for benchmarking the performance of large language models (LLMs) served using vLLM.
    2024-08-24 Tags: , , , , by klotz
  2. A startup called Backprop has demonstrated that a single Nvidia RTX 3090 GPU, released in 2020, can handle serving a modest large language model (LLM) like Llama 3.1 8B to over 100 concurrent users with acceptable throughput. This suggests that expensive enterprise GPUs may not be necessary for scaling LLMs to a few thousand users.
  3. A study investigating whether format restrictions like JSON or XML impact the performance of large language models (LLMs) in tasks like reasoning and domain knowledge comprehension.
  4. Improving the memory and computational efficiency of Large Language Models (LLMs) for handling long input sequences, including retrieval augmented questions answering, summarization, and chat tasks. It covers various techniques, such as lower precision computing, Flash Attention algorithm, positional embedding methods, and key-value caching strategies. These methods help reduce memory consumption and increase inference speeds while maintaining high accuracy levels in LLM applications. Furthermore, it highlights some advanced approaches like Multi-Query-Attention (MQA) and Grouped-Query-Attention (GQA), which further enhance computational and memory efficiency without compromising performance.
  5. 2023-11-18 Tags: , , , , by klotz
  6. 2023-10-13 Tags: , , , by klotz

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: llm + performance

About - Propulsed by SemanticScuttle