klotz: performance* + production engineering*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. The article discusses the challenges and components required to scale Retrieval Augmented Generation (RAG) from a Proof of Concept (POC) to production. It covers key issues such as performance, data management, risk, integration into workflows, and cost. It also outlines architectural components such as scalable vector databases, caching mechanisms, advanced search techniques, responsible AI layers, and API gateways needed for overcoming these challenges.
  2. A startup called Backprop has demonstrated that a single Nvidia RTX 3090 GPU, released in 2020, can handle serving a modest large language model (LLM) like Llama 3.1 8B to over 100 concurrent users with acceptable throughput. This suggests that expensive enterprise GPUs may not be necessary for scaling LLMs to a few thousand users.
  3. Distributable streaming
  4. Overall, though, the Istio/Envoy proxies use ~50% more CPU than Linkerd.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: performance + production engineering

About - Propulsed by SemanticScuttle