Simon Willison explains how to use the mistral.rs library in Rust to run the Llama Vision model on a Mac M2 laptop. He provides a detailed example and discusses the memory usage and GPU utilization.
TabbyAPI is a FastAPI based application that allows for generating text using an LLM (large language model) using the Exllamav2 backend. It supports various model types and features like HuggingFace model downloading, embedding model support, and more.
This article explains how to accurately quantize a Large Language Model (LLM) and convert it to the GGUF format for efficient CPU inference. It covers using an importance matrix (imatrix) and K-Quantization method with Gemma 2 Instruct as an example, while highlighting its applicability to other models like Qwen2, Llama 3, and Phi-3.
Inference.net offers LLM inference tokens for models like Llama 3.1 at a 50-90% discount from other providers. They aggregate unused compute resources from data centers to offer fast, reliable, and affordable inference services.
inference.net is a wholesaler of LLM inference tokens for models like Llama 3.1. We provide inference services at a 50-90% discount from what you would pay together.ai or groq.
'We sell tokens in 10 billion token increments. The current cost per 10 billion tokens for an 8B model is $200."
The author explores the use of Gemma 2 and Mozilla's llamafile on AWS Lambda for serverless AI inference
Explore the best LLM inference engines and servers available to deploy and serve LLMs in production, including vLLM, TensorRT-LLM, Triton Inference Server, RayLLM with RayServe, and HuggingFace Text Generation Inference.
In this article, we explore how to deploy and manage machine learning models using Google Kubernetes Engine (GKE), Google AI Platform, and TensorFlow Serving. We will cover the steps to create a machine learning model and deploy it on a Kubernetes cluster for inference.
Podman AI Lab is the easiest way to work with Large Language Models (LLMs) on your local developer workstation. It provides a catalog of recipes, a curated list of open source models, experiment and compare the models, get ahead of the curve and take your development to new heights wth Podman AI Lab!
This review article discusses the concept of entropy in statistical physics and its role as both a tool for inference and a measure of time irreversibility. It highlights the developments in stochastic thermodynamics and the principle of maximum caliber, emphasizing the importance of cross-talk among researchers in disparate fields.