Ollama now supports HuggingFace GGUF models, making it easier for users to run AI models locally without internet. The GGUF format allows for the use of AI models on modest-sized consumer hardware.
A step-by-step guide on building llamafiles from Llama 3.2 GGUFs, including scripting and Dockerization.
This article explains how to accurately quantize a Large Language Model (LLM) and convert it to the GGUF format for efficient CPU inference. It covers using an importance matrix (imatrix) and K-Quantization method with Gemma 2 Instruct as an example, while highlighting its applicability to other models like Qwen2, Llama 3, and Phi-3.
This document contains the quantized LLM inference performance results on 70b+ models.
Mistral.rs is a fast LLM inference platform supporting inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings. It supports the latest Llama and Phi models, as well as X-LoRA and LoRA support. The project aims to provide the fastest LLM inference platform possible.
- create a custom base image for a Cloud Workstation environment using a Dockerfile
. Uses:
Quantized models from
A deep dive into model quantization with GGUF and llama.cpp and model evaluation with LlamaIndex
Exploring Pre-Quantized Large Language Models