Tags: localllama*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. Small, inexpensive single-board computers like the Raspberry Pi 5 are becoming viable platforms for running local large language models (LLMs). By utilizing quantization techniques to reduce model size and memory requirements, users can run quantized versions of popular models such as Llama 3, Mistral, and Qwen. While processing speeds remain limited compared to high-end GPUs, these devices offer a private and low-cost way to implement AI for specific tasks.

    - Quantization allows large models to fit into the Pi's limited RAM by reducing numerical precision.
    - Tiny models (1B-3B parameters) run comfortably, while 7B parameter models are usable on 8GB versions with managed expectations.
    - Performance is measured in low single-digit tokens per second, making it suitable for non-real-time tasks.
    - Hardware upgrades like the Raspberry Pi AI HAT+ or external eGPUs can significantly boost neural processing capabilities.
  2. A practical pipeline for classifying messy free-text data into meaningful categories using a locally hosted LLM, no labeled training data required.
  3. Announcement that ggml.ai is joining Hugging Face to ensure the long-term sustainability and progress of the ggml/llama.cpp community and Local AI. Highlights continued open-source development, improved user experience, and integration with the Hugging Face transformers library.
  4. The open-source AI landscape is rapidly evolving, and recent developments surrounding GGML and Llama.cpp are significant for those interested in running large language models locally. GGML, a C library for machine learning, has joined Hugging Face, ensuring its continued development and accessibility. Meanwhile, Llama.cpp, a project focused on running Llama models on CPUs, remains open-source and is finding a stable home. This article details these changes, the implications for local AI enthusiasts, and the benefits of an open ecosystem.
  5. A user is experiencing slow performance with Qwen3-Coder-Next on their local system despite having a capable setup. They are using a tensor-split configuration with two GPUs (RTX 5060 Ti and RTX 3060) and are seeing speeds between 2-15 tokens/second, with high swap usage. The post details their hardware, parameters used, and seeks advice on troubleshooting the issue.
  6. A user shares their optimal settings for running the gpt-oss-120b model on a system with dual RTX 3090 GPUs and 128GB of RAM, aiming for a balance between performance and quality.
  7. A user shares their experience running the GPT-OSS 120b model on Ollama with an i7 6700, 64GB DDR4 RAM, RTX 3090, and a 1TB SSD. They note slow initial token generation but acceptable performance overall, highlighting it's possible on a relatively modest setup. The discussion includes comparisons to other hardware configurations, optimization techniques (llama.cpp), and the model's quality.

    >I have a 3090 with 64gb ddr4 3200 RAM and am getting around 50 t/s prompt processing speed and 15 t/s generation speed using the following:
    >
    >`llama-server -m <path to gpt-oss-120b> --ctx-size 32768 --temp 1.0 --top-p 1.0 --jinja -ub 2048 -b 2048 -ngl 99 -fa 'on' --n-cpu-moe 24`
    > This about fills up my VRAM and RAM almost entirely. For more wiggle room for other applications use `--n-cpu-moe 26`.
  8. The article discusses how NotebookLM can be used to document and troubleshoot a home lab setup. It highlights its ability to consolidate documentation, simplify complex tasks, and provide step-by-step instructions. The author shares practical examples of using NotebookLM for learning, troubleshooting, and managing a home lab environment.
    2025-08-24 Tags: , , , by klotz
  9. A user demonstrates how to run a 120B model efficiently on hardware with only 8GB VRAM by offloading MOE layers to CPU and keeping only attention layers on GPU, achieving high performance with minimal VRAM usage.
  10. llama-swap is a lightweight, transparent proxy server that provides automatic model swapping to llama.cpp's server. It allows you to easily switch between different language models on a local server, supporting OpenAI API compatible endpoints and offering features like model grouping, automatic unloading, and a web UI for monitoring.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "localllama"

About - Propulsed by SemanticScuttle