A user shares their optimal settings for running the gpt-oss-120b model on a system with dual RTX 3090 GPUs and 128GB of RAM, aiming for a balance between performance and quality.
A user shares their experience running the GPT-OSS 120b model on Ollama with an i7 6700, 64GB DDR4 RAM, RTX 3090, and a 1TB SSD. They note slow initial token generation but acceptable performance overall, highlighting it's possible on a relatively modest setup. The discussion includes comparisons to other hardware configurations, optimization techniques (llama.cpp), and the model's quality.
>I have a 3090 with 64gb ddr4 3200 RAM and am getting around 50 t/s prompt processing speed and 15 t/s generation speed using the following:
>
>`llama-server -m <path to gpt-oss-120b> --ctx-size 32768 --temp 1.0 --top-p 1.0 --jinja -ub 2048 -b 2048 -ngl 99 -fa 'on' --n-cpu-moe 24`
> This about fills up my VRAM and RAM almost entirely. For more wiggle room for other applications use `--n-cpu-moe 26`.
The article discusses how NotebookLM can be used to document and troubleshoot a home lab setup. It highlights its ability to consolidate documentation, simplify complex tasks, and provide step-by-step instructions. The author shares practical examples of using NotebookLM for learning, troubleshooting, and managing a home lab environment.
A user demonstrates how to run a 120B model efficiently on hardware with only 8GB VRAM by offloading MOE layers to CPU and keeping only attention layers on GPU, achieving high performance with minimal VRAM usage.
llama-swap is a lightweight, transparent proxy server that provides automatic model swapping to llama.cpp's server. It allows you to easily switch between different language models on a local server, supporting OpenAI API compatible endpoints and offering features like model grouping, automatic unloading, and a web UI for monitoring.
This article details how to set up a weather report on a Home Assistant dashboard using a local LLM (Ollama) for more user-friendly summaries and clothing suggestions, avoiding cloud-based services for privacy reasons. It covers the setup process, prompt engineering, and hardware considerations.
This article details how to enhance the Paperless-ngx document management system by integrating a local Large Language Model (LLM) like Ollama. It covers the setup process, including installing Docker, Ollama, and configuring Paperless AI, to enable AI-powered features such as improved search and document understanding.
Real-time observability and analytics platform for local LLMs, with dashboard and API.
A post with pithy observations and clear conclusions from building complex LLM workflows, covering topics like prompt chaining, data structuring, model limitations, and fine-tuning strategies.
A user is seeking advice on deploying a new server with 4x H100 GPUs (320GB VRAM) for on-premise AI workloads. They are considering a Kubernetes-based deployment with RKE2, Nvidia GPU Operator, and tools like vLLM, llama.cpp, and Litellm. They are also exploring the option of GPU pass-through with a hypervisor. The post details their current infrastructure and asks for potential gotchas or best practices.