LLM Garage - Home Engineer's AI Hardware Journal

Builder's Notes

Builder's Notes: Ironhorse (Phase 2) and Minimax M2.5

Scaling up to five RTX 3090s, the new Minimax M2.5 as daily driver, and the refactored website

February 2026
Builder's Notes: Ironhorse (phase 1)

Getting the massive machine up and running: motherboard issues, RAM troubleshooting, and power solutions

January 2026
Hello World! and Opencode

Starting the engineering journal: from hobbyist hacker to building LLM inference systems with off-the-shelf components

January 2026

The Great Inference Deflation: Why Owning Your GPU Makes Sense in 2026

How the convergence of model optimization, quantization, and serving infrastructure is fundamentally changing who can afford to run AI

March 2026

Ironhorse: Open-air Multi-RTX3090 Inference Workhorse

A scalable multi-GPU inference system designed for expansion from 2 to 6 GPUs with Threadripper backbone

January 2026
Hardware Deep Dive: Why Threadripper Pro for Multi-GPU

Technical analysis of PCIe lanes, CPU selection, and the workstation motherboard that makes it all possible

January 2026

Open Brain: Self-Sovereign AI Memory System

Vector db RAG for your coding agents: a cross-platform memory system built with PostgreSQL + pgvector and MCP

March 2026

Running MiniMax M2.5 (IQ2_M) on Six RTX 3090s

228B hybrid GDN+MoE model: 65+ tokens/sec decode / 200K max context

March 2026
Running Qwen3.5-122B (Q6_K_XL) on Six RTX 3090s

6-bit quantization meets 122B parameters: high-quality MoE inference at 40+ tokens/sec

March 2026
Running Qwen3.5-397B (UD-TQ1_0) on Six RTX 3090s

1-bit quantization meets 397B parameters: the largest model we've run yet

February 2026
Running Minimax M2.1 on Quad RTX 3090s

Minimax M2.1 (IQ2_M): slumming it with lower quant precision

February 2026
Running Qwen3-Coder-Next on Quad RTX 3090s

Performance and memory analysis of an 80B MoE model with Q5_K_XL quantization

February 2026
Running GLM-4.7-Flash on Dual RTX 3090s

Performance and memory analysis of a 30B MoE model with Q5_K_M quantization

January 2026
Running GLM-4.7 (IQ1_M) on Six RTX 3090s

Extreme quantization test: 358B model at 1-bit vs 30B model at 5-bit

February 2026
Running GPT-OSS-120B on Dual RTX 3090s

MoE CPU offloading to fit a 120B parameter model on 48GB VRAM + 64GB RAM

January 2026

MoE vs Dense: Nemotron vs Qwen3 Performance Showdown

Discovering 7x speedup with Mixture of Experts models: Nemotron-3-Nano vs Qwen3-32B benchmark

January 2026
Pushing Context Limits with llama.cpp

Optimizing multi-GPU setups, testing context windows, and workarounds for 65K token limits

January 2026

Tokens-per-Second Test Bench: Power Limit Sweep

Running MiniMax M2.5 through a power limit sweep (100W-350W per GPU) to understand efficiency and performance tradeoffs

February 2026
GPU Power Limit Optimization for Efficiency and Cost Savings

How capping RTX 3090 power consumption to 200W cuts electricity costs by ~43%, reduces heat, and extends GPU lifespan

February 2026

KV-Cache Optimization: Making Large Context Viable

Quantization techniques and memory optimizations to support large context windows on limited VRAM

January 2026
Nemotron Setup Deep Dive: Real Server Problems

Real-world llama.cpp server setup, framework comparison, and troubleshooting journey

January 2026
Production Debugging: Model Loading Issues

Troubleshooting memory management, port conflicts, and common server failure patterns

January 2026
Tokens-per-Second Benchmark Script

Lightweight tool for measuring LLM inference performance across configurations

January 2026