Ironhorse: Open-air Multi-GPU Workhorse

← Back to LLM Garage

Ironhorse is a scalable AI inference platform built on off‑the‑shelf hardware, designed to run the largest language models efficiently. Leveraging Opencode and a modular multi‑GPU architecture, it can grow from a modest dual‑GPU setup to a full six‑GPU powerhouse, scaling compute capacity to match your workload demands.

Bill of Materials

Core Components

Component	Part	Est. Cost	Notes
Motherboard	Supermicro M12SWA-TF-O (E-ATX)	$700	sWRX8 socket, 6x PCIe 4.0 x16, 8-ch DDR4
CPU	Threadripper Pro 5955WX	$660	16C/32T, 4.0/4.5GHz, 128 PCIe 4.0 lanes
RAM	OWC 512GB (8x64GB) DDR4-3200 ECC RDIMM	$3,423	2Rx4 288-pin 1.2V ECC Registered
GPUs	2x NVIDIA RTX 3090	$1,600	48GB VRAM total (24GB each)
PSU	Corsair HX1500i 80+ Platinum	$400	1500W single PSU (not dual)
PSU Adapter	ADD2PSU-S Dual PSU Adapter	$17	For future dual PSU setup
Frame	Bomeiqee Mining Rig Frame	$65	12GPU frame with fans
Risers	6x LINKUP PCIE 5.0 Riser Cable 40cm	$714	Black v2, true x16 bandwidth
CPU Cooler	Noctua NH-U14S TR4-SP3	$110	Threadripper compatible cooler
Case Fans	4x Noctua NF-A12x25 PWM	$152	120mm high-performance fans

Total: ~$7,875 (estimated prices based on actual purchases)

Actual Spent to date (with tax): ~$5,618 (haven't bought the 2x more 3090 GPUs, or addition PSU yet)

Critical Component Selection Criteria

Motherboard: Supermicro M12SWA-TF

Chosen specifically for its 6x PCIe 4.0 x16 slots at full bandwidth. No bifurcation or sharing - each GPU gets the full 16 lanes. This board supports both 3000WX and 5000WX Threadripper Pro CPUs with 8-channel memory configuration.

CPU: Threadripper Pro 5955WX

The 5955WX provides 16 cores/32 threads with 4.0/4.5GHz boost speeds. For GPU-bound inference workloads, this is more than adequate - the key advantage is all Threadripper Pro CPUs provide 128 PCIe 4.0 lanes regardless of core count. The 5955WX delivers excellent value compared to higher core count models for LLM inference workloads.

Power Requirements

Component	Power Draw
2x RTX 3090 @ load	~700W
Threadripper Pro 5955WX	~280W
RAM, storage, fans	~50W
Total Peak Load	~1,030W

Important: Current single 1500W PSU provides ample headroom for 2-GPU setup. Future expansion to 4-6 GPUs will require the ADD2PSU adapter and second PSU. At full 6-GPU expansion (~2,494W), you'll need a 30A/240V circuit or two separate 20A/120V circuits.

Assembly Process & Lessons Learned

PCIe Riser Cable Reality Check

Cheap mining risers won't work. Those USB-style risers are x1-to-x16 adapters with only x1 bandwidth (~4 GB/s). For LLM inference, you need true x16-to-x16 extension cables. The LINKUP Ultra cables deliver full 32 GB/s bandwidth with no measurable performance loss.

CPU Lock Warning

Some Threadripper Pro CPUs from Lenovo P620 workstations are firmware-locked. Always verify sellers confirm "unlocked" or "retail/OEM tray" before purchase. A locked CPU won't post in aftermarket boards.

RAM Configuration

ECC RDIMM is required for 512GB configuration. 64GB DIMMs are only available as registered ( RDIMM). Standard UDIMM maxes out at 256GB (8x32GB). The 8-channel configuration provides ~200 GB/s memory bandwidth.

Thermal Management

Open-air frame with directional airflow across GPUs works well. RTX 3090s run hot (~80-85C under load). Consider undervolting if thermals become problematic. Noctua fans provide excellent airflow without excessive noise.

Capability Overview

Ironhorse Configuration: Dual-Head Mode

48GB VRAM + 512GB system RAM (2 active heads)
~1.9 TB/s aggregate VRAM bandwidth
Runs medium models: Qwen3-32B-Q5_K_M, deepseek-r1:32b, Nemotron-3-Nano-30B-A3B-Q6_K
Context window: less than 65k at usable speeds
Power draw: ~1,030W (2 heads + backbone)

Head Growth: 4-Head Configuration

Expanding Ironhorse to 4 heads (~$1,600) with second PSU unlocks greater parallel processing:

96GB VRAM + 512GB system RAM (4 active heads)
~3.7 TB/s aggregate VRAM bandwidth
Enables larger MoE models: GLM-4.5 Air, GLM-4.6/4.7 (MoE offload), MiniMax M2.1 Q2
Total power draw: ~1,730W (4 heads + backbone)

Full Ironhorse: 6-Head Configuration

Maximum expansion (~$2,400 for 4 additional heads + second PSU) delivers:

144GB VRAM + 512GB system RAM (6 active heads)
~5.6 TB/s aggregate VRAM bandwidth
Full capability for premier MoE models: DeepSeek V3 Q4 (MoE offload), MiniMax M2.1 Q4, GLM-4.6/4.7
Total power draw: ~2,494W (6 heads + backbone)

Key Takeaways

Ironhorse Design Philosophy

Modular Growth: Start with 2 heads (GPUs), expand to 6 as needs grow
Neural Architecture: Each head works independently while sharing the same nervous system
Quality Interconnects: True x16 risers ensure each head communicates at full bandwidth
Future-Proof Foundation: Threadripper Pro provides the computational backbone for expansion

Cost vs. Reality

So far, this build delivers ~48GB VRAM initially for ~$5,600 actual spend which is not very economical. However, I think the design choices regarding PCIe lanes, flexibility afforded from significant RAM will pay off when we look to cram as much model as we can into our system. I've not seen used A100 40GB cards for less than $3000, so I think we are doing better than buying server-grade gear. And while a Mac Studio has decent performance and supports larger models with greater context length, the 256GB one will cost you at least $7,500, and the 512GB model > $12,000. If we can find a way to avoid running into memory bandwidth bottlenecks (~200GB/sec vs Mac's ~800GB/sec), the GPUs should offer faster inference speeds.

Ironhorse's backbone (motherboard, CPU, RAM, frame) is established. Once the first two heads (RTX3090s) are transplanted from the previous system, I'll begin benchmarking and testing this multi-headed architecture. As workload demands increase, additional heads can be activated. Ironhorse adapts to your computational needs.

Ironhorse: Open-Air Multi-GPU Workhorse