Running GPT-OSS-120B on Dual RTX 3090s

← Back to LLM Garage

This post documents an experiment running GPT-OSS-120B, a large Mixture of Experts model, on a dual RTX 3090 system with 64GB of system RAM. The goal was to see how far we could push consumer hardware using llama.cpp's MoE CPU offloading feature.

Hardware and Model

System Specs

GPUs: 2x NVIDIA RTX 3090 (48GB VRAM total)
CPU: AMD Ryzen 7 5800X
RAM: 64GB DDR4
Framework: llama.cpp server

Model Details

GPT-OSS-120B is a Mixture of Experts model with the following architecture:

Parameter	Value
Total Parameters	116.83 billion
Experts	128 total, 4 active per token
MoE Layers	36
Max Context	131,072 tokens
File Size (F16)	~65 GB

The Challenge

With 48GB of VRAM and a ~65GB model, the entire model cannot fit in GPU memory. The solution is llama.cpp's --n-cpu-moe flag, which offloads MoE expert layers to system RAM while keeping the rest of the model on GPU.

Configuration Journey

Initial Attempt: Single GPU

The starting configuration used only one GPU:

--split-mode none
--main-gpu 1
--n-cpu-moe 30

This immediately hit OOM errors - trying to fit everything on a single 24GB card doesn't work for this model.

Working Configuration: Row Split with Both GPUs

The working configuration uses both GPUs with row splitting:

--split-mode row
--tensor-split 0.5,0.5
--main-gpu 0
--n-cpu-moe 26
--ctx-size 38000
--fit on

Memory Distribution

With this configuration, memory is distributed as follows:

Location	Usage	Contents
GPU 0	~7.5 GB	Model split + KV cache + compute buffers
GPU 1	~22.8 GB	Model split + KV cache + compute buffers
System RAM (CUDA_Host)	~42 GB	26 offloaded MoE layers

Performance Results

Token Generation Speed: 10-17 tokens/second

(when active experts are cached in VRAM)

The throughput varies depending on whether the required experts are already in VRAM or need to be swapped from system RAM.

The Expert Swapping Problem

With 26 of 36 MoE layers offloaded to CPU, the model frequently needs to swap experts between system RAM and VRAM. This creates dead space where the model was silent for a minute or more during inference. The model generates tokens smoothly when experts are cached, then stalls when new experts need to be loaded.

In practice with opencode, this manifested as periodic pauses in response generation. The fewer experts that are offloaded to CPU will reduce the occurrence of these availability failures.

What We Tried That Didn't Work

Quantization (Q4_K_M)

We downloaded the Q4_K_M quantized version expecting significant memory savings. The actual file sizes:

Quantization	File Size
F16	65.4 GB
Q4_K_M	62.8 GB

Only a 4% reduction - not the 4x reduction typical of dense models. The model appears to already use mixed precision or has the experts stored efficiently. Quantization provided minimal benefit for this particular model.

KV Cache on CPU (--no-kv-offload)

We attempted to move the KV cache to system RAM to allow larger context sizes. This failed because the 64GB system RAM was already consumed by the ~42GB of offloaded MoE layers, leaving insufficient space for both the KV cache and OS overhead.

Larger Context (65k+)

Attempts to increase context beyond ~40k tokens failed due to GPU memory constraints. The KV cache and compute buffers are split between both GPUs based on --tensor-split, and GPU 1 runs out of headroom first. The --main-gpu 0 flag doesn't concentrate KV cache on a single GPU when using row split mode.

Key Learnings

What MoE CPU Offloading Enables

Running models that exceed total VRAM capacity
Trading latency (expert swapping) for capability (larger models)
Using system RAM as a "cold storage" tier for inactive experts

Limitations Encountered

Context ceiling: ~38-40k tokens due to GPU 1 memory constraints
Latency variance: Smooth generation interrupted by expert swap stalls
System RAM competition: MoE offload and KV cache both want system RAM
Uneven GPU utilization: Row split mode distributes evenly by ratio, not by available headroom

The MoE Advantage

Despite the challenges, this experiment demonstrates the core MoE value proposition:

A 120B parameter model with only 4 active experts (~3-4B parameters per token) can run on consumer hardware. The sparse activation pattern means you get the knowledge encoded in 120B parameters while only computing through a fraction of them per token.

Final Configuration

The stable working configuration for GPT-OSS-120B on dual 3090s:

llama-server \
  -m gpt-oss-120b-Q4_K_M-00001-of-00002.gguf \
  --ctx-size 38000 \
  --n-gpu-layers 99 \
  --n-cpu-moe 26 \
  --split-mode row \
  --tensor-split 0.5,0.5 \
  --main-gpu 0 \
  --fit on \
  -ub 512 \
  -b 512 \
  --threads 8

Would I Recommend This?

For experimentation and understanding MoE behavior: yes. For production use or latency-sensitive applications: no. We neither had enough VRAM or system RAM to achieve satisfactory results. llama.cpp is not capable of evenly splitting the active parameters across GPUs, so we couldn't make the best use of what little VRAM we had.

The expert swapping latency makes interactive use feel inconsistent. A third 3090 (72GB total VRAM) or more system RAM (128GB+) would likely improve the experience by keeping more experts resident. Keeping all experts on GPU would be ideal, of course.

That said, running a 120B parameter model at all on $2000 worth of used GPUs is notable. The MoE architecture and llama.cpp's CPU offloading make it possible, even if not perfectly smooth.