This post documents an experiment running GPT-OSS-120B, a large Mixture of Experts model, on a dual RTX 3090 system with 64GB of system RAM. The goal was to see how far we could push consumer hardware using llama.cpp's MoE CPU offloading feature.
Hardware and Model
System Specs
- GPUs: 2x NVIDIA RTX 3090 (48GB VRAM total)
- CPU: AMD Ryzen 7 5800X
- RAM: 64GB DDR4
- Framework: llama.cpp server
Model Details
GPT-OSS-120B is a Mixture of Experts model with the following architecture:
| Parameter | Value |
|---|---|
| Total Parameters | 116.83 billion |
| Experts | 128 total, 4 active per token |
| MoE Layers | 36 |
| Max Context | 131,072 tokens |
| File Size (F16) | ~65 GB |
The Challenge
With 48GB of VRAM and a ~65GB model, the entire model cannot fit in GPU memory. The solution is llama.cpp's --n-cpu-moe flag, which offloads MoE expert layers to system RAM while keeping the rest of the model on GPU.
Configuration Journey
Initial Attempt: Single GPU
The starting configuration used only one GPU:
--split-mode none --main-gpu 1 --n-cpu-moe 30
This immediately hit OOM errors - trying to fit everything on a single 24GB card doesn't work for this model.
Working Configuration: Row Split with Both GPUs
The working configuration uses both GPUs with row splitting:
--split-mode row --tensor-split 0.5,0.5 --main-gpu 0 --n-cpu-moe 26 --ctx-size 38000 --fit on
Memory Distribution
With this configuration, memory is distributed as follows:
| Location | Usage | Contents |
|---|---|---|
| GPU 0 | ~7.5 GB | Model split + KV cache + compute buffers |
| GPU 1 | ~22.8 GB | Model split + KV cache + compute buffers |
| System RAM (CUDA_Host) | ~42 GB | 26 offloaded MoE layers |
Performance Results
Token Generation Speed: 10-17 tokens/second
(when active experts are cached in VRAM)
The throughput varies depending on whether the required experts are already in VRAM or need to be swapped from system RAM.
The Expert Swapping Problem
With 26 of 36 MoE layers offloaded to CPU, the model frequently needs to swap experts between system RAM and VRAM. This creates dead space where the model was silent for a minute or more during inference. The model generates tokens smoothly when experts are cached, then stalls when new experts need to be loaded.
In practice with opencode, this manifested as periodic pauses in response generation. The fewer experts that are offloaded to CPU will reduce the occurrence of these availability failures.
What We Tried That Didn't Work
Quantization (Q4_K_M)
We downloaded the Q4_K_M quantized version expecting significant memory savings. The actual file sizes:
| Quantization | File Size |
|---|---|
| F16 | 65.4 GB |
| Q4_K_M | 62.8 GB |
Only a 4% reduction - not the 4x reduction typical of dense models. The model appears to already use mixed precision or has the experts stored efficiently. Quantization provided minimal benefit for this particular model.
KV Cache on CPU (--no-kv-offload)
We attempted to move the KV cache to system RAM to allow larger context sizes. This failed because the 64GB system RAM was already consumed by the ~42GB of offloaded MoE layers, leaving insufficient space for both the KV cache and OS overhead.
Larger Context (65k+)
Attempts to increase context beyond ~40k tokens failed due to GPU memory constraints. The KV cache and compute buffers are split between both GPUs based on --tensor-split, and GPU 1 runs out of headroom first. The --main-gpu 0 flag doesn't concentrate KV cache on a single GPU when using row split mode.
Key Learnings
What MoE CPU Offloading Enables
- Running models that exceed total VRAM capacity
- Trading latency (expert swapping) for capability (larger models)
- Using system RAM as a "cold storage" tier for inactive experts
Limitations Encountered
- Context ceiling: ~38-40k tokens due to GPU 1 memory constraints
- Latency variance: Smooth generation interrupted by expert swap stalls
- System RAM competition: MoE offload and KV cache both want system RAM
- Uneven GPU utilization: Row split mode distributes evenly by ratio, not by available headroom
The MoE Advantage
Despite the challenges, this experiment demonstrates the core MoE value proposition:
A 120B parameter model with only 4 active experts (~3-4B parameters per token) can run on consumer hardware. The sparse activation pattern means you get the knowledge encoded in 120B parameters while only computing through a fraction of them per token.
Final Configuration
The stable working configuration for GPT-OSS-120B on dual 3090s:
llama-server \ -m gpt-oss-120b-Q4_K_M-00001-of-00002.gguf \ --ctx-size 38000 \ --n-gpu-layers 99 \ --n-cpu-moe 26 \ --split-mode row \ --tensor-split 0.5,0.5 \ --main-gpu 0 \ --fit on \ -ub 512 \ -b 512 \ --threads 8
Would I Recommend This?
For experimentation and understanding MoE behavior: yes. For production use or latency-sensitive applications: no. We neither had enough VRAM or system RAM to achieve satisfactory results. llama.cpp is not capable of evenly splitting the active parameters across GPUs, so we couldn't make the best use of what little VRAM we had.
The expert swapping latency makes interactive use feel inconsistent. A third 3090 (72GB total VRAM) or more system RAM (128GB+) would likely improve the experience by keeping more experts resident. Keeping all experts on GPU would be ideal, of course.
That said, running a 120B parameter model at all on $2000 worth of used GPUs is notable. The MoE architecture and llama.cpp's CPU offloading make it possible, even if not perfectly smooth.