Pushing Context Limits with llama.cpp

Getting large language models to handle long contexts on consumer hardware is challenging. After extensive testing with dual RTX 3090s, I've learned what works, what doesn't, and how to push the boundaries of context windows while maintaining stable performance.

Starting Point: 2x RTX3090 Configuration

Hardware Foundation

GPU: 2x NVIDIA RTX 3090 (48GB total VRAM)
CUDA: 12.4 with latest drivers
Framework: llama.cpp with multi-GPU support
Models Tested: Qwen3-32B (Dense) and Nemotron-3-Nano (MoE)

Model Setup and Quick Switch Commands

Nemotron-3-Nano (MoE) - 65K Context

pkill llama-server 2>/dev/null; sleep 2

nohup /home/tomwest/llama.cpp/build/bin/llama-server \
  -m /home/tomwest/.cache/huggingface/hub/models--unsloth--Nemotron-3-Nano-30B-A3B-GGUF/snapshots/9ad8b366c308f931b2a96b9306f0b41aef9cd405/Nemotron-3-Nano-30B-A3B-Q6_K.gguf \
  -c 65536 \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8000 \
  > /home/tomwest/models/nemotron-server.log 2>&1 &

echo "Nemotron started on port 8000"

Qwen3-32B (Dense) - 40K Context

pkill llama-server 2>/dev/null; sleep 2

nohup /home/tomwest/llama.cpp/build/bin/llama-server \
  -m /home/tomwest/models/Qwen3-32B-GGUF/Qwen3-32B-Q5_K_M.gguf \
  -c 40960 \
  -ngl 99 \
  --split-mode layer \
  --host 0.0.0.0 \
  --port 8000 \
  > /home/tomwest/models/qwen3-server.log 2>&1 &

echo "Qwen3-32B started on port 8000"

The Context Window Battle

Qwen3-32B vs. Nemotron Context Limits

Model	Context Tokens	VRAM Usage	Status
Qwen3-32B Q5_K_M	32k	~40.5GB	✓ OK
Qwen3-32B Q5_K_M	40k	~44.8GB	✓ OK (recommended)
Qwen3-32B Q5_K_M	45k	~46.9GB	⚠ Loads but OOM during inference
Qwen3-32B Q5_K_M	48k	-	✗ FAIL (OOM on load)
Nemotron-3-Nano Q6_K	65k	~35GB	✓ OK

The Core Challenge: Multi-GPU KV Cache

KV Cache Quantization Failure

The biggest roadblock: KV cache quantization (the -ctk and -ctv flags) requires Flash Attention, but Flash Attention doesn't work with --split-mode layer on multi-GPU setups.

Error: "quantized V cache was requested, but this requires Flash Attention"

Why This Matters

KV Cache Size: At FP16, Qwen3-32B uses ~0.25MB per 1K tokens
45K Context: ~11GB just for KV cache
Available VRAM: 48GB total - 23GB model = 25GB remaining
The Bottleneck: FP16 KV cache consumes too much memory

Understanding the Memory Math

Qwen3-32B Memory Breakdown

Component	Size (40K Context)	Size (65K Context)
Model Weights (Q5_K_M)	~23GB	~23GB
KV Cache (FP16)	~10GB	~16GB
Overhead	~2GB	~2GB
Total	~35GB	~41GB

Why MoE Wins for Context

Model	Active Parameters	VRAM for Weights	Available for KV Cache
Qwen3-32B (Dense)	32B (100%)	~23GB	~25GB remaining
Nemotron-3-Nano (MoE)	3B (10%)	~19GB	~29GB remaining

Optimization Attempts and Results

Strategy 1: CPU KV Cache Offloading

First尝试: Enable CPU-based KV cache to reclaim VRAM.

# Build with CPU KV Cache support
cmake -Bbuild -DGGML_USE_CPU_KV_CACHE=ON
cmake --build build

# Runtime configuration
--n-gpu-layers 32  # Adjust based on GPU capacity
--cpu-kv         # Enable CPU-based KV cache

Result: Worked but introduced performance bottlenecks due to CPU-GPU memory transfers.

Strategy 2: Failed Quantization Experiments

Tried various KV cache quantization approaches:

--cache-type-k q4_0 --cache-type-v q4_0: Blocked by multi-GPU limitation
--rope-freq-scale 0.5: Extended theoretical limits but still memory-constrained
--ctx-size 44000: Incremental testing led to OOM failures

Current Working Solutions

Hybrid Approach: Model Selection

For Long Context Requirements

Use Nemotron-3-Nano (MoE): Handles 65K context comfortably at 64.82 tokens/sec

VRAM-efficient due to sparse activation
Excellent for document analysis, long conversations
Maintains good throughput despite large context

For Maximum Quality

Use Qwen3-32B (Dense): Limited to ~40K context but higher output quality

Better for creative writing tasks
Consistent modeling style across all tokens
Faster per-token processing (when context fits)

Multi-GPU Configuration Challenges

Split Mode Limitations

Key Finding: The --split-mode layer requirement for multi-GPU prevents several optimizations:

Flash Attention (required for KV cache quantization)
Certain memory optimizations
Advanced batching strategies

GPU Load Balancing

# Current working configuration
--split-mode layer    # Split model across GPUs
-ngl 99               # Offload all layers to GPU
--host 0.0.0.0        # Network accessibility
--port 8000           # Service port

Future Optimization Paths

Potential Workarounds

vLLM with AWQ: Alternative framework supports FP8 KV cache without Flash Attention requirement
Q4_K_M Quantization: Using Q4 instead of Q5_K_M saves ~3GB VRAM
llama.cpp Updates: Future Flash Attention multi-GPU support could enable KV cache quantization
Single GPU Mode: Running on one RTX3090 eliminates multi-GPU limitations but halves available VRAM

Practical Recommendations

Setup Command Summary

Check Server Status

# Check if running
ps aux | grep llama-server | grep -v grep

# Check health
curl -s http://localhost:8000/health

# Check VRAM usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

# Tail logs
tail -f /home/tomwest/models/qwen3-server.log

Model Switching Script

#!/bin/bash

case $1 in
  "nemotron")
    echo "Starting Nemotron-3-Nano with 65K context..."
    # Nemotron startup command from above
    ;;
  "qwen")
    echo "Starting Qwen3-32B with 40K context..."
    # Qwen startup command from above
    ;;
  "stop")
    pkill llama-server
    echo "llama-server stopped"
    ;;
  *)
    echo "Usage: $0 {nemotron|qwen|stop}"
    exit 1
    ;;
esac

Key Learnings

Technical Insights

KV Cache is the Bottleneck: FP16 KV cache at large contexts consumes more VRAM than the model itself
Multi-GPU Trade-offs: Distributed processing enables larger models but prevents key optimizations
MoE Architecture Benefits: Sparse activation leaves VRAM for larger context windows
Framework Limitations: llama.cpp's design choices impact what optimizations are possible

Practical Truths

40-45K context is the practical maximum for dense 32B models on dual RTX3090s
MoE models can handle 65K+ context on the same hardware
Performance varies significantly between model architectures
Future framework updates may dramatically change these limits

Bottom Line

For anyone wanting large context windows on consumer hardware, MoE models like Nemotron-3-Nano are currently the best option. The combination of sparse activation and efficient memory usage makes 65K context windows practical and performant on dual RTX3090 hardware.