Getting large language models to handle long contexts on consumer hardware is challenging. After extensive testing with dual RTX 3090s, I've learned what works, what doesn't, and how to push the boundaries of context windows while maintaining stable performance.
Starting Point: 2x RTX3090 Configuration
Hardware Foundation
- GPU: 2x NVIDIA RTX 3090 (48GB total VRAM)
- CUDA: 12.4 with latest drivers
- Framework: llama.cpp with multi-GPU support
- Models Tested: Qwen3-32B (Dense) and Nemotron-3-Nano (MoE)
Model Setup and Quick Switch Commands
Nemotron-3-Nano (MoE) - 65K Context
pkill llama-server 2>/dev/null; sleep 2 nohup /home/tomwest/llama.cpp/build/bin/llama-server \ -m /home/tomwest/.cache/huggingface/hub/models--unsloth--Nemotron-3-Nano-30B-A3B-GGUF/snapshots/9ad8b366c308f931b2a96b9306f0b41aef9cd405/Nemotron-3-Nano-30B-A3B-Q6_K.gguf \ -c 65536 \ -ngl 99 \ --host 0.0.0.0 \ --port 8000 \ > /home/tomwest/models/nemotron-server.log 2>&1 & echo "Nemotron started on port 8000"
Qwen3-32B (Dense) - 40K Context
pkill llama-server 2>/dev/null; sleep 2 nohup /home/tomwest/llama.cpp/build/bin/llama-server \ -m /home/tomwest/models/Qwen3-32B-GGUF/Qwen3-32B-Q5_K_M.gguf \ -c 40960 \ -ngl 99 \ --split-mode layer \ --host 0.0.0.0 \ --port 8000 \ > /home/tomwest/models/qwen3-server.log 2>&1 & echo "Qwen3-32B started on port 8000"
The Context Window Battle
Qwen3-32B vs. Nemotron Context Limits
| Model | Context Tokens | VRAM Usage | Status |
|---|---|---|---|
| Qwen3-32B Q5_K_M | 32k | ~40.5GB | ✓ OK |
| Qwen3-32B Q5_K_M | 40k | ~44.8GB | ✓ OK (recommended) |
| Qwen3-32B Q5_K_M | 45k | ~46.9GB | ⚠ Loads but OOM during inference |
| Qwen3-32B Q5_K_M | 48k | - | ✗ FAIL (OOM on load) |
| Nemotron-3-Nano Q6_K | 65k | ~35GB | ✓ OK |
The Core Challenge: Multi-GPU KV Cache
KV Cache Quantization Failure
The biggest roadblock: KV cache quantization (the -ctk and -ctv flags) requires Flash Attention, but Flash Attention doesn't work with --split-mode layer on multi-GPU setups.
Error: "quantized V cache was requested, but this requires Flash Attention"
Why This Matters
- KV Cache Size: At FP16, Qwen3-32B uses ~0.25MB per 1K tokens
- 45K Context: ~11GB just for KV cache
- Available VRAM: 48GB total - 23GB model = 25GB remaining
- The Bottleneck: FP16 KV cache consumes too much memory
Understanding the Memory Math
Qwen3-32B Memory Breakdown
| Component | Size (40K Context) | Size (65K Context) |
|---|---|---|
| Model Weights (Q5_K_M) | ~23GB | ~23GB |
| KV Cache (FP16) | ~10GB | ~16GB |
| Overhead | ~2GB | ~2GB |
| Total | ~35GB | ~41GB |
Why MoE Wins for Context
| Model | Active Parameters | VRAM for Weights | Available for KV Cache |
|---|---|---|---|
| Qwen3-32B (Dense) | 32B (100%) | ~23GB | ~25GB remaining |
| Nemotron-3-Nano (MoE) | 3B (10%) | ~19GB | ~29GB remaining |
Optimization Attempts and Results
Strategy 1: CPU KV Cache Offloading
First尝试: Enable CPU-based KV cache to reclaim VRAM.
# Build with CPU KV Cache support cmake -Bbuild -DGGML_USE_CPU_KV_CACHE=ON cmake --build build # Runtime configuration --n-gpu-layers 32 # Adjust based on GPU capacity --cpu-kv # Enable CPU-based KV cache
Result: Worked but introduced performance bottlenecks due to CPU-GPU memory transfers.
Strategy 2: Failed Quantization Experiments
Tried various KV cache quantization approaches:
--cache-type-k q4_0 --cache-type-v q4_0: Blocked by multi-GPU limitation--rope-freq-scale 0.5: Extended theoretical limits but still memory-constrained--ctx-size 44000: Incremental testing led to OOM failures
Current Working Solutions
Hybrid Approach: Model Selection
For Long Context Requirements
Use Nemotron-3-Nano (MoE): Handles 65K context comfortably at 64.82 tokens/sec
- VRAM-efficient due to sparse activation
- Excellent for document analysis, long conversations
- Maintains good throughput despite large context
For Maximum Quality
Use Qwen3-32B (Dense): Limited to ~40K context but higher output quality
- Better for creative writing tasks
- Consistent modeling style across all tokens
- Faster per-token processing (when context fits)
Multi-GPU Configuration Challenges
Split Mode Limitations
--split-mode layer requirement for multi-GPU prevents several optimizations:
- Flash Attention (required for KV cache quantization)
- Certain memory optimizations
- Advanced batching strategies
GPU Load Balancing
# Current working configuration --split-mode layer # Split model across GPUs -ngl 99 # Offload all layers to GPU --host 0.0.0.0 # Network accessibility --port 8000 # Service port
Future Optimization Paths
Potential Workarounds
- vLLM with AWQ: Alternative framework supports FP8 KV cache without Flash Attention requirement
- Q4_K_M Quantization: Using Q4 instead of Q5_K_M saves ~3GB VRAM
- llama.cpp Updates: Future Flash Attention multi-GPU support could enable KV cache quantization
- Single GPU Mode: Running on one RTX3090 eliminates multi-GPU limitations but halves available VRAM
Practical Recommendations
Setup Command Summary
Check Server Status
# Check if running ps aux | grep llama-server | grep -v grep # Check health curl -s http://localhost:8000/health # Check VRAM usage nvidia-smi --query-gpu=memory.used,memory.total --format=csv # Tail logs tail -f /home/tomwest/models/qwen3-server.log
Model Switching Script
#!/bin/bash
case $1 in
"nemotron")
echo "Starting Nemotron-3-Nano with 65K context..."
# Nemotron startup command from above
;;
"qwen")
echo "Starting Qwen3-32B with 40K context..."
# Qwen startup command from above
;;
"stop")
pkill llama-server
echo "llama-server stopped"
;;
*)
echo "Usage: $0 {nemotron|qwen|stop}"
exit 1
;;
esac
Key Learnings
Technical Insights
- KV Cache is the Bottleneck: FP16 KV cache at large contexts consumes more VRAM than the model itself
- Multi-GPU Trade-offs: Distributed processing enables larger models but prevents key optimizations
- MoE Architecture Benefits: Sparse activation leaves VRAM for larger context windows
- Framework Limitations: llama.cpp's design choices impact what optimizations are possible
Practical Truths
- 40-45K context is the practical maximum for dense 32B models on dual RTX3090s
- MoE models can handle 65K+ context on the same hardware
- Performance varies significantly between model architectures
- Future framework updates may dramatically change these limits
Bottom Line
For anyone wanting large context windows on consumer hardware, MoE models like Nemotron-3-Nano are currently the best option. The combination of sparse activation and efficient memory usage makes 65K context windows practical and performant on dual RTX3090 hardware.