Pushing Context Limits with llama.cpp

Optimizing multi-GPU setups, testing context windows, and workarounds for 65K token limits
January 2026
← Back to LLM Garage

Getting large language models to handle long contexts on consumer hardware is challenging. After extensive testing with dual RTX 3090s, I've learned what works, what doesn't, and how to push the boundaries of context windows while maintaining stable performance.

Starting Point: 2x RTX3090 Configuration

Hardware Foundation

Model Setup and Quick Switch Commands

Nemotron-3-Nano (MoE) - 65K Context

pkill llama-server 2>/dev/null; sleep 2

nohup /home/tomwest/llama.cpp/build/bin/llama-server \
  -m /home/tomwest/.cache/huggingface/hub/models--unsloth--Nemotron-3-Nano-30B-A3B-GGUF/snapshots/9ad8b366c308f931b2a96b9306f0b41aef9cd405/Nemotron-3-Nano-30B-A3B-Q6_K.gguf \
  -c 65536 \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8000 \
  > /home/tomwest/models/nemotron-server.log 2>&1 &

echo "Nemotron started on port 8000"

Qwen3-32B (Dense) - 40K Context

pkill llama-server 2>/dev/null; sleep 2

nohup /home/tomwest/llama.cpp/build/bin/llama-server \
  -m /home/tomwest/models/Qwen3-32B-GGUF/Qwen3-32B-Q5_K_M.gguf \
  -c 40960 \
  -ngl 99 \
  --split-mode layer \
  --host 0.0.0.0 \
  --port 8000 \
  > /home/tomwest/models/qwen3-server.log 2>&1 &

echo "Qwen3-32B started on port 8000"

The Context Window Battle

Qwen3-32B vs. Nemotron Context Limits

Model Context Tokens VRAM Usage Status
Qwen3-32B Q5_K_M 32k ~40.5GB ✓ OK
Qwen3-32B Q5_K_M 40k ~44.8GB ✓ OK (recommended)
Qwen3-32B Q5_K_M 45k ~46.9GB ⚠ Loads but OOM during inference
Qwen3-32B Q5_K_M 48k - ✗ FAIL (OOM on load)
Nemotron-3-Nano Q6_K 65k ~35GB ✓ OK

The Core Challenge: Multi-GPU KV Cache

KV Cache Quantization Failure

The biggest roadblock: KV cache quantization (the -ctk and -ctv flags) requires Flash Attention, but Flash Attention doesn't work with --split-mode layer on multi-GPU setups.

Error: "quantized V cache was requested, but this requires Flash Attention"

Why This Matters

Understanding the Memory Math

Qwen3-32B Memory Breakdown

Component Size (40K Context) Size (65K Context)
Model Weights (Q5_K_M) ~23GB ~23GB
KV Cache (FP16) ~10GB ~16GB
Overhead ~2GB ~2GB
Total ~35GB ~41GB

Why MoE Wins for Context

Model Active Parameters VRAM for Weights Available for KV Cache
Qwen3-32B (Dense) 32B (100%) ~23GB ~25GB remaining
Nemotron-3-Nano (MoE) 3B (10%) ~19GB ~29GB remaining

Optimization Attempts and Results

Strategy 1: CPU KV Cache Offloading

First尝试: Enable CPU-based KV cache to reclaim VRAM.

# Build with CPU KV Cache support
cmake -Bbuild -DGGML_USE_CPU_KV_CACHE=ON
cmake --build build

# Runtime configuration
--n-gpu-layers 32  # Adjust based on GPU capacity
--cpu-kv         # Enable CPU-based KV cache

Result: Worked but introduced performance bottlenecks due to CPU-GPU memory transfers.

Strategy 2: Failed Quantization Experiments

Tried various KV cache quantization approaches:

Current Working Solutions

Hybrid Approach: Model Selection

For Long Context Requirements

Use Nemotron-3-Nano (MoE): Handles 65K context comfortably at 64.82 tokens/sec

For Maximum Quality

Use Qwen3-32B (Dense): Limited to ~40K context but higher output quality

Multi-GPU Configuration Challenges

Split Mode Limitations

Key Finding: The --split-mode layer requirement for multi-GPU prevents several optimizations:

GPU Load Balancing

# Current working configuration
--split-mode layer    # Split model across GPUs
-ngl 99               # Offload all layers to GPU
--host 0.0.0.0        # Network accessibility
--port 8000           # Service port

Future Optimization Paths

Potential Workarounds

  1. vLLM with AWQ: Alternative framework supports FP8 KV cache without Flash Attention requirement
  2. Q4_K_M Quantization: Using Q4 instead of Q5_K_M saves ~3GB VRAM
  3. llama.cpp Updates: Future Flash Attention multi-GPU support could enable KV cache quantization
  4. Single GPU Mode: Running on one RTX3090 eliminates multi-GPU limitations but halves available VRAM

Practical Recommendations

Setup Command Summary

Check Server Status

# Check if running
ps aux | grep llama-server | grep -v grep

# Check health
curl -s http://localhost:8000/health

# Check VRAM usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

# Tail logs
tail -f /home/tomwest/models/qwen3-server.log

Model Switching Script

#!/bin/bash

case $1 in
  "nemotron")
    echo "Starting Nemotron-3-Nano with 65K context..."
    # Nemotron startup command from above
    ;;
  "qwen")
    echo "Starting Qwen3-32B with 40K context..."
    # Qwen startup command from above
    ;;
  "stop")
    pkill llama-server
    echo "llama-server stopped"
    ;;
  *)
    echo "Usage: $0 {nemotron|qwen|stop}"
    exit 1
    ;;
esac

Key Learnings

Technical Insights

  1. KV Cache is the Bottleneck: FP16 KV cache at large contexts consumes more VRAM than the model itself
  2. Multi-GPU Trade-offs: Distributed processing enables larger models but prevents key optimizations
  3. MoE Architecture Benefits: Sparse activation leaves VRAM for larger context windows
  4. Framework Limitations: llama.cpp's design choices impact what optimizations are possible

Practical Truths

Bottom Line

For anyone wanting large context windows on consumer hardware, MoE models like Nemotron-3-Nano are currently the best option. The combination of sparse activation and efficient memory usage makes 65K context windows practical and performant on dual RTX3090 hardware.