This post documents our experience running Qwen3-Coder-Next, an 80B Mixture of Experts model, on a quad RTX 3090 system. We investigate VRAM distribution across multiple GPUs, benchmark token generation throughput, and document the configuration needed to run this powerful coding model locally.
Hardware and Model
System Specs
- GPUs: 4x NVIDIA RTX 3090 (96GB VRAM total)
- CPU: AMD Threadripper 5995WX
- RAM: 64GB DDR4 (all model data in VRAM)
- PCIe: Full 4.0 16x speed on all GPUs
- Framework: llama.cpp server
Model Details
Qwen3-Coder-Next is a Mixture of Experts model designed for agentic coding tasks:
| Parameter | Value |
|---|---|
| Total Parameters | 80B |
| Active Parameters | ~3B per token |
| Quantization | Q5_K_XL from Unsloth's HuggingFace page |
| File Size | 53.02 GB (2 GGUF chunks) |
| Context Length | 262,144 tokens (supports up to 1M) |
| Framework Support | llama.cpp, vLLM, SGLang, transformers |
Why Qwen3-Coder-Next?
Qwen3-Coder-Next is an 80B MoE model with performance comparable to models 10-20x larger. It excels at long-horizon reasoning, complex tool use, and recovery from execution failures. What makes it particularly impressive is that it requires only ~46GB VRAM to run, making it feasible on consumer hardware. For comparison, our dual-RTX-3090 setup (48GB) can run GLM-4.7-Flash (30B), but Qwen3-Coder-Next (80B) requires our full quad-GPU setup due to its larger parameter count, even though the active parameters per token are similar.
Q5_K_XL Quantization
Unsloth's Q5_K_XL provides an excellent balance of quality and speed. The model is non-reasoning only (no <think></think> blocks), enabling ultra-quick code responses. Q4_K_XL is smaller (lower bits per weight) and would use less VRAM, but Q5_K_XL provides better quality for only ~7GB more storage. Given our quad-RTX-3090 setup has ample VRAM, we chose Q5_K_XL for a strong quality-to-speed tradeoff.
Benchmark Results
Token Generation Speed: 61-63 tokens/second
(Average over 5 runs, ~1080 character prompt generating ~100 tokens per run)
Performance Characteristics
The model shows consistent performance with low variance (53.24 - 63.09 TPS). This suggests stable inference on the current hardware configuration. With full PCIe 4.0 16x bandwidth on all 4 GPUs and 64GB CPU RAM (sufficient since all model data is in VRAM), the Threadripper platform provides an excellent foundation for MoE models. The benchmark used -ngl 999 (all layers to GPU) which maintains optimal throughput.
Memory Analysis
The Qwen3-Coder-Next model weights total 53GB. With 4 GPUs in use, VRAM is distributed as follows (from nvidia-smi):
| GPU | Memory Usage | Total VRAM | Usage % |
|---|---|---|---|
| GPU 0 | 19.1 GB | 24.6 GB | 77% |
| GPU 1 | 17.6 GB | 24.6 GB | 72% |
| GPU 2 | 19.6 GB | 24.6 GB | 80% |
| GPU 3 | 18.3 GB | 24.6 GB | 74% |
| Total | 74.6 GB | 96.0 GB | 78% |
Memory Distribution Strategy
With 53GB of model weights and 4x 24GB GPUs, we have more than enough VRAM to load the entire model. The llama.cpp server automatically distributes layers across available GPUs when -ngl 999 is specified. The relatively balanced VRAM usage (17-20GB per GPU) suggests even distribution of MoE experts across devices.
Total VRAM usage: 74.6GB out of 96GB (78%). Model weights account for 53GB; remaining ~22GB handles KV cache and overhead.
Configuration Details
Startup Script
#!/bin/bash
MODEL_PATH="/home/tomwest/models/qwen3-coder-next/Qwen3-Coder-Next-UD-Q5_K_XL-00001-of-00002.gguf"
LLAMA_SERVER="$HOME/llama.cpp/build/bin/llama-server"
LOG_FILE="$HOME/models/qwen3-coder-next.log"
nohup "$LLAMA_SERVER" \
-m "$MODEL_PATH" \
--host 0.0.0.0 \
--port 8000 \
-ngl 999 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
--jinja \
"$@" > "$LOG_FILE" 2>&1 & Key Parameters
-ngl 999: Load all layers to GPU for maximum throughput--temp 1.0: Default for general coding tasks (per Unsloth recommendations)--top-p 0.95: Nucleus sampling, standard for general tasks--min-p 0.01: Critical—llama.cpp default (0.05) is too high--top-k 40: Standard top-k sampling--jinja: Enable Jinja templating for chat formatting
Recommended Parameters
- Coding/general use:
--temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 - Tool-calling:
--temp 0.7 --top-p 1.0 --min-p 0.01 --top-k 40 - Maximize creativity:
--temp 1.0 --top-p 0.99 --min-p 0.01 --top-k 100
Performance Benchmarks
Qwen3-Coder-Next on Public Benchmarks
According to Unsloth's benchmarks, Qwen3-Coder-Next performs impressively across multiple coding tasks:
| Benchmark | Qwen3-Coder-Next | DeepSeek-V3.2 | GLM-4.7 | MiniMax M2.1 |
|---|---|---|---|---|
| SWE-Bench Verified | 70.6 | 70.2 | 74.2 | 74.8 |
| SWE-Bench Multilingual | 62.8 | 62.3 | 63.7 | 66.2 |
| SWE-Bench Pro | 44.3 | 40.9 | 40.6 | 34.6 |
| Terminal-Bench 2.0 | 36.2 | 39.3 | 37.1 | 32.6 |
| Aider | 66.2 | 69.9 | 52.1 | 61.0 |
The MoE Advantage
Qwen3-Coder-Next demonstrates the core benefit of Mixture of Experts architecture: you get the knowledge encoded in 80B parameters while only computing through ~3B per token. This balance of model size and compute cost enables impressive performance while keeping the model runnable on consumer hardware. Despite having 80B total parameters (2.6x more than GLM-4.7-Flash's 30B), the model only requires marginally more VRAM because only a fraction of parameters are active per token.
VRAM Optimization Recommendations
If you need to free up VRAM for longer contexts or additional concurrent sessions:
Reduce Context Size
Add --ctx-size 32768 to limit context to 32K tokens (default is 256K). This can save several GB of VRAM by reducing the KV cache allocation.
KV Cache Quantization
For longer contexts, quantize the KV cache to reduce VRAM usage:
--cache-type-k q4_1: 4-bit K cache quantization--cache-type-v q4_1: 4-bit V cache quantization (requires Flash Attention build)
MoE Layer Offloading
Offload MoE experts to CPU to save VRAM at the cost of speed:
-ot ".ffn_.*_exps.=CPU": Offload all MoE experts to CPU-ot ".ffn_(up|down)_exps.=CPU": Offload up/down projection experts-ot ".ffn_(up)_exps.=CPU": Offload only up projection experts
Alternative Deployment Options
For different use cases, consider these frameworks:
vLLM
Best for production deployments with high throughput needs:
vllm serve unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL \
--max-model-len 256000 \
--tool-call-parser qwen3_coder SGLang
Excellent for structured generation and efficient serving:
python3 -m sglang.launch_server \ --model-path unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL \ --max-total-tokens 256000 \ --tool-call-parser qwen3_coder
Would I Recommend This?
TL;DR: 100%. YES. Qwen3-Coder-Next is a highly capable coding model I've run locally. It's fast, produces high quality code, and handles tool use well. Meanwhile, the MoE gives it good balance of skill and speed.
The 61-63 TPS throughput is notable for an 80B model, which is a result of having only ~3B active parameters per token. The quad-RTX-3090 setup provides more than enough VRAM to load the entire model, and the even distribution across GPUs suggests efficient load balancing.
The configuration is well-suited for multi-user scenarios with -ngl 999 continuous batching. You'll need to monitor VRAM usage during long-running sessions with many concurrent users, as the KV cache grows with sequence length. For small personal coding tasks, however, you could get away with 32K-64K if you're already stretched for VRAM.
Overall, Qwen3-Coder-Next represents a solid option for local LLM inference, offering strong coding performance at an accessible model size. MoE brings usable performance to the models on even older hardware (e.g. RTX3090). The glm-4.7-flash results validate the capabilities and promise of small models. qwen3-coder-next is a bit larger, and offers strong quality. It is now a daily driver alongside glm-4.7-flash. The improvements in small models continue, which supports the case for buying a GPU (shoutout Ahmad), and running the model yourself.