Running Qwen3-Coder-Next on Quad RTX 3090s

← Back to LLM Garage

This post documents our experience running Qwen3-Coder-Next, an 80B Mixture of Experts model, on a quad RTX 3090 system. We investigate VRAM distribution across multiple GPUs, benchmark token generation throughput, and document the configuration needed to run this powerful coding model locally.

Predominantly written by Qwen3-Coder-Next using Opencode.

Hardware and Model

System Specs

GPUs: 4x NVIDIA RTX 3090 (96GB VRAM total)
CPU: AMD Threadripper 5995WX
RAM: 64GB DDR4 (all model data in VRAM)
PCIe: Full 4.0 16x speed on all GPUs
Framework: llama.cpp server

Model Details

Qwen3-Coder-Next is a Mixture of Experts model designed for agentic coding tasks:

Parameter	Value
Total Parameters	80B
Active Parameters	~3B per token
Quantization	Q5_K_XL from Unsloth's HuggingFace page
File Size	53.02 GB (2 GGUF chunks)
Context Length	262,144 tokens (supports up to 1M)
Framework Support	llama.cpp, vLLM, SGLang, transformers

Why Qwen3-Coder-Next?

Qwen3-Coder-Next is an 80B MoE model with performance comparable to models 10-20x larger. It excels at long-horizon reasoning, complex tool use, and recovery from execution failures. What makes it particularly impressive is that it requires only ~46GB VRAM to run, making it feasible on consumer hardware. For comparison, our dual-RTX-3090 setup (48GB) can run GLM-4.7-Flash (30B), but Qwen3-Coder-Next (80B) requires our full quad-GPU setup due to its larger parameter count, even though the active parameters per token are similar.

Q5_K_XL Quantization

Unsloth's Q5_K_XL provides an excellent balance of quality and speed. The model is non-reasoning only (no <think></think> blocks), enabling ultra-quick code responses. Q4_K_XL is smaller (lower bits per weight) and would use less VRAM, but Q5_K_XL provides better quality for only ~7GB more storage. Given our quad-RTX-3090 setup has ample VRAM, we chose Q5_K_XL for a strong quality-to-speed tradeoff.

Benchmark Results

Token Generation Speed: 61-63 tokens/second

(Average over 5 runs, ~1080 character prompt generating ~100 tokens per run)

Performance Characteristics

The model shows consistent performance with low variance (53.24 - 63.09 TPS). This suggests stable inference on the current hardware configuration. With full PCIe 4.0 16x bandwidth on all 4 GPUs and 64GB CPU RAM (sufficient since all model data is in VRAM), the Threadripper platform provides an excellent foundation for MoE models. The benchmark used -ngl 999 (all layers to GPU) which maintains optimal throughput.

Memory Analysis

The Qwen3-Coder-Next model weights total 53GB. With 4 GPUs in use, VRAM is distributed as follows (from nvidia-smi):

GPU	Memory Usage	Total VRAM	Usage %
GPU 0	19.1 GB	24.6 GB	77%
GPU 1	17.6 GB	24.6 GB	72%
GPU 2	19.6 GB	24.6 GB	80%
GPU 3	18.3 GB	24.6 GB	74%
Total	74.6 GB	96.0 GB	78%

Memory Distribution Strategy

With 53GB of model weights and 4x 24GB GPUs, we have more than enough VRAM to load the entire model. The llama.cpp server automatically distributes layers across available GPUs when -ngl 999 is specified. The relatively balanced VRAM usage (17-20GB per GPU) suggests even distribution of MoE experts across devices.

Total VRAM usage: 74.6GB out of 96GB (78%). Model weights account for 53GB; remaining ~22GB handles KV cache and overhead.

Configuration Details

Startup Script

#!/bin/bash

MODEL_PATH="/home/tomwest/models/qwen3-coder-next/Qwen3-Coder-Next-UD-Q5_K_XL-00001-of-00002.gguf"
LLAMA_SERVER="$HOME/llama.cpp/build/bin/llama-server"
LOG_FILE="$HOME/models/qwen3-coder-next.log"

nohup "$LLAMA_SERVER" \
    -m "$MODEL_PATH" \
    --host 0.0.0.0 \
    --port 8000 \
    -ngl 999 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --jinja \
    "$@" > "$LOG_FILE" 2>&1 &

Key Parameters

-ngl 999: Load all layers to GPU for maximum throughput
--temp 1.0: Default for general coding tasks (per Unsloth recommendations)
--top-p 0.95: Nucleus sampling, standard for general tasks
--min-p 0.01: Critical—llama.cpp default (0.05) is too high
--top-k 40: Standard top-k sampling
--jinja: Enable Jinja templating for chat formatting

Recommended Parameters

Coding/general use: --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40
Tool-calling: --temp 0.7 --top-p 1.0 --min-p 0.01 --top-k 40
Maximize creativity: --temp 1.0 --top-p 0.99 --min-p 0.01 --top-k 100

Performance Benchmarks

Qwen3-Coder-Next on Public Benchmarks

According to Unsloth's benchmarks, Qwen3-Coder-Next performs impressively across multiple coding tasks:

Benchmark	Qwen3-Coder-Next	DeepSeek-V3.2	GLM-4.7	MiniMax M2.1
SWE-Bench Verified	70.6	70.2	74.2	74.8
SWE-Bench Multilingual	62.8	62.3	63.7	66.2
SWE-Bench Pro	44.3	40.9	40.6	34.6
Terminal-Bench 2.0	36.2	39.3	37.1	32.6
Aider	66.2	69.9	52.1	61.0

The MoE Advantage

Qwen3-Coder-Next demonstrates the core benefit of Mixture of Experts architecture: you get the knowledge encoded in 80B parameters while only computing through ~3B per token. This balance of model size and compute cost enables impressive performance while keeping the model runnable on consumer hardware. Despite having 80B total parameters (2.6x more than GLM-4.7-Flash's 30B), the model only requires marginally more VRAM because only a fraction of parameters are active per token.

VRAM Optimization Recommendations

If you need to free up VRAM for longer contexts or additional concurrent sessions:

Reduce Context Size

Add --ctx-size 32768 to limit context to 32K tokens (default is 256K). This can save several GB of VRAM by reducing the KV cache allocation.

KV Cache Quantization

For longer contexts, quantize the KV cache to reduce VRAM usage:

--cache-type-k q4_1: 4-bit K cache quantization
--cache-type-v q4_1: 4-bit V cache quantization (requires Flash Attention build)

MoE Layer Offloading

Offload MoE experts to CPU to save VRAM at the cost of speed:

-ot ".ffn_.*_exps.=CPU": Offload all MoE experts to CPU
-ot ".ffn_(up|down)_exps.=CPU": Offload up/down projection experts
-ot ".ffn_(up)_exps.=CPU": Offload only up projection experts

Alternative Deployment Options

For different use cases, consider these frameworks:

vLLM

Best for production deployments with high throughput needs:

vllm serve unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL \
     --max-model-len 256000 \
     --tool-call-parser qwen3_coder

SGLang

Excellent for structured generation and efficient serving:

python3 -m sglang.launch_server \
   --model-path unsloth/Qwen3-Coder-Next-GGUF:UD-Q5_K_XL \
   --max-total-tokens 256000 \
   --tool-call-parser qwen3_coder

Would I Recommend This?

TL;DR: 100%. YES. Qwen3-Coder-Next is a highly capable coding model I've run locally. It's fast, produces high quality code, and handles tool use well. Meanwhile, the MoE gives it good balance of skill and speed.

The 61-63 TPS throughput is notable for an 80B model, which is a result of having only ~3B active parameters per token. The quad-RTX-3090 setup provides more than enough VRAM to load the entire model, and the even distribution across GPUs suggests efficient load balancing.

The configuration is well-suited for multi-user scenarios with -ngl 999 continuous batching. You'll need to monitor VRAM usage during long-running sessions with many concurrent users, as the KV cache grows with sequence length. For small personal coding tasks, however, you could get away with 32K-64K if you're already stretched for VRAM.

Overall, Qwen3-Coder-Next represents a solid option for local LLM inference, offering strong coding performance at an accessible model size. MoE brings usable performance to the models on even older hardware (e.g. RTX3090). The glm-4.7-flash results validate the capabilities and promise of small models. qwen3-coder-next is a bit larger, and offers strong quality. It is now a daily driver alongside glm-4.7-flash. The improvements in small models continue, which supports the case for buying a GPU (shoutout Ahmad), and running the model yourself.