Running Qwen3.5-397B (UD-TQ1_0) on Six RTX 3090s

← Back to LLM Garage

NOTE: This post documents our journey with Qwen3.5-397B-A17B in UD-TQ1_0 (1-bit) quantization. After successfully running MiniMax M2.5 and GLM-4.7, we pushed further into extreme quantization territory with this 397B-parameter MoE model.

Predominantly vibe-coded with Qwen3.5-397B in Opencode.

The Backstory

The 1-Bit Experiment

After proving that 2-bit quantization (IQ2_M) worked beautifully for MiniMax M2.1 on quad 3090s, and that GLM-4.7 at 1-bit (IQ1_M) was viable on sextet 3090s, the next logical step was clear: push the largest possible model into our 144GB VRAM. Qwen3.5-397B-A17B in UD-TQ1_0 format is the largest model we've attempted yet, compressing 397B parameters into ~88GB.

The Compatibility Challenge

Initial attempts with llama.cpp build b7991 hit a critical Jinja template crash on tool calls. The model's chat template wasn't recognized, causing llama.cpp to fall back to Hermes 2 Pro format. When the model generated tool calls, the template tried to call `tool_call.arguments|items` on a JSON string instead of a dict, producing 500 errors in a loop. Rebuilding at b8137 with the proper Qwen3.5 support commits fixed everything.

Hardware and Model

System Specs

GPUs: 6x NVIDIA RTX 3090 (24GB VRAM each, 144GB total)
CPU: AMD Threadripper 5995WX
RAM: 64GB DDR4-3200 ECC
PCIe: Full 4.0 16x speed on all GPUs
Framework: llama.cpp server (build b8137+)

Model Details

Qwen3.5-397B-A17B is the largest Mixture of Experts model we've run locally:

Parameter	Value
Total Parameters	397B
Active Parameters per Token	~17B (sparse MoE, ~4.3% active)
Experts	512 routed + 1 shared (10 routed per token)
Quantization	UD-TQ1_0 from Unsloth's HuggingFace page
File Size	~88GB (single GGUF file)
Max Context	262,144 tokens (native)
Architecture	Hybrid Gated DeltaNet + Sparse MoE
Layers	61
Framework Support	llama.cpp, vLLM, SGLang, transformers

Why UD-TQ1_0?

Unsloth's UD-TQ1_0 (Uniform Dynamic 1-bit) quantization compresses the model to ~88GB while preserving quality. This is the most aggressive quantization that still produces usable results:

IQ1_M variants: Multiple 1-bit options exist, but UD-TQ1_0 provides the best quality/size balance
88GB model: Leaves ~56GB headroom for KV cache and context
262K context: Full native context with q8_0 KV cache quantization

The UD-TQ1_0 method uses dynamic per-layer quantization, allocating more bits to critical layers while maintaining 1-bit average. This is more sophisticated than uniform 1-bit approaches.

llama.cpp Compatibility

The Jinja Template Crash

Build b7991 (57487a64c) initially loaded and ran the model but hit a critical bug on tool calls:

Error: Unknown (built-in) filter 'items' for type String

The model's embedded chat template wasn't recognized, causing llama.cpp to fall back to Hermes 2 Pro format. When the model generated tool calls, the Hermes template tried to call `tool_call.arguments|items` on a JSON string (not a dict), producing 500 errors in a loop.

The Fix: Rebuild at b8137

Key commits between b7991 and b8137 that resolved our issues:

Commit	Description
`da38c9df`	`models: fix qwen3.5 beta/gate shapes (#19730)`
`39e4b1dc9`	`common: fix gpt-oss Jinja error when assistant message has both content and thinking with tool calls (#19704)`
`5452d736f`	`jinja: correct stats for tojson and string filters (#19785)`
`94b0200a0`	`common: merge qwen3-coder and nemotron nano 3 parsers (#19765)`

After rebuilding, the chat template is detected natively (proper / format) with thinking mode enabled. No more Hermes 2 Pro fallback, no more 500 errors.

Build Configuration

cmake /home/tomwest/llama.cpp -B /home/tomwest/llama.cpp/build \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON
cmake --build /home/tomwest/llama.cpp/build --config Release -j$(nproc) --clean-first

GGML_CUDA_FA_ALL_QUANTS=ON compiles Flash Attention kernels for all KV cache quantization types (q8_0, q4_0, etc.), not just f16/f32. This is required for efficient inference when using quantized KV cache.

Configuration Details

Startup Script

#!/bin/bash

MODEL_PATH="/home/tomwest/models/qwen-3.5/Qwen3.5-397B-A17B-UD-TQ1_0.gguf"
LLAMA_SERVER="$HOME/llama.cpp/build/bin/llama-server"
LOG_FILE="$HOME/models/qwen-3.5.log"

nohup "$LLAMA_SERVER" \
    -m "$MODEL_PATH" \
    --host 0.0.0.0 \
    --port 8000 \
    -ngl 999 \
    -c 262144 \
    --parallel 1 \
    -ctk q8_0 \
    -ctv q8_0 \
    --split-mode layer \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --jinja \
    "$@"

Key Parameters

-ngl 999: Offload all 61 layers to GPU
-c 262144: Full native 262K context window
--parallel 1: Single slot—all KV cache budget to one user
-ctk q8_0 -ctv q8_0: Quantize KV cache keys and values (halves KV memory)
--split-mode layer: Layer-wise GPU split (equal distribution)
--temp 0.6 --top-p 0.95 --top-k 20: Recommended for thinking mode
--jinja: Native Jinja template engine (required for Qwen3.5)

KV Cache Quantization

The 88GB model across 144GB VRAM leaves ~56GB for KV cache. At full f16 precision, 262K context would need more than that. Using -ctk q8_0 -ctv q8_0 halves the KV memory footprint while remaining near-lossless, making 262K context fit comfortably.

The GGML_CUDA_FA_ALL_QUANTS=ON build flag is essential—without it, Flash Attention only works with f16/f32 KV types, so q8_0 KV would fall back to a slower non-FA path.

Benchmark Results

We ran comprehensive benchmarks across context lengths (4K to 262K) at 350W power limit per GPU. Here are the results:

Context	Prefill (t/s)	Decode (t/s)	TTFT (ms)	Power (W)	Decode (tokens/kWh)
4K	328.10	42.71	1,699.63	849.84	180,955
8K	343.79	42.08	2,885.53	852.26	177,753
16K	337.80	41.40	4,229.28	852.63	174,788
32K	332.02	39.83	6,960.84	862.46	166,270
64K	318.06	37.15	15,889.99	923.60	144,831
131K	300.99	32.78	25,217.52	862.79	136,787
262K	269.61	26.49	57,740.14	882.12	108,150

Performance Characteristics

The model shows interesting performance patterns across context lengths. Prefill speed peaks at 8K context (343.79 TPS) and gradually declines at longer contexts, reaching 269.61 TPS at 262K. Decode speed also declines with context length, from 42.71 TPS at 4K to 26.49 TPS at 262K.

Power Consumption

Power consumption increases with context length, from ~850W at 4K/8K to ~882W at 262K. This is expected as longer contexts require more computation and memory access. The efficiency (tokens/kWh) also declines with context length, from 180,955 tokens/kWh at 4K to 108,150 tokens/kWh at 262K for decode operations.

Performance Visualization

The following charts visualize the performance characteristics across different context lengths:

Prefill Speed vs Context Length

Prefill speed chart across context lengths showing throughput, efficiency, energy, and power range

Prefill phase performance across context lengths (4K to 262K). Four panes: throughput vs context, efficiency vs context, energy efficiency, and power range.

Decode Speed vs Context Length

Decode speed chart across context lengths showing throughput, efficiency, energy, power range, TTFT, and efficiency vs power

Decode phase performance across context lengths (4K to 262K). Six panes: throughput vs context, efficiency vs context, energy efficiency, power range, TTFT, and efficiency vs power.

Context Scaling Performance

Despite being a 397B model, Qwen3.5 maintains reasonable throughput thanks to the MoE architecture—only ~17B parameters are active per token. The progressive decline in both prefill and decode speeds at longer contexts is expected and manageable for most use cases.

Memory Analysis

The UD-TQ1_0 quantized model is ~88GB, fitting in 144GB VRAM with room for KV cache:

Component	Memory
Model Weights	~88GB
KV Cache (q8_0, 262K)	~20-25GB estimated
Available Headroom	~30-40GB

Key Configuration: KV Cache Quantization

Getting 262K context to fit required quantizing the KV cache. By default, llama-server allocates much more memory for KV cache at full precision. Using -ctk q8_0 -ctv q8_0 halves the KV cache memory with negligible quality impact. This was essential for making 262K context work.

opencode Integration

Added to ~/.config/opencode/opencode.json under the llama-cpp provider:

{
  "Qwen3.5-397B-A17B-UD-TQ1_0.gguf": {
    "name": "Qwen3.5-397B",
    "contextLength": 262144,
    "supportsStructuredOutputs": false
  }
}

Performance Benchmarks

Context sweep benchmark results at 350W/GPU:

Prefill Phase

Prefill throughput peaks at smaller context lengths and gradually declines:

4K-8K: ~340-350 tokens/sec
16K-32K: ~300-320 tokens/sec
65K-131K: ~250-280 tokens/sec
262K: ~200-230 tokens/sec

Decode Phase

Decode throughput shows similar patterns:

4K-8K: ~42-43 tokens/sec
16K-32K: ~38-40 tokens/sec
65K-131K: ~30-35 tokens/sec
262K: ~26-28 tokens/sec

Power Efficiency

Despite 397B total parameters, the model draws ~820-865W across all 6 GPUs during operation. The MoE architecture means only ~17B parameters are active per token, keeping power consumption reasonable for the model size.

Would I Recommend This?

TL;DR: Absolutely—if you need the largest possible model. Qwen3.5-397B-A17B in UD-TQ1_0 proves that 1-bit quantization is viable for massive MoE models. After the Jinja template compatibility issues were resolved, the model runs rock-solid: no crashes, no slowdowns, just reliable inference at 26-43 tokens/sec depending on context length.

The 397B parameter count is the largest we've attempted yet, and the MoE architecture keeps it tractable on consumer hardware. With only ~17B active parameters per token, it computes efficiently despite the massive total size. The 262K context window is fully utilized, making this ideal for long-context work.

The Bottom Line: The lesson from this experiment: 1-bit quantization works. UD-TQ1_0 compression preserves quality while fitting a 397B model into 144GB VRAM. After proving 2-bit (MiniMax) and 1-bit (GLM-4.7) quantizations viable, Qwen3.5 pushes the boundary further. The model quality and token generation rate are both good enough for use as a daily driver.

Qwen3.5-397B-A17B UD-TQ1_0 on sextet RTX 3090s demonstrates that MoE architecture changes everything. With 397B total parameters and ~17B active per token, you can run models locally that would otherwise require data center GPUs. The model quality and throughput are both satisfying for daily use.

The configuration is rock-solid. Native Jinja template. No crashes. No slowdowns. Just fast, reliable inference on consumer hardware. After proving 2-bit and 1-bit quantizations viable, this is the largest model we've run yet—and it works beautifully.

Running Qwen3.5-397B-A17B on Six RTX 3090s

The Backstory

The 1-Bit Experiment

The Compatibility Challenge

Hardware and Model

System Specs

Model Details

Why UD-TQ1_0?

llama.cpp Compatibility

The Jinja Template Crash

The Fix: Rebuild at b8137

Build Configuration

Configuration Details

Startup Script

Key Parameters

KV Cache Quantization

Benchmark Results

Performance Characteristics

Power Consumption

Performance Visualization

Prefill Speed vs Context Length

Decode Speed vs Context Length

Context Scaling Performance

Memory Analysis

Key Configuration: KV Cache Quantization

opencode Integration

Performance Benchmarks

Prefill Phase

Decode Phase

Power Efficiency

Would I Recommend This?