LLM Garage

Home Engineer's AI Hardware Journal

← Back to LLM Garage

Running GLM-4.7 (IQ1_M) on Six RTX 3090s

1-bit quantization: 358B model in 144GB VRAM
February 2026

The Genie: GLM-4.7 1-bit

"Phenomenal cosmic power... itty bitty living space."

To follow up on our test and review of MiniMax M2.x 2-bit, I figured I'd see this to its logical conclusion and find the most capable model I could stuff into my 6 RTX 3090s' VRAM. Unfortunately, Kimi K2.5 was still too large, but we've enjoyed GLM-4.7-Flash so much that I figured GLM-4.7 was worth a try.

In order to do this, we picked a really big model and chose the 1-bit quantization. The specific model is UD_IQ1_M from Unsloth's HuggingFace page. The model clocks in at ~108GB, which doesn't leave a lot of room for everything else with our total VRAM at 144GB. We had to quantize the KV cache to FP8, and we're only using 128K context because of out-of-memory crashes at 180K.

In all, it's a very good model. I am satisfied with the quality despite the heavy quantization, and I also can tolerate the tokens per second, although it is meaningfully slower than MiniMax M2.5 or Qwen3-Coder-Next.

The extent of our testing so far are the performance baselines presented here, writing this blog post, and creating my meal plan and grocery list for the week. Generally, I'd say it passed. Enjoy.

Predominantly written by GLM-4.7 using Opencode.

Hardware and Model

System Specs

Model Details

GLM-4.7 is a massive Mixture of Experts model with the following architecture:

Parameter Value
Total Parameters 358B
Experts 160 routed + 1 shared, 8 active per token (~5%)
Active Parameters per Token ~40-50B estimated
Quantization IQ1_M from Unsloth's HuggingFace page
File Size ~101GB (3 split files)
Max Context 202,752 tokens
Layers 92
Attention 96 heads, 8 KV heads (GQA), 128 head dim
Framework Support vLLM, SGLang, llama.cpp, transformers

The Quantization Question

IQ1_M represents the largest 1-bit quantization available in GGUF format (there are two smaller 1-bit variants). The model weights are compressed to approximately 1 bit per parameter with minimal metadata. This is an extreme test of whether the knowledge encoded in 358B parameters can survive such heavy compression while maintaining usable quality.

Context Length Trade-offs

Despite 144GB VRAM total, we couldn't achieve 180K context with FP8 KV cache - we hit OOM at 180K. We reduced to 128K context for stable operation with ~20GB headroom. The model's theoretical max context is 202,752 tokens, but our hardware limits us to 128K with this configuration.

Benchmark Results

We ran comprehensive benchmarks across multiple context lengths (4K, 8K, 16K, 32K, 64K, 128K) at 350W power limit per GPU. Here are the results:

Context Prefill (t/s) Decode (t/s) TTFT (ms) Power (W) Decode (tokens/kWh)
4K 107.14 30.81 976.57 819.71 135,310
8K 223.68 28.13 5,864.65 820.82 123,360
16K 234.44 25.18 10,693.10 825.91 109,754
32K 217.28 20.43 23,058.76 830.33 88,595
64K 193.23 14.94 51,677.00 846.03 63,567
128K 158.93 9.74 126,053.89 863.81 40,608

Performance Characteristics

The model shows interesting performance patterns across context lengths. Prefill speed peaks at 16K context (234.44 TPS) and gradually declines at longer contexts, reaching 158.93 TPS at 128K. Decode speed also declines with context length, from 30.81 TPS at 4K to 9.74 TPS at 128K.

Power Consumption

Power consumption increases with context length, from ~820W at 4K/8K to ~864W at 128K. This is expected as longer contexts require more computation and memory access. The efficiency (tokens/kWh) also declines with context length, from 135,310 tokens/kWh at 4K to 40,608 tokens/kWh at 128K for decode operations. Notably, we're still not maxing out power despite this being a heavy model (albeit still MoE).

Performance Visualization

The following charts visualize the performance characteristics across different context lengths:

Prefill Speed vs Context Length

Prefill speed chart across context lengths

Decode Speed vs Context Length

Decode speed chart across context lengths

Memory Analysis

With ~101GB of model weights and ~23GB KV cache at 131K context, we're using ~124GB total VRAM. This leaves ~20GB headroom across 6 GPUs, distributed as ~3.3GB per GPU.

Component Size
Model weights (IQ1_M) ~101 GB
KV cache (Q8_0, 131k) ~23 GB
Total ~124 GB
Available VRAM 144 GB
Headroom ~20 GB
KV Quant Context KV Size Total Fits?
Q8_0 180k ~31.6 GB ~133 GB No (OOM on GPU 2)
Q8_0 131k ~23 GB ~124 GB Yes (~20 GB headroom)
Q4_0 180k ~17.8 GB ~119 GB Should work (untested)
Q4_0 202k ~20 GB ~121 GB Should work (untested)

Context Length vs KV Cache Trade-offs

180K context with Q8_0 KV caused CUDA OOM during inference. To get full 180K+ context, switch to Q4_0 KV cache (at the cost of some quality on long-context recall). This is a critical trade-off: Q8_0 KV preserves quality but limits context, while Q4_0 KV enables full context at the cost of potential degradation in long-context recall accuracy.

Configuration Details

Startup Parameters

$HOME/llama.cpp/build/bin/llama-server \
    -m "$MODEL_PATH" \
    --host 0.0.0.0 \
    --port 8000 \
    -ts 1,1,1,1,1,1 \
    --ctx-size 131072 \
    --cache-type-k Q8_0 \
    --cache-type-v Q8_0 \
    --temp 0.6 \
    --top-p 0.95

Key parameters:

Comparison: GLM-4.7 (IQ1_M) vs GLM-4.7-Flash (Q5_K_M)

This experiment directly compares two extreme approaches:

Aspect GLM-4.7 (IQ1_M) GLM-4.7-Flash (Q5_K_M)
Total Parameters 358B 30B
Active Parameters/Token ~40-50B ~5B (estimated)
Quantization IQ1_M (extreme) Q5_K_M (conservative)
Model Size ~101GB ~20GB
Decode Speed 31.5 TPS 60.9 TPS
Prefill Speed 202.6 TPS Unknown
Power ~979W (6 GPUs) Unknown (2 GPUs)
VRAM Required 124GB (131K ctx) 44.5GB (202K ctx)
GPU Count 6x RTX 3090 2x RTX 3090

The Trade-off Analysis

The fundamental question is whether the knowledge encoded in 358B parameters can survive IQ1_M quantization while providing better quality than a 30B model at Q5_K_M. The decode speed suggests the MoE architecture is working correctly (only ~40-50B active parameters per token), but the quality impact of IQ1_M remains to be determined through practical use.

VRAM Optimization Recommendations

Several optimizations can save VRAM if you don't need the full context window:

Reduce Context Size

Set --ctx-size explicitly to potentially save VRAM. Currently using 131K context to prevent OOM. Lower context values (e.g., 64K or 32K) would free significant VRAM for other uses.

KV Cache Quantization

Switch from Q8_0 to Q4_0 KV cache to enable full 180K+ context. This comes at the cost of some quality on long-context recall, but may be acceptable for many use cases.

Monitor Long-Running Sessions

KV cache grows with sequence length. Long conversations or multiple concurrent users will fill VRAM faster than expected due to the pre-allocated 131K token buffer.

Would I Recommend This?

TL;DR: Yes, with caveats. This is a very good model. I am satisfied with the quality despite the heavy quantization, and I can tolerate the tokens per second, although it is meaningfully slower than MiniMax M2.5 or Qwen3-Coder-Next (but this is expected).

The model successfully passed our testing, which included performance baselines, writing this blog post, and creating a meal plan and grocery list for the week. The quality is impressive given the extreme 1-bit quantization, suggesting that the knowledge encoded in 358B parameters can survive such heavy compression.

However, there are trade-offs to consider:

For production use, I'd still recommend GLM-4.7-Flash (Q5_K_M) as the safer choice due to its faster decode speed and lower resource requirements. But for research, experimentation, or when you need the absolute maximum model capacity that can fit in your hardware, this GLM-4.7 (IQ1_M) configuration is a viable option that delivers surprisingly good quality.