Running GLM-4.7-Flash on Dual RTX 3090s

← Back to LLM Garage

This post documents an experiment running GLM-4.7-Flash, a 30B-A3B Mixture of Experts model, on a dual RTX 3090 system. We investigate memory usage patterns, KV cache behavior, and benchmark the model's token generation throughput on consumer hardware.

Predominantly written by GLM-4.7-Flash using Opencode.

Hardware and Model

System Specs

GPUs: 2x NVIDIA RTX 3090 (48GB VRAM each, 96GB total)
CPU: AMD Threadripper 5995WX
RAM: 384GB DDR4-3200 ECC
PCIe: Full 4.0 16x speed on both GPUs
Framework: llama.cpp server

Model Details

GLM-4.7-Flash is a Mixture of Experts model with the following architecture:

Parameter	Value
Total Parameters	30B
Quantization	Q5_K_M from Unsloth's HuggingFace page
File Size	~20GB
Initial Attempt	Tried Q4_K_M (released day of); not initially impressed. Likely quality difference from unsloth optimizations plus higher quantization level.
Max Context	202,752 tokens
Framework Support	vLLM, SGLang, llama.cpp, transformers

Quantization Notes

Used Unsloth's Q5_K_M model from their Unsloth Dynamic 2.0 quantization collection. Initially tried the Q4_K_M version released on the same day, but wasn't immediately impressed. The quality difference likely comes from Unsloth's specialized optimizations in the Q5_K_M release plus the higher quantization level.

Context Length Testing

We've currently used 69,000 tokens in a running session with no OOM errors or crashes. While it's not clear the max context our current config will tolerate before exhausting VRAM, the current setup shows good headroom for long conversations.

Benchmark Results

Token Generation Speed: 60.93 tokens/second

(Average over 5 runs, ~1080 character prompt generating ~100 tokens per run)

Performance Characteristics

The model shows consistent performance with low variance (53.24 - 63.09 TPS). This suggests stable inference on the current hardware configuration. With full PCIe 4.0 16x bandwidth on both GPUs and 384GB RAM, the Threadripper platform provides excellent foundation for MoE models. The benchmark used -ngl 999 continuous batching, which maintains multiple inference slots and is optimal for multi-user scenarios.

Memory Analysis

With only 20GB of model weights but 44.5GB VRAM being used, there's additional memory allocation that's not immediately obvious. From nvidia-smi, GPU 0 is using 23GB and GPU 1 is using ~21.5GB.

Memory Location	Usage	Contents
Model weights	~20GB	Q5_K_M quantized parameters
Additional VRAM	~24.5GB	Unknown allocation - doesn't appear to be growing

Memory Allocation Mystery

We're unsure what exactly is using the additional ~24.5GB of VRAM. The KV cache appears to be static throughout the session. CPU RAM usage has also not increased. The context size is fixed at 202,752 tokens (visible in logs), but we don't have visibility into what llama.cpp is pre-allocating or why VRAM usage is so high for a 20GB model.

Configuration Details

Startup Parameters

$HOME/llama.cpp/build/bin/llama-server \
    -m "$MODEL_PATH" \
    --host 0.0.0.0 \
    --port 8000 \
    -ngl 999 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01

Key parameters:

-ngl 999: Continuous batching for maximum throughput across multiple concurrent requests
--temp 1.0: Default for general use (matches Unsloth recommendations)
--top-p 0.95: Nucleus sampling, standard for general tasks
--min-p 0.01: Critical for llama.cpp (Unsloth notes llama.cpp default is 0.1)

Recommended Parameters

General use: --temp 1.0 --top-p 0.95 --min-p 0.01 --repeat-penalty 1.0
Tool-calling: --temp 0.7 --top-p 1.0 --min-p 0.01 --repeat-penalty 1.0

Performance Benchmarks

GLM-4.7 Flash on Public Benchmarks

According to the official model card, GLM-4.7-Flash performs impressively across multiple tasks:

Benchmark	GLM-4.7-Flash	Runner-up	Winner
AIME 25	91.6	85.0	91.7
GPQA	75.2	73.4	71.5
LCB v6	64.0	66.0	61.0
SWE-bench Verified	59.2	22.0	34.0
τ²-Bench	79.5	49.0	47.7
BrowseComp	42.8	2.29	28.3

The MoE Advantage

GLM-4.7-Flash demonstrates the core benefit of Mixture of Experts architecture: you get the knowledge encoded in 30B parameters while only computing through a fraction of them per token. This balance of model size and compute cost is what makes it suitable for lightweight local deployment. The fact that the model is MoE means that since not all weights are used to calculate the next token, we achieve the very impressive tokens/sec rate that we are experiencing.

VRAM Optimization Recommendations

Several optimizations can save VRAM if you don't need the full context window:

Reduce Context Size

Set --ctx-size explicitly to potentially save VRAM. Currently I'm unsure if/how much context is being allocated, or what's taking up so much VRAM. The model is only 20GB but I only have 3.5GB to spare.

Disable Repeat Penalty

Add --repeat-penalty 1.0 as recommended by Unsloth to disable the default repeat penalty.

Monitor Long-Running Sessions

KV cache grows with sequence length. Long conversations or multiple concurrent users will fill VRAM faster than expected due to the pre-allocated 200K token buffer.

Alternative Deployment Options

For different use cases, consider these frameworks:

vLLM

Best for production deployments with high throughput needs:

vllm serve zai-org/GLM-4.7-Flash \
     --tensor-parallel-size 4 \
     --speculative-config.method mtp \
     --tool-call-parser glm47 \
     --enable-auto-tool-choice

SGLang

Excellent for structured generation and efficient serving:

python3 -m sglang.launch_server \
  --model-path zai-org/GLM-4.7-Flash \
  --tp-size 4 \
  --tool-call-parser glm47 \
  --mem-fraction-static 0.8

Would I Recommend This?

TL;DR: 100%. I have not yet experienced a ~30b parameter model that is this good. This is the first local model I have felt I could actually trust for real tasks, in place of Claude, Minimax, etc. Tool use in Opencode has worked flawlessly, at least for read/write/web search. It's fast, reliable, and also: FREE (hardware cost of entry notwithstanding). It's a highly capable small model I've used.

The 60 TPS throughput is excellent for a 30B model. The fact that it works so well with such a small footprint means there's a lot more room for context, which is essential for real-life use cases, such as Opencode sessions.

The configuration is well-suited for multi-user scenarios with -ngl 999 continuous batching. You'll need to monitor VRAM usage during long-running sessions, as the KV cache can grow unbounded in open-ended conversations.

Overall, GLM-4.7-Flash represents an excellent option for local LLM inference, offering strong performance at an accessible model size. It's a pleasure to use with Opencode.