This post documents an experiment running GLM-4.7-Flash, a 30B-A3B Mixture of Experts model, on a dual RTX 3090 system. We investigate memory usage patterns, KV cache behavior, and benchmark the model's token generation throughput on consumer hardware.
Hardware and Model
System Specs
- GPUs: 2x NVIDIA RTX 3090 (48GB VRAM each, 96GB total)
- CPU: AMD Threadripper 5995WX
- RAM: 384GB DDR4-3200 ECC
- PCIe: Full 4.0 16x speed on both GPUs
- Framework: llama.cpp server
Model Details
GLM-4.7-Flash is a Mixture of Experts model with the following architecture:
| Parameter | Value |
|---|---|
| Total Parameters | 30B |
| Quantization | Q5_K_M from Unsloth's HuggingFace page |
| File Size | ~20GB |
| Initial Attempt | Tried Q4_K_M (released day of); not initially impressed. Likely quality difference from unsloth optimizations plus higher quantization level. |
| Max Context | 202,752 tokens |
| Framework Support | vLLM, SGLang, llama.cpp, transformers |
Quantization Notes
Used Unsloth's Q5_K_M model from their Unsloth Dynamic 2.0 quantization collection. Initially tried the Q4_K_M version released on the same day, but wasn't immediately impressed. The quality difference likely comes from Unsloth's specialized optimizations in the Q5_K_M release plus the higher quantization level.
Context Length Testing
We've currently used 69,000 tokens in a running session with no OOM errors or crashes. While it's not clear the max context our current config will tolerate before exhausting VRAM, the current setup shows good headroom for long conversations.
Benchmark Results
Token Generation Speed: 60.93 tokens/second
(Average over 5 runs, ~1080 character prompt generating ~100 tokens per run)
Performance Characteristics
The model shows consistent performance with low variance (53.24 - 63.09 TPS). This suggests stable inference on the current hardware configuration. With full PCIe 4.0 16x bandwidth on both GPUs and 384GB RAM, the Threadripper platform provides excellent foundation for MoE models. The benchmark used -ngl 999 continuous batching, which maintains multiple inference slots and is optimal for multi-user scenarios.
Memory Analysis
With only 20GB of model weights but 44.5GB VRAM being used, there's additional memory allocation that's not immediately obvious. From nvidia-smi, GPU 0 is using 23GB and GPU 1 is using ~21.5GB.
| Memory Location | Usage | Contents |
|---|---|---|
| Model weights | ~20GB | Q5_K_M quantized parameters |
| Additional VRAM | ~24.5GB | Unknown allocation - doesn't appear to be growing |
Memory Allocation Mystery
We're unsure what exactly is using the additional ~24.5GB of VRAM. The KV cache appears to be static throughout the session. CPU RAM usage has also not increased. The context size is fixed at 202,752 tokens (visible in logs), but we don't have visibility into what llama.cpp is pre-allocating or why VRAM usage is so high for a 20GB model.
Configuration Details
Startup Parameters
$HOME/llama.cpp/build/bin/llama-server \
-m "$MODEL_PATH" \
--host 0.0.0.0 \
--port 8000 \
-ngl 999 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 Key parameters:
-ngl 999: Continuous batching for maximum throughput across multiple concurrent requests--temp 1.0: Default for general use (matches Unsloth recommendations)--top-p 0.95: Nucleus sampling, standard for general tasks--min-p 0.01: Critical for llama.cpp (Unsloth notes llama.cpp default is 0.1)
Recommended Parameters
- General use:
--temp 1.0 --top-p 0.95 --min-p 0.01 --repeat-penalty 1.0 - Tool-calling:
--temp 0.7 --top-p 1.0 --min-p 0.01 --repeat-penalty 1.0
Performance Benchmarks
GLM-4.7 Flash on Public Benchmarks
According to the official model card, GLM-4.7-Flash performs impressively across multiple tasks:
| Benchmark | GLM-4.7-Flash | Runner-up | Winner |
|---|---|---|---|
| AIME 25 | 91.6 | 85.0 | 91.7 |
| GPQA | 75.2 | 73.4 | 71.5 |
| LCB v6 | 64.0 | 66.0 | 61.0 |
| SWE-bench Verified | 59.2 | 22.0 | 34.0 |
| τ²-Bench | 79.5 | 49.0 | 47.7 |
| BrowseComp | 42.8 | 2.29 | 28.3 |
The MoE Advantage
GLM-4.7-Flash demonstrates the core benefit of Mixture of Experts architecture: you get the knowledge encoded in 30B parameters while only computing through a fraction of them per token. This balance of model size and compute cost is what makes it suitable for lightweight local deployment. The fact that the model is MoE means that since not all weights are used to calculate the next token, we achieve the very impressive tokens/sec rate that we are experiencing.
VRAM Optimization Recommendations
Several optimizations can save VRAM if you don't need the full context window:
Reduce Context Size
Set --ctx-size explicitly to potentially save VRAM. Currently I'm unsure if/how much context is being allocated, or what's taking up so much VRAM. The model is only 20GB but I only have 3.5GB to spare.
Disable Repeat Penalty
Add --repeat-penalty 1.0 as recommended by Unsloth to disable the default repeat penalty.
Monitor Long-Running Sessions
KV cache grows with sequence length. Long conversations or multiple concurrent users will fill VRAM faster than expected due to the pre-allocated 200K token buffer.
Alternative Deployment Options
For different use cases, consider these frameworks:
vLLM
Best for production deployments with high throughput needs:
vllm serve zai-org/GLM-4.7-Flash \
--tensor-parallel-size 4 \
--speculative-config.method mtp \
--tool-call-parser glm47 \
--enable-auto-tool-choice SGLang
Excellent for structured generation and efficient serving:
python3 -m sglang.launch_server \ --model-path zai-org/GLM-4.7-Flash \ --tp-size 4 \ --tool-call-parser glm47 \ --mem-fraction-static 0.8
Would I Recommend This?
TL;DR: 100%. I have not yet experienced a ~30b parameter model that is this good. This is the first local model I have felt I could actually trust for real tasks, in place of Claude, Minimax, etc. Tool use in Opencode has worked flawlessly, at least for read/write/web search. It's fast, reliable, and also: FREE (hardware cost of entry notwithstanding). It's a highly capable small model I've used.
The 60 TPS throughput is excellent for a 30B model. The fact that it works so well with such a small footprint means there's a lot more room for context, which is essential for real-life use cases, such as Opencode sessions.
The configuration is well-suited for multi-user scenarios with -ngl 999 continuous batching. You'll need to monitor VRAM usage during long-running sessions, as the KV cache can grow unbounded in open-ended conversations.
Overall, GLM-4.7-Flash represents an excellent option for local LLM inference, offering strong performance at an accessible model size. It's a pleasure to use with Opencode.