This benchmark compares RTX 3090 performance at different power limits across various context lengths. Results show the impact of power capping on inference throughput.
Test Configuration
- GPUs: 2x RTX 3090
- Model: Qwen3-32B Q5_K_M
- Framework: llama.cpp server
Results
| Context Length | 100W Power Limit | 350W Power Limit | Difference |
|---|---|---|---|
| 1K tokens | 15 t/s | 18 t/s | 17% slower |
| 8K tokens | 12 t/s | 16 t/s | 25% slower |
| 32K tokens | 8 t/s | 14 t/s | 43% slower |
Analysis
At lower context lengths, power limits have minimal impact. At higher context lengths, the performance penalty becomes significant.
Recommendation
For maximum throughput, run at full power (350W). For lower electricity costs and heat, 200W is a good balance.