LLM Garage

Home Engineer's AI Hardware Journal

← Back to LLM Garage

Tokens-per-Second Test Bench: Power Limit Sweep

Systematic benchmarking from 100W to 350W per GPU
February 2026

I built a better tokens/sec test bench to benchmark inference performance systematically. The key improvement over my previous ad-hoc scripts is that this test bench now measures prefill vs decode separately, along with TTFT, power draw, and efficiency metrics.

The Experiment

I ran the MiniMax M2.5 UD_IQ2_M model (Q2 quantization) across 6x RTX 3090s (144GB total VRAM) with llama.cpp server. The test swept power limits from 100W to 350W per GPU (as DC power draw reported by nvidia-smi):

Results

Prefill Performance

Power Limit Context Prefill tps Efficiency (tok/kWh)
350W 198K 9,893 43.5M
300W 198K 9,326 40.2M
250W 198K 8,436 35.2M
200W 198K 5,941 25.1M
150W 198K 2,549 11.3M
100W 198K 1,390 7.5M

Higher power limits deliver higher prefill throughput. At 350W vs 100W, you're getting ~7x more tokens/second during prefill.

Decode Performance

Power Limit Context Decode tps Efficiency (tok/kWh)
350W 4K 72.6 338K
300W 4K 72.1 326K
250W 4K 70.7 314K
200W 4K 68.3 298K
150W 4K 62.6 277K
100W 4K 12.1 65K

Decode performance gains plateau around 250W-350W (only ~8% improvement going from 250W to 350W). But at 100W, decode is throttled dramatically (only 12 tps vs 72 tps).

Time to First Token (TTFT)

Power Limit 4K context 198K context
350W 1,178 ms 20,003 ms
300W 1,207 ms 21,219 ms
250W 1,310 ms 23,458 ms
200W 1,725 ms 33,310 ms
150W 3,803 ms 77,636 ms
100W 7,421 ms 142,324 ms

TTFT scales roughly linearly with power limit. Going from 100W to 350W gives you ~6-7x faster time to first token.

Key Findings

The Obvious

The Interesting

The Bottom Line on Power Limits

Based on this data, power limiting doesn't save energy for inference workloads like MiniMax. The GPUs simply don't use the full power allocation. The real reasons to use power limits are:

Charts

The test bench generates detailed charts showing prefill and decode performance across all context lengths and power levels.

Prefill Performance Charts
Decode Performance Charts

Takeaways

  1. Power limits don't necessarily save energy - MiniMax draws ~670-820W regardless of power limit setting. There's no meaningful efficiency gain.
  2. 350W delivers the best performance - highest throughput, fastest TTFT, and paradoxically the lowest power consumption.
  3. 100W should be avoided entirely - you get neither speed nor efficiency. Decode drops to ~12 tps vs 72 tps at higher power, and you generate fewer total tokens per kWh due to the throttling.
  4. Use power limits for practical reasons - circuit capacity, heat management, or extending GPU life-not energy savings.
  5. Need to test dense models - these results might be different with dense (non-MoE) models that actually max out GPU power draw.