Measuring LLM performance shouldn't require complex tooling. After experimenting with various benchmarking approaches, I settled on a simple bash script that focuses on the one metric that matters most for interactive use: tokens per second (TPS). This tool became essential for comparing model configurations, quantization levels, and hardware setups.
The Script Explained
Core Functionality
The bench-tps.sh script is a lightweight bash utility that:
- Sends a standardized prompt to an OpenAI-compatible LLM server
- Measures the exact time from request start to response completion
- Calculates tokens-per-second based on generated output length
- Parses server response using jq for accurate token counts
- Handles common error cases (server down, malformed responses)
Script Breakdown
Configuration and Defaults
PORT="${1:-8000}" # Server port (default: 8000)
PROMPT="${2:-Count from 1 to 50...}" # Test prompt (customizable)
MAX_TOKENS="${3:-256}" # Max output tokens (default: 256)
Timing and Request Logic
START=$(date +%s.%N) # High-precision start time
RESPONSE=$(curl -s http://localhost:$PORT/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{\"messages\": [{\"role\":\"user\",\"content\":\"$PROMPT\"}],\"stream\":false,\"max_tokens\":$MAX_TOKENS}")
END=$(date +%s.%N) # End time
Metrics Extraction
MODEL=$(echo "$RESPONSE" | jq -r '.model // "unknown"') PROMPT_TOKENS=$(echo "$RESPONSE" | jq -r '.usage.prompt_tokens // 0') COMPLETION_TOKENS=$(echo "$RESPONSE" | jq -r '.usage.completion_tokens // 0') ELAPSED=$(echo "$END - $START" | bc) TPS=$(echo "scale=2; $COMPLETION_TOKENS / $ELAPSED" | bc)
Why Simple TPS Matters
The Importance of Interactive Performance
While complex benchmarks measure throughput and latency, the single most important metric for interactive LLM use is tokens-per-second. This directly correlates with user experience:
- 10+ TPS: Excellent - text appears instantly as you type
- 5-10 TPS: Good - responsive with slight lag
- 2-5 TPS: Usable but noticeable delay
- <2 TPS: Frustrating for interactive work
Practical Applications
Configuration Comparison
# Test quantization impact ./bench-tps.sh 8000 "Hello world" 256 # Default config ./bench-tps.sh 8000 "Hello world" 256 # Same, verify consistency # Different models ./bench-tps.sh 8000 "Hello world" 256 # Model A ./bench-tps.sh 8000 "Hello world" 256 # Model B on different port # Context window impact ./bench-tps.sh 8000 "Summarize: [long text]" 512 # Large context ./bench-tps.sh 8000 "Hello world" 256 # Small context
Real-World Testing Scenarios
Model Configuration Experiments
Quantization Performance Comparison
| Configuration | Prompt | Result (TPS) | Interpretation |
|---|---|---|---|
| Nemotron Q6_K | "Count from 1 to 50" | 64.82 | Excellent - MoE efficiency |
| Qwen3-32B Q5_K_M | "Count from 1 to 50" | 9.06 | Dense model slower |
| Qwen3-32B Q4_K_M | "Count from 1 to 50" | 11.3 | Lighter quantization helps |
Hardware Performance Verification
Multi-GPU Scaling
Before/After Optimization
Used this script to verify multi-GPU setup effectiveness:
# Single GPU test ./bench-tps.sh 8000 "test prompt" 256 # Result: ~13 TPS (GPU limited) # Dual GPU setup ./bench-tps.sh 8000 "test prompt" 256 # Result: ~9.06 TPS # Confirmed: Both GPUs engaged, model split correctly
Model Loading Validation
Health Check Integration
# Part of automated server validation
if ! ./bench-tps.sh 8000 "test" 64 > /dev/null 2>&1; then
echo "ERROR: Server not responding properly"
exit 1
fi
echo "Server validation passed"
The script serves as both benchmark and health check - if it can't complete, the server isn't ready for production use.
Advanced Usage Patterns
Automated Testing
Benchmark Series
#!/bin/bash
# Automated benchmark loop
models=("nemotron" "qwen3-32b" "mistral-7b")
prompts=("Hello" "Write Python code" "Explain physics")
echo "Running comprehensive benchmark suite..."
for model in "${models[@]}"; do
echo "Testing $model:"
for port in 8000 8001 8002; do
for prompt in "${prompts[@]}"; do
echo " Port $port: $(./bench-tps.sh $port "$prompt" 256 | grep "Speed")"
done
done
done
Performance Monitoring
Continuous Monitoring Script
#!/bin/bash
# monitor-performance.sh
LOG_FILE="/var/log/llm-performance.log"
while true; do
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
RESULT=$(./bench-tps.sh 8000 "test prompt" 128)
TPS=$(echo "$RESULT" | grep "Speed" | cut -d: -f2 | tr -d ' ')
echo "$TIMESTAMP TPS: $TPS" >> "$LOG_FILE"
# Alert if performance drops significantly
if (( $(echo "$TPS < 5" | bc -l) )); then
echo "WARNING: Low TPS detected: $TPS" >> "$LOG_FILE"
fi
sleep 300 # Check every 5 minutes
done
Error Handling Insights
Common Failure Modes Discovered
Server Loading Issues
The script revealed critical timing issues during model loading:
# Health endpoint responds immediately
curl http://localhost:8000/health
{"status":"ok"}
# But TPS script fails with model loading error
./bench-tps.sh 8000 "test" 64
Error: Loading model
# Root cause: HTTP server starts before model loads completely
Resource Exhaustion Detection
- Out of Memory: Script returns "No response from server"
- Port Conflicts: curl fails to connect, returns connection refused
- Model Not Loaded: Returns JSON error with "Loading model" message
- Server Overload: Extremely slow TPS indicates resource starvation
Script Evolution
Design Decisions
- Bash only: No external dependencies except curl and jq
- Curl vs HTTP clients: curl is ubiquitous, reliable for simple POST requests
- jq for parsing: JSON parsing safer than text processing
- Millisecond precision: Date +%s.%N provides sufficient timing accuracy
- Streaming disabled: Non-streaming responses easier to parse reliably
Future Improvements
- Multiple runs: Calculate average TPS over several runs
- Latency breakdown: Measure time-to-first-token separately
- JSON output: Machine-readable results for automation
- Context window testing: Variable context sizes to test KV cache impact
Key Learnings from Usage
Performance Realities
Consistent benchmarking revealed important performance insights:
- MoE superiority: 7x speed advantage of Nemotron over Qwen3 with similar parameter counts
- Quantization impact: Q4 vs Q5 can provide significant speed gains with quality trade-offs
- Model loading overhead: First request after server start often slower than subsequent ones
The Value of Simple Metrics
Complex vs. Simple
While sophisticated benchmarking suites exist, the simple TPS metric directly correlates with user experience. When a model feels "slow" or "fast" in interactive use, it's almost always reflected in the TPS numbers.
The script's simplicity makes it perfect for:
- Quick performance checks during development
- Automated health monitoring
- Configuration optimization
- Hardware upgrade justification
Integration into Testing Workflow
Development Validation
# Before/after model changes BEFORE_TPS=$(./bench-tps.sh 8000 "test" 256 | grep "Speed") # Apply configuration changes AFTER_TPS=$(./bench-tps.sh 8000 "test" 256 | grep "Speed") echo "Performance impact: $BEFORE_TPS -> $AFTER_TPS"
Automated Testing
# CI/CD integration
if [[ $(./bench-tps.sh 8000 "test" 256 | grep -o '[0-9.]*' | head -1) < "5.0" ]]; then
echo "Performance regression detected"
exit 1
fi
Bottom Line
The bench-tps.sh script became an essential tool in the LLM development workflow. Its simplicity belies its effectiveness - providing immediate, actionable performance metrics without complex setup or external dependencies. For anyone developing, deploying, or optimizing LLM services, measuring tokens-per-second is the most practical way to understand real-world performance.