Tokens-per-Second Benchmark Script

← Back to LLM Garage

Measuring LLM performance shouldn't require complex tooling. After experimenting with various benchmarking approaches, I settled on a simple bash script that focuses on the one metric that matters most for interactive use: tokens per second (TPS). This tool became essential for comparing model configurations, quantization levels, and hardware setups.

The Script Explained

Core Functionality

The bench-tps.sh script is a lightweight bash utility that:

Sends a standardized prompt to an OpenAI-compatible LLM server
Measures the exact time from request start to response completion
Calculates tokens-per-second based on generated output length
Parses server response using jq for accurate token counts
Handles common error cases (server down, malformed responses)

Script Breakdown

Configuration and Defaults

Variables with default values for port, prompt, and max tokens:

PORT - Server port (default: 8000)
PROMPT - Test prompt (default: "Count from 1 to 50...")
MAX_TOKENS - Max output tokens (default: 256)

Timing and Request Logic

High-precision timing using date command with nanosecond resolution:

START=$(date +%s.%N) - High-precision start time
curl request to /v1/chat/completions endpoint
END=$(date +%s.%N) - End time

Metrics Extraction

Using jq to parse JSON response:

MODEL - Extract model name from response
PROMPT_TOKENS - Usage.prompt_tokens from response
COMPLETION_TOKENS - Usage.completion_tokens from response
TPS calculation: completion_tokens / elapsed_time

Why Simple TPS Matters

The Importance of Interactive Performance

While complex benchmarks measure throughput and latency, the single most important metric for interactive LLM use is tokens-per-second. This directly correlates with user experience:

10+ TPS: Excellent - text appears instantly as you type
5-10 TPS: Good - responsive with slight lag
2-5 TPS: Usable but noticeable delay
<2 TPS: Frustrating for interactive work

Practical Applications

Configuration Comparison

Test quantization impact: Run same prompt multiple times
Compare different models on different ports
Test context window impact with longer prompts

Real-World Testing Scenarios

Model Configuration Experiments

Quantization Performance Comparison

Configuration	Prompt	Result (TPS)	Interpretation
Nemotron Q6_K	"Count from 1 to 50"	64.82	Excellent - MoE efficiency
Qwen3-32B Q5_K_M	"Count from 1 to 50"	9.06	Dense model slower
Qwen3-32B Q4_K_M	"Count from 1 to 50"	11.3	Lighter quantization helps

Hardware Performance Verification

Multi-GPU Scaling

Before/After Optimization

Used this script to verify multi-GPU setup effectiveness:

Single GPU test: ~13 TPS (GPU limited)
Dual GPU setup: ~9.06 TPS confirmed both GPUs engaged

Model Loading Validation

Health Check Integration

The script serves as both benchmark and health check - if it can't complete, the server isn't ready for production use.

Advanced Usage Patterns

Automated Testing

Benchmark Series

Loop through multiple models and prompts for comprehensive benchmarking:

Define array of models to test
Iterate through ports and prompts
Collect and compare results

Performance Monitoring

Continuous Monitoring Script

Set up periodic checks to detect performance degradation:

Log TPS results with timestamps
Alert if TPS drops below threshold
Run every 5 minutes via cron

Error Handling Insights

Common Failure Modes Discovered

Server Loading Issues

The script revealed critical timing issues during model loading:

Health endpoint responds immediately
But TPS script fails with "Loading model" error
Root cause: HTTP server starts before model loads completely

Resource Exhaustion Detection

Out of Memory: Script returns "No response from server"
Port Conflicts: curl fails to connect, returns connection refused
Model Not Loaded: Returns JSON error with "Loading model" message
Server Overload: Extremely slow TPS indicates resource starvation

Script Evolution

Design Decisions

Bash only: No external dependencies except curl and jq
Curl vs HTTP clients: curl is ubiquitous, reliable for simple POST requests
jq for parsing: JSON parsing safer than text processing
Millisecond precision: Date +%s.%N provides sufficient timing accuracy
Streaming disabled: Non-streaming responses easier to parse reliably

Future Improvements

Multiple runs: Calculate average TPS over several runs
Latency breakdown: Measure time-to-first-token separately
JSON output: Machine-readable results for automation
Context window testing: Variable context sizes to test KV cache impact

Key Learnings from Usage

Performance Realities

Consistent benchmarking revealed important performance insights:

MoE superiority: 7x speed advantage of Nemotron over Qwen3 with similar parameter counts
Quantization impact: Q4 vs Q5 can provide significant speed gains with quality trade-offs
Model loading overhead: First request after server start often slower than subsequent ones

The Value of Simple Metrics

Complex vs. Simple

While sophisticated benchmarking suites exist, the simple TPS metric directly correlates with user experience. When a model feels "slow" or "fast" in interactive use, it's almost always reflected in the TPS numbers.

The script's simplicity makes it perfect for:

Quick performance checks during development
Automated health monitoring
Configuration optimization
Hardware upgrade justification

Integration into Testing Workflow

Development Validation

Compare TPS before and after model changes to measure performance impact.

Automated Testing

CI/CD integration to detect performance regressions - fail if TPS drops below threshold.

Bottom Line

The bench-tps.sh script became an essential tool in the LLM development workflow. Its simplicity belies its effectiveness - providing immediate, actionable performance metrics without complex setup or external dependencies. For anyone developing, deploying, or optimizing LLM services, measuring tokens-per-second is the most practical way to understand real-world performance.