Tokens-per-Second Benchmark Script

Measuring LLM performance shouldn't require complex tooling. After experimenting with various benchmarking approaches, I settled on a simple bash script that focuses on the one metric that matters most for interactive use: tokens per second (TPS). This tool became essential for comparing model configurations, quantization levels, and hardware setups.

The Script Explained

Core Functionality

The bench-tps.sh script is a lightweight bash utility that:

Sends a standardized prompt to an OpenAI-compatible LLM server
Measures the exact time from request start to response completion
Calculates tokens-per-second based on generated output length
Parses server response using jq for accurate token counts
Handles common error cases (server down, malformed responses)

Script Breakdown

Configuration and Defaults

PORT="${1:-8000}"                    # Server port (default: 8000)
PROMPT="${2:-Count from 1 to 50...}"  # Test prompt (customizable)
MAX_TOKENS="${3:-256}"               # Max output tokens (default: 256)

Timing and Request Logic

START=$(date +%s.%N)                # High-precision start time
RESPONSE=$(curl -s http://localhost:$PORT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{\"messages\": [{\"role\":\"user\",\"content\":\"$PROMPT\"}],\"stream\":false,\"max_tokens\":$MAX_TOKENS}")
END=$(date +%s.%N)                   # End time

Metrics Extraction

MODEL=$(echo "$RESPONSE" | jq -r '.model // "unknown"')
PROMPT_TOKENS=$(echo "$RESPONSE" | jq -r '.usage.prompt_tokens // 0')
COMPLETION_TOKENS=$(echo "$RESPONSE" | jq -r '.usage.completion_tokens // 0')
ELAPSED=$(echo "$END - $START" | bc)
TPS=$(echo "scale=2; $COMPLETION_TOKENS / $ELAPSED" | bc)

Why Simple TPS Matters

The Importance of Interactive Performance

While complex benchmarks measure throughput and latency, the single most important metric for interactive LLM use is tokens-per-second. This directly correlates with user experience:

10+ TPS: Excellent - text appears instantly as you type
5-10 TPS: Good - responsive with slight lag
2-5 TPS: Usable but noticeable delay
<2 TPS: Frustrating for interactive work

Practical Applications

Configuration Comparison

# Test quantization impact
./bench-tps.sh 8000 "Hello world" 256  # Default config
./bench-tps.sh 8000 "Hello world" 256  # Same, verify consistency

# Different models
./bench-tps.sh 8000 "Hello world" 256  # Model A
./bench-tps.sh 8000 "Hello world" 256  # Model B on different port

# Context window impact
./bench-tps.sh 8000 "Summarize: [long text]" 512  # Large context
./bench-tps.sh 8000 "Hello world" 256           # Small context

Real-World Testing Scenarios

Model Configuration Experiments

Quantization Performance Comparison

Configuration	Prompt	Result (TPS)	Interpretation
Nemotron Q6_K	"Count from 1 to 50"	64.82	Excellent - MoE efficiency
Qwen3-32B Q5_K_M	"Count from 1 to 50"	9.06	Dense model slower
Qwen3-32B Q4_K_M	"Count from 1 to 50"	11.3	Lighter quantization helps

Hardware Performance Verification

Multi-GPU Scaling

Before/After Optimization

Used this script to verify multi-GPU setup effectiveness:

# Single GPU test
./bench-tps.sh 8000 "test prompt" 256
# Result: ~13 TPS (GPU limited)

# Dual GPU setup
./bench-tps.sh 8000 "test prompt" 256  
# Result: ~9.06 TPS
# Confirmed: Both GPUs engaged, model split correctly

Model Loading Validation

Health Check Integration

# Part of automated server validation
if ! ./bench-tps.sh 8000 "test" 64 > /dev/null 2>&1; then
    echo "ERROR: Server not responding properly"
    exit 1
fi
echo "Server validation passed"

The script serves as both benchmark and health check - if it can't complete, the server isn't ready for production use.

Advanced Usage Patterns

Automated Testing

Benchmark Series

#!/bin/bash
# Automated benchmark loop
models=("nemotron" "qwen3-32b" "mistral-7b")
prompts=("Hello" "Write Python code" "Explain physics")

echo "Running comprehensive benchmark suite..."
for model in "${models[@]}"; do
    echo "Testing $model:"
    for port in 8000 8001 8002; do
        for prompt in "${prompts[@]}"; do
            echo "  Port $port: $(./bench-tps.sh $port "$prompt" 256 | grep "Speed")"
        done
    done
done

Performance Monitoring

Continuous Monitoring Script

#!/bin/bash
# monitor-performance.sh
LOG_FILE="/var/log/llm-performance.log"

while true; do
    TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
    RESULT=$(./bench-tps.sh 8000 "test prompt" 128)
    TPS=$(echo "$RESULT" | grep "Speed" | cut -d: -f2 | tr -d ' ')
    
    echo "$TIMESTAMP TPS: $TPS" >> "$LOG_FILE"
    
    # Alert if performance drops significantly
    if (( $(echo "$TPS < 5" | bc -l) )); then
        echo "WARNING: Low TPS detected: $TPS" >> "$LOG_FILE"
    fi
    
    sleep 300  # Check every 5 minutes
done

Error Handling Insights

Common Failure Modes Discovered

Server Loading Issues

The script revealed critical timing issues during model loading:

# Health endpoint responds immediately
curl http://localhost:8000/health
{"status":"ok"}

# But TPS script fails with model loading error
./bench-tps.sh 8000 "test" 64
Error: Loading model

# Root cause: HTTP server starts before model loads completely

Resource Exhaustion Detection

Out of Memory: Script returns "No response from server"
Port Conflicts: curl fails to connect, returns connection refused
Model Not Loaded: Returns JSON error with "Loading model" message
Server Overload: Extremely slow TPS indicates resource starvation

Script Evolution

Design Decisions

Bash only: No external dependencies except curl and jq
Curl vs HTTP clients: curl is ubiquitous, reliable for simple POST requests
jq for parsing: JSON parsing safer than text processing
Millisecond precision: Date +%s.%N provides sufficient timing accuracy
Streaming disabled: Non-streaming responses easier to parse reliably

Future Improvements

Multiple runs: Calculate average TPS over several runs
Latency breakdown: Measure time-to-first-token separately
JSON output: Machine-readable results for automation
Context window testing: Variable context sizes to test KV cache impact

Key Learnings from Usage

Performance Realities

Consistent benchmarking revealed important performance insights:

MoE superiority: 7x speed advantage of Nemotron over Qwen3 with similar parameter counts
Quantization impact: Q4 vs Q5 can provide significant speed gains with quality trade-offs
Model loading overhead: First request after server start often slower than subsequent ones

The Value of Simple Metrics

Complex vs. Simple

While sophisticated benchmarking suites exist, the simple TPS metric directly correlates with user experience. When a model feels "slow" or "fast" in interactive use, it's almost always reflected in the TPS numbers.

The script's simplicity makes it perfect for:

Quick performance checks during development
Automated health monitoring
Configuration optimization
Hardware upgrade justification

Integration into Testing Workflow

Development Validation

# Before/after model changes
BEFORE_TPS=$(./bench-tps.sh 8000 "test" 256 | grep "Speed")
# Apply configuration changes  
AFTER_TPS=$(./bench-tps.sh 8000 "test" 256 | grep "Speed")

echo "Performance impact: $BEFORE_TPS -> $AFTER_TPS"

Automated Testing

# CI/CD integration
if [[ $(./bench-tps.sh 8000 "test" 256 | grep -o '[0-9.]*' | head -1) < "5.0" ]]; then
    echo "Performance regression detected"
    exit 1
fi

Bottom Line

The bench-tps.sh script became an essential tool in the LLM development workflow. Its simplicity belies its effectiveness - providing immediate, actionable performance metrics without complex setup or external dependencies. For anyone developing, deploying, or optimizing LLM services, measuring tokens-per-second is the most practical way to understand real-world performance.