Tokens-per-Second Benchmark Script

A lightweight tool for measuring LLM inference performance across different configurations
January 2026
← Back to LLM Garage

Measuring LLM performance shouldn't require complex tooling. After experimenting with various benchmarking approaches, I settled on a simple bash script that focuses on the one metric that matters most for interactive use: tokens per second (TPS). This tool became essential for comparing model configurations, quantization levels, and hardware setups.

The Script Explained

Core Functionality

The bench-tps.sh script is a lightweight bash utility that:

Script Breakdown

Configuration and Defaults

PORT="${1:-8000}"                    # Server port (default: 8000)
PROMPT="${2:-Count from 1 to 50...}"  # Test prompt (customizable)
MAX_TOKENS="${3:-256}"               # Max output tokens (default: 256)

Timing and Request Logic

START=$(date +%s.%N)                # High-precision start time
RESPONSE=$(curl -s http://localhost:$PORT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{\"messages\": [{\"role\":\"user\",\"content\":\"$PROMPT\"}],\"stream\":false,\"max_tokens\":$MAX_TOKENS}")
END=$(date +%s.%N)                   # End time

Metrics Extraction

MODEL=$(echo "$RESPONSE" | jq -r '.model // "unknown"')
PROMPT_TOKENS=$(echo "$RESPONSE" | jq -r '.usage.prompt_tokens // 0')
COMPLETION_TOKENS=$(echo "$RESPONSE" | jq -r '.usage.completion_tokens // 0')
ELAPSED=$(echo "$END - $START" | bc)
TPS=$(echo "scale=2; $COMPLETION_TOKENS / $ELAPSED" | bc)

Why Simple TPS Matters

The Importance of Interactive Performance

While complex benchmarks measure throughput and latency, the single most important metric for interactive LLM use is tokens-per-second. This directly correlates with user experience:

Practical Applications

Configuration Comparison

# Test quantization impact
./bench-tps.sh 8000 "Hello world" 256  # Default config
./bench-tps.sh 8000 "Hello world" 256  # Same, verify consistency

# Different models
./bench-tps.sh 8000 "Hello world" 256  # Model A
./bench-tps.sh 8000 "Hello world" 256  # Model B on different port

# Context window impact
./bench-tps.sh 8000 "Summarize: [long text]" 512  # Large context
./bench-tps.sh 8000 "Hello world" 256           # Small context

Real-World Testing Scenarios

Model Configuration Experiments

Quantization Performance Comparison

Configuration Prompt Result (TPS) Interpretation
Nemotron Q6_K "Count from 1 to 50" 64.82 Excellent - MoE efficiency
Qwen3-32B Q5_K_M "Count from 1 to 50" 9.06 Dense model slower
Qwen3-32B Q4_K_M "Count from 1 to 50" 11.3 Lighter quantization helps

Hardware Performance Verification

Multi-GPU Scaling

Before/After Optimization

Used this script to verify multi-GPU setup effectiveness:

# Single GPU test
./bench-tps.sh 8000 "test prompt" 256
# Result: ~13 TPS (GPU limited)

# Dual GPU setup
./bench-tps.sh 8000 "test prompt" 256  
# Result: ~9.06 TPS
# Confirmed: Both GPUs engaged, model split correctly

Model Loading Validation

Health Check Integration

# Part of automated server validation
if ! ./bench-tps.sh 8000 "test" 64 > /dev/null 2>&1; then
    echo "ERROR: Server not responding properly"
    exit 1
fi
echo "Server validation passed"

The script serves as both benchmark and health check - if it can't complete, the server isn't ready for production use.

Advanced Usage Patterns

Automated Testing

Benchmark Series

#!/bin/bash
# Automated benchmark loop
models=("nemotron" "qwen3-32b" "mistral-7b")
prompts=("Hello" "Write Python code" "Explain physics")

echo "Running comprehensive benchmark suite..."
for model in "${models[@]}"; do
    echo "Testing $model:"
    for port in 8000 8001 8002; do
        for prompt in "${prompts[@]}"; do
            echo "  Port $port: $(./bench-tps.sh $port "$prompt" 256 | grep "Speed")"
        done
    done
done

Performance Monitoring

Continuous Monitoring Script

#!/bin/bash
# monitor-performance.sh
LOG_FILE="/var/log/llm-performance.log"

while true; do
    TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
    RESULT=$(./bench-tps.sh 8000 "test prompt" 128)
    TPS=$(echo "$RESULT" | grep "Speed" | cut -d: -f2 | tr -d ' ')
    
    echo "$TIMESTAMP TPS: $TPS" >> "$LOG_FILE"
    
    # Alert if performance drops significantly
    if (( $(echo "$TPS < 5" | bc -l) )); then
        echo "WARNING: Low TPS detected: $TPS" >> "$LOG_FILE"
    fi
    
    sleep 300  # Check every 5 minutes
done

Error Handling Insights

Common Failure Modes Discovered

Server Loading Issues

The script revealed critical timing issues during model loading:

# Health endpoint responds immediately
curl http://localhost:8000/health
{"status":"ok"}

# But TPS script fails with model loading error
./bench-tps.sh 8000 "test" 64
Error: Loading model

# Root cause: HTTP server starts before model loads completely

Resource Exhaustion Detection

Script Evolution

Design Decisions

Future Improvements

  1. Multiple runs: Calculate average TPS over several runs
  2. Latency breakdown: Measure time-to-first-token separately
  3. JSON output: Machine-readable results for automation
  4. Context window testing: Variable context sizes to test KV cache impact

Key Learnings from Usage

Performance Realities

Consistent benchmarking revealed important performance insights:

The Value of Simple Metrics

Complex vs. Simple

While sophisticated benchmarking suites exist, the simple TPS metric directly correlates with user experience. When a model feels "slow" or "fast" in interactive use, it's almost always reflected in the TPS numbers.

The script's simplicity makes it perfect for:

Integration into Testing Workflow

Development Validation

# Before/after model changes
BEFORE_TPS=$(./bench-tps.sh 8000 "test" 256 | grep "Speed")
# Apply configuration changes  
AFTER_TPS=$(./bench-tps.sh 8000 "test" 256 | grep "Speed")

echo "Performance impact: $BEFORE_TPS -> $AFTER_TPS"

Automated Testing

# CI/CD integration
if [[ $(./bench-tps.sh 8000 "test" 256 | grep -o '[0-9.]*' | head -1) < "5.0" ]]; then
    echo "Performance regression detected"
    exit 1
fi

Bottom Line

The bench-tps.sh script became an essential tool in the LLM development workflow. Its simplicity belies its effectiveness - providing immediate, actionable performance metrics without complex setup or external dependencies. For anyone developing, deploying, or optimizing LLM services, measuring tokens-per-second is the most practical way to understand real-world performance.