LLM Garage

Home Engineer's AI Hardware Journal

← Back to LLM Garage

Tokens-per-Second Benchmark Script

Lightweight tool for measuring LLM inference performance
January 2026

Measuring LLM performance shouldn't require complex tooling. After experimenting with various benchmarking approaches, I settled on a simple bash script that focuses on the one metric that matters most for interactive use: tokens per second (TPS). This tool became essential for comparing model configurations, quantization levels, and hardware setups.

The Script Explained

Core Functionality

The bench-tps.sh script is a lightweight bash utility that:

  • Sends a standardized prompt to an OpenAI-compatible LLM server
  • Measures the exact time from request start to response completion
  • Calculates tokens-per-second based on generated output length
  • Parses server response using jq for accurate token counts
  • Handles common error cases (server down, malformed responses)

Script Breakdown

Configuration and Defaults

Variables with default values for port, prompt, and max tokens:

  • PORT - Server port (default: 8000)
  • PROMPT - Test prompt (default: "Count from 1 to 50...")
  • MAX_TOKENS - Max output tokens (default: 256)

Timing and Request Logic

High-precision timing using date command with nanosecond resolution:

  • START=$(date +%s.%N) - High-precision start time
  • curl request to /v1/chat/completions endpoint
  • END=$(date +%s.%N) - End time

Metrics Extraction

Using jq to parse JSON response:

  • MODEL - Extract model name from response
  • PROMPT_TOKENS - Usage.prompt_tokens from response
  • COMPLETION_TOKENS - Usage.completion_tokens from response
  • TPS calculation: completion_tokens / elapsed_time

Why Simple TPS Matters

The Importance of Interactive Performance

While complex benchmarks measure throughput and latency, the single most important metric for interactive LLM use is tokens-per-second. This directly correlates with user experience:

Practical Applications

Configuration Comparison

  • Test quantization impact: Run same prompt multiple times
  • Compare different models on different ports
  • Test context window impact with longer prompts

Real-World Testing Scenarios

Model Configuration Experiments

Quantization Performance Comparison

Configuration Prompt Result (TPS) Interpretation
Nemotron Q6_K "Count from 1 to 50" 64.82 Excellent - MoE efficiency
Qwen3-32B Q5_K_M "Count from 1 to 50" 9.06 Dense model slower
Qwen3-32B Q4_K_M "Count from 1 to 50" 11.3 Lighter quantization helps

Hardware Performance Verification

Multi-GPU Scaling

Before/After Optimization

Used this script to verify multi-GPU setup effectiveness:

  • Single GPU test: ~13 TPS (GPU limited)
  • Dual GPU setup: ~9.06 TPS confirmed both GPUs engaged

Model Loading Validation

Health Check Integration

The script serves as both benchmark and health check - if it can't complete, the server isn't ready for production use.

Advanced Usage Patterns

Automated Testing

Benchmark Series

Loop through multiple models and prompts for comprehensive benchmarking:

  • Define array of models to test
  • Iterate through ports and prompts
  • Collect and compare results

Performance Monitoring

Continuous Monitoring Script

Set up periodic checks to detect performance degradation:

  • Log TPS results with timestamps
  • Alert if TPS drops below threshold
  • Run every 5 minutes via cron

Error Handling Insights

Common Failure Modes Discovered

Server Loading Issues

The script revealed critical timing issues during model loading:

  • Health endpoint responds immediately
  • But TPS script fails with "Loading model" error
  • Root cause: HTTP server starts before model loads completely

Resource Exhaustion Detection

Script Evolution

Design Decisions

Future Improvements

  1. Multiple runs: Calculate average TPS over several runs
  2. Latency breakdown: Measure time-to-first-token separately
  3. JSON output: Machine-readable results for automation
  4. Context window testing: Variable context sizes to test KV cache impact

Key Learnings from Usage

Performance Realities

Consistent benchmarking revealed important performance insights:

The Value of Simple Metrics

Complex vs. Simple

While sophisticated benchmarking suites exist, the simple TPS metric directly correlates with user experience. When a model feels "slow" or "fast" in interactive use, it's almost always reflected in the TPS numbers.

The script's simplicity makes it perfect for:

  • Quick performance checks during development
  • Automated health monitoring
  • Configuration optimization
  • Hardware upgrade justification

Integration into Testing Workflow

Development Validation

Compare TPS before and after model changes to measure performance impact.

Automated Testing

CI/CD integration to detect performance regressions - fail if TPS drops below threshold.

Bottom Line

The bench-tps.sh script became an essential tool in the LLM development workflow. Its simplicity belies its effectiveness - providing immediate, actionable performance metrics without complex setup or external dependencies. For anyone developing, deploying, or optimizing LLM services, measuring tokens-per-second is the most practical way to understand real-world performance.