LLM Garage

Home Engineer's AI Hardware Journal

← Back to LLM Garage

Production Debugging: Model Loading Issues

Troubleshooting memory management, port conflicts, and common server failure patterns
January 2026

Production deployment of LLM servers is filled with subtle failure modes. What seems like a simple model load can become a cascade of cryptic errors. This post documents the real debugging journey and the systematic approach to resolving server issues.

The Common Failure Patterns

Pattern #1: Multiple Server Processes

The Symptom

curl to /v1/chat/completions returns an invalid response format error.

The Discovery

Running multiple servers simultaneously - use ps aux to see processes:

  • llama-server on port 8000 with qwen3-32b
  • llama-server on port 8000 with nemotron-nano

The Cause

Previous session's server didn't properly terminate, leaving zombie processes consuming resources and conflicting on the same port.

The Systematic Fix

  1. Complete server cleanup: pkill -f llama-server
  2. Verify clean state: ps aux | grep llama-server should return nothing
  3. Start server with proper logging: nohup /path/to/llama-server [arguments] > /tmp/server.log 2>&1 &
  4. Monitor startup progress: tail -f /tmp/server.log

Pattern #2: Model Loading Deception

The False Positive

  • Health check passes: curl http://localhost:8000/health returns JSON with status ok
  • But inference fails: curl to /v1/chat/completions returns loading error (503)

The Reality

The HTTP server starts immediately and responds to health checks, but the model continues loading in background for 30+ seconds.

Proper Readiness Check

  • Wait for log message: grep "llm_load_tensors" /tmp/server.log
  • Look for: "llm_load_tensors: loaded all 32 layers in 15.73 seconds"
  • Then test with actual inference request

Pattern #3: Memory Management Mysteries

The Silent OOM

Starting server with 48GB model on 48GB VRAM - no error message, no immediate failure. Then requests just hang or return generic errors.

The Investigation

  • Monitor VRAM during model load
  • Use: nvidia-smi --query-gpu=memory.used,memory.total
  • 发现: Model needs 31GB + KV cache buffer + overhead = 48GB+ need

Memory Strategy Evolution

Approach GPU Layers VRAM Used Result
Conservative --n-gpu-layers 24 ~38GB Slow inference (CPU offload)
Balanced --n-gpu-layers 32 ~42GB Adequate performance
Aggressive --n-gpu-layers 99 ~45GB Optimal (fits on 48GB)

The Debugging Toolkit

Essential Monitoring Commands

Server Health Monitoring

  • Process monitoring: ps aux | grep llama-server
  • Port usage: lsof -i :8000
  • GPU usage: nvidia-smi
  • CPU/memory: htop
  • Real-time GPU: nvidia-smi dmon -s pucvmet -o -d 1

Log Analysis

  • Real-time log: tail -f /tmp/server.log
  • Error patterns: grep -i error /tmp/server.log
  • Warnings: grep -i warn /tmp/server.log
  • Model loading: grep "llm_load_tensors" /tmp/server.log
  • CUDA activity: grep ggml_cuda /tmp/server.log

Common Error Messages and Solutions

Error: "Loading model" 503

Cause: Model still loading in background. Solution: Wait for "loaded all 32 layers" log message. Time to Resolution: 30-60 seconds

Error: "Model too large"

Cause: Insufficient VRAM for model + KV cache. Solution: Reduce --n-gpu-layers or use smaller quantization. Check: nvidia-smi for available VRAM

Error: "CUDA out of memory"

Cause: VRAM exhausted during inference. Solution: Restart server with reduced GPU layers. Prevention: Monitor VRAM usage during startup

Pre-Deployment Checklist

Before Starting Server

  1. Clean Environment: Kill existing processes, verify port availability
  2. Verify Dependencies: Check CUDA version, GPU drivers, model file integrity
  3. Resource Planning: Confirm VRAM capacity, disk space, network connectivity
  4. Configuration Review: Validate command arguments, config file settings

During Model Load

  1. Monitor Progress: Track GPU memory usage, watch log messages
  2. Watch Timeouts: Model load shouldn't exceed 60-90 seconds
  3. Resource Allocation: Verify balanced GPU utilization in multi-GPU setups
  4. Early Failure Detection: Any immediate error indicates configuration issues

After Server Ready

  1. Inference Testing: Send sample requests, verify response quality
  2. Performance Benchmarking: Test typical workloads, measure latency
  3. Memory Monitoring: Check for memory leaks or growing usage
  4. API Compatibility: Validate endpoint formats with client applications

Troubleshooting Methodology

Systematic Approach

Step 1: Isolate the Symptom

  • Server won't start?
  • Server starts but model won't load?
  • Model loads but health check fails?
  • Health passes but inference fails?

Step 2: Check the Basics

  • GPU available and drivers working?
  • Sufficient VRAM for model?
  • Port not in conflict?
  • Model file complete and accessible?

Step 3: Examine Logs

  • What was the last successful action?
  • When did the first error appear?
  • Are there resource exhaustion signs?
  • Any permission or path issues?

Step 4: Reproduce and Test

  • Restart cleanly with monitoring
  • Reproduce original error
  • Test minimal working configuration
  • Gradually increase complexity

The Debugging Workflow

My Routine Process

  1. Clean slate: pkill llama-server, verify clean with ps aux
  2. Start with monitoring: nohup llama-server > server.log 2>&1 &
  3. System verification: nvidia-smi, curl localhost:8000/health
  4. Test inference: curl -X POST /v1/chat/completions [test request]
  5. Full deployment: Only after all above pass

Production Monitoring

Health Check Automation

Simple Monitoring Script

  • Check if server is running: pgrep -f llama-server
  • Check HTTP response: curl with http_code
  • Verify model loaded: grep "llm_load_tensors" in log
  • Return OK or WARNING status codes

Crontab Setup

Add to crontab for every 5 minute checks: */5 * * * * /path/to/healthcheck.sh >> /var/log/llama-health.log 2>&1

Log Rotation

Prevent log files from growing too large - add to logrotate configuration for /tmp/server.log with daily rotation, 7 days retention, compress enabled.

Key Takeaways

Understanding Server States

LLM servers have multiple readiness states: HTTP ready != Model ready. Always test actual inference, not just health endpoints.

Process Management

Clean startup procedures prevent most common issues. Always kill existing processes before starting new ones.

Resource Planning

VRAM needs exceed model size. Always account for KV cache, overhead, and fragmentation when planning memory usage.

Systematic Debugging

Isolate symptoms systematically. Most errors fall into predictable patterns with repeatable solutions.

Automation is Essential

Manual debugging works for one-time deployments, but production requires automated health checks and monitoring.

Bottom Line

Production LLM server deployment is complex but predictable. Most failures follow recognizable patterns that systematic approaches can resolve quickly. The key is understanding the difference between server startup and model readiness, and having proper monitoring to catch silent failures.