Nemotron Model Setup Deep Dive

Setting up Nemotron-3-Nano-30B-A3B on dual RTX3090s was anything but straightforward. What should have been a simple model deployment became a week-long debugging session that revealed critical differences between inference frameworks and practical deployment realities.

The Initial Plan

Target Configuration

Model: Nemotron-3-Nano-30B-A3B-Q6_K (31.2GB)
Hardware: 2x RTX 3090 (48GB total VRAM)
Framework: Initially tried vLLM, ended up with llama.cpp
Context: 65K tokens
Goal: OpenAI-compatible API for OpenCode integration

The Discovery Process

Finding Existing Infrastructure

Lucky Discoveries

On first login to the system, I found existing infrastructure:

/home/tomwest/llama.cpp/build/bin/ - Already compiled and ready
~32GB of model blob in HuggingFace cache
vLLM server from previous attempt running on port 8001

The Roadblocks: Problem by Problem

Problem #1: Framework Incompatibility

The Issue

vLLM refused to load the GGUF format model with cryptic error messages.

Root Cause

vLLM: Supports Safetensors, older GGML format
llama.cpp: Native GGUF support (optimized format)
Reality: Model format determines inference backend choice

Problem #2: Incomplete Model Download

The Issue

Model directory had 32GB blob file but no proper snapshot structure.

Root Cause

Previous download session interrupted, leaving model in corrupted state.

Solution

Completed download via HuggingFace cache, using built-in resume capability.

Problem #3: Port Conflicts

The Issue

# Multiple llama-server processes running
ps aux | grep llama-server
tomwest   1234  0.0  45.0  24567890  12345   ?  S   Jan10 0:02 llama-server --port 8000 --model qwen-model
tomwest   5678  0.0  47.2  24789012  13456   ?  S   Jan11 0:03 llama-server --port 8001 --model nemotron

Root Cause

Previous server instance still running when attempting new setup.

Solution

# Clean restart procedure
pkill llama-server
sleep 2
# Start fresh on single port (8000)

Problem #4: Health Check Confusion

The Issue

curl http://localhost:8000/health
{"error":{"message":"Loading model","type":"unavailable_error","code":503}}

The Learning

HTTP health endpoint returns 200 before model loads completely. The API says "ok" but inference fails until background loading completes.

Real Health Check

# Wait for this message in logs:
# "llm_load_tensors:    loaded all 32 layers in 15.73 seconds"
# Then test with actual inference

Problem #5: GPU Memory Optimization

Hardware Reality Check

Needed to determine optimal --n-gpu-layers parameter for 31.2GB model on 48GB VRAM.

Configuration Tested

# Conservative approach
--n-gpu-layers 32  # ~40GB VRAM usage, CPU offload for remainder

# Aggressive approach 
--n-gpu-layers 99  # All layers on GPU

# Final choice: --n-gpu-layers 99
# Fits comfortably ~48GB VRAM with room for KV cache

Framework Comparison: llama.cpp vs vLLM

Why llama.cpp Won Here

llama.cpp Advantages

Native GGUF support: No format conversion needed
Lower memory footprint: Efficient quantization handling
Easier setup: Single binary, minimal dependencies
OpenAI API: Works directly with OpenCode
Dual GPU support: Via --split-mode layer

vLLM Strengths (But Not for This Use Case)

High Concurrency: Better throughput for multiple simultaneous requests
PagedAttention: Advanced memory management
Batch Processing: Superior handling of parallel requests
Production Features: Better monitoring and metrics

Usage Pattern Decision

Our Workload Profile

Solo development: Single user, sequential requests
Local serving: No need for high concurrency
Model focus: One model at a time

Result: llama.cpp perfect for single-model, local development scenarios.

Configuration Deep Dive

Final Working Setup

Model Loading

nohup /home/tomwest/llama.cpp/build/bin/llama-server \
  -m /home/tomwest/.cache/huggingface/hub/models--unsloth--Nemotron-3-Nano-30B-A3B-GGUF/snapshots/9ad8b366c308f931b2a96b9306f0b41aef9cd405/Nemotron-3-Nano-30B-A3B-Q6_K.gguf \
  -c 65536 \
  --n-gpu-layers 99 \
  --host 0.0.0.0 \
  --port 8000 \
  > /tmp/llama-server.log 2>&1 &

OpenCode Integration

# Update ~/.config/opencode/opencode.json
{
  "provider": "llama_cpp",
  "model_id": "nemotron-nano-30b",
  "context_window": 65536,
  "max_tokens": 32000,
  "base_url": "http://localhost:8000/v1"
}

Performance Observations

Metric	Value	Notes
Model Load Time	~30 seconds	Includes GPU tensor loading
VRAM Usage	~48GB	Full model on both GPUs
Context Window	65K tokens	Maximum tested capacity
GPU Utilization	Both cards 85-90%	Tensor parallelism working

Key Learnings

Model Format Ecosystem

Not all inference frameworks support all model formats. The choice between GGUF, Safetensors, and older GGML formats dictates your backend options.

Quantization Sweet Spot

Q6_K (8.49 bits per weight) provides excellent balance for 30B models. Smaller quantization would fit more models but quality degrades noticeably.

Server Management

# Essential process management commands
pkill llama-server                    # Stop all instances
ps aux | grep llama-server             # Check for leftovers
curl -s http://localhost:8000/health   # Health check
tail -f /tmp/llama-server.log          # Monitor startup

Health vs. Ready

HTTP health endpoint != model loaded. Wait for "model loaded" log message before attempting inference.

Production Readiness Checklist

Before Going Live

Clean Start: Kill existing servers, verify clean environment
Model Integrity: Verify complete model download before startup
Memory Planning: Confirm VRAM capacity with chosen quantization
API Testing: Test actual inference, not just health endpoint
Configuration Sync: Ensure OpenCode config matches server settings
Monitoring Setup: Establish log monitoring before production

Bottom Line

The "simple" model setup became a framework compatibility discovery lesson. llama.cpp proved superior for local, single-model deployments despite vLLM's enterprise features. The key insight: match your use case to the appropriate framework from the start.