Setting up Nemotron-3-Nano-30B-A3B on dual RTX3090s was anything but straightforward. What should have been a simple model deployment became a week-long debugging session that revealed critical differences between inference frameworks and practical deployment realities.
The Initial Plan
Target Configuration
- Model: Nemotron-3-Nano-30B-A3B-Q6_K (31.2GB)
- Hardware: 2x RTX 3090 (48GB total VRAM)
- Framework: Initially tried vLLM, ended up with llama.cpp
- Context: 65K tokens
- Goal: OpenAI-compatible API for OpenCode integration
The Discovery Process
Finding Existing Infrastructure
Lucky Discoveries
On first login to the system, I found existing infrastructure:
/home/tomwest/llama.cpp/build/bin/- Already compiled and ready- ~32GB of model blob in HuggingFace cache
- vLLM server from previous attempt running on port 8001
The Roadblocks: Problem by Problem
Problem #1: Framework Incompatibility
The Issue
vLLM refused to load the GGUF format model with cryptic error messages.
Root Cause
- vLLM: Supports Safetensors, older GGML format
- llama.cpp: Native GGUF support (optimized format)
- Reality: Model format determines inference backend choice
Problem #2: Incomplete Model Download
The Issue
Model directory had 32GB blob file but no proper snapshot structure.
Root Cause
Previous download session interrupted, leaving model in corrupted state.
Solution
Completed download via HuggingFace cache, using built-in resume capability.
Problem #3: Port Conflicts
The Issue
# Multiple llama-server processes running ps aux | grep llama-server tomwest 1234 0.0 45.0 24567890 12345 ? S Jan10 0:02 llama-server --port 8000 --model qwen-model tomwest 5678 0.0 47.2 24789012 13456 ? S Jan11 0:03 llama-server --port 8001 --model nemotron
Root Cause
Previous server instance still running when attempting new setup.
Solution
# Clean restart procedure pkill llama-server sleep 2 # Start fresh on single port (8000)
Problem #4: Health Check Confusion
The Issue
curl http://localhost:8000/health
{"error":{"message":"Loading model","type":"unavailable_error","code":503}}
The Learning
HTTP health endpoint returns 200 before model loads completely. The API says "ok" but inference fails until background loading completes.
Real Health Check
# Wait for this message in logs: # "llm_load_tensors: loaded all 32 layers in 15.73 seconds" # Then test with actual inference
Problem #5: GPU Memory Optimization
Hardware Reality Check
Needed to determine optimal --n-gpu-layers parameter for 31.2GB model on 48GB VRAM.
Configuration Tested
# Conservative approach --n-gpu-layers 32 # ~40GB VRAM usage, CPU offload for remainder # Aggressive approach --n-gpu-layers 99 # All layers on GPU # Final choice: --n-gpu-layers 99 # Fits comfortably ~48GB VRAM with room for KV cache
Framework Comparison: llama.cpp vs vLLM
Why llama.cpp Won Here
llama.cpp Advantages
- Native GGUF support: No format conversion needed
- Lower memory footprint: Efficient quantization handling
- Easier setup: Single binary, minimal dependencies
- OpenAI API: Works directly with OpenCode
- Dual GPU support: Via
--split-mode layer
vLLM Strengths (But Not for This Use Case)
- High Concurrency: Better throughput for multiple simultaneous requests
- PagedAttention: Advanced memory management
- Batch Processing: Superior handling of parallel requests
- Production Features: Better monitoring and metrics
Usage Pattern Decision
Our Workload Profile
- Solo development: Single user, sequential requests
- Local serving: No need for high concurrency
- Model focus: One model at a time
Result: llama.cpp perfect for single-model, local development scenarios.
Configuration Deep Dive
Final Working Setup
Model Loading
nohup /home/tomwest/llama.cpp/build/bin/llama-server \ -m /home/tomwest/.cache/huggingface/hub/models--unsloth--Nemotron-3-Nano-30B-A3B-GGUF/snapshots/9ad8b366c308f931b2a96b9306f0b41aef9cd405/Nemotron-3-Nano-30B-A3B-Q6_K.gguf \ -c 65536 \ --n-gpu-layers 99 \ --host 0.0.0.0 \ --port 8000 \ > /tmp/llama-server.log 2>&1 &
OpenCode Integration
# Update ~/.config/opencode/opencode.json
{
"provider": "llama_cpp",
"model_id": "nemotron-nano-30b",
"context_window": 65536,
"max_tokens": 32000,
"base_url": "http://localhost:8000/v1"
}
Performance Observations
| Metric | Value | Notes |
|---|---|---|
| Model Load Time | ~30 seconds | Includes GPU tensor loading |
| VRAM Usage | ~48GB | Full model on both GPUs |
| Context Window | 65K tokens | Maximum tested capacity |
| GPU Utilization | Both cards 85-90% | Tensor parallelism working |
Key Learnings
Model Format Ecosystem
Not all inference frameworks support all model formats. The choice between GGUF, Safetensors, and older GGML formats dictates your backend options.
Quantization Sweet Spot
Q6_K (8.49 bits per weight) provides excellent balance for 30B models. Smaller quantization would fit more models but quality degrades noticeably.
Server Management
# Essential process management commands pkill llama-server # Stop all instances ps aux | grep llama-server # Check for leftovers curl -s http://localhost:8000/health # Health check tail -f /tmp/llama-server.log # Monitor startup
Health vs. Ready
HTTP health endpoint != model loaded. Wait for "model loaded" log message before attempting inference.
Production Readiness Checklist
Before Going Live
- Clean Start: Kill existing servers, verify clean environment
- Model Integrity: Verify complete model download before startup
- Memory Planning: Confirm VRAM capacity with chosen quantization
- API Testing: Test actual inference, not just health endpoint
- Configuration Sync: Ensure OpenCode config matches server settings
- Monitoring Setup: Establish log monitoring before production
Bottom Line
The "simple" model setup became a framework compatibility discovery lesson. llama.cpp proved superior for local, single-model deployments despite vLLM's enterprise features. The key insight: match your use case to the appropriate framework from the start.
Bottom Line
The "simple" model setup became a framework compatibility discovery lesson. llama.cpp proved superior for local, single-model deployments despite vLLM's enterprise features. The key insight: match your use case to the appropriate framework from the start.