Setting up Nemotron-3-Nano-30B-A3B on dual RTX3090s was anything but straightforward. What should have been a simple model deployment became a week-long debugging session that revealed critical differences between inference frameworks and practical deployment realities.
The Initial Plan
Target Configuration
- Model: Nemotron-3-Nano-30B-A3B-Q6_K (31.2GB)
- Hardware: 2x RTX 3090 (48GB total VRAM)
- Framework: Initially tried vLLM, ended up with llama.cpp
- Context: 65K tokens
- Goal: OpenAI-compatible API for OpenCode integration
The Discovery Process
Finding Existing Infrastructure
Lucky Discoveries
On first login to the system, I found existing infrastructure:
/home/tomwest/llama.cpp/build/bin/- Already compiled and ready- ~32GB of model blob in HuggingFace cache
- vLLM server from previous attempt running on port 8001
The Roadblocks: Problem by Problem
Problem #1: Framework Incompatibility
The Issue
vLLM refused to load the GGUF format model with cryptic error messages.
Root Cause
- vLLM: Supports Safetensors, older GGML format
- llama.cpp: Native GGUF support (optimized format)
- Reality: Model format determines inference backend choice
Problem #2: Memory Management
The Issue
Model loading failed with out-of-memory despite having exactly 48GB VRAM available.
The Missing Piece
KV cache requires additional memory beyond the model weights. The formula:
- Model: 31GB
- KV cache: 8GB for 32K context
- Overhead: 2GB
- Total: 41GB+
Problem #3: Health Check Confusion
The Issue
Server health check passes but inference requests fail.
The Learning
HTTP health endpoint returns 200 before model loads completely. The API says "ok" but inference fails until background loading completes.
Real Health Check
Wait for this message in logs: "llm_load_tensors: loaded all 32 layers in 15.73 seconds" Then test with actual inference.
The Solution
Final Configuration That Works
./llama-server \ --model /path/to/nemotron-30b-a3b-q6_k.gguf \ --n-gpu-layers 99 \ --threads 32 \ --ctx-size 65535 \ --port 8000
Key Parameters
| Parameter | Value | Reason |
|---|---|---|
| --n-gpu-layers | 99 | Max GPU offload |
| --threads | 32 | Threadripper cores |
| --ctx-size | 65535 | Max context |
| --port | 8000 | Consistent API port |
Lessons Learned
Framework Matters
GGUF format is llama.cpp native. Don't fight the framework - work with its strengths.
Memory Math
Always account for KV cache. Model size ≠ total memory needed.
Health Check Reality
Test actual inference, not just HTTP endpoints. Model loading happens asynchronously.
Patience Required
First-time setup takes time. Debug systematically and document solutions.
Performance Results
Achieved Metrics
| Metric | Value |
|---|---|
| Tokens/Second | 64.82 |
| Context Size | 65K tokens |
| VRAM Usage | ~45GB / 48GB |
| Load Time | ~16 seconds |
What Works
The setup works reliably for interactive use with 65K context. Response times are fast enough for real-time interaction.
Remaining Limitations
- Single request at a time (no batching)
- VRAM at 93% utilization
- No room for more layers
Next Steps
Potential Improvements
- Explore quantization trade-offs (Q5 vs Q6)
- Test longer context stability
- Multi-GPU load distribution
What to Avoid
- Don't use wrong format for framework
- Don't skip KV cache math
- Don't trust health alone