Nemotron Model Setup Deep Dive

← Back to LLM Garage

Setting up Nemotron-3-Nano-30B-A3B on dual RTX3090s was anything but straightforward. What should have been a simple model deployment became a week-long debugging session that revealed critical differences between inference frameworks and practical deployment realities.

The Initial Plan

Target Configuration

Model: Nemotron-3-Nano-30B-A3B-Q6_K (31.2GB)
Hardware: 2x RTX 3090 (48GB total VRAM)
Framework: Initially tried vLLM, ended up with llama.cpp
Context: 65K tokens
Goal: OpenAI-compatible API for OpenCode integration

The Discovery Process

Finding Existing Infrastructure

Lucky Discoveries

On first login to the system, I found existing infrastructure:

/home/tomwest/llama.cpp/build/bin/ - Already compiled and ready
~32GB of model blob in HuggingFace cache
vLLM server from previous attempt running on port 8001

The Roadblocks: Problem by Problem

Problem #1: Framework Incompatibility

The Issue

vLLM refused to load the GGUF format model with cryptic error messages.

Root Cause

vLLM: Supports Safetensors, older GGML format
llama.cpp: Native GGUF support (optimized format)
Reality: Model format determines inference backend choice

Problem #2: Memory Management

The Issue

Model loading failed with out-of-memory despite having exactly 48GB VRAM available.

The Missing Piece

KV cache requires additional memory beyond the model weights. The formula:

Model: 31GB
KV cache: 8GB for 32K context
Overhead: 2GB
Total: 41GB+

Problem #3: Health Check Confusion

The Issue

Server health check passes but inference requests fail.

The Learning

HTTP health endpoint returns 200 before model loads completely. The API says "ok" but inference fails until background loading completes.

Real Health Check

Wait for this message in logs: "llm_load_tensors: loaded all 32 layers in 15.73 seconds" Then test with actual inference.

The Solution

Final Configuration That Works

./llama-server \
  --model /path/to/nemotron-30b-a3b-q6_k.gguf \
  --n-gpu-layers 99 \
  --threads 32 \
  --ctx-size 65535 \
  --port 8000

Key Parameters

Parameter	Value	Reason
--n-gpu-layers	99	Max GPU offload
--threads	32	Threadripper cores
--ctx-size	65535	Max context
--port	8000	Consistent API port

Lessons Learned

Framework Matters

GGUF format is llama.cpp native. Don't fight the framework - work with its strengths.

Memory Math

Always account for KV cache. Model size ≠ total memory needed.

Health Check Reality

Test actual inference, not just HTTP endpoints. Model loading happens asynchronously.

Patience Required

First-time setup takes time. Debug systematically and document solutions.

Performance Results

Achieved Metrics

Metric	Value
Tokens/Second	64.82
Context Size	65K tokens
VRAM Usage	~45GB / 48GB
Load Time	~16 seconds

What Works

The setup works reliably for interactive use with 65K context. Response times are fast enough for real-time interaction.

Remaining Limitations

Single request at a time (no batching)
VRAM at 93% utilization
No room for more layers

Next Steps

Potential Improvements

Explore quantization trade-offs (Q5 vs Q6)
Test longer context stability
Multi-GPU load distribution

What to Avoid

Don't use wrong format for framework
Don't skip KV cache math
Don't trust health alone

Nemotron Setup Deep Dive: Real Server Problems

The Initial Plan

Target Configuration

The Discovery Process

Finding Existing Infrastructure

Lucky Discoveries

The Roadblocks: Problem by Problem

Problem #1: Framework Incompatibility

The Issue

Root Cause

Problem #2: Memory Management

The Issue

The Missing Piece

Problem #3: Health Check Confusion

The Issue

The Learning

Real Health Check

The Solution

Final Configuration That Works

Key Parameters

Lessons Learned

Framework Matters

Memory Math

Health Check Reality

Patience Required

Performance Results

Achieved Metrics

What Works

Remaining Limitations

Next Steps

Potential Improvements

What to Avoid