Instructions to use blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled", filename="nemotron-120b-q4-k-m.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled # Run inference directly in the terminal: llama-cli -hf blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled # Run inference directly in the terminal: llama-cli -hf blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled # Run inference directly in the terminal: ./llama-cli -hf blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled # Run inference directly in the terminal: ./build/bin/llama-cli -hf blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled
Use Docker
docker model run hf.co/blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled
- LM Studio
- Jan
- vLLM
How to use blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled
- Ollama
How to use blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled with Ollama:
ollama run hf.co/blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled
- Unsloth Studio new
How to use blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled to start chatting
- Pi new
How to use blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled
Run Hermes
hermes
- Docker Model Runner
How to use blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled with Docker Model Runner:
docker model run hf.co/blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled
- Lemonade
How to use blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled
Run and chat with the model
lemonade run user.Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled-{{QUANT_TAG}}List all available models
lemonade list
Beta Release - This is a beta release. A v2 is expected with more training data and improved training methodology. As of now, this model is fine-tuned exclusively on the nohurry/Opus-4.6-Reasoning-3000x-filtered dataset (2,326 reasoning traces from Claude Opus 4.6).
Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled
GGUF quantizations of the fine-tuned NVIDIA Nemotron-3-Super-120B-A12B, distilled from Claude Opus 4.6 reasoning traces.
Available Quantizations
| Quantization | File | Size | Description |
|---|---|---|---|
| Q4_K_M | nemotron-120b-q4-k-m.gguf |
~50 GB | Best balance of quality and size. Medium quality, recommended for most users. |
| Q8_0 | nemotron-120b-q8-0.gguf |
~120 GB | Near-lossless quantization. Best quality, requires more RAM. |
Model Details
| Property | Value |
|---|---|
| Base Model | nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 |
| Architecture | Nemotron-H (Mamba-2 SSM + MoE + Attention hybrid) |
| Parameters | 120B total / 12B active (MoE) |
| Fine-tuning Method | LoRA (r=32, alpha=64) merged into base weights |
| Training Data | nohurry/Opus-4.6-Reasoning-3000x-filtered |
| Epochs | 3 |
| Final Training Loss | 0.42 |
What's Different
This model has been fine-tuned on 2,326 high-quality reasoning traces from Claude Opus 4.6. The model produces structured reasoning with <think> tags before answering, similar to o1/reasoning-style models.
Usage
With llama.cpp
# Download the Q4_K_M quantization
huggingface-cli download blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled \
nemotron-120b-q4-k-m.gguf --local-dir ./models
# Run inference
./llama-cli -m ./models/nemotron-120b-q4-k-m.gguf \
-p "<|im_start|>system\nYou are a helpful reasoning assistant. Think step by step before answering.<|im_end|>\n<|im_start|>user\nWhat is 7 * 13?<|im_end|>\n<|im_start|>assistant\n" \
--temp 1.0 --top-p 0.95 -n 512
With Ollama
Create a Modelfile:
FROM ./nemotron-120b-q4-k-m.gguf
PARAMETER temperature 1.0
PARAMETER top_p 0.95
SYSTEM You are a helpful reasoning assistant. Think step by step before answering.
Then:
ollama create nemotron-reasoning -f Modelfile
ollama run nemotron-reasoning "What is the sum of all prime numbers less than 20?"
Recommended Sampling Parameters
Per NVIDIA's recommendation, use temperature=1.0 and top_p=0.95 across all tasks — reasoning, tool calling, and general chat alike.
Other Formats
- BF16 (safetensors): blobbybob/Nemotron-3-Super-120B-A12B-BF16-Claude-4.6-Opus-Reasoning-Distilled
- FP8 (safetensors): blobbybob/Nemotron-3-Super-120B-A12B-FP8-Claude-4.6-Opus-Reasoning-Distilled
Limitations
- Fine-tuned on only 2,326 examples — may not generalize to all domains
- Reasoning traces are from Claude Opus 4.6; model behavior reflects that style
- Beta release — expect improvements in v2
License
This model inherits the NVIDIA Nemotron Open Model License from the base model.
- Downloads last month
- 453
We're not able to determine the quantization variants.