How to use from
Lemonade
Pull the model
# Download Lemonade from https://lemonade-server.ai/
lemonade pull majentik/Nemotron-3-Super-120B-A12B-TurboQuant-GGUF-Q3_K_M:Q3_K_M
Run and chat with the model
lemonade run user.Nemotron-3-Super-120B-A12B-TurboQuant-GGUF-Q3_K_M-Q3_K_M
List all available models
lemonade list
Quick Links

Nemotron-3-Super-120B-A12B-TurboQuant-GGUF-Q3_K_M

GGUF Q3_K_M weight-quantized variant of nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 optimised for use with TurboQuant KV cache compression via a dedicated llama.cpp fork.

Important: TurboQuant KV cache types (planar3, iso3) are not available in upstream llama.cpp, standard Ollama, or LM Studio. They require a specific llama.cpp fork. The GGUF file itself is a standard GGUF and works with any llama.cpp-compatible runtime using normal KV cache types (f16, q8_0, q4_0, etc.).

Hardware compatibility

Device VRAM / RAM Recommendation
CPU host with β‰₯58 GB RAM ~58.1 GB works via llama.cpp; slower than GPU but no accelerator required
Apple Silicon (Metal) ~63.4 GB llama.cpp Metal backend; fast on M-series unified memory
NVIDIA GPU (partial offload) split between GPU + RAM offload as many layers as VRAM allows; rest on CPU

Overview

This model combines two independent compression techniques:

Technique What it does Requirement
GGUF Q3_K_M weight quantization Reduces model size from ~240 GB (BF16) to ~52.8 GB Any llama.cpp-compatible runtime
TurboQuant KV cache compression β€” random rotation + Lloyd-Max scalar quantization (--cache-type-k planar3 --cache-type-v planar3) Block-diagonal rotations / random rotation for compressed KV cache llama-cpp-turboquant fork only

Quickstart

Option A β€” With TurboQuant KV cache (fork required)

You must build from the TurboQuant-enabled llama.cpp fork:

# Clone and build the fork
git clone https://github.com/johndpope/llama-cpp-turboquant.git
cd llama-cpp-turboquant && git checkout feature/planarquant-kv-cache

# CUDA (Windows/Linux)
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j

# Metal (Apple Silicon)
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j

# Run with TurboQuant KV cache
./build/bin/llama-cli -m Nemotron-3-Super-120B-A12B-TurboQuant-GGUF-Q3_K_M.gguf \
  --cache-type-k planar3 --cache-type-v planar3 \
  -ngl 99 -fa \
  -p "Explain quantum computing"

# Or run as a server
./build/bin/llama-server -m Nemotron-3-Super-120B-A12B-TurboQuant-GGUF-Q3_K_M.gguf \
  --cache-type-k planar3 --cache-type-v planar3 \
  -ngl 99 -fa --jinja

Option B β€” With standard llama.cpp / LM Studio / Ollama

The GGUF works as a normal quantised model. You won't get TurboQuant-specific KV cache benefits, but standard KV cache quantization (q8_0, q4_0) still reduces VRAM significantly.

llama.cpp (upstream)

llama-cli -m Nemotron-3-Super-120B-A12B-TurboQuant-GGUF-Q3_K_M.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -ngl 99 -fa \
  -p "Explain quantum computing"

LM Studio

  1. Download the GGUF file and load in LM Studio.
  2. Enable Developer Mode (Settings β†’ Developer).
  3. In the model loader's advanced settings, set Flash Attention to ON.
  4. Set K Cache Quantization and V Cache Quantization to q8_0 (or q4_0 for more aggressive VRAM savings).
  5. Note: LM Studio does not currently support TurboQuant's planar3 cache types. Track this feature request for updates.

Ollama

# Standard Ollama does not support TurboQuant cache types.
# Use with default or q8_0 KV cache via OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_FLASH_ATTENTION=1 ollama run majentik/Nemotron-3-Super-120B-A12B-TurboQuant-GGUF-Q3_K_M

Specifications

Property Value
Base Model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
Architecture Mamba-2 + Transformer hybrid Sparse MoE
Parameters 120B total, 12B active per token
Context Length 1M
Weight Quantization GGUF Q3_K_M (compact 3-bit, slight quality loss)
Original Size (BF16) ~240 GB
Quantized File Size ~52.8 GB
KV Cache (TurboQuant) 3-bit via --cache-type-k planar3 --cache-type-v planar3 (fork only)
KV Cache (standard) q8_0, q4_0, f16, etc. (any llama.cpp runtime)
License other
Modalities Text only
Compatible Runtimes llama.cpp, LM Studio, Ollama, koboldcpp

What is TurboQuant?

TurboQuant (ICLR 2026) is a KV cache compression method that applies a random orthogonal rotation followed by optimal scalar quantization. Bit-identical prefill logits at 4-bit on tested models, with up to 4-8Γ— memory savings for long sequences.

Benchmarks from the TurboQuant repository (Llama 3.1 8B, RTX 5090 β€” results will vary by model and hardware):

Metric TurboQuant (4-bit) Standard q4_0
Quality Bit-identical prefill Lossy
KV Compression ~4Γ— vs FP16 ~4Γ— vs FP16
Speedup (Apple Silicon) 1.4–1.7Γ— β€”

Note: These benchmarks are from the TurboQuant repository using Llama 3.1 8B on an RTX 5090. Performance on Nemotron-3-Super-120B-A12B will differ. Independent benchmarks for this specific model are welcome β€” please open a discussion if you have results to share.

Current Status of TurboQuant in the Ecosystem

Runtime TurboQuant Support Standard KV Quant
llama.cpp (upstream) ❌ Not merged βœ… q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
llama-cpp-turboquant fork βœ… planar3 βœ… All standard types
LM Studio ❌ Requested βœ… Via advanced settings
Ollama ❌ Not supported βœ… Via OLLAMA_KV_CACHE_TYPE
koboldcpp ❌ Not supported βœ… Standard types

Recommended Settings

For VRAM-constrained setups, standard q8_0 KV cache quantization already halves KV cache memory with negligible quality impact. Flash Attention should always be enabled β€” it is required for V cache quantization and improves memory efficiency regardless.

VRAM Suggested Configuration
24 GB (RTX 4090) Q3_K_M + q8_0 KV cache + Flash Attention, 8K–16K context
16 GB Q3_K_M + q4_0 KV cache + Flash Attention, 4K–8K context
48+ GB Q3_K_M + f16 KV cache, full 32K+ context

See Also

Downloads last month
126
GGUF
Model size
121B params
Architecture
nemotron_h_moe
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for majentik/Nemotron-3-Super-120B-A12B-TurboQuant-GGUF-Q3_K_M

Quantized
(43)
this model

Paper for majentik/Nemotron-3-Super-120B-A12B-TurboQuant-GGUF-Q3_K_M