--- library_name: mlx license: other license_name: nvidia-nemotron-open-model-license license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/ pipeline_tag: text-generation language: - en - fr - es - it - de - ja - zh tags: - nvidia - pytorch - nemotron-3 - latent-moe - mtp - mlx - turboquant - quantization - apple-silicon - hybrid-quantization - 48gb - moe - mamba - hybrid datasets: - nvidia/nemotron-post-training-v3 - nvidia/nemotron-pre-training-datasets track_downloads: true base_model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 --- # Nemotron-3-Super-120B-A12B-tq3a-tq2e-g32 (48 GB hybrid) **TurboQuant hybrid quantization** of [nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16) — 3-bit attention + 2-bit experts at group_size=32 — using [TurboQuant-MLX](https://github.com/manjunathshiva/turboquant-mlx). This is the **48 GB-RAM** variant of the Nemotron-3 Super 120B quantization. The standard 3-bit (~50 GB) needs ~55 GB peak and only fits a 64 GB Mac after raising `iogpu.wired_limit_mb`. This hybrid keeps attention at 3-bit (where precision matters) and pushes experts to 2-bit (where the bulk of the weights live), dropping peak memory to ~40.8 GB and lifting decode speed to ~27.2 tok/s so the model fits comfortably on a **48 GB or 64 GB Apple Silicon MacBook** with headroom for other apps. ## Model Details - **Base Model**: [nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16) (hybrid Mamba + Sparse Attention + MoE, 120 B params total, ~12 B active per token) - **Architecture**: 88 layers, hybrid override pattern `MEMEMEM*EMEMEMEM*…` (M = Mamba, E = MoE, * = Attention) - **Experts**: 512 routed experts + 1 shared expert, latent MoE with `moe_latent_size = 1024` - **Quantization**: TurboQuant hybrid (Hadamard rotation + Lloyd-Max codebook) - **Attention** (q/k/v/o_proj): **3-bit** - **MoE experts and shared expert**: **2-bit** - **Group size**: **32** (per-group scaling) - **Calibration data**: **none** — TurboQuant is data-free - **Size**: **~36 GB on disk** (vs ~240 GB BF16, ~6.7× smaller; vs the standard tq3 ~50 GB, 28% smaller) - **Peak memory at decode**: **~40.8 GB** — fits the default `iogpu.wired_limit_mb=49152` (48 GB) on a 64 GB Mac - **Decode speed**: **~27.2 tok/s** (779-token generation, M-series MacBook, sampler-B config) - **Runs on**: Apple Silicon (M1/M2/M3/M4) with **48 GB or more unified memory** ## Requirements ```bash pip install "turboquant-mlx-full>=0.2.0" "mlx-lm>=0.31.3" ``` > ⚠️ Use **`turboquant-mlx-full` 0.2.0 or newer** — earlier versions don't have the per-layer `--attn-bits` / `--mlp-bits` plumbing required to load this hybrid model, the long-context kernel fix for prompts that span more than a few thousand tokens, or the v0.2 KV-cache CLI flags (`--kv-k-bits` / `--kv-v-bits` / `--kv-min-tokens`) shown below. ## Quick Start ### Download the model ```bash hf download manjunathshiva/Nemotron-3-Super-120B-A12B-tq3a-tq2e-g32 \ --local-dir ~/models/nemotron-3-super-120b-tq3a-tq2e-g32 ``` ### Generate text — recommended config For prose, code, format, and long-context tasks, use the empirically-validated decode config (see *Phase-1 known limitation* below for math/numeric prompts): ```bash turboquant-generate \ --model ~/models/nemotron-3-super-120b-tq3a-tq2e-g32 \ --prompt "Why is the sky blue? Explain in detail." \ --max-tokens 4096 --min-tokens 50 \ --temp 0.7 --rep-penalty 1.04 --rep-ctx 256 ``` The `--min-tokens 50` flag is required for Nemotron-3 Super — the model emits a `` reasoning trace before its final answer, and the chat template primes EOS as the top-1 logit at the start of the assistant turn. The small repetition penalty (`--rep-penalty 1.04 --rep-ctx 256`) prevents long-form generation from collapsing into degenerate tail loops past ~1500 tokens. Without it, you may see em-dash runs or repeated phrases at the tail of long essays. ### Generate with TurboQuant KV cache (v0.2+) — adds another ~30% RAM headroom For long-context generation, layer the v0.2 KV-cache compression on top. Mixed `K8/V3` is required when the weights are TurboQuant-quantized; symmetric `K3` would compound the noise and break long-form output. The 128-token fp16 sink protects attention sinks at the prompt start. ```bash turboquant-generate \ --model ~/models/nemotron-3-super-120b-tq3a-tq2e-g32 \ --prompt "Why is the sky blue? Explain in detail." \ --max-tokens 4096 --min-tokens 50 \ --temp 0.7 --rep-penalty 1.04 --rep-ctx 256 \ --kv-k-bits 8 --kv-v-bits 3 --kv-min-tokens 128 ``` ## Phase-1 known limitation: math accuracy **Step-by-step arithmetic on this hybrid is degraded under any non-zero `--rep-penalty`.** The 2-bit experts cause small slips in numeric reasoning that the repetition penalty doesn't compensate for. For numeric/math prompts in this Phase-1 release, **omit `--rep-penalty`**: ```bash # Math/numeric prompt — omit rep-penalty turboquant-generate \ --model ~/models/nemotron-3-super-120b-tq3a-tq2e-g32 \ --prompt "A train leaves Boston at 9:00 AM going 60 mph..." \ --max-tokens 2048 --min-tokens 50 \ --temp 0.7 ``` The trade-off: without the penalty you may see long-gen tail loops on prompts with very long outputs, but the arithmetic will land correctly more often. For serious numeric work, prefer the standard tq3 model: [`manjunathshiva/Nemotron-3-Super-120B-A12B-tq3`](https://huggingface.co/manjunathshiva/Nemotron-3-Super-120B-A12B-tq3). A permanent fix is planned for **Phase 2** of TurboQuant-MLX — likely first/last-layer bit protection, a calibration-data codebook, or a fused QJL Metal kernel. ## Long-context support The fused MoE decode kernel transparently chunks expert routings on long prompts (`K_CHUNK=4096`), so this hybrid handles long-context retrieval over 4000+ tokens of context without the kernel argument-validation crash that affected earlier builds. A 4000+ token "needle in a haystack" prompt (recall a password buried in 2000 words of filler on each side) recovers the password reliably. ## Results Measured on a 64 GB MacBook M-series with macOS, MLX, and `turboquant-mlx-full` 0.1.6. | Configuration | Size | Peak RAM | Fits 48 GB? | Speed | |---|---|---|---|---| | BF16 (original) | ~240 GB | — | ❌ | n/a | | TurboQuant 3-bit (standard) | ~50 GB | ~55 GB | ❌ (needs sysctl) | ~19 tok/s | | **TurboQuant hybrid (this repo)** | **~36 GB** | **~40.8 GB** | **✅** | **~27.2 tok/s** | ### Stress test summary (sampler-B config: `temp=0.7 rep_penalty=1.04 rep_ctx=256`) | Test | Result | |---|---| | 1500-word essay (3500-tok budget) | ✅ clean — proper conclusion + references, no degenerate tail | | Step-by-step math (train-meeting problem) | ⚠️ Phase-1 limitation — final number off | | Python code generation (`merge_intervals` + 3 unit tests) | ✅ clean | | Long-context needle (4000-tok password recall) | ✅ password recovered | | Numbered-list format (5 benefits, ≤15 words each) | ✅ clean — exits ``, exactly 5 lines | | Open-ended explanation (4096-tok budget) | ✅ clean — terminates at ~1.5K tokens with proper structure | ## How It Works TurboQuant applies, in one shot with no calibration data: 1. **Hadamard rotation** — a reversible orthogonal transform that flattens weight outliers, so all values land in a narrow range that 2-bit/3-bit quantization can represent without large error. 2. **Lloyd-Max codebook** — optimal scalar values (4 levels at 2-bit, 8 levels at 3-bit) chosen to minimize total quantization error. Codebooks are fixed and embedded in `config.json`. 3. **Group-wise scaling** — per-group float16 scales (group size **32**) preserve per-channel dynamic range. Smaller groups improve per-group fit at the cost of slightly larger storage. 4. **Hybrid bit allocation** — attention precision matters more for next-token coherence; experts dominate storage. Splitting attention to 3-bit and experts to 2-bit recovers most of the standard-tq3 quality at ~28% smaller size. 5. **Latent-MoE quantization** — Nemotron-3 Super's 512 experts share a 1024-dim latent space. Quantizing that shared space compresses every expert at once. ## Architecture Notes Nemotron-3 Super is a *hybrid* model — the layer pattern alternates between: - **M** — Mamba state-space layers (cheap for long context) - **E** — Mixture-of-Experts (512 routed experts, latent-MoE design) - **\*** — Sparse attention (used only where it helps) Plus 1 MTP (multi-token-prediction) layer. There are no dense MLP layers — all FFN compute goes through the MoE. `turboquant-mlx-full` 0.1.6 quantizes: - Mamba `in_proj` / `out_proj` linears (2-bit) - Attention QKV / O linears (**3-bit** — the hybrid distinction) - The **latent-MoE projections** (`fc1_latent_proj`, `fc2_latent_proj`) — the shared expert pantry (2-bit) - The shared expert and MTP layer linears (2-bit) Embeddings, layer norms, and small bias-style tensors stay in BF16 / FP16. ## Roadmap - **Phase 1 (this release, v0.1.6)** — Hybrid quantization for 48 GB target + long-context kernel fix + recommended decode config. Math accuracy at long generation is a known limitation. - **Phase 2 (planned)** — Permanent math accuracy fix. Candidates under evaluation: first/last-layer bit protection (architectural prior), calibration-data Lloyd-Max codebook (algorithmic), or a fused QJL Metal kernel (kernel-level). ## License Released under the **NVIDIA Nemotron Open Model License** (same as the base model). See https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/ ## Acknowledgements - **NVIDIA** — for releasing Nemotron-3-Super-120B-A12B-BF16 openly - **Google Research** — for the original TurboQuant algorithm - **Apple** — for MLX and the unified-memory architecture that makes this fit - **`mlx-lm`** maintainers — for landing Nemotron-H + latent-MoE + MTP support in 0.31.3 ## Citation ```bibtex @article{zandieh2025turboquant, title = {TurboQuant: A Unified Framework for Extremely Low-Bit Weight and KV Cache Quantization}, author = {Zandieh, Amir and Han, Minsik and Dalca, Andre and Shin, Jungwoo and Wang, Brian and Zhang, Yichao and Bordegoni, Matteo and Tian, Yuan and others}, year = {2025} } ``` ## Repository - **TurboQuant-MLX** (the conversion tool): https://github.com/manjunathshiva/turboquant-mlx - **Standard tq3 variant** (~50 GB, needs sysctl bump on 64 GB): [`manjunathshiva/Nemotron-3-Super-120B-A12B-tq3`](https://huggingface.co/manjunathshiva/Nemotron-3-Super-120B-A12B-tq3) - Issues / questions: https://github.com/manjunathshiva/turboquant-mlx/issues