---
library_name: mlx
license: other
license_name: nvidia-nemotron-open-model-license
license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/
pipeline_tag: text-generation
language:
- en
- fr
- es
- it
- de
- ja
- zh
tags:
- nvidia
- pytorch
- nemotron-3
- latent-moe
- mtp
- mlx
- turboquant
- quantization
- apple-silicon
- hybrid-quantization
- 48gb
- moe
- mamba
- hybrid
datasets:
- nvidia/nemotron-post-training-v3
- nvidia/nemotron-pre-training-datasets
track_downloads: true
base_model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
---

# Nemotron-3-Super-120B-A12B-tq3a-tq2e-g32 (48 GB hybrid)

**TurboQuant hybrid quantization** of [nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16) — 3-bit attention + 2-bit experts at group_size=32 — using [TurboQuant-MLX](https://github.com/manjunathshiva/turboquant-mlx).

This is the **48 GB-RAM** variant of the Nemotron-3 Super 120B quantization. The standard 3-bit (~50 GB) needs ~55 GB peak and only fits a 64 GB Mac after raising `iogpu.wired_limit_mb`. This hybrid keeps attention at 3-bit (where precision matters) and pushes experts to 2-bit (where the bulk of the weights live), dropping peak memory to ~40.8 GB and lifting decode speed to ~27.2 tok/s so the model fits comfortably on a **48 GB or 64 GB Apple Silicon MacBook** with headroom for other apps.

## Model Details

- **Base Model**: [nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16) (hybrid Mamba + Sparse Attention + MoE, 120 B params total, ~12 B active per token)
- **Architecture**: 88 layers, hybrid override pattern `MEMEMEM*EMEMEMEM*…` (M = Mamba, E = MoE, * = Attention)
- **Experts**: 512 routed experts + 1 shared expert, latent MoE with `moe_latent_size = 1024`
- **Quantization**: TurboQuant hybrid (Hadamard rotation + Lloyd-Max codebook)
  - **Attention** (q/k/v/o_proj): **3-bit**
  - **MoE experts and shared expert**: **2-bit**
  - **Group size**: **32** (per-group scaling)
- **Calibration data**: **none** — TurboQuant is data-free
- **Size**: **~36 GB on disk** (vs ~240 GB BF16, ~6.7× smaller; vs the standard tq3 ~50 GB, 28% smaller)
- **Peak memory at decode**: **~40.8 GB** — fits the default `iogpu.wired_limit_mb=49152` (48 GB) on a 64 GB Mac
- **Decode speed**: **~27.2 tok/s** (779-token generation, M-series MacBook, sampler-B config)
- **Runs on**: Apple Silicon (M1/M2/M3/M4) with **48 GB or more unified memory**

## Requirements

```bash
pip install "turboquant-mlx-full>=0.2.0" "mlx-lm>=0.31.3"
```

> ⚠️ Use **`turboquant-mlx-full` 0.2.0 or newer** — earlier versions don't have the per-layer `--attn-bits` / `--mlp-bits` plumbing required to load this hybrid model, the long-context kernel fix for prompts that span more than a few thousand tokens, or the v0.2 KV-cache CLI flags (`--kv-k-bits` / `--kv-v-bits` / `--kv-min-tokens`) shown below.

## Quick Start

### Download the model

```bash
hf download manjunathshiva/Nemotron-3-Super-120B-A12B-tq3a-tq2e-g32 \
    --local-dir ~/models/nemotron-3-super-120b-tq3a-tq2e-g32
```

### Generate text — recommended config

For prose, code, format, and long-context tasks, use the empirically-validated decode config (see *Phase-1 known limitation* below for math/numeric prompts):

```bash
turboquant-generate \
    --model ~/models/nemotron-3-super-120b-tq3a-tq2e-g32 \
    --prompt "Why is the sky blue? Explain in detail." \
    --max-tokens 4096 --min-tokens 50 \
    --temp 0.7 --rep-penalty 1.04 --rep-ctx 256
```

The `--min-tokens 50` flag is required for Nemotron-3 Super — the model emits a `<think>` reasoning trace before its final answer, and the chat template primes EOS as the top-1 logit at the start of the assistant turn.

The small repetition penalty (`--rep-penalty 1.04 --rep-ctx 256`) prevents long-form generation from collapsing into degenerate tail loops past ~1500 tokens. Without it, you may see em-dash runs or repeated phrases at the tail of long essays.

### Generate with TurboQuant KV cache (v0.2+) — adds another ~30% RAM headroom

For long-context generation, layer the v0.2 KV-cache compression on top. Mixed `K8/V3` is required when the weights are TurboQuant-quantized; symmetric `K3` would compound the noise and break long-form output. The 128-token fp16 sink protects attention sinks at the prompt start.

```bash
turboquant-generate \
    --model ~/models/nemotron-3-super-120b-tq3a-tq2e-g32 \
    --prompt "Why is the sky blue? Explain in detail." \
    --max-tokens 4096 --min-tokens 50 \
    --temp 0.7 --rep-penalty 1.04 --rep-ctx 256 \
    --kv-k-bits 8 --kv-v-bits 3 --kv-min-tokens 128
```

## Phase-1 known limitation: math accuracy

**Step-by-step arithmetic on this hybrid is degraded under any non-zero `--rep-penalty`.** The 2-bit experts cause small slips in numeric reasoning that the repetition penalty doesn't compensate for. For numeric/math prompts in this Phase-1 release, **omit `--rep-penalty`**:

```bash
# Math/numeric prompt — omit rep-penalty
turboquant-generate \
    --model ~/models/nemotron-3-super-120b-tq3a-tq2e-g32 \
    --prompt "A train leaves Boston at 9:00 AM going 60 mph..." \
    --max-tokens 2048 --min-tokens 50 \
    --temp 0.7
```

The trade-off: without the penalty you may see long-gen tail loops on prompts with very long outputs, but the arithmetic will land correctly more often. For serious numeric work, prefer the standard tq3 model: [`manjunathshiva/Nemotron-3-Super-120B-A12B-tq3`](https://huggingface.co/manjunathshiva/Nemotron-3-Super-120B-A12B-tq3).

A permanent fix is planned for **Phase 2** of TurboQuant-MLX — likely first/last-layer bit protection, a calibration-data codebook, or a fused QJL Metal kernel.

## Long-context support

The fused MoE decode kernel transparently chunks expert routings on long prompts (`K_CHUNK=4096`), so this hybrid handles long-context retrieval over 4000+ tokens of context without the kernel argument-validation crash that affected earlier builds.

A 4000+ token "needle in a haystack" prompt (recall a password buried in 2000 words of filler on each side) recovers the password reliably.

## Results

Measured on a 64 GB MacBook M-series with macOS, MLX, and `turboquant-mlx-full` 0.1.6.

| Configuration | Size | Peak RAM | Fits 48 GB? | Speed |
|---|---|---|---|---|
| BF16 (original)                              | ~240 GB | — | ❌ | n/a |
| TurboQuant 3-bit (standard)                  | ~50 GB  | ~55 GB | ❌ (needs sysctl) | ~19 tok/s |
| **TurboQuant hybrid (this repo)** | **~36 GB** | **~40.8 GB** | **✅** | **~27.2 tok/s** |

### Stress test summary (sampler-B config: `temp=0.7 rep_penalty=1.04 rep_ctx=256`)

| Test | Result |
|---|---|
| 1500-word essay (3500-tok budget)            | ✅ clean — proper conclusion + references, no degenerate tail |
| Step-by-step math (train-meeting problem)    | ⚠️ Phase-1 limitation — final number off |
| Python code generation (`merge_intervals` + 3 unit tests) | ✅ clean |
| Long-context needle (4000-tok password recall) | ✅ password recovered |
| Numbered-list format (5 benefits, ≤15 words each) | ✅ clean — exits `<think>`, exactly 5 lines |
| Open-ended explanation (4096-tok budget)     | ✅ clean — terminates at ~1.5K tokens with proper structure |

## How It Works

TurboQuant applies, in one shot with no calibration data:

1. **Hadamard rotation** — a reversible orthogonal transform that flattens weight outliers, so all values land in a narrow range that 2-bit/3-bit quantization can represent without large error.
2. **Lloyd-Max codebook** — optimal scalar values (4 levels at 2-bit, 8 levels at 3-bit) chosen to minimize total quantization error. Codebooks are fixed and embedded in `config.json`.
3. **Group-wise scaling** — per-group float16 scales (group size **32**) preserve per-channel dynamic range. Smaller groups improve per-group fit at the cost of slightly larger storage.
4. **Hybrid bit allocation** — attention precision matters more for next-token coherence; experts dominate storage. Splitting attention to 3-bit and experts to 2-bit recovers most of the standard-tq3 quality at ~28% smaller size.
5. **Latent-MoE quantization** — Nemotron-3 Super's 512 experts share a 1024-dim latent space. Quantizing that shared space compresses every expert at once.

## Architecture Notes

Nemotron-3 Super is a *hybrid* model — the layer pattern alternates between:

- **M** — Mamba state-space layers (cheap for long context)
- **E** — Mixture-of-Experts (512 routed experts, latent-MoE design)
- **\*** — Sparse attention (used only where it helps)

Plus 1 MTP (multi-token-prediction) layer. There are no dense MLP layers — all FFN compute goes through the MoE.

`turboquant-mlx-full` 0.1.6 quantizes:

- Mamba `in_proj` / `out_proj` linears (2-bit)
- Attention QKV / O linears (**3-bit** — the hybrid distinction)
- The **latent-MoE projections** (`fc1_latent_proj`, `fc2_latent_proj`) — the shared expert pantry (2-bit)
- The shared expert and MTP layer linears (2-bit)

Embeddings, layer norms, and small bias-style tensors stay in BF16 / FP16.

## Roadmap

- **Phase 1 (this release, v0.1.6)** — Hybrid quantization for 48 GB target + long-context kernel fix + recommended decode config. Math accuracy at long generation is a known limitation.
- **Phase 2 (planned)** — Permanent math accuracy fix. Candidates under evaluation: first/last-layer bit protection (architectural prior), calibration-data Lloyd-Max codebook (algorithmic), or a fused QJL Metal kernel (kernel-level).

## License

Released under the **NVIDIA Nemotron Open Model License** (same as the base model). See https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/

## Acknowledgements

- **NVIDIA** — for releasing Nemotron-3-Super-120B-A12B-BF16 openly
- **Google Research** — for the original TurboQuant algorithm
- **Apple** — for MLX and the unified-memory architecture that makes this fit
- **`mlx-lm`** maintainers — for landing Nemotron-H + latent-MoE + MTP support in 0.31.3

## Citation

```bibtex
@article{zandieh2025turboquant,
  title  = {TurboQuant: A Unified Framework for Extremely Low-Bit Weight and KV Cache Quantization},
  author = {Zandieh, Amir and Han, Minsik and Dalca, Andre and Shin, Jungwoo and Wang, Brian and Zhang, Yichao and Bordegoni, Matteo and Tian, Yuan and others},
  year   = {2025}
}
```

## Repository

- **TurboQuant-MLX** (the conversion tool): https://github.com/manjunathshiva/turboquant-mlx
- **Standard tq3 variant** (~50 GB, needs sysctl bump on 64 GB): [`manjunathshiva/Nemotron-3-Super-120B-A12B-tq3`](https://huggingface.co/manjunathshiva/Nemotron-3-Super-120B-A12B-tq3)
- Issues / questions: https://github.com/manjunathshiva/turboquant-mlx/issues