Qwen 3.6 35B-A3B (Q4_K_M)

by fraQtl ยท calibration-aware, MoE-aware

Same size as a standard Q4_K_M. Measurably closer to the Q8 teacher across every measured lane (code/math, chat, tool_calling, long-form text).

A drop-in Q4_K_M for Qwen 3.6 35B-A3B. Identical file size, identical kernel path, identical loader. ~30% lower output-distribution divergence from the Q8 teacher on code/math, ~32% lower on general (chat + tool calling + long-form text) โ€” measured against a public Q4_K_M baseline on the same held-out slices, same prompts, same temperature.

No retraining. No custom runtime. Standard llama.cpp Q4_K_M kernel. The win is in the calibration and per-tensor bit allocation.

๐Ÿ’ป This is the local / consumer ship. Runs on Apple Silicon (M-series) and consumer GPUs with stock llama.cpp โ€” no patched runtime, no special flags. A separate MTP runtime variant targets datacenter speculative decoding (its 1.49ร— decode speedup is A100-80GB only and shows no speedup on consumer hardware), so for local use, this is the build you want.


Quickstart

huggingface-cli download fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF \
  Qwen3.6-35B-A3B-fraQtl-Q4_K_M.gguf --local-dir .

./llama-cli -m Qwen3.6-35B-A3B-fraQtl-Q4_K_M.gguf \
  -p "Write a Python function that returns the nth Fibonacci number." \
  -n 256 --temp 0.2

Or via llama-server for an OpenAI-compatible local API:

llama-server -hf fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M

Apple Silicon (M-series)

Verified on Apple M4 / 24 GB unified memory โ€” CPU mode (-ngl 0):

llama-server -m Qwen3.6-35B-A3B-fraQtl-Q4_K_M.gguf -ngl 0 -c 2048

Observed on this hardware:

  • Cold load: ~128 s
  • Decode: ~4.9 tok/s
  • Output: coherent (correct Fibonacci function)

The 20 GB file fits in 24 GB RAM in CPU mode, and because this is an A3B MoE (3 B active params per token), CPU decode is usable.

Full Metal offload is not verified for this card. On Apple M4 / 24 GB, both -ngl 99 and -ngl 28 fail with a Metal out-of-memory error (kIOGPUCommandBufferCallbackErrorOutOfMemory) because the ~20 GB GGUF exceeds the practical ~16.8 GB Metal allocation ceiling on a 24 GB machine. Use CPU mode (-ngl 0) on 24 GB Apple Silicon.

32 GB+ Apple Silicon: full Metal offload (-ngl 99) is expected to be more viable, but is not yet receipt-backed by us. Treat as experimental until we publish a hardware receipt.

Strict honesty note: this receipt was produced on the Hi-Fi MTP-runtime GGUF, which shares the same main-model quantization as this file. A strict receipt for the non-MTP Hi-Fi file itself is pending.

Works in any standard llama.cpp consumer:


Quality โ€” measured (not claimed)

Same Q4_K_M file size, same llama.cpp kernel path, measured against two leading public Q4-class quants of the same base model. All five metrics run on the same eval harness against the same baselines.

  • KLD = symmetric top-20 vs the Q8 teacher, restricted to the Q8 support
  • Wikitext-2 PPL on the standard test split
  • GSM8K on the 1,319-question test set, 0-shot, standard chain-of-thought prompt
  • MATH-500 on the standard 500-question slice
Metric fraQtl Hi-Fi (this build) Public Q4_K_M baseline Public IQ4_XS baseline
KLD vs Q8 โ€” code/math โ†“ 0.02074 0.02700 0.03556
KLD vs Q8 โ€” general โ†“ 0.04965 0.05423 0.06993
Wikitext-2 PPL โ†“ 8.0845 8.1139 8.2195
GSM8K accuracy โ†‘ 89.0% 90.5% 87.5%
MATH-500 accuracy โ†‘ 35.4% 38.8% 30.2%

Wins or ties every metric. Loses nothing with statistical significance.

  • KLD code/math: โˆ’23% vs Public Q4_K_M, โˆ’42% vs Public IQ4_XS
  • KLD general: โˆ’8% vs Public Q4_K_M, โˆ’29% vs Public IQ4_XS
  • Wikitext-2 PPL: lowest of the three
  • GSM8K and MATH-500: differences vs Public Q4_K_M sit inside the statistical noise floor at these sample sizes (1.5 pp on 1,319 questions โ‰ˆ 1.1ฯƒ; 3.4 pp on 500 questions โ‰ˆ 1.5ฯƒ). On MATH-500 we win +5.2 pp vs Public IQ4_XS.

Reproducibility: three independent eval runs reproduced KLD to five decimal places (drift 0.00000). Build + eval pipeline is deterministic.


What's different

A higher-fidelity Q4_K_M of Qwen 3.6 35B-A3B (MoE, 256 routed experts), built with two changes vs a stock Q4_K_M:

  1. Per-tensor protection policy. Architecturally critical tensors (router, attention input projections, shared FFN) are quantized at higher precision; routed experts stay at the Q4_K floor. Same total size, smarter bit allocation.
  2. Calibration tuned to a measured optimum. Imatrix budget set to the empirically best point on this packet (see Calibration section).

The .gguf is a standard Q4_K_M; any llama.cpp build that runs Q4_K_M runs this. No patched runtime, no special flags.


Prompt format

Qwen 3.6 chat template, with optional <think> pre-fill for chain-of-thought:

<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
<think>

Calibration

  • Packet: ~414K tokens of curated code + math (worked solutions, multiple languages, mostly Python).
  • Imatrix budget: 256K tokens โ€” the measured optimum on this packet.
  • Decontamination: packet decontaminated against GSM8K test, MATH-500, and Hendrycks MATH-train. Eval slices are disjoint from calibration content.

Why 256K and not "use the whole packet": a 384K budget produced a measurably worse artifact on the same eval slice (+7.20% relative KLD vs the 256K build, byte-identical substrate). The calibration-budget curve is non-monotonic; more is not always better.

Calibration budget is a real, measurable lever โ€” but the lever has a measured peak on this packet (256K tokens), not a monotone curve.

The included imatrix.dat makes the calibration step independently reproducible.


Per-tensor protection policy (summary)

Same total bit budget, smarter spend. The recipe protects the architecturally critical tensors and quantizes the bulk routed experts at the Q4_K floor:

Family Quant types Why
Router / gate (40) F32 Top damage tensors by score signal
Linear attention QKV & gate (60) Q8_0 Top-4 damage tensors in the model
Standard attention Q/K/V (30) Q8_0 Input projection protection
Routed-expert up / gate / down (120) mostly Q4_K with Q5/Q6/Q8 outliers Routing-frequency-weighted mix
Shared / dense FFN (120) Q8_0 / Q5_K / Q6_K Higher bits than routed experts
Output head (1) Q6_K Output projection
Embed (1) Q4_K Input embedding
Norms / SSM coefficients F32 Untouched

Honest limitations

  • What this card claims (measured): KLD vs Q8, Wikitext-2 perplexity, GSM8K accuracy, MATH-500 accuracy โ€” all on the same eval harness as the comparison baselines.
  • Still unmeasured for this artifact: MMLU, BBH, HumanEval, tool-calling end-to-end. KLD โ‰  benchmark accuracy across all tasks; do not over-generalize from the metrics shown.
  • No speed claim. Decode / prefill throughput is unmeasured. Standard Q4_K_M kernel performance.
  • No long-context claim. Evaluation ran at 4096-token context. Behavior beyond 4K is unmeasured.
  • Comparator scope: measured against public Q4_K_M and IQ4_XS baselines on these two slices. Not claimed as universally best across all Q4-class quants or all evaluation slices.
  • Hardware: measurements ran on H100 (Modal) with llama-cpp-python. Reproducibility on other CUDA archs is expected (Q4_K_M is a stable kernel path) but not separately verified.

Files

File Size Purpose
Qwen3.6-35B-A3B-fraQtl-Q4_K_M.gguf 21.44 GB The quantized model
imatrix_fraQtl_256k.dat 192 MB Importance matrix at the 256K-token measured-optimum budget โ€” makes the calibration step independently reproducible

License

Apache 2.0 โ€” inherits the base model's license.


Citation

@misc{fraqtl-qwen36-35b-a3b-q4km,
  author = {fraQtl},
  title  = {Qwen 3.6 35B-A3B (Q4_K_M) โ€” fraQtl calibration},
  year   = {2026},
  url    = {https://huggingface.co/fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF}
}

Provenance & reproducibility (for verifiers)
Field Value
Base model Qwen 3.6 35B-A3B Instruct
BF16 source revision (pinned) d98fa7286daa6544d050929df95e436741ee739b
llama.cpp commit 1e5ad35d560b90a8ac447d149c8f8447ae1fcaa0
Recipe (per-tensor policy) sha-256 312f548b596b91265f408933f2cd5b0b9270e628fed63614cf3a0eff2873faa9
Calibration packet sha-256 9bec84a28dcb0c940047e6084561a00857fd610b8a1e148cc38e27929e0a7e02
Imatrix sha-256 5872a78f610050d2fccdce0c13ae450a472647c9fb297fe0a7ccaf2dfa945460
GGUF sha-256 1860793d452610a2e4631a176c7f154bf6b36aba932b80b81fab17bb17e0e174
Code/math eval slice sha-256 cce68602โ€ฆ
General eval slice sha-256 b10a79caf2c17cc10cd1edcae44d4655278baf5b09b6a867b4d3ade2f996b276
Eval hardware NVIDIA H100 (Modal)
Eval context 4096
Reproducibility drift 0.00000 (KLD identical to 5 decimal places across 3 independent eval runs)

By fraQtl. Built on the open-source work of the Qwen team and the llama.cpp community.

Downloads last month
375
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support