Instructions to use fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF",
	filename="Qwen3.6-35B-A3B-fraQtl-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M

Use Docker

docker model run hf.co/fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M

Ollama
How to use fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF with Ollama:
```
ollama run hf.co/fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M
```

Unsloth Studio new

How to use fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF to start chatting

Pi new

How to use fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF with Docker Model Runner:
```
docker model run hf.co/fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M
```

Lemonade

How to use fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3.6-35B-A3B-Hi-Fi-GGUF-Q4_K_M

List all available models

lemonade list

Qwen 3.6 35B-A3B (Q4_K_M)

by fraQtl · calibration-aware, MoE-aware

Same size as a standard Q4_K_M. Measurably closer to the Q8 teacher across every measured lane (code/math, chat, tool_calling, long-form text).

A drop-in Q4_K_M for Qwen 3.6 35B-A3B. Identical file size, identical kernel path, identical loader. ~30% lower output-distribution divergence from the Q8 teacher on code/math, ~32% lower on general (chat + tool calling + long-form text) — measured against a public Q4_K_M baseline on the same held-out slices, same prompts, same temperature.

No retraining. No custom runtime. Standard llama.cpp Q4_K_M kernel. The win is in the calibration and per-tensor bit allocation.

💻 This is the local / consumer ship. Runs on Apple Silicon (M-series) and consumer GPUs with stock llama.cpp — no patched runtime, no special flags. A separate MTP runtime variant targets datacenter speculative decoding (its 1.49× decode speedup is A100-80GB only and shows no speedup on consumer hardware), so for local use, this is the build you want.

Quickstart

huggingface-cli download fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF \
  Qwen3.6-35B-A3B-fraQtl-Q4_K_M.gguf --local-dir .

./llama-cli -m Qwen3.6-35B-A3B-fraQtl-Q4_K_M.gguf \
  -p "Write a Python function that returns the nth Fibonacci number." \
  -n 256 --temp 0.2

Or via llama-server for an OpenAI-compatible local API:

llama-server -hf fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF:Q4_K_M

Apple Silicon (M-series)

Verified on Apple M4 / 24 GB unified memory — CPU mode (-ngl 0):

llama-server -m Qwen3.6-35B-A3B-fraQtl-Q4_K_M.gguf -ngl 0 -c 2048

Observed on this hardware:

Cold load: ~128 s
Decode: ~4.9 tok/s
Output: coherent (correct Fibonacci function)

The ~~20 GB file fits in 24 GB RAM in CPU mode, and because this is an A3B MoE (~~3 B active params per token), CPU decode is usable.

Full Metal offload is not verified for this card. On Apple M4 / 24 GB, both -ngl 99 and -ngl 28 fail with a Metal out-of-memory error (kIOGPUCommandBufferCallbackErrorOutOfMemory) because the ~20 GB GGUF exceeds the practical ~16.8 GB Metal allocation ceiling on a 24 GB machine. Use CPU mode (-ngl 0) on 24 GB Apple Silicon.

32 GB+ Apple Silicon: full Metal offload (-ngl 99) is expected to be more viable, but is not yet receipt-backed by us. Treat as experimental until we publish a hardware receipt.

Strict honesty note: this receipt was produced on the Hi-Fi MTP-runtime GGUF, which shares the same main-model quantization as this file. A strict receipt for the non-MTP Hi-Fi file itself is pending.

Works in any standard llama.cpp consumer:

llama.cpp • LM Studio • Ollama • llama-cpp-python • koboldcpp • Jan • Text Generation Web UI

Quality — measured (not claimed)

Same Q4_K_M file size, same llama.cpp kernel path, measured against two leading public Q4-class quants of the same base model. All five metrics run on the same eval harness against the same baselines.

KLD = symmetric top-20 vs the Q8 teacher, restricted to the Q8 support
Wikitext-2 PPL on the standard test split
GSM8K on the 1,319-question test set, 0-shot, standard chain-of-thought prompt
MATH-500 on the standard 500-question slice

Metric	fraQtl Hi-Fi (this build)	Public Q4_K_M baseline	Public IQ4_XS baseline
KLD vs Q8 — code/math ↓	0.02074	0.02700	0.03556
KLD vs Q8 — general ↓	0.04965	0.05423	0.06993
Wikitext-2 PPL ↓	8.0845	8.1139	8.2195
GSM8K accuracy ↑	89.0%	90.5%	87.5%
MATH-500 accuracy ↑	35.4%	38.8%	30.2%

Wins or ties every metric. Loses nothing with statistical significance.

KLD code/math: −23% vs Public Q4_K_M, −42% vs Public IQ4_XS

KLD general: −8% vs Public Q4_K_M, −29% vs Public IQ4_XS

Wikitext-2 PPL: lowest of the three

GSM8K and MATH-500: differences vs Public Q4_K_M sit inside the statistical noise floor at these sample sizes (1.5 pp on 1,319 questions ≈ 1.1σ; 3.4 pp on 500 questions ≈ 1.5σ). On MATH-500 we win +5.2 pp vs Public IQ4_XS.

Reproducibility: three independent eval runs reproduced KLD to five decimal places (drift 0.00000). Build + eval pipeline is deterministic.

What's different

A higher-fidelity Q4_K_M of Qwen 3.6 35B-A3B (MoE, 256 routed experts), built with two changes vs a stock Q4_K_M:

Per-tensor protection policy. Architecturally critical tensors (router, attention input projections, shared FFN) are quantized at higher precision; routed experts stay at the Q4_K floor. Same total size, smarter bit allocation.
Calibration tuned to a measured optimum. Imatrix budget set to the empirically best point on this packet (see Calibration section).

The .gguf is a standard Q4_K_M; any llama.cpp build that runs Q4_K_M runs this. No patched runtime, no special flags.

Prompt format

Qwen 3.6 chat template, with optional <think> pre-fill for chain-of-thought:

<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
<think>

Calibration

Packet: ~414K tokens of curated code + math (worked solutions, multiple languages, mostly Python).
Imatrix budget: 256K tokens — the measured optimum on this packet.
Decontamination: packet decontaminated against GSM8K test, MATH-500, and Hendrycks MATH-train. Eval slices are disjoint from calibration content.

Why 256K and not "use the whole packet": a 384K budget produced a measurably worse artifact on the same eval slice (+7.20% relative KLD vs the 256K build, byte-identical substrate). The calibration-budget curve is non-monotonic; more is not always better.

Calibration budget is a real, measurable lever — but the lever has a measured peak on this packet (256K tokens), not a monotone curve.

The included imatrix.dat makes the calibration step independently reproducible.

Per-tensor protection policy (summary)

Same total bit budget, smarter spend. The recipe protects the architecturally critical tensors and quantizes the bulk routed experts at the Q4_K floor:

Family	Quant types	Why
Router / gate (40)	F32	Top damage tensors by score signal
Linear attention QKV & gate (60)	Q8_0	Top-4 damage tensors in the model
Standard attention Q/K/V (30)	Q8_0	Input projection protection
Routed-expert up / gate / down (120)	mostly Q4_K with Q5/Q6/Q8 outliers	Routing-frequency-weighted mix
Shared / dense FFN (120)	Q8_0 / Q5_K / Q6_K	Higher bits than routed experts
Output head (1)	Q6_K	Output projection
Embed (1)	Q4_K	Input embedding
Norms / SSM coefficients	F32	Untouched

Honest limitations

What this card claims (measured): KLD vs Q8, Wikitext-2 perplexity, GSM8K accuracy, MATH-500 accuracy — all on the same eval harness as the comparison baselines.
Still unmeasured for this artifact: MMLU, BBH, HumanEval, tool-calling end-to-end. KLD ≠ benchmark accuracy across all tasks; do not over-generalize from the metrics shown.
No speed claim. Decode / prefill throughput is unmeasured. Standard Q4_K_M kernel performance.
No long-context claim. Evaluation ran at 4096-token context. Behavior beyond 4K is unmeasured.
Comparator scope: measured against public Q4_K_M and IQ4_XS baselines on these two slices. Not claimed as universally best across all Q4-class quants or all evaluation slices.
Hardware: measurements ran on H100 (Modal) with llama-cpp-python. Reproducibility on other CUDA archs is expected (Q4_K_M is a stable kernel path) but not separately verified.

Files

File	Size	Purpose
`Qwen3.6-35B-A3B-fraQtl-Q4_K_M.gguf`	21.44 GB	The quantized model
`imatrix_fraQtl_256k.dat`	192 MB	Importance matrix at the 256K-token measured-optimum budget — makes the calibration step independently reproducible

License

Apache 2.0 — inherits the base model's license.

Citation

@misc{fraqtl-qwen36-35b-a3b-q4km,
  author = {fraQtl},
  title  = {Qwen 3.6 35B-A3B (Q4_K_M) — fraQtl calibration},
  year   = {2026},
  url    = {https://huggingface.co/fraQtl/Qwen3.6-35B-A3B-Hi-Fi-GGUF}
}

Provenance & reproducibility (for verifiers)

Field	Value
Base model	Qwen 3.6 35B-A3B Instruct
BF16 source revision (pinned)	`d98fa7286daa6544d050929df95e436741ee739b`
llama.cpp commit	`1e5ad35d560b90a8ac447d149c8f8447ae1fcaa0`
Recipe (per-tensor policy) sha-256	`312f548b596b91265f408933f2cd5b0b9270e628fed63614cf3a0eff2873faa9`
Calibration packet sha-256	`9bec84a28dcb0c940047e6084561a00857fd610b8a1e148cc38e27929e0a7e02`
Imatrix sha-256	`5872a78f610050d2fccdce0c13ae450a472647c9fb297fe0a7ccaf2dfa945460`
GGUF sha-256	`1860793d452610a2e4631a176c7f154bf6b36aba932b80b81fab17bb17e0e174`
Code/math eval slice sha-256	`cce68602…`
General eval slice sha-256	`b10a79caf2c17cc10cd1edcae44d4655278baf5b09b6a867b4d3ade2f996b276`
Eval hardware	NVIDIA H100 (Modal)
Eval context	4096
Reproducibility drift	0.00000 (KLD identical to 5 decimal places across 3 independent eval runs)

By fraQtl. Built on the open-source work of the Qwen team and the llama.cpp community.

Downloads last month: 375

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

4-bit