Instructions to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with HERMES:

# No code snippets available yet for this library.

# To use this model, check the repository files and the library's documentation.

# Want to help? PRs adding snippets are welcome at:
# https://github.com/huggingface/huggingface.js

llama-cpp-python

How to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF",
	filename="Qwen3.6-35B-A3B-apex-iquality.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16
# Run inference directly in the terminal:
./llama-cli -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16

Use Docker

docker model run hf.co/luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16

LM Studio
Jan
Ollama
How to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with Ollama:
```
ollama run hf.co/luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16
```

Unsloth Studio new

How to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF to start chatting

Pi new

How to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16

Run Hermes

hermes

Docker Model Runner
How to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with Docker Model Runner:
```
docker model run hf.co/luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16
```

Lemonade

How to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16

Run and chat with the model

lemonade run user.Qwen3.6-35B-A3B-APEX-IQuality-GGUF-F16

List all available models

lemonade list

Qwen3.6-35B-A3B-APEX-IQuality-GGUF

Qwen3.6-35B-A3B quantized with APEX imatrix built from real Hermes agent session traces.

Most GGUF quants calibrate on generic text. This one calibrates on actual agentic workloads — tool calls, multi-turn reasoning, code generation, and task completions from production Hermes agent sessions. If you run local agents, the quantization importance weights reflect your actual inference distribution.

21 GB on disk. Runs on M1 Max (64GB) with large context headroom.
~42 tok/s generation, ~620 tok/s prompt processing on Apple Silicon (Metal).
Vision + video support included (mmproj).

Model Description

Built from Qwen/Qwen3.6-35B-A3B — a 35B hybrid MoE model with 256 experts, 8 active per token, 34.66B total parameters. Combines attention layers with Gated Delta Net SSM layers (full attention every 4th layer), trained to 262K context.

This GGUF applies APEX quantization with a custom imatrix built entirely from Hermes agent session traces — real multi-turn conversations including tool calls, reasoning chains, code generation, and agentic task completions. No generic wikitext. Verified for use with Hermes Agent on Apple Silicon (M1 Max).

Credits & Attribution

Base Model: Qwen/Qwen3.6-35B-A3B
- Original Qwen3.6 MoE release by Qwen team
Calibration Dataset: Combined imatrix calibration
- bartowski's calibration dataset v3 — high-quality general calibration base
- Hermes agent session traces — real multi-turn agentic conversations: tool calls, reasoning chains, code generation, scientific queries
- Combined and extracted using extract_hermes_traces.py with Qwen3.6 chat template
APEX Quantization: mudler/apex-quant
- Reference: mudler/Qwen3.5-35B-A3B-APEX-GGUF
TurboQuant backend: Custom Metal kernels for M-series Apple Silicon
- 3.5× faster TQ4_1S kernel, MoE 256-expert kernel instantiations
This release: @luffydenolan
- Built Hermes agent trace calibration dataset
- Applied APEX iQuality quantization using Hermes imatrix
- Local testing and verification on M1 Max (64GB)

Methodology

Started from Qwen/Qwen3.6-35B-A3B base (34.66B params, ~65GB f16, ~34GB Q8_0)
Built custom imatrix calibration combining:
- bartowski's calibration dataset v3 — general-purpose high-quality base
- Hermes agent session traces — real multi-turn agentic conversations: tool calls, reasoning chains, code generation, task completions
Generated imatrix importance weights using llama-imatrix on Q8_0:
- -c 512, --chunks 200, --threads 10, -ngl 99
Applied APEX quantization guided by imatrix weights → iQuality output (21GB)
Tested locally on M1 Max (64GB) with TurboQuant Metal backend

Architecture Notes

Qwen3.6-35B-A3B is a hybrid MoE + SSM model:

40 layers total, full attention every 4th layer (10 attention, 30 SSM/MoE)
256 experts per MoE layer, 8 activated per token (~3B active params per forward pass)
SSM: Gated Delta Net, inner size 4096, state size 128, 16 groups
GQA: 16 attention heads, 2 KV heads (8× GQA), head dim 256
Context trained to 262K tokens (rope freq base 10M)

Files Included

Qwen3.6-35B-A3B-apex-iquality.gguf — Main model weights (21 GB)
mmproj-F16-Qwen3.6-35B-A3B.gguf — Multimodal projection layer, vision + video (858 MB)

Usage

llama.cpp server (text only)

llama-server \
  -m Qwen3.6-35B-A3B-apex-iquality.gguf \
  -ngl 99 \
  -t 4 \
  --ctx-size 132000 \
  --cache-type-k q8_0 \
  --cache-type-v turbo4 \
  --flash-attn on \
  --jinja \
  --chat-template-file chat_template_qwen36.jinja \
  --temp 0.7 \
  --top-k 20 \
  --top-p 0.95 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0

llama.cpp server (vision/video)

llama-server \
  -m Qwen3.6-35B-A3B-apex-iquality.gguf \
  --mmproj mmproj-F16-Qwen3.6-35B-A3B.gguf \
  -ngl 99 \
  -t 4 \
  --ctx-size 132000 \
  --cache-type-k q8_0 \
  --cache-type-v turbo4 \
  --flash-attn on \
  --jinja \
  --chat-template-file chat_template_qwen36.jinja \
  --temp 0.7 \
  --top-k 20 \
  --top-p 0.95 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0

Recommended sampling parameters

Parameter	Value	Notes
temperature	0.7
top-k	20	Qwen3.x default
top-p	0.95
min-p	0.0
presence-penalty	0.0
repeat-penalty	1.0
cache-type-k	q8_0	Best attention accuracy on M-series
cache-type-v	turbo4	Good compression, less sensitive than K
ctx-size	132000	~half of trained 262K — fits 64GB with headroom

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="Qwen3.6-35B-A3B-apex-iquality.gguf",
    n_ctx=132000,
    n_gpu_layers=99,
    n_threads=4,
)

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response["choices"][0]["message"]["content"])

Performance

Tested on Apple M1 Max (64GB unified memory), llama.cpp via Metal + TurboQuant backend, 4 threads.

Throughput (llama-bench, 3 runs, solo — no other GPU load)

Test	Tokens/sec
Prompt processing — pp128	360.60 ± 13.95
Prompt processing — pp512	620.68 ± 2.74
Prompt processing — pp1024	611.77 ± 6.17
Token generation — tg128	42.24 ± 0.04
Token generation — tg512	41.87 ± 0.22

Model size: 21 GiB on disk, loaded to GPU via Metal
Params: 34.66B total (~3B active per token)
Quant: APEX iQuality (imatrix, Hermes traces)
Backend: Metal (GPU) + TurboQuant, Apple M1 Max
KV cache: q8_0 K / turbo4 V (10 attention layers only, 132K ctx)

TurboQuant Build (Apple Silicon)

turbo4 KV cache requires TheTom's TurboQuant fork of llama.cpp.

https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md

Memory at 132K context (M1 Max 64GB)

Component	Size
Model weights (APEX iQuality)	~21 GB
KV cache (q8_0 K / turbo4 V, 10 attn layers)	~2 GB
SSM recurrent state	~0.5 GB
Metal buffers + overhead	~3 GB
Total RSS	~26.6 GB

~37 GB headroom remaining on 64GB for OS + other processes.

License

Apache 2.0 (inherited from base model)

Citation

@misc{qwen36-35b-a3b-apex-iquality-gguf,
  title = {Qwen3.6-35B-A3B-APEX-IQuality-GGUF},
  author = {luffydenolan},
  year = {2026},
  url = {https://huggingface.co/luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF}
}

Related Models

Qwen/Qwen3.6-35B-A3B — Base model
mudler/Qwen3.5-35B-A3B-APEX-GGUF — APEX quantization reference

Downloads last month: 621

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(407)

this model