Instructions to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- HERMES
How to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with HERMES:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- llama-cpp-python
How to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF", filename="Qwen3.6-35B-A3B-apex-iquality.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16 # Run inference directly in the terminal: ./llama-cli -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16
Use Docker
docker model run hf.co/luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16
- LM Studio
- Jan
- Ollama
How to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with Ollama:
ollama run hf.co/luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16
- Unsloth Studio new
How to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF to start chatting
- Pi new
How to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16
Run Hermes
hermes
- Docker Model Runner
How to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with Docker Model Runner:
docker model run hf.co/luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16
- Lemonade
How to use luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF:F16
Run and chat with the model
lemonade run user.Qwen3.6-35B-A3B-APEX-IQuality-GGUF-F16
List all available models
lemonade list
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF to start chattingUsing HuggingFace Spaces for Unsloth
# No setup required# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF to start chattingQwen3.6-35B-A3B-APEX-IQuality-GGUF
Qwen3.6-35B-A3B quantized with APEX imatrix built from real Hermes agent session traces.
Most GGUF quants calibrate on generic text. This one calibrates on actual agentic workloads โ tool calls, multi-turn reasoning, code generation, and task completions from production Hermes agent sessions. If you run local agents, the quantization importance weights reflect your actual inference distribution.
- 21 GB on disk. Runs on M1 Max (64GB) with large context headroom.
- ~42 tok/s generation, ~620 tok/s prompt processing on Apple Silicon (Metal).
- Vision + video support included (mmproj).
Model Description
Built from Qwen/Qwen3.6-35B-A3B โ a 35B hybrid MoE model with 256 experts, 8 active per token, 34.66B total parameters. Combines attention layers with Gated Delta Net SSM layers (full attention every 4th layer), trained to 262K context.
This GGUF applies APEX quantization with a custom imatrix built entirely from Hermes agent session traces โ real multi-turn conversations including tool calls, reasoning chains, code generation, and agentic task completions. No generic wikitext. Verified for use with Hermes Agent on Apple Silicon (M1 Max).
Credits & Attribution
Base Model: Qwen/Qwen3.6-35B-A3B
- Original Qwen3.6 MoE release by Qwen team
Calibration Dataset: Combined imatrix calibration
- bartowski's calibration dataset v3 โ high-quality general calibration base
- Hermes agent session traces โ real multi-turn agentic conversations: tool calls, reasoning chains, code generation, scientific queries
- Combined and extracted using
extract_hermes_traces.pywith Qwen3.6 chat template
APEX Quantization: mudler/apex-quant
- Reference: mudler/Qwen3.5-35B-A3B-APEX-GGUF
TurboQuant backend: Custom Metal kernels for M-series Apple Silicon
- 3.5ร faster TQ4_1S kernel, MoE 256-expert kernel instantiations
This release: @luffydenolan
- Built Hermes agent trace calibration dataset
- Applied APEX iQuality quantization using Hermes imatrix
- Local testing and verification on M1 Max (64GB)
Methodology
- Started from
Qwen/Qwen3.6-35B-A3Bbase (34.66B params, ~65GB f16, ~34GB Q8_0) - Built custom imatrix calibration combining:
- bartowski's calibration dataset v3 โ general-purpose high-quality base
- Hermes agent session traces โ real multi-turn agentic conversations: tool calls, reasoning chains, code generation, task completions
- Generated imatrix importance weights using
llama-imatrixon Q8_0:-c 512,--chunks 200,--threads 10,-ngl 99
- Applied APEX quantization guided by imatrix weights โ iQuality output (21GB)
- Tested locally on M1 Max (64GB) with TurboQuant Metal backend
Architecture Notes
Qwen3.6-35B-A3B is a hybrid MoE + SSM model:
- 40 layers total, full attention every 4th layer (10 attention, 30 SSM/MoE)
- 256 experts per MoE layer, 8 activated per token (~3B active params per forward pass)
- SSM: Gated Delta Net, inner size 4096, state size 128, 16 groups
- GQA: 16 attention heads, 2 KV heads (8ร GQA), head dim 256
- Context trained to 262K tokens (rope freq base 10M)
Files Included
Qwen3.6-35B-A3B-apex-iquality.ggufโ Main model weights (21 GB)mmproj-F16-Qwen3.6-35B-A3B.ggufโ Multimodal projection layer, vision + video (858 MB)
Usage
llama.cpp server (text only)
llama-server \
-m Qwen3.6-35B-A3B-apex-iquality.gguf \
-ngl 99 \
-t 4 \
--ctx-size 132000 \
--cache-type-k q8_0 \
--cache-type-v turbo4 \
--flash-attn on \
--jinja \
--chat-template-file chat_template_qwen36.jinja \
--temp 0.7 \
--top-k 20 \
--top-p 0.95 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0
llama.cpp server (vision/video)
llama-server \
-m Qwen3.6-35B-A3B-apex-iquality.gguf \
--mmproj mmproj-F16-Qwen3.6-35B-A3B.gguf \
-ngl 99 \
-t 4 \
--ctx-size 132000 \
--cache-type-k q8_0 \
--cache-type-v turbo4 \
--flash-attn on \
--jinja \
--chat-template-file chat_template_qwen36.jinja \
--temp 0.7 \
--top-k 20 \
--top-p 0.95 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0
Recommended sampling parameters
| Parameter | Value | Notes |
|---|---|---|
| temperature | 0.7 | |
| top-k | 20 | Qwen3.x default |
| top-p | 0.95 | |
| min-p | 0.0 | |
| presence-penalty | 0.0 | |
| repeat-penalty | 1.0 | |
| cache-type-k | q8_0 | Best attention accuracy on M-series |
| cache-type-v | turbo4 | Good compression, less sensitive than K |
| ctx-size | 132000 | ~half of trained 262K โ fits 64GB with headroom |
Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama(
model_path="Qwen3.6-35B-A3B-apex-iquality.gguf",
n_ctx=132000,
n_gpu_layers=99,
n_threads=4,
)
response = llm.create_chat_completion(
messages=[{"role": "user", "content": "Hello!"}]
)
print(response["choices"][0]["message"]["content"])
Performance
Tested on Apple M1 Max (64GB unified memory), llama.cpp via Metal + TurboQuant backend, 4 threads.
Throughput (llama-bench, 3 runs, solo โ no other GPU load)
| Test | Tokens/sec |
|---|---|
| Prompt processing โ pp128 | 360.60 ยฑ 13.95 |
| Prompt processing โ pp512 | 620.68 ยฑ 2.74 |
| Prompt processing โ pp1024 | 611.77 ยฑ 6.17 |
| Token generation โ tg128 | 42.24 ยฑ 0.04 |
| Token generation โ tg512 | 41.87 ยฑ 0.22 |
- Model size: 21 GiB on disk, loaded to GPU via Metal
- Params: 34.66B total (~3B active per token)
- Quant: APEX iQuality (imatrix, Hermes traces)
- Backend: Metal (GPU) + TurboQuant, Apple M1 Max
- KV cache: q8_0 K / turbo4 V (10 attention layers only, 132K ctx)
TurboQuant Build (Apple Silicon)
turbo4 KV cache requires TheTom's TurboQuant fork of llama.cpp.
https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md
Memory at 132K context (M1 Max 64GB)
| Component | Size |
|---|---|
| Model weights (APEX iQuality) | ~21 GB |
| KV cache (q8_0 K / turbo4 V, 10 attn layers) | ~2 GB |
| SSM recurrent state | ~0.5 GB |
| Metal buffers + overhead | ~3 GB |
| Total RSS | ~26.6 GB |
~37 GB headroom remaining on 64GB for OS + other processes.
License
Apache 2.0 (inherited from base model)
Citation
@misc{qwen36-35b-a3b-apex-iquality-gguf,
title = {Qwen3.6-35B-A3B-APEX-IQuality-GGUF},
author = {luffydenolan},
year = {2026},
url = {https://huggingface.co/luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF}
}
Related Models
- Qwen/Qwen3.6-35B-A3B โ Base model
- mudler/Qwen3.5-35B-A3B-APEX-GGUF โ APEX quantization reference
- Downloads last month
- 621
We're not able to determine the quantization variants.
Model tree for luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF
Base model
Qwen/Qwen3.6-35B-A3B
Install Unsloth Studio (macOS, Linux, WSL)
# Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for luffydenolan/Qwen3.6-35B-A3B-APEX-IQuality-GGUF to start chatting