Qwen 3.6 27B — JANG_4M (MLX)

Balanced 4-bit quantization of Alibaba's hybrid linear/full-attention dense 27B VL — full-attention q/k/v/o, embeddings and lm_head at 8-bit affine; dense FFN, linear-attention projections and vision tower at 4-bit affine.

Model Details

Property	Value
Base model	`Qwen/Qwen3.6-27B`
Parameters	27.3 B, dense (no MoE)
Architecture	`qwen3_5` — 64 decoder layers: 48 `Gated DeltaNet` (linear-attn) + 16 full-attention with `swish` output gate
Quantization	JANG_4M — mixed 4/8-bit native MLX affine
Package size on disk	17.5 GB across 11 shards
Avg bits/param	4.45
vs BF16 source	52 GB → 17.5 GB, 3.0× compression
Context	262 144 native; upstream card reports up to ~1 M with YaRN
Vision tower	27-layer ViT (hidden 1152, patch 16), temporal_patch 2, quantized at 4-bit with patch-embed axes pre-transposed to MLX layout
Chat format	Qwen `im_start`/`im_end` with `enable_thinking` toggle

JANG_4M bit allocation

Category	Bits	Group	Notes
Full-attention projections (`q_proj`, `k_proj`, `v_proj`, `o_proj`) — 16 layers	8	64	Precision-critical. `q_proj` is fused with a swish output gate (half queries / half gate)
Embedding (`embed_tokens`), `lm_head`	8	64	Input/output precision bound
Dense FFN (`mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj`) — 64 layers	4	64	Bulk of parameters
Linear-attention projections (`in_proj_qkv`, `in_proj_z`, `in_proj_b`, `in_proj_a`, `out_proj`) — 48 layers	4	64
Vision tower (27 ViT layers)	4	64
Norms, `A_log`, `dt_bias`, `conv1d`	bf16	—	Passthrough
MTP head	—	—	Stripped — mlx_vlm doesn't use it

Per-module bit overrides are encoded directly in config.json["quantization"] (65 overrides for the 8-bit tier), so any MLX-compatible runtime can load the bundle without custom decode paths.

Why mixed 4/8? This dense 27B has 16 full-attention layers whose q_proj is fused with a sigmoid(gate) multiplier — activation noise near the gate transition zone is amplifying if attention is 4-bit. JANG_4M keeps those projections at 8-bit while absorbing the compression into the 64-layer dense FFN.

Architecture notes

Hybrid attention stack: 48 of 64 layers use Gated DeltaNet, a linear-attention / delta-rule hybrid with a grouped conv1d input path and per-head A_log / dt_bias state — constant memory in sequence length. The other 16 layers (one every 4, given by full_attention_interval: 4) use full softmax attention with attn_output_gate: true — q_proj produces a fused (queries, gate) tensor; attention output is multiplied by sigmoid(gate) before o_proj.
Partial rotary embeddings: only the first 25% of head dim rotates (partial_rotary_factor: 0.25), rope_theta = 1e7. Position metadata for mixed text/image/video (mrope_section, mrope_interleaved: true) is preserved in config.json.
Dense FFN: no MoE. Each layer has gate_proj/up_proj (5120 → 17408) + down_proj (17408 → 5120) with SwiGLU activation.
Vision tower: qwen3_vl ViT, 27 layers, hidden 1152, patch 16, temporal_patch 2. Produces video token sequences via 3D conv patch-embed (pairs of frames merge into one temporal patch).

Usage

Load in Osaurus on Apple Silicon (macOS) — single-click deploy, local chat + vision, no Python setup. The bundle also loads in any Apple Silicon MLX runtime that supports qwen3_5 VL bundles with per-module quantization config (see config.json["quantization"]).

Reasoning on/off, image inference, and video inference are all verified on this quant.

Verified modalities

Test	Result
Chat template (with + without thinking)	✓ coherent
Text (enable_thinking=False): "The capital of France is" → "Paris"	✓
Text: "Translate to French: Hello, how are you?" → "Bonjour, comment allez-vous ?"	✓
Text: `def fibonacci(n):` → correct recursive continuation	✓
VL image: solid red/green/blue/yellow 224×224 → correct color ID	✓ 4/4
VL video: 4-frame RGBY sequence → structurally coherent description	✓

The 4-frame RGBY video encodes as 2 temporal patches via temporal_patch_size=2, which the model perceives as a 2-region color composition — identical behavior to the BF16 source. This is not a quant artifact.

For synthetic PIL-frame video tests, set processor.video_processor.do_sample_frames = False so each frame maps 1:1 to a patch.

MMLU-200 (10 subjects × 20 questions, reasoning OFF)

Both quants evaluated on the same 200-question slice of MMLU with enable_thinking=False (direct answer, no <think> preamble). Same prompts, same greedy decode, same extraction.

Subject	MXFP4	JANG_4M	Δ (JANG − MXFP4)
abstract_algebra	12/20 (60.0%)	15/20 (75.0%)	+3
anatomy	18/20 (90.0%)	16/20 (80.0%)	-2
astronomy	20/20 (100.0%)	19/20 (95.0%)	-1
college_computer_science	16/20 (80.0%)	16/20 (80.0%)	0
college_physics	15/20 (75.0%)	15/20 (75.0%)	0
high_school_biology	19/20 (95.0%)	19/20 (95.0%)	0
high_school_chemistry	16/20 (80.0%)	15/20 (75.0%)	-1
high_school_mathematics	12/20 (60.0%)	14/20 (70.0%)	+2
logical_fallacies	20/20 (100.0%)	19/20 (95.0%)	-1
world_religions	19/20 (95.0%)	17/20 (85.0%)	-2
Total	167/200 (83.5%)	165/200 (82.5%)	−1.0 pp

Both quants are strong baselines on reasoning-OFF MMLU. MXFP4 edges ahead by 1 pp overall. JANG_4M wins on the harder math-heavy subjects (abstract_algebra +3, high_school_mathematics +2) — plausibly because the 8-bit full-attention projections carry more signal on multi-step symbolic chains. MXFP4 wins on rote-recall subjects (anatomy, world_religions) by ~2 each, closer to ties on factual/scientific subjects.

Reasoning ON: not yet measured. Qwen 3.6 is a reasoning-optional model — with enable_thinking=True the model generates a <think>…</think> block before answering, which typically lifts MMLU significantly. Reasoning-ON benchmarks for both quants are planned as a follow-up.

Hardware notes

17.5 GB weights on disk; once loaded, expect ~18–22 GB resident plus KV cache.

Mac	Works?	Notes
24 GB unified	⚠️	Text + image tight; no video
32 GB unified	✅	Comfortable for text + image + short video
48 GB+ unified	✅	Full context + VL + video

License

Apache 2.0 — inherits from the base model.

Packaged on Apple Silicon by Osaurus.
© 2026 Osaurus AI — osaurus.ai

Downloads last month: 149

Safetensors

Model size

5B params

Tensor type

F16

U32

MLX

Hardware compatibility

Quantized

Model tree for OsaurusAI/Qwen3.6-27B-JANG_4M

Base model

Qwen/Qwen3.6-27B

Finetuned

(202)

this model