Osaurus AI

Qwen 3.6 27B — JANG_4M (MLX)

Balanced 4-bit quantization of Alibaba's hybrid linear/full-attention dense 27B VL — full-attention q/k/v/o, embeddings and lm_head at 8-bit affine; dense FFN, linear-attention projections and vision tower at 4-bit affine.

Website  OsaurusAI


Model Details

Property Value
Base model Qwen/Qwen3.6-27B
Parameters 27.3 B, dense (no MoE)
Architecture qwen3_5 — 64 decoder layers: 48 Gated DeltaNet (linear-attn) + 16 full-attention with swish output gate
Quantization JANG_4M — mixed 4/8-bit native MLX affine
Package size on disk 17.5 GB across 11 shards
Avg bits/param 4.45
vs BF16 source 52 GB → 17.5 GB, 3.0× compression
Context 262 144 native; upstream card reports up to ~1 M with YaRN
Vision tower 27-layer ViT (hidden 1152, patch 16), temporal_patch 2, quantized at 4-bit with patch-embed axes pre-transposed to MLX layout
Chat format Qwen im_start/im_end with enable_thinking toggle

JANG_4M bit allocation

Category Bits Group Notes
Full-attention projections (q_proj, k_proj, v_proj, o_proj) — 16 layers 8 64 Precision-critical. q_proj is fused with a swish output gate (half queries / half gate)
Embedding (embed_tokens), lm_head 8 64 Input/output precision bound
Dense FFN (mlp.gate_proj, mlp.up_proj, mlp.down_proj) — 64 layers 4 64 Bulk of parameters
Linear-attention projections (in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj) — 48 layers 4 64
Vision tower (27 ViT layers) 4 64
Norms, A_log, dt_bias, conv1d bf16 Passthrough
MTP head Stripped — mlx_vlm doesn't use it

Per-module bit overrides are encoded directly in config.json["quantization"] (65 overrides for the 8-bit tier), so any MLX-compatible runtime can load the bundle without custom decode paths.

Why mixed 4/8? This dense 27B has 16 full-attention layers whose q_proj is fused with a sigmoid(gate) multiplier — activation noise near the gate transition zone is amplifying if attention is 4-bit. JANG_4M keeps those projections at 8-bit while absorbing the compression into the 64-layer dense FFN.


Architecture notes

  • Hybrid attention stack: 48 of 64 layers use Gated DeltaNet, a linear-attention / delta-rule hybrid with a grouped conv1d input path and per-head A_log / dt_bias state — constant memory in sequence length. The other 16 layers (one every 4, given by full_attention_interval: 4) use full softmax attention with attn_output_gate: trueq_proj produces a fused (queries, gate) tensor; attention output is multiplied by sigmoid(gate) before o_proj.
  • Partial rotary embeddings: only the first 25% of head dim rotates (partial_rotary_factor: 0.25), rope_theta = 1e7. Position metadata for mixed text/image/video (mrope_section, mrope_interleaved: true) is preserved in config.json.
  • Dense FFN: no MoE. Each layer has gate_proj/up_proj (5120 → 17408) + down_proj (17408 → 5120) with SwiGLU activation.
  • Vision tower: qwen3_vl ViT, 27 layers, hidden 1152, patch 16, temporal_patch 2. Produces video token sequences via 3D conv patch-embed (pairs of frames merge into one temporal patch).

Usage

Load in Osaurus on Apple Silicon (macOS) — single-click deploy, local chat + vision, no Python setup. The bundle also loads in any Apple Silicon MLX runtime that supports qwen3_5 VL bundles with per-module quantization config (see config.json["quantization"]).

Reasoning on/off, image inference, and video inference are all verified on this quant.


Verified modalities

Test Result
Chat template (with + without thinking) ✓ coherent
Text (enable_thinking=False): "The capital of France is" → "Paris"
Text: "Translate to French: Hello, how are you?" → "Bonjour, comment allez-vous ?"
Text: def fibonacci(n): → correct recursive continuation
VL image: solid red/green/blue/yellow 224×224 → correct color ID ✓ 4/4
VL video: 4-frame RGBY sequence → structurally coherent description

The 4-frame RGBY video encodes as 2 temporal patches via temporal_patch_size=2, which the model perceives as a 2-region color composition — identical behavior to the BF16 source. This is not a quant artifact.

For synthetic PIL-frame video tests, set processor.video_processor.do_sample_frames = False so each frame maps 1:1 to a patch.



MMLU-200 (10 subjects × 20 questions, reasoning OFF)

Both quants evaluated on the same 200-question slice of MMLU with enable_thinking=False (direct answer, no <think> preamble). Same prompts, same greedy decode, same extraction.

Subject MXFP4 JANG_4M Δ (JANG − MXFP4)
abstract_algebra 12/20 (60.0%) 15/20 (75.0%) +3
anatomy 18/20 (90.0%) 16/20 (80.0%) -2
astronomy 20/20 (100.0%) 19/20 (95.0%) -1
college_computer_science 16/20 (80.0%) 16/20 (80.0%) 0
college_physics 15/20 (75.0%) 15/20 (75.0%) 0
high_school_biology 19/20 (95.0%) 19/20 (95.0%) 0
high_school_chemistry 16/20 (80.0%) 15/20 (75.0%) -1
high_school_mathematics 12/20 (60.0%) 14/20 (70.0%) +2
logical_fallacies 20/20 (100.0%) 19/20 (95.0%) -1
world_religions 19/20 (95.0%) 17/20 (85.0%) -2
Total 167/200 (83.5%) 165/200 (82.5%) −1.0 pp

Both quants are strong baselines on reasoning-OFF MMLU. MXFP4 edges ahead by 1 pp overall. JANG_4M wins on the harder math-heavy subjects (abstract_algebra +3, high_school_mathematics +2) — plausibly because the 8-bit full-attention projections carry more signal on multi-step symbolic chains. MXFP4 wins on rote-recall subjects (anatomy, world_religions) by ~2 each, closer to ties on factual/scientific subjects.

Reasoning ON: not yet measured. Qwen 3.6 is a reasoning-optional model — with enable_thinking=True the model generates a <think>…</think> block before answering, which typically lifts MMLU significantly. Reasoning-ON benchmarks for both quants are planned as a follow-up.


Hardware notes

17.5 GB weights on disk; once loaded, expect ~18–22 GB resident plus KV cache.

Mac Works? Notes
24 GB unified ⚠️ Text + image tight; no video
32 GB unified Comfortable for text + image + short video
48 GB+ unified Full context + VL + video

License

Apache 2.0 — inherits from the base model.


Packaged on Apple Silicon by Osaurus.
© 2026 Osaurus AI — osaurus.ai

Downloads last month
149
Safetensors
Model size
5B params
Tensor type
F16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OsaurusAI/Qwen3.6-27B-JANG_4M

Base model

Qwen/Qwen3.6-27B
Finetuned
(202)
this model