Instructions to use OsaurusAI/Qwen3.6-27B-JANG_4M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use OsaurusAI/Qwen3.6-27B-JANG_4M with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("OsaurusAI/Qwen3.6-27B-JANG_4M") config = load_config("OsaurusAI/Qwen3.6-27B-JANG_4M") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi
How to use OsaurusAI/Qwen3.6-27B-JANG_4M with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "OsaurusAI/Qwen3.6-27B-JANG_4M"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "OsaurusAI/Qwen3.6-27B-JANG_4M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use OsaurusAI/Qwen3.6-27B-JANG_4M with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "OsaurusAI/Qwen3.6-27B-JANG_4M"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default OsaurusAI/Qwen3.6-27B-JANG_4M
Run Hermes
hermes
Qwen 3.6 27B — JANG_4M (MLX)
Balanced 4-bit quantization of Alibaba's hybrid linear/full-attention dense 27B VL — full-attention q/k/v/o, embeddings and lm_head at 8-bit affine; dense FFN, linear-attention projections and vision tower at 4-bit affine.
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.6-27B |
| Parameters | 27.3 B, dense (no MoE) |
| Architecture | qwen3_5 — 64 decoder layers: 48 Gated DeltaNet (linear-attn) + 16 full-attention with swish output gate |
| Quantization | JANG_4M — mixed 4/8-bit native MLX affine |
| Package size on disk | 17.5 GB across 11 shards |
| Avg bits/param | 4.45 |
| vs BF16 source | 52 GB → 17.5 GB, 3.0× compression |
| Context | 262 144 native; upstream card reports up to ~1 M with YaRN |
| Vision tower | 27-layer ViT (hidden 1152, patch 16), temporal_patch 2, quantized at 4-bit with patch-embed axes pre-transposed to MLX layout |
| Chat format | Qwen im_start/im_end with enable_thinking toggle |
JANG_4M bit allocation
| Category | Bits | Group | Notes |
|---|---|---|---|
Full-attention projections (q_proj, k_proj, v_proj, o_proj) — 16 layers |
8 | 64 | Precision-critical. q_proj is fused with a swish output gate (half queries / half gate) |
Embedding (embed_tokens), lm_head |
8 | 64 | Input/output precision bound |
Dense FFN (mlp.gate_proj, mlp.up_proj, mlp.down_proj) — 64 layers |
4 | 64 | Bulk of parameters |
Linear-attention projections (in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj) — 48 layers |
4 | 64 | |
| Vision tower (27 ViT layers) | 4 | 64 | |
Norms, A_log, dt_bias, conv1d |
bf16 | — | Passthrough |
| MTP head | — | — | Stripped — mlx_vlm doesn't use it |
Per-module bit overrides are encoded directly in config.json["quantization"] (65 overrides for the 8-bit tier), so any MLX-compatible runtime can load the bundle without custom decode paths.
Why mixed 4/8? This dense 27B has 16 full-attention layers whose q_proj is fused with a sigmoid(gate) multiplier — activation noise near the gate transition zone is amplifying if attention is 4-bit. JANG_4M keeps those projections at 8-bit while absorbing the compression into the 64-layer dense FFN.
Architecture notes
- Hybrid attention stack: 48 of 64 layers use
Gated DeltaNet, a linear-attention / delta-rule hybrid with a groupedconv1dinput path and per-headA_log/dt_biasstate — constant memory in sequence length. The other 16 layers (one every 4, given byfull_attention_interval: 4) use full softmax attention withattn_output_gate: true—q_projproduces a fused (queries, gate) tensor; attention output is multiplied bysigmoid(gate)beforeo_proj. - Partial rotary embeddings: only the first 25% of head dim rotates (
partial_rotary_factor: 0.25),rope_theta = 1e7. Position metadata for mixed text/image/video (mrope_section,mrope_interleaved: true) is preserved inconfig.json. - Dense FFN: no MoE. Each layer has
gate_proj/up_proj(5120 → 17408) +down_proj(17408 → 5120) with SwiGLU activation. - Vision tower:
qwen3_vlViT, 27 layers, hidden 1152, patch 16, temporal_patch 2. Produces video token sequences via 3D conv patch-embed (pairs of frames merge into one temporal patch).
Usage
Load in Osaurus on Apple Silicon (macOS) — single-click deploy, local chat + vision, no Python setup. The bundle also loads in any Apple Silicon MLX runtime that supports qwen3_5 VL bundles with per-module quantization config (see config.json["quantization"]).
Reasoning on/off, image inference, and video inference are all verified on this quant.
Verified modalities
| Test | Result |
|---|---|
| Chat template (with + without thinking) | ✓ coherent |
| Text (enable_thinking=False): "The capital of France is" → "Paris" | ✓ |
| Text: "Translate to French: Hello, how are you?" → "Bonjour, comment allez-vous ?" | ✓ |
Text: def fibonacci(n): → correct recursive continuation |
✓ |
| VL image: solid red/green/blue/yellow 224×224 → correct color ID | ✓ 4/4 |
| VL video: 4-frame RGBY sequence → structurally coherent description | ✓ |
The 4-frame RGBY video encodes as 2 temporal patches via temporal_patch_size=2, which the model perceives as a 2-region color composition — identical behavior to the BF16 source. This is not a quant artifact.
For synthetic PIL-frame video tests, set processor.video_processor.do_sample_frames = False so each frame maps 1:1 to a patch.
MMLU-200 (10 subjects × 20 questions, reasoning OFF)
Both quants evaluated on the same 200-question slice of MMLU with enable_thinking=False (direct answer, no <think> preamble). Same prompts, same greedy decode, same extraction.
| Subject | MXFP4 | JANG_4M | Δ (JANG − MXFP4) |
|---|---|---|---|
| abstract_algebra | 12/20 (60.0%) | 15/20 (75.0%) | +3 |
| anatomy | 18/20 (90.0%) | 16/20 (80.0%) | -2 |
| astronomy | 20/20 (100.0%) | 19/20 (95.0%) | -1 |
| college_computer_science | 16/20 (80.0%) | 16/20 (80.0%) | 0 |
| college_physics | 15/20 (75.0%) | 15/20 (75.0%) | 0 |
| high_school_biology | 19/20 (95.0%) | 19/20 (95.0%) | 0 |
| high_school_chemistry | 16/20 (80.0%) | 15/20 (75.0%) | -1 |
| high_school_mathematics | 12/20 (60.0%) | 14/20 (70.0%) | +2 |
| logical_fallacies | 20/20 (100.0%) | 19/20 (95.0%) | -1 |
| world_religions | 19/20 (95.0%) | 17/20 (85.0%) | -2 |
| Total | 167/200 (83.5%) | 165/200 (82.5%) | −1.0 pp |
Both quants are strong baselines on reasoning-OFF MMLU. MXFP4 edges ahead by 1 pp overall. JANG_4M wins on the harder math-heavy subjects (abstract_algebra +3, high_school_mathematics +2) — plausibly because the 8-bit full-attention projections carry more signal on multi-step symbolic chains. MXFP4 wins on rote-recall subjects (anatomy, world_religions) by ~2 each, closer to ties on factual/scientific subjects.
Reasoning ON: not yet measured. Qwen 3.6 is a reasoning-optional model — with
enable_thinking=Truethe model generates a<think>…</think>block before answering, which typically lifts MMLU significantly. Reasoning-ON benchmarks for both quants are planned as a follow-up.
Hardware notes
17.5 GB weights on disk; once loaded, expect ~18–22 GB resident plus KV cache.
| Mac | Works? | Notes |
|---|---|---|
| 24 GB unified | ⚠️ | Text + image tight; no video |
| 32 GB unified | ✅ | Comfortable for text + image + short video |
| 48 GB+ unified | ✅ | Full context + VL + video |
License
Apache 2.0 — inherits from the base model.
Packaged on Apple Silicon by Osaurus.
© 2026 Osaurus AI — osaurus.ai
- Downloads last month
- 149
Quantized
Model tree for OsaurusAI/Qwen3.6-27B-JANG_4M
Base model
Qwen/Qwen3.6-27B