🪨 caveman-qwen3.6

A QLoRA fine-tune of unsloth/Qwen3.6-35B-A3B trained to produce minimized, direct, high-quality responses by default — no system prompt required. Brevity is baked into the weights.

Inspired by Mintzs/oogaboogalm (Qwen2.5-7B and Qwen3-8B variants); this is the Qwen3.6-35B-A3B (MoE) port.


Why

Standard LLMs are verbose. They greet, restate the question, and pad with markdown headers. The common fix is a system prompt like "Be concise", but that:

  • Wastes tokens on every single call
  • Is not reliably enforced across all responses
  • Adds boilerplate to every integration

caveman-qwen3.6 eliminates this — the compression behavior lives in the weights.


Before / After

Sampled from a 5-prompt smoke test (greedy decoding, enable_thinking=False, no system prompt):

Prompt Base (unsloth/Qwen3.6-35B-A3B) caveman-qwen3.6
"How do I reverse a string in Python?" 98+ words — markdown headers, multiple methods, prose explanation 8 wordss = "hello"; reversed_s = s[::-1]
"What is the capital of Japan?" 53 words — explains Edo→Tokyo history, mentions constitution 1 wordTokyo.
"Write a function that returns true if a number is even." 22 words — type hints + docstring 10 words — bare function
"Explain what a closure is in JavaScript." 45+ words — markdown structured 31 words — direct definition
"How do I list files larger than 100MB on Linux?" 114 words — explainer table breaking down each flag 9 words — bare find command

Average reduction: ~75-90% fewer output tokens with no observed correctness loss on the sample.


Use Cases

  • 💸 API cost reduction — fewer output tokens = lower cost at scale
  • Faster inference — less to generate per request, lower TTFT and total latency
  • 🤖 Cleaner agent pipelines — response bloat compounds across multi-step LLM calls
  • 🔌 Lightweight integrations — no system-prompt boilerplate needed

Training

Property Value
Base model unsloth/Qwen3.6-35B-A3B (256 experts, 8 active per token, hybrid DeltaNet+Attention)
Fine-tuning method QLoRA (4-bit) via Axolotl 0.16.2.dev0
MoE LoRA backend ScatterMoE (custom Triton kernels)
Training environment RunPod A100 80GB PCIe
LoRA rank 32
LoRA alpha 64
Target modules q_proj, k_proj, v_proj, o_proj (attention only)
Trainable parameters 6,881,280 (0.04% of total)
Sequence length 2048
Sample packing enabled
Effective batch size 8
Epochs ~3 (33 packed steps)
Optimizer AdamW 8-bit
Learning rate 2e-4 (cosine)
Final training loss 0.519
Training time 6.6 min
Distribution format Adapter only (apply to base model at runtime)
Context window inherited from base (128K via YaRN)

Dataset

A custom synthetic dataset of 1,500 (prompt, terse_response) pairs, generated via Groq Llama-3.3-70B as a "compression engine" teacher.

Domain mix:

  • 600 coding (Python, JS/TS, Go, Bash, SQL) — sampled from CodeAlpaca-20k + Alpaca
  • 400 factual ("what is X", "explain Y") — sampled from Dolly-15k
  • 300 task instructions — sampled from Dolly + Alpaca
  • 200 first-turn conversational — sampled from FineTome-100k

Filters: length 10-500 words, English only, dedup by normalized form, no safety/jailbreak prompts.

No system prompt is included in any training example — the brevity behavior is internalized in the weights, not instruction-followed at runtime.

Audited the full dataset for rule violations across all 1500 pairs:

  • 0 pleasantry openers
  • 0 question restatements
  • 0 closing pleasantries ("Hope this helps!")
  • 0 meta-commentary ("In summary...")
  • 0 refusals

Median answer length: 15 words. Mean: 26 words. p90: 48 words.


Usage

With PEFT + Transformers (single GPU, requires base model)

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

BASE = "unsloth/Qwen3.6-35B-A3B"
ADAPTER = "njmason/caveman-qwen3.6"

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
tok = AutoTokenizer.from_pretrained(BASE)
base = AutoModelForCausalLM.from_pretrained(
    BASE, quantization_config=bnb, device_map="auto", torch_dtype=torch.bfloat16
)
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

prompt = "How do I run a Docker container?"
msgs = [{"role": "user", "content": prompt}]
text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True, enable_thinking=False)
ids = tok(text, return_tensors="pt").to(model.device)

with torch.inference_mode():
    out = model.generate(**ids, max_new_tokens=200, do_sample=False)
print(tok.decode(out[0][ids.input_ids.shape[1]:], skip_special_tokens=True).strip())

Tip: enable_thinking=False disables the <think>...</think> block that Qwen3.6 emits by default. With thinking disabled, you get the terse-only behavior immediately. With thinking enabled, the model still reasons internally — useful for harder problems.

With llama.cpp (after merging adapter into base)

For deployment via llama.cpp (CPU, Metal, Vulkan, CUDA), merge the adapter into the base model and convert to GGUF. See the Axolotl merge-lora docs and llama.cpp's convert_hf_to_gguf.py.


Limitations & Caveats

  • MoE adapter is attention-only. The expert FFN weights were not adapted (an earlier attempt with the full attention+expert adapter on this specific model crashed in Unsloth's MoE LoRA path; this run used Axolotl's ScatterMoE backend on attention only as a result). Brevity behavior emerged anyway from attention-level adaptation, but a full attention+expert adapter could potentially improve quality.
  • Trained on a small synthetic dataset — 1,500 pairs. May not generalize perfectly to all domains.
  • Extreme brevity may omit context that some use cases require (tutorials, education, compliance documentation).
  • Not suited for tasks where verbose explanation is desirable — pedagogical content, creative writing, analysis essays.
  • Smoke-tested on 5 prompts only. No formal benchmark suite (MMLU, HumanEval, etc.) was run on the trained model. Production users should evaluate on their own task distribution.
  • Vision capability untested. The base model is multimodal (Qwen3.6 VL); fine-tuning was text-only and the vision pathway was not exercised post-training.

License

Apache-2.0 (matches base model).


Citation

@misc{caveman-qwen3.6,
  author = {Nick Mason},
  title = {caveman-qwen3.6: A brevity-trained QLoRA adapter for Qwen3.6-35B-A3B},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/njmason/caveman-qwen3.6}
}

Inspired by Mintzs/oogaboogalm, itself inspired by JuliusBrussee/caveman.

Axolotl training config
base_model: unsloth/Qwen3.6-35B-A3B

plugins:
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
  - axolotl.integrations.kernels.KernelsPlugin
  - axolotl.integrations.liger.LigerPlugin
use_kernels: true
use_scattermoe: true
liger_layer_norm: true
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_rms_norm_gated: true

torch_compile: false

chat_template: qwen3_5
datasets:
  - path: pairs_flat.jsonl
    type: chat_template

val_set_size: 0
output_dir: ./outputs/qwen36-caveman-lora
dataset_prepared_path: last_run_prepared

sequence_len: 2048
sample_packing: true

load_in_4bit: true
quantize_moe_experts: true
adapter: qlora

lora_r: 32
lora_alpha: 64
lora_dropout: 0
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj

lora_qkv_kernel: true
lora_o_kernel: true
lora_mlp_kernel: false

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
optimizer: adamw_torch_8bit
lr_scheduler: cosine
learning_rate: 0.0002

bf16: auto
tf32: true

gradient_checkpointing: true
activation_offloading: true
logging_steps: 10
save_strategy: epoch
save_total_limit: 2
flash_attention: true

warmup_ratio: 0.03
weight_decay: 0.01

Framework versions

  • PEFT 0.19.1
  • Transformers 5.5.4
  • PyTorch 2.10.0+cu128
  • Datasets 4.8.5
  • Tokenizers 0.22.2
  • Axolotl 0.16.2.dev0
Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for njmason/caveman-qwen3.6

Adapter
(6)
this model
Quantizations
1 model