Instructions to use njmason/caveman-qwen3.6 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use njmason/caveman-qwen3.6 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen3.6-35B-A3B") model = PeftModel.from_pretrained(base_model, "njmason/caveman-qwen3.6") - Notebooks
- Google Colab
- Kaggle
🪨 caveman-qwen3.6
A QLoRA fine-tune of unsloth/Qwen3.6-35B-A3B trained to produce minimized, direct, high-quality responses by default — no system prompt required. Brevity is baked into the weights.
Inspired by Mintzs/oogaboogalm (Qwen2.5-7B and Qwen3-8B variants); this is the Qwen3.6-35B-A3B (MoE) port.
Why
Standard LLMs are verbose. They greet, restate the question, and pad with markdown headers. The common fix is a system prompt like "Be concise", but that:
- Wastes tokens on every single call
- Is not reliably enforced across all responses
- Adds boilerplate to every integration
caveman-qwen3.6 eliminates this — the compression behavior lives in the weights.
Before / After
Sampled from a 5-prompt smoke test (greedy decoding, enable_thinking=False, no system prompt):
| Prompt | Base (unsloth/Qwen3.6-35B-A3B) | caveman-qwen3.6 |
|---|---|---|
| "How do I reverse a string in Python?" | 98+ words — markdown headers, multiple methods, prose explanation | 8 words — s = "hello"; reversed_s = s[::-1] |
| "What is the capital of Japan?" | 53 words — explains Edo→Tokyo history, mentions constitution | 1 word — Tokyo. |
| "Write a function that returns true if a number is even." | 22 words — type hints + docstring | 10 words — bare function |
| "Explain what a closure is in JavaScript." | 45+ words — markdown structured | 31 words — direct definition |
| "How do I list files larger than 100MB on Linux?" | 114 words — explainer table breaking down each flag | 9 words — bare find command |
Average reduction: ~75-90% fewer output tokens with no observed correctness loss on the sample.
Use Cases
- 💸 API cost reduction — fewer output tokens = lower cost at scale
- ⚡ Faster inference — less to generate per request, lower TTFT and total latency
- 🤖 Cleaner agent pipelines — response bloat compounds across multi-step LLM calls
- 🔌 Lightweight integrations — no system-prompt boilerplate needed
Training
| Property | Value |
|---|---|
| Base model | unsloth/Qwen3.6-35B-A3B (256 experts, 8 active per token, hybrid DeltaNet+Attention) |
| Fine-tuning method | QLoRA (4-bit) via Axolotl 0.16.2.dev0 |
| MoE LoRA backend | ScatterMoE (custom Triton kernels) |
| Training environment | RunPod A100 80GB PCIe |
| LoRA rank | 32 |
| LoRA alpha | 64 |
| Target modules | q_proj, k_proj, v_proj, o_proj (attention only) |
| Trainable parameters | 6,881,280 (0.04% of total) |
| Sequence length | 2048 |
| Sample packing | enabled |
| Effective batch size | 8 |
| Epochs | ~3 (33 packed steps) |
| Optimizer | AdamW 8-bit |
| Learning rate | 2e-4 (cosine) |
| Final training loss | 0.519 |
| Training time | 6.6 min |
| Distribution format | Adapter only (apply to base model at runtime) |
| Context window | inherited from base (128K via YaRN) |
Dataset
A custom synthetic dataset of 1,500 (prompt, terse_response) pairs, generated via Groq Llama-3.3-70B as a "compression engine" teacher.
Domain mix:
- 600 coding (Python, JS/TS, Go, Bash, SQL) — sampled from CodeAlpaca-20k + Alpaca
- 400 factual ("what is X", "explain Y") — sampled from Dolly-15k
- 300 task instructions — sampled from Dolly + Alpaca
- 200 first-turn conversational — sampled from FineTome-100k
Filters: length 10-500 words, English only, dedup by normalized form, no safety/jailbreak prompts.
No system prompt is included in any training example — the brevity behavior is internalized in the weights, not instruction-followed at runtime.
Audited the full dataset for rule violations across all 1500 pairs:
- 0 pleasantry openers
- 0 question restatements
- 0 closing pleasantries ("Hope this helps!")
- 0 meta-commentary ("In summary...")
- 0 refusals
Median answer length: 15 words. Mean: 26 words. p90: 48 words.
Usage
With PEFT + Transformers (single GPU, requires base model)
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
BASE = "unsloth/Qwen3.6-35B-A3B"
ADAPTER = "njmason/caveman-qwen3.6"
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
tok = AutoTokenizer.from_pretrained(BASE)
base = AutoModelForCausalLM.from_pretrained(
BASE, quantization_config=bnb, device_map="auto", torch_dtype=torch.bfloat16
)
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()
prompt = "How do I run a Docker container?"
msgs = [{"role": "user", "content": prompt}]
text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True, enable_thinking=False)
ids = tok(text, return_tensors="pt").to(model.device)
with torch.inference_mode():
out = model.generate(**ids, max_new_tokens=200, do_sample=False)
print(tok.decode(out[0][ids.input_ids.shape[1]:], skip_special_tokens=True).strip())
Tip:
enable_thinking=Falsedisables the<think>...</think>block that Qwen3.6 emits by default. With thinking disabled, you get the terse-only behavior immediately. With thinking enabled, the model still reasons internally — useful for harder problems.
With llama.cpp (after merging adapter into base)
For deployment via llama.cpp (CPU, Metal, Vulkan, CUDA), merge the adapter into the base model and convert to GGUF. See the Axolotl merge-lora docs and llama.cpp's convert_hf_to_gguf.py.
Limitations & Caveats
- MoE adapter is attention-only. The expert FFN weights were not adapted (an earlier attempt with the full attention+expert adapter on this specific model crashed in Unsloth's MoE LoRA path; this run used Axolotl's ScatterMoE backend on attention only as a result). Brevity behavior emerged anyway from attention-level adaptation, but a full attention+expert adapter could potentially improve quality.
- Trained on a small synthetic dataset — 1,500 pairs. May not generalize perfectly to all domains.
- Extreme brevity may omit context that some use cases require (tutorials, education, compliance documentation).
- Not suited for tasks where verbose explanation is desirable — pedagogical content, creative writing, analysis essays.
- Smoke-tested on 5 prompts only. No formal benchmark suite (MMLU, HumanEval, etc.) was run on the trained model. Production users should evaluate on their own task distribution.
- Vision capability untested. The base model is multimodal (Qwen3.6 VL); fine-tuning was text-only and the vision pathway was not exercised post-training.
License
Apache-2.0 (matches base model).
Citation
@misc{caveman-qwen3.6,
author = {Nick Mason},
title = {caveman-qwen3.6: A brevity-trained QLoRA adapter for Qwen3.6-35B-A3B},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/njmason/caveman-qwen3.6}
}
Inspired by Mintzs/oogaboogalm, itself inspired by JuliusBrussee/caveman.
Axolotl training config
base_model: unsloth/Qwen3.6-35B-A3B
plugins:
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
- axolotl.integrations.kernels.KernelsPlugin
- axolotl.integrations.liger.LigerPlugin
use_kernels: true
use_scattermoe: true
liger_layer_norm: true
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_rms_norm_gated: true
torch_compile: false
chat_template: qwen3_5
datasets:
- path: pairs_flat.jsonl
type: chat_template
val_set_size: 0
output_dir: ./outputs/qwen36-caveman-lora
dataset_prepared_path: last_run_prepared
sequence_len: 2048
sample_packing: true
load_in_4bit: true
quantize_moe_experts: true
adapter: qlora
lora_r: 32
lora_alpha: 64
lora_dropout: 0
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
lora_qkv_kernel: true
lora_o_kernel: true
lora_mlp_kernel: false
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
optimizer: adamw_torch_8bit
lr_scheduler: cosine
learning_rate: 0.0002
bf16: auto
tf32: true
gradient_checkpointing: true
activation_offloading: true
logging_steps: 10
save_strategy: epoch
save_total_limit: 2
flash_attention: true
warmup_ratio: 0.03
weight_decay: 0.01
Framework versions
- PEFT 0.19.1
- Transformers 5.5.4
- PyTorch 2.10.0+cu128
- Datasets 4.8.5
- Tokenizers 0.22.2
- Axolotl 0.16.2.dev0
- Downloads last month
- 8