Instructions to use Jackrong/Qwopus3.6-35B-A3B-v1-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Jackrong/Qwopus3.6-35B-A3B-v1-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Jackrong/Qwopus3.6-35B-A3B-v1-GGUF")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Jackrong/Qwopus3.6-35B-A3B-v1-GGUF", dtype="auto")

llama-cpp-python

How to use Jackrong/Qwopus3.6-35B-A3B-v1-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Jackrong/Qwopus3.6-35B-A3B-v1-GGUF",
	filename="Qwopus3.6-35B-A3B-v1-IQ4_XS.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Jackrong/Qwopus3.6-35B-A3B-v1-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Jackrong/Qwopus3.6-35B-A3B-v1-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Jackrong/Qwopus3.6-35B-A3B-v1-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Jackrong/Qwopus3.6-35B-A3B-v1-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Jackrong/Qwopus3.6-35B-A3B-v1-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Jackrong/Qwopus3.6-35B-A3B-v1-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Jackrong/Qwopus3.6-35B-A3B-v1-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Jackrong/Qwopus3.6-35B-A3B-v1-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Jackrong/Qwopus3.6-35B-A3B-v1-GGUF:Q4_K_M

Use Docker

docker model run hf.co/Jackrong/Qwopus3.6-35B-A3B-v1-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use Jackrong/Qwopus3.6-35B-A3B-v1-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Jackrong/Qwopus3.6-35B-A3B-v1-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jackrong/Qwopus3.6-35B-A3B-v1-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Jackrong/Qwopus3.6-35B-A3B-v1-GGUF:Q4_K_M

SGLang

How to use Jackrong/Qwopus3.6-35B-A3B-v1-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Jackrong/Qwopus3.6-35B-A3B-v1-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jackrong/Qwopus3.6-35B-A3B-v1-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Jackrong/Qwopus3.6-35B-A3B-v1-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jackrong/Qwopus3.6-35B-A3B-v1-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use Jackrong/Qwopus3.6-35B-A3B-v1-GGUF with Ollama:
```
ollama run hf.co/Jackrong/Qwopus3.6-35B-A3B-v1-GGUF:Q4_K_M
```

Unsloth Studio new

How to use Jackrong/Qwopus3.6-35B-A3B-v1-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Jackrong/Qwopus3.6-35B-A3B-v1-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Jackrong/Qwopus3.6-35B-A3B-v1-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Jackrong/Qwopus3.6-35B-A3B-v1-GGUF to start chatting

Pi new

How to use Jackrong/Qwopus3.6-35B-A3B-v1-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Jackrong/Qwopus3.6-35B-A3B-v1-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Jackrong/Qwopus3.6-35B-A3B-v1-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Jackrong/Qwopus3.6-35B-A3B-v1-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Jackrong/Qwopus3.6-35B-A3B-v1-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Jackrong/Qwopus3.6-35B-A3B-v1-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use Jackrong/Qwopus3.6-35B-A3B-v1-GGUF with Docker Model Runner:
```
docker model run hf.co/Jackrong/Qwopus3.6-35B-A3B-v1-GGUF:Q4_K_M
```

Lemonade

How to use Jackrong/Qwopus3.6-35B-A3B-v1-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Jackrong/Qwopus3.6-35B-A3B-v1-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwopus3.6-35B-A3B-v1-GGUF-Q4_K_M

List all available models

lemonade list

🌟 Qwopus3.6-35B-A3B-v1

💡 Base Model Overview

Qwen3.6-35B-A3B is an advanced hybrid sparse MoE (Mixture-of-Experts) model developed by Alibaba Cloud. It features 35B total parameters with only 3B active parameters per token, ensuring high inference efficiency. Architecturally, it combines Gated DeltaNet linear attention with standard gated attention layers, routing tokens across 256 experts. It natively supports a massive 262k context window and is specifically designed for high-performance agentic coding, deep reasoning, and multimodal tasks.

🚀 Model Refinement & Logic Tuning （Qwopus3.6-35B-A3B-v1）

🪐Qwopus3.6-35B-A3B-v1 is a reasoning-enhanced MoE (Mixture of Experts) model fine-tuned on top of Qwen3.6-35B-A3B.

🛠 Training Strategy

The fine-tuning process for this model is structured into three distinct stages of distributed SFT (Supervised Fine-Tuning), progressively scaling reasoning complexity and data diversity. This systematic approach ensures the model inherits the base MoE capabilities while sharpening its logic-handling depth.

Looking ahead, Reinforcement Learning (RL) training will be introduced in subsequent versions to further optimize the reasoning paths and alignment performance.

This version uses LoRA fine-tuning, but uniquely scales up the trainable parameters, with approximately 9% of the model parameters participating in the update. This allows for a deeper adaptation of reasoning capabilities while maintaining the efficiency of parameter-efficient fine-tuning. However, setting trainable parameters to 9% is a risky configuration for this MoE architecture, as it significantly increases the potential for training instability and weight merging conflicts.

Vision & Tool Calling Support: This model supports visual capabilities and tool calling. To enable vision, please place the mmproj.gguf file from the GGUF repository into the same directory as the main .gguf file.

It is designed for:

🧩 More structured reasoning
🪶 More consistent answer style
🔁 Better cross-source distillation alignment
⚡ A stronger foundation for later larger-scale versions

Community Release Notice: Qwopus3.6-35B-A3B-v1 has not undergone complete performance evaluation or safety testing. It is released purely as an experimental community version for research and exploration.

🧪 Independent Benchmark Results

Benchmark Comparison

Model	Overall	Speed	Quality	Reliability (%)	Tokens/s
🏆 Jackrong/Qwopus3.6-35B-A3B-v1	88.6	69.3	94.2	91.7	44
hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled	82.7	69.2	86.0	86.1	44
GestaltLabs/Qwen3.6-35B-A3B-NSC-ACE-SABER	65.2	69.6	61.7	69.4	45
AtomicChat/Qwen3.6-27B-UDT-MTP	65.1	38.9	70.2	75.0	8
AtomicChat/Qwen3.6-35B-A3B-UDT-MTP	49.4	—	62.9	59.3	—

🚀 Qwopus3.6-35B-A3B-v1 demonstrates leading performance in this evaluation, particularly excelling in overall quality and reliability, while maintaining strong inference speed on consumer hardware.

Benchmark source: Independent test by Tekholms.aptm (@adsilva264). Results reflect quantized GGUF performance under consistent testing conditions.

SWE testing is currently underway, and results will be available soon！

🧪 Data Composition & Context Length Mix

The model was trained on a carefully curated dataset encompassing a wide range of domains, including mathematics, code, science, multilingual chat, and instruction following.

To balance different capabilities, the training data is divided into four main context-length buckets, incorporating a mix of:

Short format stable samples
Medium complexity reasoning samples
Long context high-quality samples
A small amount of replay samples

Context Length Distribution:

< 4096 tokens: Short-context data focused on establishing stable formatting and basic reasoning.
4096 - 8192 tokens: Medium-context data introducing higher reasoning complexity.
8192 - 16384 tokens: Long-context reasoning data, which also includes 10% short sample replay to prevent catastrophic forgetting of basic instruction-following.
16384 - 32K tokens: A small amount of multi-turn conversations to maintain extended interaction capabilities.

🎯 Three-Stage Curriculum Learning

Qwopus3.6-35B-A3B-v1 employs a curriculum learning-style phased reasoning data mix, progressively increasing the difficulty and complexity of the training signals:

Early Stage (Format Establishment): Focuses on short-to-medium length, format-stable reasoning samples. The primary goal here is to establish a reliable, structured new reasoning format without overwhelming the model with extreme complexity.
Middle Stage (Complexity Scaling & Multi-Teacher Distillation): Gradually increases the proportion of complex reasoning samples from multiple teacher models.
- Distillation data sourced from a 27B model that closely matches the base model's stylistic distribution, ensuring the capability gap isn't too drastic to learn efficiently.
Final Stage (Long-Context Reinforcement & Anti-Drift): Strengthens long-context reasoning capabilities. Crucially, this stage retains short sample replay to ensure the model maintains its short-context instruction-following ability and minimizes capability drift.

🚀 Context Length and Long-Context Usage

During fine-tuning, this model was trained with a maximum sequence length of 32K tokens. The training data mixture was also constructed around samples up to 32K tokens, so the "Context Length Distribution" shown in this model card reflects the fine-tuning data distribution rather than a hard architectural limit.

The model still inherits the native long-context capability of the Qwen3.6 base model. Therefore, longer context windows such as 128K or 256K may be available in compatible inference runtimes, depending on the backend and configuration.

For practical long-context inference beyond 32K, especially when using llama.cpp / GGUF, it is recommended to enable RoPE/YaRN scaling instead of only increasing n_ctx / --ctx-size. Directly setting a larger context window without RoPE scaling may work in some cases, but it can be less stable and may not achieve the expected long-context performance.

This is consistent with Qwen community guidance for long-context GGUF usage: 128K context generally requires YaRN/RoPE scaling, and it is not necessarily enabled by default in llama.cpp. For example, Qwen maintainers have noted that "128K context length needs YaRN" and that it should be explicitly enabled when supported by the runtime.
Reference: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct-GGUF/discussions/2

Community feedback also suggests that RoPE/YaRN scaling can improve long-context stability for this model family. One user reported that, on HermesAgent-20, Qwopus3.6-35B-A3B-v1 performed better when extending from 32K to 128K via RoPE scaling than when directly setting a 128K context window without scaling, with scores of 83 vs. 72 in their setup. This result may vary depending on the backend, quantization type, KV cache settings, hardware, and benchmark configuration, but it is consistent with the recommendation to use RoPE/YaRN scaling for contexts beyond 32K.

Example llama.cpp configuration for extending from 32K to 128K:

./llama-server \
  -m model.gguf \
  --ctx-size 131072 \
  --rope-scaling yarn \
  --rope-scale 4 \
  --yarn-orig-ctx 32768

For 256K context, users may need to adjust the scaling factor accordingly and validate the result in their own workload:

./llama-server \
  -m model.gguf \
  --ctx-size 262144 \
  --rope-scaling yarn \
  --rope-scale 8 \
  --yarn-orig-ctx 32768

Please note that long-context behavior may vary depending on the inference backend, quantization type, KV cache settings, available memory, and task type. For best results, users should benchmark their own target workload when using contexts beyond 32K.

🚀 Quick Evaluation Summary: Qwopus3.6-35B-A3B-v1

This model represents a significant leap in inference efficiency and one-shot generation quality compared to previous dense architectures. By leveraging a Hybrid MoE structure (35B total / 3B active parameters) and Gated DeltaNet linear attention, it balances high throughput with deep reasoning capabilities.

Unmatched Speed: Achieves an average of 161.9 tok/s on an RTX 5090—a 2.6× speedup over the 27B dense predecessor—making it one of the fastest high-parameter models available for single-GPU consumer hardware.
Production-Grade Frontend Design: Evaluated as one of the strongest open models for one-shot HTML/CSS generation. Unlike models that provide surface-level scaffolding, this model delivers complete, functional pages with complex micro-interactions, animated components, and production-ready logic.
Starvation-Free Reasoning: Successfully resolves the "thinking starvation" issues seen in earlier versions. It maintains robust performance in long-context JSON extraction and multi-step agentic planning, outputting valid structured data even after extensive internal reasoning traces.
Architectural Efficiency: The integration of Gated DeltaNet allows for a massive 262K native context window with optimized VRAM usage, keeping memory requirements nearly flat even as sequence lengths increase.

Verdict: A premier choice for developers requiring a high-throughput, agentic model that excels at UI/UX generation and complex logical deduction on a single-GPU setup.

Here is a summary for model card, based on the 🔗 Qwopus3.6-35B-A3B-v1 comprehensive evaluation report by Kyle Hessling.

⚠️ Known Training & Deployment Issues

Due to the architectural complexities of the Qwen3.6 MoE models, several technical challenges were encountered during training and weight merging. Users should be aware of the following potential instabilities:

MoE Architecture Compatibility Issues

The weight structure of MoE expert layers differs significantly from standard dense models.

There are known, easily triggered incompatibilities between PEFT/LoRA, Transformers 5.x's fused expert pattern, and Unsloth patches.

Even when using the absolute latest environment and dependencies, merging the LoRA weights into the base model after training may fail or encounter severe compatibility bugs.

Common Error: You may encounter ModuleNotFoundError: Could not import module 'Qwen3_5MoeForConditionalGeneration' or similar structural mismatch errors during the weight merging phase.

If you are attempting to fine-tune or merge weights for this MoE architecture locally, proceed with caution and be prepared to manually patch model definition files or downgrade specific library versions.

📚 Resources & Guides

👉 GitHub Repository: Jackrong-llm-finetuning-guide Visit the repo to dive into the codebase and reproduce the results locally or on Colab.

🙏 Acknowledgements

Special thanks to:

The Qwen team for the strong Qwen3.6 MoE base model.
Unsloth for efficient fine-tuning frameworks.
Open-source datasets and community contributors.
Kyle Hessling for his generous hardware and equipment support. You can follow him for more updates on X / Twitter: @KyleHessling1.

📖 Citation

@misc{jackrong_qwopus36_35b_a3b_v1,
  title        = {Qwopus3.6-35B-A3B-v1},
  author       = {Jackrong},
  year         = {2026},
  publisher    = {Hugging Face}
}

Downloads last month: 418,603

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

3-bit

4-bit

5-bit

6-bit

8-bit

Model tree for Jackrong/Qwopus3.6-35B-A3B-v1-GGUF

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

unsloth/Qwen3.6-35B-A3B

Adapter

(6)

this model

Finetunes

1 model

Datasets used to train Jackrong/Qwopus3.6-35B-A3B-v1-GGUF

Spaces using Jackrong/Qwopus3.6-35B-A3B-v1-GGUF 2

Collection including Jackrong/Qwopus3.6-35B-A3B-v1-GGUF

🍎 Qwopus3.6

Collection

This collection features the advanced Qwopus3.6 series of multimodal large models, which are fine-tuned from the Qwen3.6 base models with a focus on e • 10 items • Updated 8 days ago • 58