Qwen 3 Next

Qwen3-Next represents the next-generation foundation models optimized for extreme context length and large-scale parameter efficiency. The series introduces architectural innovations including Hybrid Attention (Gated DeltaNet + Gated Attention), High-Sparsity MoE with 1:50 activation ratio, and Multi-Token Prediction for enhanced performance and inference acceleration.

This guide shows how to fine-tune it with Axolotl with multi-turn conversations and proper masking.

Getting started

  1. Install Axolotl following the installation guide.

  2. Install Cut Cross Entropy to reduce training VRAM usage.

  3. Install FLA for improved performance

pip3 uninstall -y causal-conv1d && pip3 install flash-linear-attention==0.4.1
  1. Run the finetuning example:
axolotl train examples/qwen3-next/qwen3-next-80b-a3b-qlora.yaml

This config uses about ~47 GiB (no target experts) and ~71GiB (target experts) VRAM.

Let us know how it goes. Happy finetuning! 🚀

TIPS

  • For inference, you can experiment with temperature: 0.7, top_p: 0.8, top_k: 20, and min_p: 0.
  • You can run a full finetuning by removing the adapter: qlora and load_in_4bit: true from the config. See Multi-GPU section below.
  • Read more on how to load your own dataset at docs.
  • The dataset format follows the OpenAI Messages format as seen here.

Optimization Guides