Qwen3 X ModelScope Toolkit: Faster Training + Comprehensive Evaluation
Qwen3 has recently released new models, following the style of Qwen2.x, including two series: Dense models and MoE models. In the open-source versions, the Dense models largely maintain the previous model structure, with the difference being the addition of RMSNorm to the Q and K tensors. The MoE models have removed the common experts, while other structures remain essentially the same. In terms of model size, they range from 0.6B to 32B (Dense) and up to 235B (MoE). For inference capabilities, they have added options for "thinking" abilities, making the models more adaptable and versatile in different scenarios.
After the model release, developers who need to privatize open-source models or have vertical industry requirements typically need to conduct secondary training (fine-tuning, alignment, etc.), followed by evaluation and deployment. From a training perspective, common requirements are:
Having large amounts of unlabeled industry data requiring Continued Pre-Training (CPT). Base models are generally used for this.
Having large quantities of question-answer pairs requiring Supervised Fine-Tuning (SFT). Base or Instruct models are chosen based on data volume.
Needing models with unique response capabilities, requiring additional RLHF.
Needing to enhance specific domain reasoning capabilities (or chain-of-thought), typically using distillation, sampling fine-tuning, or GRPO.
In practical scenarios, multiple training methods are usually combined. For example, CPT is always followed by SFT, or RLVR (such as GRPO with verifiable rewards). Hardware requirements range from single-card to multi-machine setups, creating difficulties in training selection.
In the evaluation pipeline, a simple and easy-to-use evaluation method is also in high demand. Especially when dealing with multi-domain or even multi-modal combination scenarios, finding evaluation data and tracking evaluation progress remains a major challenge.
To address these issues, our community has launched a solution supporting the Qwen3 series models with SWIFT (training) + EvalScope (evaluation) composite capabilities. Notably, training MoE structures has always been a pain point in the open-source industry due to high secondary training costs and complex training processes. Therefore, we have comprehensively supported the Qwen3-MoE Megatron structure training method, which offers 20% to 1000% training speed improvements compared to transformers structures.
In the SWIFT framework, most parameters for Megatron structure training and transformers structure training are identical, allowing developers to flexibly switch between these two training methods with virtually no cost.
Stage | CMD(Qwen/Qwen3-8B) |
---|---|
CPT | swift pt --model Qwen/Qwen3-8B --dataset xxx |
SFT | swift sft --model Qwen/Qwen3-8B --dataset xxx |
Megatron MoE | megatron sft --model Qwen/Qwen3-8B --dataset xxx |
DPO | swift rlhf --rlhf_type dpo --model Qwen/Qwen3-8B --dataset xxx |
GRPO | swift rlhf --rlhf_type grpo --model Qwen/Qwen3-8B --dataset xxx |
Rejected sampling | example |
Deployment | swift deploy --model Qwen/Qwen3-8B --infer_backend vllm |
Eval | evalscope eval --model Qwen/Qwen3-8B --datasets xxx |
Megatron Support
In multi-GPU scenarios, the torch DDP (Distributed Data Parallel) framework is typically used as a foundation, with additional parallel grouping mechanisms to enable LLM training. For example, current mainstream training GPUs generally have 24GB, 40GB, or 80GB of memory, with some models reaching 96GB or 128GB, but this is insufficient for training a 32B model, let alone larger models. Therefore, beyond DDP, model partitioning mechanisms are usually added so that each GPU only carries a portion of the model shards, using all-gather to collect parameters and reduce-scatter to collect gradients. This is also the basic principle behind DeepSpeed ZeRO or FSDP (Fully Sharded Data Parallel).
However, for models larger than 32B, or MoE models, the transformers code implementation combined with DeepSpeed results in excessive inter-GPU communication and serial MoE execution, leading to insufficient training efficiency.
for expert_idx in range(self.num_experts):
expert_layer = self.experts[expert_idx]
idx, top_x = torch.where(expert_mask[expert_idx])
current_state = hidden_states[None, top_x].reshape(-1, hidden_dim)
current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None]
final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))
For loop code of MoE forward in transformers
Megatron comes from NVIDIA's Megatron-LM library. This library generally handles extremely large-scale training, whereas the transformers library is more suited for lightweight training. This is because:
In general small-scale Dense model training scenarios, lightweight training methods (LoRA, Quantization) provide higher benefits, while Megatron's complex distributed structure is not well-suited for single-GPU or dual-GPU scenarios
The learning curve for developers is relatively steep, hindering understanding and usage
However, in our tests, even using an 8-GPU single-machine environment, Dense model training using Megatron can achieve approximately 20% speed improvement over the same model code in transformers, with higher GPU utilization. For MoE models, this advantage is even more significant, with speedups reaching 1000% or more.
Advantages of the Megatron framework include:
Additional optimizations for Attention structures, such as fused kernels, resulting in faster training speeds
Better compatibility with multi-machine training, reasonably sharding models within and across machines to maintain lower communication volume
Additional parallel training support for MoE structures
As mentioned above, using for-loops to train MoE serially fails to leverage the advantages of multiple GPUs. Therefore, SWIFT has introduced Megatron's parallel technologies to accelerate large model training, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism. It supports pre-training and fine-tuning of models such as Qwen3, Qwen3-MoE, Qwen2.5, Llama3, and Deepseek-R1 distillation series.
Megatron-LM | DeepSpeed-ZeRO2 | DeepSpeed-ZeRO3 | |
---|---|---|---|
Train Speed | 9.6s/it | - | 91.2s/it |
GPU Memory | 16 * 60GiB | OOM | 16 * 80GiB |
Performance comparison of Qwen3-30B-A3B full parameter training
RLVR Support
After the DeepSeek-R1 technical report, the industry widely recognized that verified rewards could be used to train models' reasoning abilities. This approach requires much less data compared to PRM methods, while also offering faster training speeds and simpler engineering implementation, making it beneficial for applying RL training in scenarios involving small and medium-sized developers with such needs. Common RLVR training algorithms include PPO, GRPO, and DAPO, with GRPO being used more frequently because it omits the Critic model and uses sampling instead of model fitting, making it simpler and more robust in engineering implementation. SWIFT also provides support for RLVR algorithms, and they can be directly used on the latest Qwen3 models.
SWIFT GRPO model placement
Currently, we support two model placement modes:
Side Mode: Actor and Rollout models occupy separate GPUs, allowing vLLM to use all GPU memory and computing power, supporting tensor parallel.
Colocate Mode: Actor and Rollout models share GPUs. In this mode, vLLM and Actor time-share the GPU through offload/load, which is more friendly to large-scale models.
Currently, SWIFT's GRPO can support training on hundreds of cards (or more) in a cluster.
Sampling and Distillation
Distillation, as one of the main methods of knowledge infusion, was also mentioned in the DeepSeek-R1 technical report. During our actual training of Qwen3, we discovered that directly applying SFT with our own datasets might cause serious knowledge forgetting issues. Although this problem has consistently accompanied large model training in recent years, it has become particularly evident in the latest models. Therefore, we anticipate that the focus of future model training paradigms may shift from SFT toward reinforcement fine-tuning. This direction includes on-policy training methods like RLVR, as well as off-policy methods such as rejection sampling fine-tuning and distillation. The friendliness and precision of using rollout data (whether from larger models or the model's own data) is much higher in training quality compared to manually generated datasets. For example, in previous experiments, we found that using competition_math for LLM SFT actually caused the competition_math test set to drop by more than 10 points. Conversely, using distillation, MCTS sampling, rejection sampling, and GRPO methods can improve performance on corresponding test sets while preserving knowledge in other areas. This can be approximately understood as "proximal policy" although some algorithms do not include KL divergence regularization constraints.
Similarly, we have provided support for model sampling and distillation, which can be directly applied to the Qwen3 series models. Examples can be found here and here.
Evaluation Support
To comprehensively assess models' various capabilities and understand performance metric changes before and after training, ModelScope has launched the EvalScope assessment tool. It provides a unified platform to integrate and manage evaluation processes for various models across different benchmarks, including large language models' coding abilities (LiveCodeBench), mathematical abilities (AIME2024, AIME2025), knowledge capabilities (MMLU-Pro, CEVAL), instruction following (IFEval); multimodal large models' visual comprehension abilities (ChartQA); and text-to-image models' text-image consistency (GenAI-Bench), among others.
Through EvalScope, we can conveniently perform the following operations:
Automated evaluation processes: reducing manual intervention and improving assessment efficiency
Visualized performance analysis: viewing all evaluation results for comprehensive model analysis
Custom evaluations: easily extending to new assessment tasks or building evaluation dataset collections through simple configurations.
In addition to this, EvalScope has also integrated model service inference performance stress testing functionality, allowing for one-click testing of metrics such as service throughput and first-token latency. Evaluations on the Qwen3 series models (including model service inference performance evaluation, model capability evaluation, and model reasoning efficiency evaluation) can be found here.
Conclusion
In AI industry predictions, many voices suggest that AGI may be achieved within a few years. It can also be observed that currently, the most attention-grabbing models in the Qwen3 series are the Qwen3-32B, Qwen3-235B-A22B, and other similar-sized models. Developers are increasingly turning their attention to more capable, larger-sized models, and the use and focus on these models have also influenced the application ecosystem, such as the latest technologies and directions in digital humans, Agents, and other fields. We hope that ModelScope's toolkit can continuously adapt to increasing model sizes and facilitate easier training and evaluation, while introducing new training techniques and models in the open-source domain. Developers can continue to follow our community (www.modelscope.cn), where we will continually build new model capabilities and application abilities based on powerful model ecosystems like Qwen3, DeepSeek, LLaMA, and others.