Practical Case: Converting Model Weights to Megatron Model Weights
This case provides a method for converting the model weights (in Safetensors format) of the MindSpore Transformers library to the format used in the Megatron-LM library to facilitate accuracy comparison or migration training. After conversion, Megatron-LM weights are of the BF16 type.
Environment Preparations
Preparing Code
Clone the Megatron-LM code repository and switch to the core_r0.12.0 branch.
git clone https://github.com/NVIDIA/Megatron-LM.git -b core_r0.12.0
Copy the conversion script to the Megatron-LM/tools/checkpoint/ directory.
Model Weight Preparations
Convert the Safetensors weights saved by MindSpore Transformers.
Currently, only the weights of GPT-like models (such as GPT and Qwen) composed of SelfAttention and MLP can be converted. MLA and MoE are not supported.
Only complete weights that are not split for distribution are supported. If weights are distributed, merge them by referring to Weight Merging.
Weight Conversion Procedure
Go to the Megatron-LM directory.
cd Megatron-LM
Run the weight conversion commands (with the actual path and parameters).
TARGET_TP_SIZE=2 # Target tensor parallelism (TP). TARGET_PP_SIZE=2 # Target pipeline parallelism (PP). python ./tools/checkpoint/convert.py \ --megatron-path 'path_to_megatron' \ --model-type GPT \ --loader core_mf \ --saver core \ --target-tensor-parallel-size ${TARGET_TP_SIZE} \ --target-pipeline-parallel-size ${TARGET_PP_SIZE} \ --load-dir "path_to_ms_ckpt" \ --save-dir "path_to_megatron_ckpt" \ --loader-transformer-impl local \ --saver-transformer-impl local \ --position-embedding-type "rope" \ --true-vocab-size 128000 \ --padded-vocab-size 128000 \ --num-layers 32 \ --seq-length 2048 \ --hidden-size 4096 \ --ffn-hidden-size 16384 \ --num-attention-heads 32 \ --num-query-groups 16 \ --normalization "RMSNorm" \ --add-bias-linear \ --swiglu
Parameters
Name
Required
Default Value
Description
--megatron-pathYes
None
Root directory of the Megatron-LM repository.
--model-typeYes
None
Model type (for example, GPT).
--loaderYes
None
Loader type (core_mf in this example).
--saverYes
None
Saver type (for example, core).
--target-tensor-parallel-sizeYes
None
Target TP.
--target-pipeline-parallel-sizeYes
None
Target PP.
--load-dirYes
None
Path to the Safetensors weight files exported from MindSpore. (It may include only one file or folder.)
--save-dirYes
None
Megatron weight output directory.
--loader-transformer-implNo
transformer_engine
Transformer implementation of the loader, which can be local (for precision comparison) or transformer_engine.
--saver-transformer-implNo
transformer_engine
Transformer implementation of the saver, which can be local (for precision comparison) or transformer_engine.
--position-embedding-typeNo
learned_absolute
Position encoding type (learned_absolute or rope).
--true-vocab-sizeNo
None
Actual vocabulary size of the model. If this parameter is specified, the padding of the embedding table is removed.
--padded-vocab-sizeNo
128000
Vocabulary size after padding. In MindSpore Transformers, the value is usually the same as the actual vocabulary size.
--num-layersNo
512
Number of transformer layers.
--seq-lengthNo
2048
Maximum sequence length.
--hidden-sizeNo
512
Dimension of the hidden layer.
--ffn-hidden-sizeNo
128
Dimension of the hidden layer in the feedforward network.
--num-attention-headsNo
64
Number of attention heads.
--num-query-groupsNo
None
Number of query groups.
--normalizationNo
RMSNorm
Normalization type.
--add-bias-linearNo
False
Specifies whether to add bias to the linear layer. (The value is of the Boolean type. Set the parameter to True if you need to add bias.)
--swigluNo
False
Specifies whether to activate SwiGLU. (The value is of the Boolean type. Set the parameter to True if you need to activate SwiGLU.)
--ms2torch-ckpt-pathNo
./ms2pt_checkpoint
Path to the intermediate weights output during conversion.
After a successful operation, ensure that weights are saved to the location (
./ms2pt_checkpointby default) specified by--ms2torch-ckpt-path.
FAQ
Q: What can I do if an error is reported when Megatron loads converted weights?
A: Ensure that all model structure parameters (such as the number of layers, hidden layer dimension, and vocabulary size) are the same as those of the original model.
Q: Are the MoE and other structures supported?
A: Currently, only the standard SelfAttention+MLP structure is supported.
Q: Are distributed weights supported?
A: No. You need to merge the weights first.