Release Notes
MindSpore Transformers 1.9.0 Release Notes
The following is the changelog for MindSpore Transformers 1.9.0 compared with 1.8.0, including key new features and bug fixes.
New Features
Training: Supports forward inference during training; when pipeline parallelism is enabled for training jobs, parameter loading information for the corresponding rank can be printed.
Model support: Added inference and pre-training for TeleChat3-36B; added pre-training for TeleChat3-105B.
Performance monitoring: Extended the Profile performance monitoring module with timing tracking for the cluster’s first startup phase.
Checkpoint solution: Checkpoint 2.0 is adapted for fast recovery from failures; optimizes Hugging Face weight loading performance[1].
PyNative Capability (Experimental): Supports launching the training process via Trainer; supports the construction of Qwen3 dense models.
New Models
Newly supported models:
Model |
Variants |
|---|---|
TeleChat3 |
TeleChat3-36B (pre-training, inference), TeleChat3-105B-A4.7B (pre-training) |
Bug Fixes
During this release cycle we fixed issues across models, features, usability, documentation, and more. Key fixes include:
!8006: Fixed incorrect TFLOPs printing for MoE models.
!7874: Fixed
pad_token_idnot taking effect in MCore networks.!7818: Fixed hostname retrieval failures in some environments.
!7630: Fixed safetensors weight conversion and loading when changing parallel strategies.
!7620: Fixed accuracy issues caused by communication for VocabEmbedding under certain configurations.
Change Notes
This release includes changes to some historically deprecated models, code, and materials. Details:
Change |
Description |
|---|---|
None |
No change notes for this version |
Contributors
Thanks to everyone who contributed during this release cycle:
@lanshaozuishuai, @zyw-hw, @smallsilly, @wei_zhuoyi, @yule100, @zzzkeke, @sunyu-xuan, @alpha-junh, @zhangyihuiben, @jimmyisme1, @yiyison, @huangjingwei, @chenrayray, @Sunshine_Youngster, @suhaibo, @minghu111, @senzhen-town, @limuan, @husichao, @xiaoqi-zhou, @silkage_jiajia, @hss-shuai, @pengjingyou, @wjlflyer, @shen_haochen, @wujinyuan1, @yyyyrf, @Somnus2020, @renyujin, @qsc97, @yinanf, @hangangqiang, @lzy0920232
Contributions in any form are welcome!
- Experimental tests show that loading time for a hundred-billion-parameter model on a hundred-NPU cluster has been reduced by 80%.