Release Notes

MindSpore Transformers 1.9.0 Release Notes

The following is the changelog for MindSpore Transformers 1.9.0 compared with 1.8.0, including key new features and bug fixes.

New Features

  • Training: Supports forward inference during training; when pipeline parallelism is enabled for training jobs, parameter loading information for the corresponding rank can be printed.

  • Model support: Added inference and pre-training for TeleChat3-36B; added pre-training for TeleChat3-105B.

  • Performance monitoring: Extended the Profile performance monitoring module with timing tracking for the cluster’s first startup phase.

  • Checkpoint solution: Checkpoint 2.0 is adapted for fast recovery from failures; optimizes Hugging Face weight loading performance[1].

  • PyNative Capability (Experimental): Supports launching the training process via Trainer; supports the construction of Qwen3 dense models.

New Models

Newly supported models:

Model

Variants

TeleChat3

TeleChat3-36B (pre-training, inference), TeleChat3-105B-A4.7B (pre-training)

Bug Fixes

During this release cycle we fixed issues across models, features, usability, documentation, and more. Key fixes include:

  • !8006: Fixed incorrect TFLOPs printing for MoE models.

  • !7874: Fixed pad_token_id not taking effect in MCore networks.

  • !7818: Fixed hostname retrieval failures in some environments.

  • !7793 !7713: Fixed Hugging Face dataset-related issues.

  • !7630: Fixed safetensors weight conversion and loading when changing parallel strategies.

  • !7620: Fixed accuracy issues caused by communication for VocabEmbedding under certain configurations.

Change Notes

This release includes changes to some historically deprecated models, code, and materials. Details:

Change

Description

None

No change notes for this version

Contributors

Thanks to everyone who contributed during this release cycle:

@lanshaozuishuai, @zyw-hw, @smallsilly, @wei_zhuoyi, @yule100, @zzzkeke, @sunyu-xuan, @alpha-junh, @zhangyihuiben, @jimmyisme1, @yiyison, @huangjingwei, @chenrayray, @Sunshine_Youngster, @suhaibo, @minghu111, @senzhen-town, @limuan, @husichao, @xiaoqi-zhou, @silkage_jiajia, @hss-shuai, @pengjingyou, @wjlflyer, @shen_haochen, @wujinyuan1, @yyyyrf, @Somnus2020, @renyujin, @qsc97, @yinanf, @hangangqiang, @lzy0920232

Contributions in any form are welcome!

  1. Experimental tests show that loading time for a hundred-billion-parameter model on a hundred-NPU cluster has been reduced by 80%.