Release Notes

MindSpore Transformers 1.8.0 Release Notes

The following outlines the key new features and bug fixes introduced in version 1.8.0 of the MindSpore Transformers suite, compared to version 1.7.0.

New Features

  • Training Features: Mcore models support fine-grained configuration parameters for initialization standard deviation; learning rate strategy supports fine-grained configuration of grouped learning rates; new Muon optimizer with QKClip configuration support, implementing MuonClip optimizer.

  • Mcore Model Architecture: Support for different position encoding strategies for different TransformerLayer configurations; support for configuring SlidingWindowAttention.

  • Datasets: Hugging Face datasets support streaming data loading, reducing dataset loading time for fine-tuning tasks.

  • Architecture Upgrade: Weight saving/loading & resume training solution upgraded, implementing a new weight directory structure, simplified configuration, and Reshard loading mechanism, significantly improving usability and loading/recovery performance.

Bugfix

During the current release cycle, we have implemented numerous bugfixes across models, functionalities, usability, and documentation. Key fixes include:

  • !7824: Fixed issue where pad_token_id was not effective in Mcore networks.

  • !7818: Fixed hostname retrieval failure issue in certain environments.

  • !7793 !7713: Fixed Hugging Face dataset related issues.

  • !7630: Fixed safetensors weight conversion and loading issue when changing parallel strategy.

  • !7743: Fixed hidden_size assignment logic when shared experts are greater than 1.

  • !7790: Fixed inference weight loading failure when q_lora_rank is None.

  • !7902: Fixed error in DeepSeek-V3 inference model when weights are not loaded.

Change Notes

This release introduces modifications to certain historical deprecated models/code/materials. Detailed changes and explanations are as follows:

Change Content

Change Description

Deprecated Model Sunset

The following models have been sunset: Llama3.1, Mixtral, Llm_boost.

Contributors

We extend our gratitude to the following team and their members for their outstanding contributions:

  • 天翼云息壤智算团队: RFC !7757 Support for MoE expert hot/cold expert migration, improving training performance during the initial phase of MoE model training when expert load is unbalanced.

We also thank the following developers who contributed during the release cycle:

@ccsszz, @chenrayray, @hangangqiang, @highcloud3, @hss-shuai, @huan-xiaoling, @husichao, @jimmyisme, @JingweiHuang, @lanshaozuishuai, @limuan, @Lin-Bert, @liulili-huawei, @liu-yanwei6, @lzy0920232, @minghu111, @niu-junhao01, @pengjingyou, @qsc97, @renyujin, @senzhen-town, @smallsilly, @Somnus2020, @song-jiaqi1999, @suhaibo, @Sunshine_Youngster, @wei_zhuoyi, @xiaoqi-zhou, @yinanf, @yiyison, @yule100, @zhangyihuiben, @zyw-hw, @zzzkeke

Contributions to the project in any form are most welcome!