Logs

Logs Saving

Overview

MindSpore Transformers will write the model's training configuration, training steps, loss, throughput and other information into the log. Developers can specify the path for log storage.

Training Log Directory Structure

During the training process, MindSpore Transformers will generate a training log directory in the output directory (default is ./output) by default: ./log.

When the training task is started using the ms_run method, an additional log directory will be generated in the output directory by default: ./msrun_log.

Folder	Description
log	The log information of each card is divided into `rank_{i}` folders. (`i` corresponds to the NPU card number used for training tasks) Each `rank_{i}` folder will include `info.log` and `error.log` to record the INFO level and ERROR level information output during training respectively. The default maximum size for a single log file is 50 MB, with a maximum of 5 backup logs.
msrun_log	`worker_{i}.log` is used to record the training log of each card (including error information), and `scheduler.log` records the startup information of msrun. Training log information is usually viewed through this folder.

Take an 8-rank task started by msrun as an example. The specific log structure is as follows:

output
    ├── log
        ├── rank_0
            ├── info.log    # Record the training information of NPU rank 0
            └── error.log   # Record the error information of NPU rank 0
        ├── ...
        └── rank_7
            ├── info.log    # Record the training information of NPU rank 7
            └── error.log   # Record the error information of NPU rank 7
    └── msrun_log
        ├── scheduler.log   # Record the communication information between each NPU rank
        ├── worker_0.log    # Record the training and error information of NPU rank 0
        ├── ...
        └── worker_7.log    # Record the training and error information of NPU rank 7

Configuration and Usage

By default, MindSpore Transformers specifies the file output path as ./output in the training yaml file. If you start the training task under the mindformers path, the log output generated by the training will be saved under mindformers/output by default.

YAML Parameter Configuration

If you need to re-specify the output log folder, you can modify the configuration in yaml.

Taking DeepSeek-V3 pre-training yaml as an example, the following configuration can be made:

output_dir: './output' # path to save logs/checkpoint/strategy

Specifying Output Directory for Single-Card Tasks

In addition to specifying the yaml file configuration, MindSpore Transformers also supports run_mindformer in the one-click start script, use the --output_dir start command to specify the log output path.

If the output path is configured here, it will overwrite the configuration in the yaml file!

Distributed Task Specifies the Output Directory

If the model training requires multiple servers, use the distributed task launch script to start the distributed training task.

If shared storage is set, you can also specify the input parameter LOG_DIR in the startup script to specify the log output path of the Worker and Scheduler, and output the logs of all machine nodes to one path for unified observation.