Logs

View Source On Gitee

Logs Saving

Overview

MindSpore Transformers will write the model's training configuration, training steps, loss, throughput and other information into the log. Developers can specify the path for log storage.

Training Log Directory Structure

During the training process, MindSpore Transformers will generate a training log directory in the output directory (default is ./output) by default: ./log.

When the training task is started using the ms_run method, an additional log directory will be generated in the output directory by default: ./msrun_log.

Folder

Description

log

The log information of each card is divided into rank_{i} folders. (i corresponds to the NPU card number used for training tasks)
Each rank_{i} folder will include info.log and error.log to record the INFO level and ERROR level information output during training respectively. The default maximum size for a single log file is 50 MB, with a maximum of 5 backup logs.

msrun_log

worker_{i}.log is used to record the training log of each card (including error information), and scheduler.log records the startup information of msrun.
Training log information is usually viewed through this folder.

Take an 8-rank task started by msrun as an example, the specific log structure is as follows:

output
    ├── log
        ├── rank_0
            ├── info.log    # Record the training information of NPU rank 0
            └── error.log   # Record the error information of NPU rank 0
        ├── ...
        └── rank_7
            ├── info.log    # Record the training information of NPU rank 8
            └── error.log   # Record the error information of NPU rank 8
    └── msrun_log
        ├── scheduler.log   # Record the communication information between each NPU rank
        ├── worker_0.log    # Record the training and error information of NPU rank 0
        ├── ...
        └── worker_7.log    # Record the training and error information of NPU rank 8

Configuration and Usage

By default, MindSpore Transformers specifies the file output path as ./output in the training yaml file. If you start the training task under the mindformers path, the log output generated by the training will be saved under mindformers/output by default.

YAML Parameter Configuration

If you need to re-specify the output log folder, you can modify the configuration in yaml.

Taking DeepSeek-V3 pre-training yaml as an example, the following configuration can be made:

output_dir: './output' # path to save logs/checkpoint/strategy

Specifying Output Directory for Single-Card Tasks

In addition to specifying the yaml file configuration, MindSpore Transformers also supports run_mindformer In the one-click start script, use the --output_dir start command to specify the log output path.

If the output path is configured here, it will overwrite the configuration in the yaml file!

Distributed Task Specifies the Output Directory

If the model training requires multiple servers, use the distributed task launch script to start the distributed training task.

If shared storage is set, you can also specify the input parameter LOG_DIR in the startup script to specify the log output path of the Worker and Scheduler, and output the logs of all machine nodes to one path for unified observation.