Logs
Logs Saving
Overview
MindSpore Transformers will write the model's training configuration, training steps, loss, throughput and other information into the log. Developers can specify the path for log storage.
Training Log Directory Structure
During the training process, MindSpore Transformers will generate a training log directory in the output directory (default is ./output
) by default: ./log
.
When the training task is started using the ms_run
method, an additional log directory will be generated in the output directory by default: ./msrun_log
.
Folder |
Description |
---|---|
log |
The log information of each card is divided into |
msrun_log |
|
Take an 8-rank task started by msrun
as an example, the specific log structure is as follows:
output
├── log
├── rank_0
├── info.log # Record the training information of NPU rank 0
└── error.log # Record the error information of NPU rank 0
├── ...
└── rank_7
├── info.log # Record the training information of NPU rank 8
└── error.log # Record the error information of NPU rank 8
└── msrun_log
├── scheduler.log # Record the communication information between each NPU rank
├── worker_0.log # Record the training and error information of NPU rank 0
├── ...
└── worker_7.log # Record the training and error information of NPU rank 8
Configuration and Usage
By default, MindSpore Transformers specifies the file output path as ./output
in the training yaml file. If you start the training task under the mindformers
path, the log output generated by the training will be saved under mindformers/output
by default.
YAML Parameter Configuration
If you need to re-specify the output log folder, you can modify the configuration in yaml.
Taking DeepSeek-V3
pre-training yaml as an example, the following configuration can be made:
output_dir: './output' # path to save logs/checkpoint/strategy
Specifying Output Directory for Single-Card Tasks
In addition to specifying the yaml file configuration, MindSpore Transformers also supports run_mindformer In the one-click start script,
use the --output_dir
start command to specify the log output path.
If the output path is configured here, it will overwrite the configuration in the yaml file!
Distributed Task Specifies the Output Directory
If the model training requires multiple servers, use the distributed task launch script to start the distributed training task.
If shared storage is set, you can also specify the input parameter LOG_DIR
in the startup script to specify the log output path of the Worker and Scheduler, and output the logs of all machine nodes to one path for unified observation.