Running Data Recorder

View Source On Gitee

Overview

Running Data Recorder (RDR) is a function that MindSpore provides to record data when the training program is running. The data to be recorded will be preset in MindSpore, and if there is a run exception in MindSpore when running the training script, the pre-recorded data in MindSpore will be automatically exported to help locate the cause of the run exception. Different runtime exceptions will export different data, for example, if a Run task error exception occurs, it will export information such as computation graph, graph execution order, memory allocation, etc. to help locate the cause of the exception.

Not all run exceptions will export data, and only some of them are currently supported.

Currently, only graph mode training scenarios are supported to collect CPU/Ascend/GPU related data.

Usage

Configuring RDR via Configuration File

  1. Create the configuration file mindspore_config.json.

    {
        "rdr": {
            "enable": true,
            "mode": 1,
            "path": "/path/to/rdr/dir"
        }
    }
    

    enable: Control whether the RDR function is enabled or not.

    mode: Control RDR data export mode. Set to 1 to export data only when training abnormally terminates, and set to 2 to export data when training abnormally terminates or ends normally.

    path: Set the path to save data in RDR, only absolute path is supported.

  2. Configure the RDR via context.

    import mindspore as ms
    ms.set_context(env_config_path="./mindspore_config.json")
    

Configuring RDR via Environment Variables

Enable RDR by export MS_RDR_ENABLE=1, set the export data mode by export MS_RDR_MODE=1 or export MS_RDR_MODE=2, and then set the root directory path for RDR file export by export MS_RDR_PATH=/path/to/root/dir. The RDR file will be saved in the /path/to/root/dir/rank_{RANK_ID}/rdr/ directory, where RANK_ID is the card number in the multi-card training scenario, and the default RANK_ID=0 in the single-card scenario.

User-set configuration files take precedence over environment variables.

Exception Handling

Suppose we train with MindSpore on Atlas training series, the training throws a Run task error exception.

At this point we go to the export directory of the RDR file and we can see that there are several files, each representing a type of data. For example, hwopt_d_before_graph_0.ir is a computational graph file. Open the file using the text tool to view the calculation diagram and analyze whether it meets expectations.

Diagnostic Handling

When RDR is turned on and environment variable export MS_RDR_MODE=2 is set, enter diagnostic mode. At the end of the graph compilation, we can also see the same files saved with exception handling in the export directory of the RDR file.