Implementing a Cross-Silo Federated Target Detection Application (x86)

View Source On Gitee

Based on the type of participating clients, federated learning can be classified into cross-silo federated learning and cross-device federated learning. In a cross-silo federated learning scenario, the clients involved in federated learning are different organizations (e.g., healthcare or finance) or geographically distributed data centers, i.e., training models on multiple data silos. In the cross-device federated learning scenario, the participating clients are a large number of mobile or IoT devices. This framework will describe how to implement a target detection application by using network Fast R-CNN on MindSpore Federated cross-silo federated framework.

The full script to launch cross-silo federated target detection application can be found here.

Preparation

This tutorial deploy the cross-silo federated target detection task based on the faster_rcnn network provided in MindSpore model_zoo. Please first follow the official faster_rcnn tutorial and code to understand the COCO dataset, faster_rcnn network structure, training process and evaluation process first. Since the COCO dataset is open source, please refer to its official website guidelines to download a dataset by yourself and perform dataset slicing (for example, suppose there are 100 clients, the dataset can be sliced into 100 copies, each representing the data held by one client).

Since the original COCO dataset is in json file format, the target detection script provided by cross-silo federated learning framework only supports input data in MindRecord format. You can convert the json file to MindRecord format file according to the following steps.

  • Configure the following parameters in the configuration filedefault_config.yaml:

    • mindrecord_dir

      Used to set the generated MindRecord format file save path. The folder name must be mindrecord_{num} format, and the number num represents the client label number 0, 1, 2, 3, ……

      mindrecord_dir:"./datasets/coco_split/split_100/mindrecord_0"
      
    • instance_set

      Used to set original json file path.

      instance_set: "./datasets/coco_split/split_100/train_0.json"
      
  • Run the script generate_mindrecord.py to generate MindRecord file according to train_0.json, saved in the mindrecord_dir path.

Starting the Cross-Silo Federated Mission

Installing MindSpore and Mindspore Federated

Including both downloading source code and downloading release version, supporting CPU, GPU, Ascend hardware platforms, just choose to install according to the hardware platforms. For the installing step, refer to MindSpore installationMindspore Federated installation.

Currently the federated learning framework is only supported for deployment in Linux environments, and cross-silo federated learning framework requires MindSpore version number >= 1.5.0.

Starting Mission

Refer to example to start the cluster. The reference example directory structure is as follows:

cross_silo_faster_rcnn
├── src
│   ├── FasterRcnn
│   │   ├── __init__.py                  // init file
│   │   ├── anchor_generator.py          // Anchor generator
│   │   ├── bbox_assign_sample.py        // Phase I Sampler
│   │   ├── bbox_assign_sample_stage2.py // Phase II Sampler
│   │   ├── faster_rcnn_resnet.py        // Faster R-CNN network
│   │   ├── faster_rcnn_resnet50v1.py    // Faster R-CNN network taking Resnet50v1.0 as backbone
│   │   ├── fpn_neck.py                  // Feature Pyramid Network
│   │   ├── proposal_generator.py        // Candidate generator
│   │   ├── rcnn.py                      // R-CNN network
│   │   ├── resnet.py                    // Backbone network
│   │   ├── resnet50v1.py                // Resnet50v1.0 backbone network
│   │   ├── roi_align.py                 // ROI aligning network
│   │   └── rpn.py                       // Regional candidate network
│   ├── dataset.py                     // Create and process datasets
│   ├── lr_schedule.py                 // Learning rate generator
│   ├── network_define.py              // Faster R-CNN network definition
│   ├── util.py                        // Routine operation
│   └── model_utils
│           ├── __init__.py                  // init file
│           ├── config.py                    // Obtain .yaml configuration parameter
│           ├── device_adapter.py            // Obtain on-cloud id
│           ├── local_adapter.py             // Get local id
│           └── moxing_adapter.py            // On-cloud data preparation
├── requirements.txt
├── mindspore_hub_conf.py
├── generate_mindrecord.py              // Convert annotations files in .json format to MindRecord format for reading datasets
├── default_yaml_config.yaml                 // Required configuration files for Federated training
├── default_config.yaml                         // Required configuration file of network structure, dataset address, and fl_plan
├── run_cross_silo_fasterrcnn_worker.py // Start Cloud Federated worker script
├── run_cross_silo_fasterrcnn_worker_distribute.py // Start the Cloud Federated distributed worker training script
└── test_fl_fasterrcnn.py               // Training scripts used by the client
└── run_cross_silo_fasterrcnn_sched.py  // Start Cloud federated scheduler script
└── run_cross_silo_fasterrcnn_server.py // Start Cloud federated server script
  1. Note that you can choose whether to record the loss value for each step by setting the parameter dataset_sink_mode in the test_fl_fasterrcnn.py file.

    model.train(config.client_epoch_num, dataset, callbacks=cb, dataset_sink_mode=True)  # Not setting dataset_sink_mode means that only the loss value of the last step in each epoch is recorded.
    model.train(config.client_epoch_num, dataset, callbacks=cb, dataset_sink_mode=False)   # Set dataset_sink_mode=False to record the loss value of each step, which is the default mode in the code.
    
  2. Set the following parameters in configuration file default_config.yaml:

    • pre_trained

      Used to set the pre-trained model path (.ckpt format).

      The pre-trained model experimented in this tutorial is a ResNet-50 checkpoint trained on ImageNet 2012. You can use the resnet50 script in ModelZoo to train, and then use src/convert_checkpoint.py to convert the trained resnet50 weight file into a loadable weight file.

  3. Start redis

    redis-server --port 2345 --save ""
    
  4. Start Scheduler

    run_cross_silo_fasterrcnn_sched.py is the Python script used to start Scheduler and supports modifying the configuration by passing argument argparse. Execute the following command, which represents the Scheduler that starts this federated learning task. --yaml_config is used to set the yaml file path, and its management ip:port is 127.0.0.1:18019.

    python run_cross_silo_fasterrcnn_sched.py --yaml_config="default_yaml_config.yaml" --scheduler_manage_address="127.0.0.1:18019"
    

    For the detailed implementation, see run_cross_silo_fasterrcnn_sched.py.

    The following print represents a successful starting:

    [INFO] FEDERATED(3944,2b280497ed00,python):2022-10-10-17:11:08.154.878 [mindspore_federated/fl_arch/ccsrc/scheduler/scheduler.cc:35] Run] Scheduler started successfully.
    [INFO] FEDERATED(3944,2b28c5ada700,python):2022-10-10-17:11:08.155.056 [mindspore_federated/fl_arch/ccsrc/common/communicator/http_request_handler.cc:90] Run] Start http server!
    
  5. Start Server

    run_cross_silo_fasterrcnn_server.py is a Python script for starting a number of Servers, and supports modifying the configuration by passing argument argparse. Execute the following command, representing the Server that starts this Federated Learning task with a TCP address of 127.0.0.1. The starting port for the Federated Learning HTTP service is 6668 and the number of Servers is 4.

    python run_cross_silo_fasterrcnn_server.py --yaml_config="default_yaml_config.yaml" --tcp_server_ip="127.0.0.1" --checkpoint_dir="/path/to/fl_ckpt" --local_server_num=4 --http_server_address="127.0.0.1:6668"
    

    The above command is equivalent to starting four Server processes, each with a federated learning service port of 6668, 6669, 6670 and 6671, as detailed in run_cross_silo_fasterrcnn_server.py, and checkpoint_dir needs to enter the directory path where the checkpoint is located. The server will read the checkpoint initialization weight from this path. The prefix format of the checkpoint needs to be {fl_name}_ recovery_ iteration_.

    The following print represents a successful starting:

    [INFO] FEDERATED(3944,2b280497ed00,python):2022-10-10-17:11:08.154.645 [mindspore_federated/fl_arch/ccsrc/common/communicator/http_server.cc:122] Start] Start http server!
    [INFO] FEDERATED(3944,2b280497ed00,python):2022-10-10-17:11:08.154.725 [mindspore_federated/fl_arch/ccsrc/common/communicator/http_request_handler.cc:85] Initialize] Ev http register handle of: [/d    isableFLS, /enableFLS, /state, /queryInstance, /newInstance] success.
    [INFO] FEDERATED(3944,2b280497ed00,python):2022-10-10-17:11:08.154.878 [mindspore_federated/fl_arch/ccsrc/scheduler/scheduler.cc:35] Run] Scheduler started successfully.
    [INFO] FEDERATED(3944,2b28c5ada700,python):2022-10-10-17:11:08.155.056 [mindspore_federated/fl_arch/ccsrc/common/communicator/http_request_handler.cc:90] Run] Start http server!
    
  6. Start Worker

    run_cross_silo_femnist_worker.py is a Python script for starting a number of workers, and supports modifying the configuration by the passing argument argparse. The following instruction is executed, representing the worker that starts this federated learning task, and the number of workers needed for the federated learning task to proceed properly is at least 2.

    python run_cross_silo_fasterrcnn_worker.py --local_worker_num=2 --yaml_config="default_yaml_config.yaml" --pre_trained="/path/to/pre_trained" --dataset_path=/path/to/datasets/coco_split/split_100 --http_server_address=127.0.0.1:6668
    

    For the detailed implementation, see run_cross_silo_femnist_worker.py. Note that in dataset sink mode, the unit of the synchronization frequency of Cloud Federated is in epoch, otherwise the synchronization frequency is in step.

    As the above command, --local_worker_num=2 means starting two clients, and the datasets used by the two clients are datasets/coco_split/split_100/mindrecord_0 and datasets/coco_split/split_100/mindrecord_1. Please prepare the required datasets for the corresponding clients according to the pre-task preparation tutorial.

    After executing the above three commands and waiting for a while, go to the worker_0 folder in the current directory and check the worker_0 log with the command grep -rn "\epoch:" * and you will see a log message similar to the following:

    epoch: 1 step: 1 total_loss: 0.6060338
    

    Then it means that cross-silo federated is started successfully and worker_0 is training. Other workers can be viewed in a similar way.

    At present, the ‘worker’ node of Cloud Federated supports the distributed training mode of single machine multi-card and multi-machine multi-card. run_cross_silo_fasterrcnn_worker_distributed.py is a python script for users to start distributed training of the worker node, and supports configuration modification via argparse. Execute the following instructions, representing the distributed ‘worker’ that starts this federated learning task, where ‘device_num’ represents the number of processes started by the ‘worker’ cluster, ‘run_distribute’ represents the distributed training started by the cluster, and its http start port is ‘6668’. Number of ‘worker’ processes is ‘4’:

    python run_cross_silo_fasterrcnn_worker_distributed.py --device_num=4 --run_distribute=True --dataset_path=/path/to/datasets/coco_split/split_100 --http_server_address=127.0.0.1:6668
    

    Enter the ‘worker_distributed/log_output/’ folder in the current directory and run the ‘grep -rn “epoch” *’ command to view the logs of the ‘worker’ distributed cluster. You can see the following information:

    epoch: 1 step: 1 total_loss: 0.613467
    

    Please refer to yaml configuration notes for the description of parameter configuration in the above script.

Viewing the Log

After successfully starting the task, the corresponding log file will be generated under the current directory cross_silo_faster_rcnn. The log file directory structure is as follows:

cross_silo_faster_rcnn
├── scheduler
│   └── scheduler.log     # Print logs during running scheduler
├── server_0
│   └── server.log        # Print logs during running server_0
├── server_1
│   └── server.log        # Print logs during running server_1
├── server_2
│   └── server.log        # Print logs during running server_2
├── server_3
│   └── server.log        # Print logs during running server_3
├── worker_0
│   ├── ckpt              # Store the aggregated model ckpt obtained by worker_0 at the end of each federated learning iteration
│   │  └── mindrecord_0
│   │      ├── mindrecord_0-fast-rcnn-0epoch.ckpt
│   │      ├── mindrecord_0-fast-rcnn-1epoch.ckpt
│   │      │
│   │      │              ......
│   │      │
│   │      └── mindrecord_0-fast-rcnn-29epoch.ckpt
│   ├──loss_0.log         # Record the loss value of each step in the training process of worker_0
│   └── worker.log        # Record the output logs during worker_0 participation in the federal learning task
└── worker_1
    ├── ckpt              # Store the aggregated model ckpt obtained by worker_1 at the end of each federated learning iteration
    │  └── mindrecord_1
    │      ├── mindrecord_1-fast-rcnn-0epoch.ckpt
    │      ├── mindrecord_1-fast-rcnn-1epoch.ckpt
    │      │
    │      │                     ......
    │      │
    │      └── mindrecord_1-fast-rcnn-29epoch.ckpt
    ├──loss_0.log         # Record the loss value of each step in the training process of worker_1
    └── worker.log        # Record the output logs during worker_1 participation in the federal learning task

Closing the Mission

If you want to exit in the middle, the following command is available:

python finish_cross_silo_fasterrcnn.py --redis_port=2345

For the detailed implementation, see finish_cloud.py.

Or when the training task is finished, the cluster exits automatically, no need to close it manually.

Results

  • Use data:

    COCO dataset is split into 100 copies, and the first two copies are taken as two worker datasets respectively

  • The number of client-side local training epochs: 1

  • Total number of cross-silo federated learning iterations: 30

  • Results (recording the loss values during the client-side local training):

    Go to the worker_0 folder in the current directory, and check the worker_0 log with the command grep -rn "\]epoch:" * to see the loss values output in each step:

    epoch: 1 step: 1 total_loss: 5.249325
    epoch: 1 step: 2 total_loss: 4.0856013
    epoch: 1 step: 3 total_loss: 2.6916502
    epoch: 1 step: 4 total_loss: 1.3917351
    epoch: 1 step: 5 total_loss: 0.8109232
    epoch: 1 step: 6 total_loss: 0.99101084
    epoch: 1 step: 7 total_loss: 1.7741735
    epoch: 1 step: 8 total_loss: 0.9517553
    epoch: 1 step: 9 total_loss: 1.7988946
    epoch: 1 step: 10 total_loss: 1.0213892
    epoch: 1 step: 11 total_loss: 1.1700443
                      .
                      .
                      .
    

The histograms of the training loss transformations in each step of worker_1 and worker_2 during the 30 iterations training are as follows, [1] and [2]:

The polygrams of the average loss (the sum of the losses of all the steps in an epoch divided by the number of steps) in each step of worker_1 and worker_2 during the 30 iterations training are as follows, [3] and [4]:

cross-silo_fastrcnn-2workers-loss.png