Distributed Configure
Linux Windows Ascend GPU CPU Environment Preparation Basic Intermediate
Q: What do I do if the error Init plugin so failed, ret = 1343225860 occurs during the HCCL distributed training?
A: HCCL fails to be initialized. The possible cause is that rank json is incorrect. You can use the tool in mindspore/model_zoo/utils/hccl_tools to generate one. Alternatively, import the environment variable export ASCEND_SLOG_PRINT_TO_STDOUT=1 to enable the log printing function of HCCL and check the log information.
Q: How to fix the error below when running MindSpore distributed training with GPU:
Loading libgpu_collective.so failed. Many reasons could cause this:
1.libgpu_collective.so is not installed.
2.nccl is not installed or found.
3.mpi is not installed or found
A: This message means that MindSpore failed to load library libgpu_collective.so. The Possible causes are:
OpenMPI or NCCL is not installed in this environment.
NCCL version is not updated to
v2.7.6: MindSporev1.1.0supports GPU P2P communication operator which relies on NCCLv2.7.6.libgpu_collective.socan’t be loaded successfully if NCCL is not updated to this version.
Q: The communication profile file needs to be configured on the Ascend environment, how should it be configured?
A: Please refer to the Configuring Distributed Environment Variables section of Ascend-based distributed training in the MindSpore tutorial.
Q: How to perform distributed multi-machine multi-card training?
A: For Ascend environment, please refer to the Multi-machine Training section of the MindSpore tutorial “distributed_training_ascend”. For GPU-based environments, please refer to the Run Multi-Host Script section of the MindSpore tutorial “distributed_training_gpu”.
