Implement Problem
Q: How do I use MindSpore to implement multi-scale training?
A: During multi-scale training, when different shape
are used to call Cell
objects, different graphs are automatically built and called based on different shape
, to implement the multi-scale training. Note that multi-scale training supports only the non-data sink mode and does not support the data offloading mode. For details, see the multi-scale training implement of yolov3.
Q: If a tensor
of MindSpore whose requirements_grad
is set to False
is converted into numpy
for processing and then converted into tensor
, will the computational graph and backward propagation be affected?
A: In PyNative mode, if numpy
is used for computation, gradient transfer will be interrupted. In the scenario where requirements_grad
is set to False
, if the backward propagation of tensor
is not transferred to other parameters, there is no impact. If requirements_grad
is set to True
, there is an impact.
Q: How do I modify the weight
and bias
of the fully-connected layer like torch.nn.functional.linear()
?
A: The nn.Dense
interface is similar to torch.nn.functional.linear()
. nn.Dense
can specify the initial values of weight
and bias
. Subsequent changes are automatically updated by the optimizer. During the training, you do not need to change the values of the two parameters.
Q: What is the function of the .meta
file generated after the model is saved using MindSpore? Can the .meta
file be used to import the graph structure?
A: The .meta
file is a compiled graph structure. However, this structure cannot be directly imported currently. If you do not know the graph structure, you still need to use the MindIR file to import the network.
Q: Can the yolov4-tiny-3l.weights
model file be directly converted into a MindSpore model?
A: No. You need to convert the parameters trained by other frameworks into the MindSpore format, and then convert the model into a MindSpore model.
Q: Why an error message is displayed when MindSpore is used to set model.train
?
model.train(1, dataset, callbacks=ms.train.LossMonitor(1), dataset_sink_mode=True)
model.train(1, dataset, callbacks=ms.train.LossMonitor(1), dataset_sink_mode=False)
A: If the offloading mode has been set, it cannot be set to non-offloading mode, which is a restriction on the running mechanism.
Q: What should I pay attention to when using MindSpore to train a model in the eval
phase? Can the network and parameters be loaded directly? Does the optimizer need to be used in the Model?
A: It mainly depends on what is required in the eval
phase. For example, the output of the eval
network of the image classification task is the probability of each class, and the acc
is circulated with the corresponding label.
In most cases, the training network and parameters can be directly reused. Note that the inference mode needs to be set.
net.set_train(False)
The optimizer is not required in the eval
phase. However, if the model.eval
API of MindSpore needs to be used, the loss function
needs to be configured. For example:
# Define a model.
model = ms.train.Model(net, loss_fn=loss, metrics={'top_1_accuracy', 'top_5_accuracy'})
# Evaluate the model.
res = model.eval(dataset)
Q: How do I use param_group
in SGD to reduce the learning rate?
A: To change the value according to epoch
, use Dynamic LR Function and set step_per_epoch
to step_size
. To change the value according to step
, set step_per_epoch
to 1. You can also use LearningRateSchedule.
Q: How do I modify parameters (such as the dropout value) on MindSpore?
A: When building a network, use if self.training: x = dropput(x)
. When inferring, set network.set_train(False)
before execution to disable the dropout function. During training, set network.set_train(mode_false)
to True to enable the dropout function.
Q: How do I view the number of model parameters?
A: You can load the checkpoint count directly. Variables in the momentum and optimizer may be counted, so you need to filter them out. You can refer to the following APIs to collect the number of network parameters:
def count_params(net):
"""Count number of parameters in the network
Args:
net (mindspore.nn.Cell): Mindspore network instance
Returns:
total_params (int): Total number of trainable params
"""
total_params = 0
for param in net.trainable_params():
total_params += np.prod(param.shape)
return total_params
Q: How do I monitor the loss
during training and save the training parameters when the loss
is the lowest?
A: You can refer to EarlyStopping。
Q: How do I obtain feature map
with the expected size when nn.Conv2d
is used?
A: For details about how to derive the Conv2d shape
, click here Change pad_mode
of Conv2d
to same
. Alternatively, you can calculate the pad
based on the Conv2d shape
derivation formula to keep the shape
unchanged. Generally, the pad is (kernel_size-1)//2
.
Q: Can MindSpore be used to customize a loss function that can return multiple values?
A: After customizing the loss function
, you need to customize TrainOneStepCell
. The number of sens
for implementing gradient calculation is the same as the number of network
outputs. For details, see the following:
net = Net()
loss_fn = MyLoss()
loss_with_net = MyWithLossCell(net, loss_fn)
train_net = MyTrainOneStepCell(loss_with_net, optim)
model = ms.train.Model(net=train_net, loss_fn=None, optimizer=None)
Q: How does MindSpore implement the early stopping function?
A: You can customize the callback
method to implement the early stopping function.
Example: When the loss value decreases to a certain value, the training stops.
class EarlyStop(Callback):
def __init__(self, control_loss=1):
super(EarlyStop, self).__init__()
self._control_loss = control_loss
def step_end(self, run_context):
cb_params = run_context.original_args()
loss = cb_params.net_outputs
if loss.asnumpy() < self._control_loss:
# Stop training
run_context._stop_requested = True
stop_cb = EarlyStop(control_loss=1)
model.train(epoch_size, ds_train, callbacks=[stop_cb])
Q: After a model is trained, how do I save the model output in text or npy
format?
A: The network output is Tensor
. You need to use the asnumpy()
method to convert the Tensor
to NumPy
and then save the data. For details, see the following:
out = net(x)
np.save("output.npy", out.asnumpy())
Q: Can the vgg16
model be loaded and transferred on a GPU using the Hub?
A: Yes, but you need to manually modify the following two arguments:
# Add the **kwargs argument as follows:
def vgg16(num_classes=1000, args=None, phase="train", **kwargs):
# Add the **kwargs argument as follows:
net = Vgg(cfg['16'], num_classes=num_classes, args=args, batch_norm=args.batch_norm, phase=phase, **kwargs)
Q: How to handle cache server exception shutdown?
A: During the use of the cache server, system resources such as IPC share memory and socket files are allocated. If overflow is allowed, there will be overflowing data files on disk space. In general, if the server is shut down normally via the cache_admin --stop
command, these resources will be automatically cleaned up.
However, if the cache server is shut down abnormally, such as the cache service process is killed, the user needs to try to restart the server first. If the startup fails, you should follow the following steps to manually clean up the system resources:
Delete the IPC resource.
Check for IPC shared memory residue.
In general, the system allocates 4GB of share memory for the caching service. The following command allows you to view the usage of share memory blocks in the system.
$ ipcs -m ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x61020024 15532037 root 666 4294967296 1
where
shmid
is the share memory block id,bytes
is the size of the share memory block, andnattch
is the number of processes linking to the shared memory block.nattch
is not 0, which indicates that there are still processes that use the share memory block. Before you delete share memory, you need to stop all processes that use that memory block.Delete the IPC share memory.
Find the corresponding share memory id, and delete via the following command.
ipcrm -m {shmid}
Delete socket files.
In general, socket files is located /tmp/mindspore/cache
. Enter the folder, and execute the following command to delete socket files.
rm cache_server_p{port_number}
where port_number
is the port number specified when the user creates the cache server, which defaults to 50052.
Delete data files that overflow to disk space.
Enter the specified overflow data path when you enabled the cache server. In general, the default overflow path is /tmp/mindspore/cache
. Find the corresponding data folders under the path and delete them one by one.
Q: Can the vgg16
model be loaded by using the GPU via Hub and whether can the migration model be done?
A: Please manually modify the following two parameters:
# Increase **kwargs parameter: as the following
def vgg16(num_classes=1000, args=None, phase="train", **kwargs):
# Increase **kwargs parameter: as the following
net = Vgg(cfg['16'], num_classes=num_classes, args=args, batch_norm=args.batch_norm, phase=phase, **kwargs)
Q: How to obtain middle-layer features of a VGG model?
A: Obtaining the middle-layer features of a network is not closely related to the specific framework. For the vgg
model defined in torchvison
, the features
field can be used to obtain the “middle-layer features”. The vgg
source code of torchvison
is as follows:
class VGG(nn.Module):
def __init__(self, features, num_classes=1000, init_weights=True):
super(VGG, self).__init__()
self.features = features
self.avgpool = nn.AdaptiveAvgPool2d((7, 7))
The vgg16
defined in ModelZoo of MindSpore can be obtained through the layers
field as follows:
network = vgg16()
print(network.layers)
Q: When MindSpore is used for model training, there are four input parameters for CTCLoss
: inputs
, labels_indices
, labels_values
, and sequence_length
. How do I use CTCLoss
for model training?
A: The dataset
received by the defined model.train
API can consist of multiple pieces of data, for example, (data1
, data2
, data3
, …). Therefore, the dataset
can contain inputs
, labels_indices
, labels_values
, and sequence_length
information. You only need to define the dataset in the corresponding format and transfer it to model.train
. For details, see Data Processing API.
Q: What are the available recommendation or text generation networks or models provided by MindSpore?
A: Currently, recommendation models such as Wide & Deep, DeepFM, and NCF are under development. In the natural language processing (NLP) field, Bert_NEZHA is available and models such as MASS are under development. You can rebuild the network into a text generation network based on the scenario requirements. Please stay tuned for updates on the MindSpore ModelZoo.
Q: How do I use MindSpore to fit functions such as \(f(x)=a \times sin(x)+b\)?
A: The following is based on the official MindSpore linear fitting case.
# The fitting function is: f(x)=2*sin(x)+3.
import numpy as np
import mindspore as ms
from mindspore.train import Model, LossMonitor
from mindspore import dataset as ds
from mindspore.common.initializer import Normal
from mindspore import nn
ms.set_context(mode=ms.GRAPH_MODE, device_target="CPU")
def get_data(num, w=2.0, b=3.0):
# f(x)=w * sin(x) + b
# f(x)=2 * sin(x) +3
for i in range(num):
x = np.random.uniform(-np.pi, np.pi)
noise = np.random.normal(0, 1)
y = w * np.sin(x) + b + noise
yield np.array([np.sin(x)]).astype(np.float32), np.array([y]).astype(np.float32)
def create_dataset(num_data, batch_size=16, repeat_size=1):
input_data = ds.GeneratorDataset(list(get_data(num_data)), column_names=['data','label'])
input_data = input_data.batch(batch_size)
input_data = input_data.repeat(repeat_size)
return input_data
class LinearNet(nn.Cell):
def __init__(self):
super(LinearNet, self).__init__()
self.fc = nn.Dense(1, 1, Normal(0.02), Normal(0.02))
def construct(self, x):
x = self.fc(x)
return x
if __name__ == "__main__":
num_data = 1600
batch_size = 16
repeat_size = 1
lr = 0.005
momentum = 0.9
net = LinearNet()
net_loss = nn.loss.MSELoss()
opt = nn.Momentum(net.trainable_params(), lr, momentum)
model = Model(net, net_loss, opt)
ds_train = create_dataset(num_data, batch_size=batch_size, repeat_size=repeat_size)
model.train(1, ds_train, callbacks=LossMonitor(), dataset_sink_mode=False)
print(net.trainable_params()[0], "\n%s" % net.trainable_params()[1])
Q: How do I use MindSpore to fit quadratic functions such as \(f(x)=ax^2+bx+c\)?
A: The following code is referenced from the official MindSpore tutorial code.
Modify the following items to fit \(f(x) = ax^2 + bx + c\):
Dataset generation.
Network fitting.
Optimizer.
The following explains detailed information about the modification:
# Since the selected optimizer does not support CPU, so the training computing platform is changed to GPU, which requires readers to install the corresponding GPU version of MindSpore.
ms.set_context(mode=ms.GRAPH_MODE, device_target="GPU")
# Assume that the function to be fitted this time is f(x)=2x^2+3x+4, the data generation function is modified as follows:
def get_data(num, a=2.0, b=3.0 ,c = 4):
for i in range(num):
x = np.random.uniform(-10.0, 10.0)
noise = np.random.normal(0, 1)
# The y value is generated by the fitting target function ax^2+bx+c.
y = x * x * a + x * b + c + noise
# When a*x^2+b*x+c is fitted, a and b are weight parameters and c is offset parameter bias. The training data corresponding to the two weights are x^2 and x respectively, so the dataset generation mode is changed as follows:
yield np.array([x*x, x]).astype(np.float32), np.array([y]).astype(np.float32)
def create_dataset(num_data, batch_size=16, repeat_size=1):
input_data = ds.GeneratorDataset(list(get_data(num_data)), column_names=['data','label'])
input_data = input_data.batch(batch_size)
input_data = input_data.repeat(repeat_size)
return input_data
class LinearNet(nn.Cell):
def __init__(self):
super(LinearNet, self).__init__()
# Because the full join function inputs two training parameters, the input value is changed to 2, the first Nomral(0.02) will automatically assign random weights to the input two parameters, and the second Normal is the random bias.
self.fc = nn.Dense(2, 1, Normal(0.02), Normal(0.02))
def construct(self, x):
x = self.fc(x)
return x
if __name__ == "__main__":
num_data = 1600
batch_size = 16
repeat_size = 1
lr = 0.005
momentum = 0.9
net = LinearNet()
net_loss = nn.loss.MSELoss()
# RMSProp optimalizer with better effect is selected for quadratic function fitting, Currently, Ascend and GPU computing platforms are supported.
opt = nn.RMSProp(net.trainable_params(), learning_rate=0.1)
model = ms.train.Model(net, net_loss, opt)
ds_train = create_dataset(num_data, batch_size=batch_size, repeat_size=repeat_size)
model.train(1, ds_train, callbacks=ms.train.LossMonitor(), dataset_sink_mode=False)
print(net.trainable_params()[0], "\n%s" % net.trainable_params()[1])
Q: How do I execute a single ut
case in mindspore/tests
?
A: ut
cases are usually based on the MindSpore package of the debug version, which is not provided on the official website. You can run sh build.sh
to compile based on the source code and then run the pytest
command. The compilation in debug mode does not depend on the backend. Compile the sh build.sh -t on
option. For details about how to execute cases, see the tests/runtest.sh
script.
Q: For Ascend users, how to get more detailed logs to help position the problems when the run task error
is reported during executing the cases?
A: Use the msnpureport tool to set the on-device log level. The tool is stored in /usr/local/Ascend/latest/driver/tools/msnpureport
.
Global-level:
/usr/local/Ascend/latest/driver/tools/msnpureport -g info
Module-level
/usr/local/Ascend/latest/driver/tools/msnpureport -m SLOG:error
Event-level
/usr/local/Ascend/latest/driver/tools/msnpureport -e disable/enable
Multi-device ID-level
/usr/local/Ascend/latest/driver/tools/msnpureport -d 1 -g warning
Assume that the value range of deviceID is [0, 7], and devices 0–3
and devices 4–7
are on the same OS. devices 0
to device3
share the same log configuration file and device4
-device7
shares the same configuration file. In this way, changing any log level in devices 0
to device3
will change that of other device
. This rule also applies to device4
-device7
.
After the Driver
package is installed (assuming that the installation path is /usr/local/HiAI and the execution file msnpureport.exe
is in the C:\ProgramFiles\Huawei\Ascend\Driver\tools\ directory on Windows), suppose the user executes the command line directly in the /home/shihangbo/directory, the Device side logs are exported to the current directory and stored in a timestamp-named folder.
Q: How can I do when the error message Out of Memory!!! total[3212254720] (dynamic[0] memory poll[524288000]) malloc[32611480064] failed!
is displayed by performing the training process using the Ascend platform?
A: This issue is a memory shortage problem caused by too much memory usage, which can be caused by two possible causes:
Set the value of
batch_size
too large. Solution: Reduce the value ofbatch_size
.Introduce the abnormally large
parameter
, for example, a single data shape is [640,1024,80,81]. The data type is float32, and the single data size is over 15G. In this way, the two data with the similar size are added together, and the memory occupied is over 3*15G, which easily causesOut of Memory
. Solution: Check theshape
of the parameter. If it is abnormally large, the shape can be reduced.If the following operations cannot solve the problem, you can raise the problem on the official forum, and there are dedicated technical personnels for help.
Q: How do I change hyperparameters for calculating loss values during neural network training?
A: Sorry, this function is not available yet. You can find the optimal hyperparameters by training, redefining an optimizer, and then training.
Q: What should I do when error error while loading shared libraries: libge_compiler.so: cannot open shared object file: No such file or directory
is displayed during application running?
A: While installing Ascend 310 AI Processor software packages depended by MindSpore, the CANN
package should install the full-featured toolkit
version instead of the nnrt
version.
Q: Why does set_ps_context(enable_ps=True) in model_zoo/official/cv/ResNet/train.py in the MindSpore code have to be set before init?
A: In MindSpore Ascend mode, if init is called first, all processes will be allocated cards, but in parameter server training mode, the server does not need to allocate cards, and the worker and server will use the same card, resulting in an error: Ascend kernel runtime initialization failed.
Q: What should I do if the memory continues to increase when resnet50 training is being performed on the CPU ARM platform?
A: When resnet50 training is performed on the CPU ARM, some operators are implemented based on the oneDNN library, and the oneDNN library achieves multi-threaded parallelism based on the libgomp library. Currently, there is a problem in libgomp where the number of threads configured for multiple parallel domains is different and the memory consumption continues to grow. The continuous growth of the memory can be controlled by configuring a uniform number of threads globally. For comprehensive performance considerations, it is recommended to configure a unified configuration to 1/4 of the number of physical cores, such as export OMP_NUM_THREADS=32
.
Q: Why report an error that the stream exceeds the limit when executing the model on the Ascend platform?
A: Stream represents an operation queue. Tasks on the same stream are executed in sequence, and different streams can be executed in parallel. Various operations in the network generate tasks and are assigned to streams to control the concurrent mode of task execution. Ascend platform has a limit on the number of tasks on the same stream, and tasks that exceed the limit will be assigned to new streams. The multiple parallel methods of MindSpore will also be assigned to new streams, such as parallel communication operators. Therefore, when the number of assigned streams exceeds the resource limit of the Ascend platform, an error will be reported. Reference solution:
Reduce the size of the network model
Reduce the use of communication operators in the network
Reduce conditional control statements in the network
Q: On the Ascend platform, if an error “Ascend error occurred, error message:” is reported in the log and followed by an error code, such as “E40011”, how to find the cause of the error code?
A: When “Ascend error occurred, error message:” appears, it indicates that a module of Ascend CANN is abnormal and the error log is reported.
At this time, there is an error message after the error code. If you need a more detailed possible cause and solution for this exception, please refer to the “error code troubleshooting” section of the corresponding Ascend version document, such as CANN Community 6.0.RC1.alpha002 Error Code troubleshooting.
Q: When the third-party component gensim is used to train the NLP network, the error “ValueError” may be reported. What can I do?
A: The following error information is displayed:
>>> import gensim
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/miniconda3/envs/ci39_cj/lib/python3.9/site-packages/gensim/__init__.py", line 11, in <module>
from gensim import parsing, corpora, matutils, interfaces, models, similarities, utils # noqa:F401
File "/home/miniconda3/envs/ci39_cj/lib/python3.9/site-packages/gensim/corpora/__init__.py", line 6, in <module>
from .indexedcorpus import IndexedCorpus # noqa:F401 must appear before the other classes
File "/home/miniconda3/envs/ci39_cj/lib/python3.9/site-packages/gensim/corpora/indexedcorpus.py", line 14, in <module>
from gensim import interfaces, utils
File "/home/miniconda3/envs/ci39_cj/lib/python3.9/site-packages/gensim/interfaces.py", line 19, in <module>
from gensim import utils, matutils
File "/home/miniconda3/envs/ci39_cj/lib/python3.9/site-packages/gensim/matutils.py", line 1024, in <module>
from gensim._matutils import logsumexp, mean_absolute_difference, dirichlet_expectation
File "gensim/_matutils.pyx", line 1, in init gensim._matutils
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
For details about the error cause, see the gensim or numpy official website.
Solutions:
Method 1: Reinstall the Numpy and Gensim and run the following commands: pip uninstall gensim numpy -y && pip install numpy==1.18.5 gensim
Method 2: If the problem persists, delete the cache file of the wheel installation package and then perform method 1. (The cache directory of the wheel installation package is ~/.cache/pip/wheels
)
Q: What should I do if I encounter matplotlib.pyplot.show()
or plt.show
not be executed during the documentation sample code is running?
A: First confirm whether matplotlib
is installed. If it is not installed, you can execute pip install matplotlib
on the command line to install it.
Secondly, because the function of matplotlib.pyplot.show()
is to display graph data graphically, it is necessary to run the system to support the graph display function. If the system cannot support graph display, the reader needs to comment out the command line of the graph display. Operation will not affect the results of the overall code.
Q: How to handle running failures when encountering an online runtime provided in the documentation?
A: Need to confirm that the following preparations have been done.
First, you need to log in to ModelArts through your HUAWEI CLOUD account.
Secondly, note that the hardware environment supported by the tags in the tutorial document and the hardware environment configured in the example code is Ascend, GPU or CPU. Since the hardware environment used by default after login is CPU, the Ascend environment and GPU environment need to be switched manually by the user.
Finally, confirm that the current
Kernel
is MindSpore.
After completing the above steps, you can run the tutorial.
For the specific operation process, please refer to Based on ModelArts Online Experience MindSpore.
Q: No error is reported when using result of division in GRAPH mode, but an error is reported when using result of division in PYNATIVE mode?
A: In GRAPH mode, since the graph compilation is used, the data type of the output result of the operator is determined at the graph compilation stage.
For example, the following code is executed in GRAPH mode, and the type of input data is int, so the output result is also int type according to graph compilation.
import mindspore as ms
from mindspore import nn
ms.set_context(mode=ms.GRAPH_MODE, device_target="CPU")
class MyTest(nn.Cell):
def __init__(self):
super(MyTest, self).__init__()
def construct(self, x, y):
return x / y
x = 16
y = 4
net = MyTest()
output = net(x, y)
print(output, type(output))
output:
4 <class 'int'>
Change the execution mode and change GRAPH_MODE to PYNATIVE_MODE. Since the Python syntax is used in PyNative mode, the type of any division output to Python syntax is float type, so the execution result is as follows.
4.0 <class 'float'>
Therefore, in the scenario where the subsequent operator clearly needs to use int, it is recommended to use Python’s divisibility symbol //
.
Q: Why will running the script on GPU stuck for a long time on version 1.8?
A: In order to be compatible with more GPU architectures, NVCC compiles CUDA files into PTX files first, and compiles them into binary executable files when using them for the first time. Therefore, compilation time will be consumed.
Compared with the previous version, version 1.8 has added many CUDA operators, resulting in an increase in the compilation time of this part (The time varies according to the equipment. For example, the first compilation time on V100 is about 5 minutes).
This compilation will generate a cache file (taking the Ubuntu system as an example, the cache file is located in ~/.nv/computecache
), and the cache file will be directly loaded during subsequent execution.
Therefore, it will be stuck for several minutes during the first use, and the subsequent use will be a normal time consumption.
Subsequent versions will be pre-compiled and optimized.
Q: What can I do when the error message MemoryError: std::bad_alloc
is reported during the execution of the operator?
A: The reason for this error is that the user did not configure the operator parameters correctly, so that the memory space applied by the operator exceeded the system memory limit, and the system failed to allocate memory. The following uses mindspore.ops.UniformCandidateSampler as an example for description.
UniformCandidateSampler samples a set of classes by using uniform distribution. According to the parameter
num_sampled
set by the user,the shape of output tensor would be(num_sampled,)
.When the user sets
num_sampled=int64.max
,the memory space requested by the output tensor exceeds the system memory limit, causingbad_alloc
.
Therefore, the user needs to set the operator parameters appropriately to avoid such errors.
Q: How do I understand the “Ascend Error Message” in the error message?
A: The “Ascend Error Message” is a fault message thrown after there is an error during CANN execution when CANN (Ascend Heterogeneous Computing Architecture) interface is called by MindSpore, which contains information such as error code and error description. For example:
Traceback (most recent call last):
File "train.py", line 292, in <module>
train_net()
File "/home/resnet_csj2/scripts/train_parallel0/src/model_utils/moxing_adapter.py", line 104, in wrapped_func
run_func(*args, **kwargs)
File "train.py", line 227, in train_net
set_parameter()
File "train.py", line 114, in set_parameter
init()
File "/home/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/communication/management.py", line 149, in init
init_hccl()
RuntimeError: Ascend kernel runtime initialization failed.
\----------------------------------------------------
\- Ascend Error Message:
\----------------------------------------------------
EJ0001: Failed to initialize the HCCP process. Reason: Maybe the last training process is running. //EJ0001 is the error code, followed by the description and cause of the error. The cause of the error in this example is that the distributed training of the same 8 nodes was started several times, causing process conflicts
Solution: Wait for 10s after killing the last training process and try again. //The print message here gives the solution to the problem, and this example suggests that the user clean up the process
TraceBack (most recent call last): //The information printed here is the stack information used by the developer for positioning, and generally the user do not need to pay attention
tsd client wait response fail, device response code[1]. unknown device error.[FUNC:WaitRsp][FILE:process_mode_manager.cpp][LINE:233]
In addition, CANN may throw some Inner Errors, for example, the error code is “EI9999: Inner Error”. If you cannot search the case description in MindSpore official website or forum, you can ask for help in the community by raising an issue.
Q: What can I do when the error message python: relocation error: /the-path-of-cuda/libcublas.so.11: symbol xxxxx version libcublasLt.so.11 not defined in file libcublasLt.so with link time reference
is reported during the execution of the operator?
A: There is a known problem in 2.0.0alpha. Similar errors may occur when there are many cuda versions in the environment. The most conservative solution is to add a soft link to /usr/local/cuda
for the cuda version to be run, and rename the other cuda directories. This issue has been fixed in the master branch. Please look forward to the next release.