Implementing Efficient Computation of Matmul for Accelerated MindSpore Network Inference
Implementing Efficient Computation of Matmul for Accelerated MindSpore Network Inference
1 System Environment
Hardware environment: Ascend/GPU/CPU
MindSpore version: any version
Execution mode (PyNative/Graph): any mode
Python version: 3.7/3.8/3.9
OS platform: any OS
2 Error Information
2.1 Error Description
When training a network using MindSpore, it is found that the network inference time is relatively long and requires optimization.
2.2 Error Message
The Profiler analysis shows that most of the time is spent on the Matmul matrix multiplication at the fully-connected layer.
2.3 Script Code
Construct the code based on the description.
3. Root Cause Analysis
According to the error message, the primary cause for slow training is the Matmul matrix multiplication. When dealing with computationally intensive operators, using float32 precision takes longer than using float16 precision. To speed up computations and save time, you can convert the precision type to float16 before performing the computation, and then convert it back to float32 once the computation is complete.
Test the custom code and multiply the values to check the running time difference.
import numpy as np
import mindspore.nn as nn
from mindspore.ops import operations as ops
import mindspore as ms
import time
ms.set_context(mode=ms.GRAPH_MODE, device_target="GPU")
class Net(nn.Cell):
def __init__(self):
super(Net, self).__init__()
self.matmul = ops.MatMul(transpose_b=True)
def construct(self, x, y):
return self.matmul(x, y)
x = ms.Tensor(np.arange(10240*10240).reshape(10240, 10240).astype(np.float32))
y = ms.Tensor(np.arange(10240*10240).reshape(10240, 10240).astype(np.float32))
net = Net()
# print(net(x, y))
# Timing
a = time.time()
output = net(x, y)
time32 = time.time() - a
# print(output)
print(output.shape)
print (time32)
net2 = Net()
# Type conversion
x2 = ms.Tensor(x, dtype=ms.float16)
# Timing
b = time.time()
output = net(x, y)
time16 = time.time() - b
# print(output)
print(output.shape)
print (time16)
The output shows that using float16 is several times faster than using float32.

4. Solution
Convert the precision type to float16 before performing the computation, and then convert it back to float32 once the computation is complete. This accelerates the computation.
According to the error message, we can locate the error at the fully-connected layer. After conducting computation tests at this layer, we can find that the computation speed is improved by approximately 50 times following the conversion of data types.
import numpy as np
import mindspore.nn as nn
from mindspore.ops import operations as ops
import mindspore as ms
import time
ms.set_context(mode=ms.GRAPH_MODE, device_target="GPU")
x = ms.Tensor(np.arange(10240*10240).reshape(10240, 10240).astype(np.float32))
net = nn.Dense(10240, 60)
# Timing
a = time.time()
output = net(x)
time32 = time.time() - a
# print(output)
print(output.shape)
print (time32)
net2 = nn.Dense(10240, 60)
# Type conversion
x2 = ms.Tensor(x, dtype=ms.float16)
# Timing
b = time.time()
output = net(x)
time16 = time.time() - b
# print(output)
print(output.shape)
print (time16)

Acceleration: nearly 2.15/0.04 = 50 times
For details, see the official documents.
https://www.mindspore.cn/docs/en/r2.0.0-alpha/api_python/mindspore.html
https://www.mindspore.cn/docs/en/r2.0.0-alpha/api_python/nn/mindspore.nn.Dense.html