静态图网络编译性能优化

概述

在深度学习网络进行训练或者推理时，网络端到端的耗时基本由编译耗时与运行耗时两部分组成，尤其在推理场景，编译耗时往往远大于运行耗时，因此优化编译性能对于提升网络在实际应用时的部署效果有着极为重要的意义。MindSpore静态图模式下，部分场景可以通过改变网络写法，使用等价语义替换，或者设置编译选项改变编译机制来优化网络编译性能。

使用HyperMap优化编译性能

HyperMap是一个特殊的类，类对象构造时需要传入映射函数f，调用对象时需要传入f的n个参数序列，更多使用方法见：HyperMap。映射函数f必须是MultitypeFuncGraph类型, 可参考MultitypeFuncGraph。在使用for循环批量处理列表元素时，可以通过HyperMap等价语义替换来优化网络编译性能。

一个使用HyperMap替换for循环来优化编译性能的代码样例如下：

[1]:

import time
from mindspore.ops import MultitypeFuncGraph, HyperMap
from mindspore import ops
from mindspore import ms_function

add = MultitypeFuncGraph('add')
@add.register("Number", "Number")
def add_scalar(x, y):
    return ops.scalar_add(x, y)

add_map = HyperMap(add)
list1 = [i for i in range(200)]
list2 = [i for i in range(200)]
@ms_function
def hyper_map_net():
    output = add_map(list1, list2)
    return output

start_time = time.time()
output = hyper_map_net()
end_time = time.time()
print("hyper map cost time:", end_time - start_time)

@ms_function
def for_loop_net():
    out = []
    for i in range(200):
        out.append(ops.scalar_add(i, i))
    return out

start_time = time.time()
for_loop_net()
end_time = time.time()
print("for loop cost time:", end_time - start_time)

hyper map cost time: 0.1894233226776123
for loop cost time: 1.2634551525115967

使用Select算子优化编译性能

编写网络时，会经常使用到if语句，如果if语句的条件是变量条件，每个if语句都会产生额外的子图，if语句的使用可参考：if语句。在静态图模式下，子图数量越多，编译耗时越久，因此部分场景可以通过Select算子等价替换if语句来优化编译性能。

需要注意的是，使用Select算子替换if语句会影响网络的运行性能。一方面，Select算子会同时执行true分支和false分支，而if语句只执行其一个分支，因此使用if运行耗时相比使用Select算子耗时减少；另一方面，Select算子性能优于if语句产生的控制流算子，使用if运行耗时相比使用Select算子运行耗时增加。综合上述两种因素，最终运行性能变化情况需要结合实际情况判断。一般来讲，当分支中算子数量较少，建议使用Select算子；当分支中算子数量较多，建议使用if语句。

一个使用Select算子替代if语句来优化编译性能的代码样例如下：

[2]:

import time
from mindspore import ms_function, Tensor, ops

@ms_function
def if_net(x, y):
    out = 0
    for _ in range(100):
        if x < y:
            x = x - y
        else:
            x = x + y
        out = out + x
    return out

start_time = time.time()
out = if_net(Tensor([0]), Tensor([1]))
end_time = time.time()
print("if net cost time:", end_time - start_time)

@ms_function
def select_net(x, y):
    out = x
    for _ in range(100):
        cond = x < y
        x = ops.Select()(cond, x - y, x + y)
        out = out + x
    return out

start_time = time.time()
out = select_net(Tensor([0]), Tensor([1]))
end_time = time.time()
print("select net cost time:", end_time - start_time)

if net cost time: 1.1603329181671143
select net cost time: 0.483151912689209

使用编译缓存优化编译性能

在进行训练或者推理时，如果某个网络结构未作任何变更，那么可以通过使用编译缓存来缩短编译时间。编译缓存的本质是存储了网络模型的编译中间过程文件，当网络模型不变时，生产的编译中间过程文件也是一样的，因此可以复用上一次编程产生的中间过程文件，详细使用方法可参考设置context中的enable_compile_cache相关内容。

一个通过使能编译缓存来优化编译性能的代码样例如下：

[3]:

import time
from mindspore import set_context
from mindspore import Tensor, dtype
from mindspore import ms_function

@ms_function
def func(input_x, input_y):
    output = input_x
    for _ in range(200):
        output = input_x + input_x * input_y + output
    return output

set_context(enable_compile_cache=False)
x = Tensor([1], dtype.float32)
y = Tensor([2], dtype.float32)
start_time = time.time()
out = func(x, y)
end_time = time.time()
print("Disable comile_cache cost time:", end_time - start_time)

Disable comile_cache cost time: 0.5485098361968994

上述测试样例是关闭编译缓存状态，执行上述测试样例2次，第1次耗时和第2次耗时如下：（实际耗时与硬件环境有关，以下数据仅供参考）

Disable comile_cache cost time: 0.5485098361968994

Disable comile_cache cost time: 0.4614279270172119

可以看到，关闭编译缓存时，第1次执行样例与第2次执行样例耗时基本接近。

[4]:

import time
from mindspore import set_context
from mindspore import Tensor, dtype
from mindspore import ms_function

@ms_function
def func(input_x, input_y):
    output = input_x
    for _ in range(200):
        output = input_x + input_x * input_y + output
    return output

set_context(enable_compile_cache=True, compile_cache_path="my_compile_cache")
x = Tensor([1], dtype.float32)
y = Tensor([2], dtype.float32)
start_time = time.time()
out = func(x, y)
end_time = time.time()
print("Enable comile_cache cost time:", end_time - start_time)

Enable comile_cache cost time: 0.6357541084289551

上述测试样例是开启编译缓存状态，执行上述测试样例2次，第1次耗时和第2次耗时如下：（实际耗时与硬件环境有关，以下数据仅供参考）

Enable comile_cache cost time: 0.6357541084289551

Enable comile_cache cost time: 0.09379792213439941

可以看到，开启编译缓存时，第2次执行样例耗时只有第一次执行耗时的1/7左右。

使用vmap优化编译性能

MindSpore当前已知支持vmap特性，在处理无依赖关系的批量数据且相关的算子支持vmap功能时，可以使用vmap替代for循环处理批量数据来优化编译性能。vmap的详细介绍可参考vmap。需要注意的是，vmap不仅能优化编译性能，也能优化运行性能。

一个使用vmap替换for循环处理批量数据来优化编译性能的代码样例如下：

[5]:

import numpy as np
import time
from mindspore import ops
from mindspore import ms_function, Tensor

def hswish_func(x):
    return ops.HSwish()(x)

@ms_function
def manually_batched(xs):
    output = []
    for i in range(xs.shape[0]):
        output.append(hswish_func(xs[i]))
    return ops.stack(output)

shape = (100, 2)
prop = 100
x_np = (np.random.randn(*shape) * prop).astype(np.float32)
x = Tensor(x_np)
x = ops.sub(x, 0)

start_time = time.time()
output_vmap = ops.vmap(hswish_func, in_axes=(0,))(x)
end_time = time.time()
print("vmap cost time:", end_time - start_time)

start_time = time.time()
output_manually = manually_batched(x)
end_time = time.time()
print("for loop cost time:", end_time - start_time)

vmap cost time: 0.05766916275024414
for loop cost time: 1.9284062385559082

上述样例中，相当于需要批量处理100组Tensor数据，可以看到使用vmap处理的性能超过使用for循环处理性能的30倍。