CustomOpBuilder通过AtbOpRunner接入ATB算子

概述

Ascend Transformer Boost (ATB) 算子加速库是基于华为Ascend AI处理器，专门为Transformer模型的训练和推理而设计的算子库。

当用户需要使用ATB加速库的算子，而MindSpore未提供相应算子接口时，用户可以使用自定义算子的方法快速接入使用。

在基于CustomOpBuilder的自定义算子中，MindSpore提供了 PyboostRunner 方便用户在动态图接入自定义算子。现在针对ATB算子，MindSpore又额外提供了一套AtbOpRunner用于把ATB算子的调用流程和动态图多级流水封装到一起。

在完整的ATB算子的调用流程中，用户需要执行构造Param、创建Operation和Context、设置variantPack（算子输入输出张量）、调用Setup、调用Execute、销毁Context和Operation 等流程。但是对于一个算子来说，其Operation仅依赖于算子属性（Param），其Context仅依赖于流（stream），且都是可以复用的，因此MindSpore提供了一个缓存，将这些数据结构放在缓存中，避免多次创建和销毁带来不必要的时间消耗。

用户基于 AtbOpRunner类对接ATB算子时，仅需要提供相应Param的哈希函数（作为缓存Operation的键值），并调用Init接口初始化（即构造Operation），再调用Run接口即可执行ATB算子。还可以直接调用 RunAtbOp函数一键执行（函数内包含了Init和Run接口的调用）。

本指南以一个SwiGLU为例，展示ATB算子的接入流程。完整代码请参阅代码仓库。

安装ATB加速库

点这里查看安装教程

由于MindSpore在构建时采用的是“ABI=0”的标准，所以在设置ATB的set_env.sh脚本时也需要加上“ABI=0”的配置，例如：

source /usr/local/Ascend/nnal/atb/set_env.sh --cxx_abi=0 &> /dev/null

SwiGLU算子接入

这里使用ms::pynative::RunAtbOp接入算子，并通过ms::pynative::PyboostRunner::Call调用函数接口：

#include "ms_extension/api.h"

namespace atb {
template <>
struct HashOpParam<atb::infer::ActivationParam> {
  void operator()(const atb::infer::ActivationParam &param) const {
    add_param_to_buf("activationType", param.activationType);
    add_param_to_buf("scale", param.scale);
    add_param_to_buf("dim", param.dim);
    add_param_to_buf("geluMode", param.geluMode);
  }
};
}  // namespace atb

ms::Tensor InferSwigluForward(const ms::Tensor &x, int32_t dim) {
  ShapeVector out_tensor_shape(x.shape());
  int64_t split_dim = dim;
  if (split_dim < 0) {
    split_dim += out_tensor_shape.size();
  }
  const int64_t split_num = 2;
  out_tensor_shape[split_dim] /= split_num;
  return ms::Tensor(x.data_type(), out_tensor_shape);
}

ms::Tensor npu_swiglu(const ms::Tensor &x, int32_t dim) {
  auto y = InferSwigluForward(x, dim);

  atb::infer::ActivationParam param;
  param.activationType = atb::infer::ActivationType::ACTIVATION_SWIGLU_FORWARD;
  param.dim = dim;

  ms::pynative::RunAtbOp("SwiGLU", param, {x}, {y});
  return y;
}

auto pyboost_npu_swiglu(const ms::Tensor &x, int32_t dim) {
  return ms::pynative::PyboostRunner::Call<1>(npu_swiglu, x, dim);
}

PYBIND11_MODULE(MS_EXTENSION_NAME, m) {
  m.def("npu_swiglu", &pyboost_npu_swiglu, "swiglu realization", pybind11::arg("x"), pybind11::arg("dim") = -1);
}

1. 提供Param的哈希函数

namespace atb {
template <>
struct HashOpParam<atb::infer::ActivationParam> {
  void operator()(const atb::infer::ActivationParam &param) const {
    add_param_to_buf("activationType", param.activationType);
    add_param_to_buf("scale", param.scale);
    add_param_to_buf("dim", param.dim);
    add_param_to_buf("geluMode", param.geluMode);
  }
};
}  // namespace atb

通过查看ATB加速库的API文档可以知道ATB的SwiGLU算子使用的是atb::infer::ActivationParam参数。

哈希函数被定义成一个HashOpParam模板类里的operator()函数。用户通过实际Param特例化此类，并需要放在namespace atb内。在哈希函数里仅需使用add_param_to_buf接口依次添加Param的各个成员变量即可，框架在调用时会根据缓存区内的值计算得到一个整数哈希值。

一般情况下，如果算子参数的某个值是未使用的或者固定值的，那可以不把它加进哈希函数内，因为哈希函数的目的是对于相同的Param仅创建一次Operation。但是为了可维护性和可扩展性，防止以后在扩展算子功能时，因为疏忽而漏了计算某个成员变量的哈希值，导致出现难以定位的精度问题，可以在一开始就把Param的成员变量都添加上去。

2. 推导算子的输出信息

ms::Tensor InferSwigluForward(const ms::Tensor &x, int32_t dim) {
  ShapeVector out_tensor_shape(x.shape());
  int64_t split_dim = dim;
  if (split_dim < 0) {
    split_dim += out_tensor_shape.size();
  }
  const int64_t split_num = 2;
  out_tensor_shape[split_dim] /= split_num;
  return ms::Tensor(x.data_type(), out_tensor_shape);
}

对于SwiGLU算子，它输出张量的数据类型和输入的一样，形状仅有dim维度长度是输入维度的一半长度，其它维度与输入维度一样。推导出输出形状之后，通过ms::Tensor构造函数构造一个空的张量。

这里定义输出张量为 y：

auto y = InferSwigluForward(x, dim);

3. 创建并设置算子属性结构体

atb::infer::ActivationParam param;
param.activationType = atb::infer::ActivationType::ACTIVATION_SWIGLU_FORWARD;
param.dim = dim;

4. 调用RunAtbOp接口执行算子

ms::pynative::RunAtbOp("SwiGLU", param, {x}, {y});

这是一个模板接口，其等效于：

auto runner = std::make_shared<AtbOpRunner>("SwiGLU");
runner->Init(param);
runner->Run({x}, {y});

传入算子名、属性、输入张量列表、输出张量列表几个信息，即可调用相应的ATB算子。此接口支持了动态图的多级流水执行流程。

5. 通过pybind11将C++函数绑定一个Python函数

auto pyboost_npu_swiglu(const ms::Tensor &x, int32_t dim) {
  return ms::pynative::PyboostRunner::Call<1>(npu_swiglu, x, dim);
}

PYBIND11_MODULE(MS_EXTENSION_NAME, m) {
  m.def("swiglu", &pyboost_npu_swiglu, "swiglu realization", pybind11::arg("x"), pybind11::arg("dim") = -1);
}

6. 使用CustomOpBuilder编译自定义算子

将上述C++代码保存成文件atb_activation.cpp，然后使用Python接口CustomOpBuilder编译。

import mindspore
import numpy as np
x = mindspore.Tensor(np.random.rand(2, 32).astype(np.float16))
my_ops = mindspore.ops.CustomOpBuilder("atb_activation", "atb_activation.cpp", enable_atb=True).load()
y = my_ops.swiglu(x, -1)
print(y)

这里向CustomOpBuilder传入了enable_atb=True的参数，MindSpore会自动添加与ATB加速库有关的编译和链接选项。用户续保证正确执行了ATB库的set_env.sh脚本，环境中有了ATB_HOME_PATH环境变量。