CustomOpBuilder Integrates ACLNN Operators via AclnnOpRunner

Overview

The Operator Acceleration Library (AOL) in CANN provides a large number of deeply optimized and hardware-friendly high-performance operators. If MindSpore has not yet wrapped the aclnn operator's Python interface, or if you have developed your own operator based on Ascend C, you can seamlessly integrate it in dynamic graph (PyNative) mode using CustomOpBuilder + AclnnOpRunner, without worrying about low-level details such as memory, stream, or workspace.

The typical calling convention for aclnn operators is based on a "two-stage" interface, like this:

aclnnStatus aclxxXxxGetWorkspaceSize(const aclTensor * src, ..., aclTensor * out, ..., uint64_t * workspaceSize, aclOpExecutor ** executor);
aclnnStatus aclxxXxx(void * workspace, uint64_t workspaceSize, aclOpExecutor * executor, aclrtStream stream);

You must first call the first-stage interface aclxxXxxGetWorkspaceSize to calculate how much workspace memory is required for this API call. After obtaining the required workspace size, allocate NPU memory accordingly, and then call the second-stage interface aclxxXxx to perform the computation.

In Custom Operator Based on CustomOpBuilder, MindSpore provides PyboostRunner to help users integrate custom operators in dynamic graph mode. To simplify the calling process and hide interface data type conversion operations, MindSpore provides a unified execution entry ms::pynative::AclnnOpRunner for aclnn operators. It supports PyBoost multi-level pipeline and MindSpore's operator caching capabilities, improving operator and network execution efficiency.

This tutorial uses ArgMin as an example to demonstrate the full integration process. The complete code can be found in the MindSpore repository.

Installing ACLNN Development Environment

Operators in CANN

If the operator is already included in the CANN package, no additional environment configuration is required. Just follow the MindSpore Installation Guide to set up the MindSpore environment.
Custom Operators Based on Ascend C

If the operator is a custom one developed by the user based on Ascend C, you need to add the compiled operator path to the environment variable ASCEND_CUSTOM_OPP_PATH, for example:
```
export ASCEND_CUSTOM_OPP_PATH={build_out_path}/build_out/_CPack_Package/Linux/External/custom_opp_euleros_aarch64.run/packages/vendors/{your_custom_name}:$ASCEND_CUSTOM_OPP_PATH
```

ArgMin Operator Integration Example

Below is the complete example code.

#include <set>
#include <optional>
#include "ms_extension/all.h"

namespace custom {

/* 1. Infer output shape */
static ShapeVector InferArgMinShape(const ShapeVector &in_shape, int64_t dim, bool keep_dims) {
  const int64_t rank = static_cast<int64_t>(in_shape.size());
  if (rank == 0) {
    return in_shape;
  }

  int64_t axis = (dim < 0) ? (dim + rank) : dim;
  if (axis < 0 || axis >= rank) {
    MS_LOG(EXCEPTION) << "Infer shape failed";
  }

  ShapeVector out_shape;
  out_shape.reserve(keep_dims ? rank : rank - 1);

  for (int64_t i = 0; i < rank; ++i) {
    if (i == axis) {
      if (keep_dims) {
        out_shape.push_back(1);
      }
    } else {
      out_shape.push_back(in_shape[i]);
    }
  }

  return out_shape;
}

/* 2. Construct empty output tensor */
ms::Tensor GenResultTensor(const ms::Tensor &t, int64_t dim, bool keep_dim, ms::TypeId type_id) {
  ShapeVector in_shape = t.shape();
  ShapeVector out_shape = InferArgMinShape(in_shape, dim, keep_dim);
  return ms::Tensor(type_id, out_shape);
}

/* 3. Operator entry: called directly from Python */
ms::Tensor npu_arg_min(const ms::Tensor &x, int64_t dim, bool keep_dim) {
  auto result = GenResultTensor(x, dim, keep_dim, ms::TypeId::kNumberTypeInt64);
  auto runner = std::make_shared<ms::pynative::AclnnOpRunner>("ArgMin");
  runner->SetLaunchFunc(LAUNCH_ACLNN_FUNC(aclnnArgMin, x, dim, keep_dim, result));
  runner->Run({x}, {result});
  return result;
}
}  // namespace custom

/* 4. PYBIND11 interface definition */
PYBIND11_MODULE(MS_EXTENSION_NAME, m) { m.def("npu_arg_min", PYBOOST_CALLER(1, custom::npu_arg_min)); }

1. Infer Operator Output Info

auto y = GenResultTensor(x, axis, keep_dims);

This step creates the output tensor based on the operator's logic, using shape and type. For example, aclnnArgMin precomputes the output shape and type based on axis and keep_dims, and constructs an empty Tensor using ms::Tensor(dtype, shape). This tensor only allocates metadata and does not allocate device memory. AclnnOpRunner::Run will allocate device memory internally.

2. Create AclnnOpRunner

In Custom Operator Based on CustomOpBuilder, MindSpore provides the general custom operator integration class PyboostRunner. For aclnn operators, users can directly use the AclnnOpRunner class to create an object.

auto runner = std::make_shared<ms::pynative::AclnnOpRunner>("ArgMin");

3. Call Interface to Execute Operator

runner->SetLaunchFunc(LAUNCH_ACLNN_FUNC(aclnnArgMin, x, axis, keep_dims, y));
runner->Run({x}, {y});

In LAUNCH_ACLNN_FUNC, pass the operator name, inputs, and outputs in order, and use SetLaunchFunc to set the launch function to the runner. Call the Run method, with inputs and outputs of type ms::Tensor.

4. Bind C++ Function to Python via pybind11

PYBIND11_MODULE(MS_EXTENSION_NAME, m) { m.def("npu_arg_min", PYBOOST_CALLER(1, custom::npu_arg_min)); }

npu_arg_min: Frontend interface name.
custom::npu_arg_min: Actual backend interface being called.
PYBOOST_CALLER: Takes the number of outputs and the backend interface.

5. Compile Custom Operator Using CustomOpBuilder

Save the above C++ code as argmin.cpp, then compile it using the Python CustomOpBuilder interface.

import mindspore as ms
import numpy as np

my_ops = CustomOpBuilder("my_custom", 'argmin.cpp', backend="Ascend").load()
x = np.random.randn(2, 3, 4, 5).astype(np.float32)
output = my_ops.npu_arg_min(ms.Tensor(x), 0, False)