MindSpore Custom: A Unified Custom Operator API
MindSpore Custom: A Unified Custom Operator API
MindSpore Custom: A Unified Custom Operator API August 26, 2022 Introduction There are two major barriers that hinder the development of deep learning full-stack solutions: The vertical barrier is the gap between the manual optimization solution and the automatic compilation optimization solution. Most deep learning frameworks face the dilemma of choosing between an the operator library that focuses on manual operator optimization and a compilation scheme that focuses on automatic optimization. How to integrate manual and automatic optimization and expert knowledge in machine learning optimization is a major challenge in the industry. The horizontal barrier is the cross-layer barrier caused by layering of graph computing software. Currently, in most deep learning frameworks, abstractions of different layers, such as the graph layer and operator layer, are independently designed. However, deep learning compilation and optimization cannot be completed layer by layer. You need to consider different layers as a whole. These two barriers severely impact our computing expression capabilities during network development in practice. On the one hand, vertical barriers also exist between different manual optimization solutions. Deep learning frameworks face great obstacles when integrating these optimization solutions to accelerate training. Some manually optimized operators are implemented and packed in third-party operator libraries. Adding all these operators will make the framework too heavy. Some manual optimization solutions are applicable only to certain scenarios, lacking universality. Especially in the AI-HPC convergence scenario, many operators in HPC applications are optimized for special hardware. How to integrate these operators is an important challenge for a deep learning framework. On the other hand, in most industry deep learning frameworks, the expression and registration of operators are model-agnostic, which means that the graph layer sees operators as black boxes. When the user adds an operator outside the framework, the graph layer cannot understand the specific calculation logic of the operator. As a result, graph layer optimization, such as operator fusion and operator splitting, cannot be implemented. It has been an important topic that how to directly define operators in the graph layer so that the graph layer understands the logic of the operators. This is the key to bridging the horizontal gap between the graph layer and operator layer to implement interworking between different layer abstractions. In general, a full-stack deep learning solution is bound to encounter vertical barriers between different operator optimization solutions and horizontal barriers between graph and operator abstractions. To break these barriers, MindSpore provides its own solution, that is, unified custom operator expression. 1. Custom: New Unified Custom Operator API With MindSPONGE being increasingly used in new network scenarios such as scientific computing, operator libraries designed for traditional deep learning networks fail to meet the growing requirements imposed on the flexibility of operator expressions. Therefore, MindSpore introduced the unified custom operator API Custom in version 1.6, and upgraded the API in version 1.8. The Custom API combines manual optimization and automatic compilation operators, and enables the graph layer to perceive operator definitions, helping users easily and efficiently add custom operators. The API meets user requirements in different scenarios, such as quick verification, real-time compilation, and third-party operator integration. The Custom API now supports custom operators based on ms_kernel, tbe, aicpu, aot, pyfunc, and julia. Custom operators developed using different methods and their application scenarios are as follows. Operator Development Method Language Compilation Method Platform Scenario Recommendation ms_kernel MindSpore Python DSL JIT Ascend NPU, GPU, CPU Ascend NPU/GPU universal development and quick verification tbe TBE DSL JIT Ascend NPU Ascend AI Core custom operators aicpu C/C++ AOT Ascend NPU Ascend AI CPU custom operators aot C/C++/CUDA AOT GPU, CPU Manually optimized operators for high performance, third-party operator library invocation pyfunc Python JIT CPU Quick verification, interaction with Python julia Julia JIT CPU Scientific computing, Julia-based programming Custom operator modes and platforms When designing the modes of custom operators, we fully consider the two barriers that hinder the development of deep learning full-stack solutions and use a unified API to meet the various requirements on operators. 1.1 Quick Integration of Manually Optimized Operators As mentioned in the previous discussion, the vertical gap between the development of the deep learning full-stack solution is the gap between the manual optimization solution and the automatic compilation optimization solution. The Custom API provides the aot mode to flexibly encapsulate manual operators, helping users quickly integrate them. You can manually optimize the operator implementation and connect the operator to the MindSpore acceleration network as a dynamic library. Particularly, to use APIs of C++ or CUDA functions provided by a third-party library, you can call the APIs in a custom operator and then connect the third-party library to the MindSpore acceleration network through the compilation linking. In this way, you can easily integrate the manually optimized operator. Take the PyTorch Aten library as an example. When porting a PyTorch-based network, you may encounter operators that are not supported by MindSpore. In that case, you can use the aot development mode of the Custom API to invoke PyTorch Aten operators for quick verification, and directly use operator interfaces provided by Aten to implement the calculation logic. In the following code example, the torch::leaky_relu_out operator interface of Aten is directly used to implement LeakyRelu calculation. #include #include // Referencing the header file. int8_t GetDtype(const std::string &dtypes) { int8_t type = 6; std::unordered_map m { {"uint8", 0}, {"int8", 1}, {"int16", 2}, {"int32", 3}, {"int64", 4}, {"float16", 5}, {"float32", 6}, {"float64", 7}}; if (m.count(dtypes)) { type = m[dtypes]; } return type; } extern "C" int LeakyRelu( int nparam, void** params, int* ndims, int64_t** shapes, const char** dtypes, void* stream, void* extra) { std::vector tensors; for (int i = 0; i < nparam; i++) { std::vector size; for (int j = 0; j < ndims[i]; j++) { size.push_back(shapes[i][j]); } int8_t type = GetDtype(dtypes[i]); auto option = at::TensorOptions().dtype(static_cast(type)).device(device); tensors.emplace_back(at::from_blob(params[i], size, option)); } auto at_input = tensors[0]; auto at_output = tensors[1]; torch::leaky_relu_out(at_output, at_input); return 0; } After compiling the preceding source code into a binary file using the C++ extension provided by Aten, you can use the aot mode of Custom to invoke torch::leaky_relu_out in the network. For more details about this example, see our operator migration tutorial. import numpy as np import mindspore as ms from mindspore.nn import Cell import mindspore.ops as ops ms.set_context(device_target="CPU") def LeakyRelu(): return ops.Custom("./leaky_relu_cpu.so:LeakyRelu", out_shape=lambda x : x, out_dtype=lambda x : x, func_type="aot") class Net(Cell): def __init__(self): super(Net, self).__init__() self.leaky_relu = LeakyRelu() def construct(self, x): return self.leaky_relu(x) if __name__ == "__main__": x0 = np.array([[0.0, -0.1], [-0.2, 1.0]]).astype(np.float32) net = Net() output = net(ms.Tensor(x0)) print(output) In addition, on the Ascend platform, MindSpore provides a TBE operator library based on automatic optimization. However, for some irregular operations, you need to manually optimize AI CPU operators. Custom supports AI CPU operators (the aicpu type). The AI CPU operators that are manually optimized can be quickly deployed on mainstream embedded platforms in aot mode. Compared with TBE operators, AI CPU operators are better at logic operations and are ideal for operators that are difficult to vectorize. In this way, you can use both the TBE operator library for automatic optimization and AI CPU operators for manual optimization on the Ascend platform to apply the acceleration capability of Ascend to more scenarios. 1.2 Fusion of Graph Layer and Operators The graph kernel fusion feature of MindSpore converges the expressions of the graph layer and operator layer at the backend. However, the frontend expressions of the graph layer and operator layer still use different domain specific languages (DSLs). The Custom API provides the ability to converge the graph layer and operator layer at the frontend to break the horizontal barrier. In tbe mode, you can directly use the DSL of the operator compiler to write operators in the graph layer and use the Custom API to add the operators to the network using just-in-time (JIT) compilation. Example: import numpy as np import mindspore as ms import mindspore.ops as ops from mindspore.ops import DataType, CustomRegOp, custom_info_register ms.set_context(device_target="Ascend") # Implement the operator and register the operator information. @custom_info_register(CustomRegOp() \ .input(0, "a") \ .input(1, "b") \ .output(0, "output") \ .dtype_format(DataType.F16_Default, DataType.F16_Default, DataType.F16_Default) \ .dtype_format(DataType.F32_Default, DataType.F32_Default, DataType.F32_Default) \ .target("Ascend") \ .get_op_info()) def add(a, b, output, kernel_name="add"): import te.lang.cce from te import tvm data0 = tvm.placeholder(a.get("shape"), name="data0", dtype=a.get("dtype").lower()) data1 = tvm.placeholder(b.get("shape"), name="data1", dtype=b.get("dtype").lower()) res = te.lang.cce.vadd(data0, data1) with tvm.target.cce(): sch = te.lang.cce.auto_schedule(res) config = {"print_ir": False, "name": kernel_name, "tensor_list": [data0, data1, res]} te.lang.cce.cce_build_code(sch, config) if __name__ == "__main__": # Define the custom operator of the tbe type. op = ops.Custom(add, out_shape=lambda x, _: x, out_dtype=lambda x, _: x, func_type="tbe") x0 = np.array([[0.0, 0.0], [1.0, 1.0]]).astype(np.float32) x1 = np.array([[2.0, 2.0], [3.0, 3.0]]).astype(np.float32) output = op(ms.Tensor(x0), ms.Tensor(x1)) print(output) Here, we directly use the TBE DSL of the operator to define an operator in the network definition script, and use the Custom tbe mode to add the operator to the network, greatly improving the development efficiency. 2 New Features: From AI to Scientific Computing When the custom operator unified API Custom was first introduced, we implemented the basic modes and functions based on actual challenges such as network migration and graph kernel expression. As the convergence of AI and scientific computing attracts more and more attention from the industry, MindSpore is also exploring how custom operators can be applied to scientific computing. 2.1 Taking the lead in Julia Language Support Julia is a high-level general-purpose language (GPL) that is fast and easy to use. It was initially designed for scientific computing, but has gained more popularity with main stream users in recent years due to its efficient and practical features. The ease of use is Julia's most significant feature, which allows users to code like writing mathematical formulas, greatly facilitating operator development. Therefore, the MindSpore Custom API provides the julia mode to add Julia-based operators to MindSpore-based networks. Users can write operators in Julia to take advantage of Julia's rich ecosystem. For example, you can use Julia to implement an addition function as follows: # add.jl module Add # inputs: x, y, output: z, output should use .= to inplace assign function add(x, y, z) z .= x + y end end Then, you can reference the preceding function as an operator in the julia mode in the network script. For example: import numpy as np from mindspore import context, Tensor import mindspore.ops as ops context.set_context(device_target="CPU") class Net(Cell): def __init__(self): super(Net, self).__init__() # Define custom operator of the julia type. self.add = ops.Custom("./add.jl:Add:add", out_shape=lambda x, _: x, out_dtype=lambda x, _: x, func_type="julia") def construct(self, x, y): return self.add(x, y) if __name__ == "__main__": net = Net() x0 = np.array([[0.0, 0.0], [1.0, 1.0]]).astype(np.float32) x1 = np.array([[2.0, 2.0], [3.0, 3.0]]).astype(np.float32) output = net(Tensor(x0), Tensor(x1)) print(output) In this way, you can apply Julia-based operators in scenarios such as model migration, quick verification, and model acceleration, and enjoy the benefits of the Julia language in MindSpore. Specifically, in the scientific computing scenario, you can use Julia's powerful expression capability to write operators to build AI+scientific computing applications based on MindSpore. 2.2 Cross-Platform Unified ms_kernel Mode As operators developed using Custom are added to the network using JIT after automatic optimization, we found the following issues when developing scientific computing operators in practice: 1. In terms of automatic scheduling, most the existing operator compilers implement optimization based on deep learning operators for large-scale parallel regular computing scenarios. Such optimization solution is not fully applicable for irregular computing scenarios frequently encountered in scientific computing, especially in domain specific architectures (DSAs). 2. The calculation logic of scientific computing operators is complex and requires extensive and repeated debugging. MindSpore 1.8 provides a unified cross-platform ms_kernel mode to address these two challenges. Custom operators developed in ms_kernel mode can be used in all backends. Particularly, the ms_kernel mode provides new scheduling primitives to help custom operators enable a new scheduler module at the Ascend backend. The primitives assist the scheduler to generate code for operator scheduling, and help users utilize the acceleration ability of the Ascend backend for scientific computing tasks. Operators in the ms_kernel mode can also be executed by the native Python interpreter for quick verification. 2.2.1 New Scheduling Primitives for Scheduling The ms_kernel mode provides scheduling primitives to describe different types of loops. The scheduling primitives assist the new scheduler at the Ascend backend to generate code. Such scheduling primitives include: 1. serial: instructs the scheduler to keep the sequence of the loop during scheduling generation. 2. vectorize: often used in the inner most loop to prompt the scheduler that vectorization instruction can be generated. 3. parallel: often used in the outermost loop to indicate a parallel execution opportunity and prompt the scheduler to preferentially consider parallel execution. 4. reduce: indicates a reducing axis in the calculation. When writing operators, you can use your experience to guide the scheduler to generate efficient code at the Ascend backend. For example: import numpy as np from mindspore import context, Tensor, ops from mindspore.ops import ms_kernel context.set_context(device_target="Ascend") @ms_kernel def hybrid_dsl_test(a, b): for i in parallel(a.shape[0]): for j in serial(a.shape[1]): for k in serial(j): b[i, j] = b[i, j] - a[i, j, k] * b[i, j] return b class Net(Cell): def __init__(self): super(Net, self).__init__() # Define the custom operator of the ms_kernel type (default mode of Custom). self.cus_op =ops.Custom(hybrid_dsl_test) def construct(self, x, y): return self.cus_op(x, y) if __name__ == "__main__": net = Net() x0 = np.random.randn(16, 16, 16).astype(np.float32) x1 = np.random.randn(16, 16).astype(np.float32) output = net(Tensor(x0), Tensor(x1)) print(output) The parallel primitive in the outermost loop indicates that the outermost i-axis loop has no dependency and can be scheduled for parallel acceleration. The serial primitives in the inner loops indicate that the calculation of j and k depends on each other and that the sequence of j and k should be maintained during scheduling. When device_target is set to Ascend, the preceding instructions are sent to the scheduler to implement manual operator scheduling and assist code generation, effectively extending the operator expression capability of MindSpore when applying the Ascend backend to scientific computing scenarios. In the future, MindSpore will further expand the scheduler to the full back end for scientific computing. 2.2.2 Seamless Pyfunc Switchover for Easy Debugging and High Performance For example, you can modify the preceding operator as follows: class Net(Cell): def __init__(self): super(Net, self).__init__() ## Use the Python interpreter for quick verification. self.cus_op =ops.Custom(hybrid_dsl_test, func_type="pyfunc") def construct(self, x, y): return self.cus_op(x, y) That is, you only need to change the mode of Custom to pyfunc to run the preceding operator as a native Python function. In this way, you can use Python to quickly verify the algorithm logic or insert print statements to verify the correctness of intermediate results.