MindSpore Serving-based Inference Service Deployment

Linux Ascend Serving GPU Beginner Intermediate Expert

View Source On Gitee

Overview

MindSpore Serving is a lightweight and high-performance service module that helps MindSpore developers efficiently deploy online inference services in the production environment. After completing model training on MindSpore, you can export the MindSpore model and use MindSpore Serving to create an inference service for the model.

The following uses a simple Add network as an example to describe how to use MindSpore Serving.

Preparing the Environment

Before running the sample network, ensure that MindSpore Serving has been properly installed. To install MindSpore Serving on your PC, go to the MindSpore Serving installation page and configure environment variables on the MindSpore Serving environment configuration page.

Exporting the Model

Use add_model.py to build a network with only the Add operator and export the MindSpore inference deployment model.

import os
from shutil import copyfile
import numpy as np

import mindspore.context as context
import mindspore.nn as nn
import mindspore.ops as ops
import mindspore as ms

context.set_context(mode=context.GRAPH_MODE, device_target="Ascend")


class Net(nn.Cell):
    """Define Net of add"""

    def __init__(self):
        super(Net, self).__init__()
        self.add = ops.Add()

    def construct(self, x_, y_):
        """construct add net"""
        return self.add(x_, y_)


def export_net():
    """Export add net of 2x2 + 2x2, and copy output model `tensor_add.mindir` to directory ../add/1"""
    x = np.ones([2, 2]).astype(np.float32)
    y = np.ones([2, 2]).astype(np.float32)
    add = Net()
    output = add(ms.Tensor(x), ms.Tensor(y))
    ms.export(add, ms.Tensor(x), ms.Tensor(y), file_name='tensor_add', file_format='MINDIR')
    dst_dir = '../add/1'
    try:
        os.mkdir(dst_dir)
    except OSError:
        pass

    dst_file = os.path.join(dst_dir, 'tensor_add.mindir')
    copyfile('tensor_add.mindir', dst_file)
    print("copy tensor_add.mindir to " + dst_dir + " success")

    print(x)
    print(y)
    print(output.asnumpy())


if __name__ == "__main__":
    export_net()

To use MindSpore for neural network definition, inherit mindspore.nn.Cell. (A Cell is a base class of all neural networks.) Define each layer of a neural network in the __init__ method in advance, and then define the construct method to complete the forward construction of the neural network. Use export of the mindspore module to export the model file. For more detailed examples, see Implementing an Image Classification Application.

Execute the add_model.py script to generate the tensor_add.mindir file. The input of the model is two 2D tensors with shape [2,2], and the output is the sum of the two input tensors.

Deploying the Serving Inference Service

Start Serving with the following files:

test_dir
├── add/
│    └── servable_config.py
│    └── 1/
│        └── tensor_add.mindir
└── master_with_worker.py
  • master_with_worker.py: Script file for starting the service.

  • add: Model folder, which is named after the model name.

  • tensor_add.mindir: Model file generated by the network in the previous step, which is stored in folder 1 (the number indicates the version number). Different versions are stored in different folders. The version number must be a string of digits. By default, the latest model file is started.

  • servable_config.py: Model configuration file, which defines the model processing functions, including the add_common and add_cast methods. add_common defines an addition operation whose input is two pieces of float32 data, and add_cast defines an addition operation whose input is data with its type converted to float32.

Content of the configuration file:

import numpy as np
from mindspore_serving.worker import register


def add_trans_datatype(x1, x2):
    """define preprocess, this example has one input and one output"""
    return x1.astype(np.float32), x2.astype(np.float32)


# when with_batch_dim is set to False, only 2x2 add is supported
# when with_batch_dim is set to True(default), Nx2 add is supported, while N is viewed as batch
# float32 inputs/outputs
register.declare_servable(servable_file="tensor_add.mindir", model_format="MindIR", with_batch_dim=False)


# register add_common method in add
@register.register_method(output_names=["y"])
def add_common(x1, x2):  # only support float32 inputs
    """method add_common data flow definition, only call model servable"""
    y = register.call_servable(x1, x2)
    return y


# register add_cast method in add
@register.register_method(output_names=["y"])
def add_cast(x1, x2):
    """method add_cast data flow definition, only call preprocess and model servable"""
    x1, x2 = register.call_preprocess(add_trans_datatype, x1, x2)  # cast input to float32
    y = register.call_servable(x1, x2)
    return y

MindSpore Serving provides both lightweight deployment and cluster deployment. In lightweight deployment mode, the master and worker nodes are deployed in the same process. In cluster deployment mode, the master and worker nodes are deployed in different processes. If there is only one worker node, you can consider lightweight deployment, that is, deploy the master node in the process where the worker node is located. If there are multiple worker nodes, you can deploy them in a cluster and use one of them as the master node to manage all worker nodes. You can select the deployment mode based on the actual requirements.

Lightweight Deployment

The server calls a Python API to start the inference process shared by both master and worker nodes. The client directly connects to the inference service and delivers an inference task. Run the master_with_worker.py script to deploy lightweight service:

import os
from mindspore_serving import master
from mindspore_serving import worker

def start():
    servable_dir = os.path.abspath(".")
    worker.start_servable_in_master(servable_dir, "add", device_id=0)
    master.start_grpc_server("127.0.0.1", 5500)

if __name__ == "__main__":
    start()

If the server prints the Serving gRPC start success, listening on 0.0.0.0:5500 log, the Serving has loaded the inference model.

Cluster Deployment

The server consists of the master and worker processes. The master process manages all worker nodes in the cluster and distributes inference tasks. The cluster deployment is as follows:

Master deployment:

import os
from mindspore_serving import master

def start():
    servable_dir = os.path.abspath(".")
    master.start_grpc_server("127.0.0.1", 5500)
    master.start_master_server("127.0.0.1", 6500)
if __name__ == "__main__":
    start()

Worker deployment:

import os
from mindspore_serving import worker

def start():
    servable_dir = os.path.abspath(".")
    worker.start_servable(servable_dir, "add", device_id=0,
                          master_ip="127.0.0.1", master_port=6500,
                          worker_ip="127.0.0.1", worker_port=6600)

if __name__ == "__main__":
    start()

The lightweight and the cluster deployment modes use different APIs to start a worker. The start_servable_in_master API is used in lightweight deployment mode, while the start_servable API is used in cluster deployment mode.

Inference Execution

The client can access the inference service through either gRPC or RESTful. The following uses gRPC as an example. Execute client.py to start the Python client.

import numpy as np
from mindspore_serving.client import Client


def run_add_common():
    """invoke servable add method add_common"""
    client = Client("localhost", 5500, "add", "add_common")
    instances = []

    # instance 1
    x1 = np.asarray([[1, 1], [1, 1]]).astype(np.float32)
    x2 = np.asarray([[1, 1], [1, 1]]).astype(np.float32)
    instances.append({"x1": x1, "x2": x2})

    # instance 2
    x1 = np.asarray([[2, 2], [2, 2]]).astype(np.float32)
    x2 = np.asarray([[2, 2], [2, 2]]).astype(np.float32)
    instances.append({"x1": x1, "x2": x2})

    # instance 3
    x1 = np.asarray([[3, 3], [3, 3]]).astype(np.float32)
    x2 = np.asarray([[3, 3], [3, 3]]).astype(np.float32)
    instances.append({"x1": x1, "x2": x2})

    result = client.infer(instances)
    print(result)


def run_add_cast():
    """invoke servable add method add_cast"""
    client = Client("localhost", 5500, "add", "add_cast")
    instances = []
    x1 = np.ones((2, 2), np.int32)
    x2 = np.ones((2, 2), np.int32)
    instances.append({"x1": x1, "x2": x2})
    result = client.infer(instances)
    print(result)


if __name__ == '__main__':
    run_add_common()
    run_add_cast()

Use the Client class defined by mindspore_serving.client. The client defines two cases to call two model methods. In the run_add_common case with three pairs of float32 arrays, each pair of arrays are added up. In the run_add_cast case, two int32 arrays are added up. If the results of the two cases are displayed as follows, the Serving has properly executed the Add network inference.

[{'y': array([[2. , 2.],
        [2.,  2.]], dtype=float32)},{'y': array([[4. , 4.],
        [4.,  4.]], dtype=float32)},{'y': array([[6. , 6.],
        [6.,  6.]], dtype=float32)}]
[{'y': array([[2. , 2.],
        [2.,  2.]], dtype=float32)}]