Using Java Interface to Perform Concurrent Inference

View Source On Gitee

Overview

MindSpore Lite provides multi-model concurrent inference interface ModelParallelRunner. Multi model concurrent inference now supports Atlas 200/300/500 inference product, Atlas inference series (with Ascend 310P AI processor), Atlas training series, Nvidia GPU and CPU backends.

After exporting the mindir model by MindSpore or converting it by model conversion tool to obtain the mindir model, the concurrent inference process of the model can be executed in Runtime. This tutorial describes how to perform concurrent inference with multiple modes by using the Java interface.

To use the MindSpore Lite concurrent inference framework, perform the following steps:

  1. Create a configuration item: Create a multi-model concurrent inference configuration item RunnerConfig, which is used to configure multiple model concurrency.

  2. Initialization: initialization before multi-model concurrent inference.

  3. Execute concurrent inference: Use the Predict interface of ModelParallelRunner to perform concurrent inference on multiple Models.

  4. Release memory: When you do not need to use the MindSpore Lite concurrent inference framework, you need to release the ModelParallelRunner and related Tensors you created.

Preparation

  1. The following code samples are from Sample code for performing cloud-side inference by C++ interface.

  2. Export the MindIR model via MindSpore, or get the MindIR model by converting it with model conversion tool and copy it to the mindspore/lite/examples/cloud_infer/quick_start_parallel_java/model directory, and you can download the MobileNetV2 model file mobilenetv2.mindir.

  3. Download the Ascend, Nvidia GPU, CPU triplet MindSpore Lite cloud-side inference package mindspore-lite-{version}-linux-{arch}.tar.gz from Official Website and save it to mindspore/lite/examples/cloud_infer/quick_start_parallel_java directory.

Creating Configuration

The configuration item will save some basic configuration parameters required for concurrent inference, which are used to guide the number of concurrent models, model compilation and model execution.

The following sample code demonstrates how to create a RunnerConfig and configure the number of workers for concurrent inference:

// use default param init context
MSContext context = new MSContext();
context.init(1,0);
boolean ret = context.addDeviceInfo(DeviceType.DT_CPU, false, 0);
if (!ret) {
    System.err.println("init context failed");
    context.free();
    return ;
}
// init runner config
RunnerConfig config = new RunnerConfig();
config.init(context);
config.setWorkersNum(2);

For details on the configuration method of Context, see Context.

Multi-model concurrent inference currently only supports CPUDeviceInfo, GPUDeviceInfo, and AscendDeviceInfo several different hardware backends. When setting the GPU backend, you need to set the GPU backend first and then the CPU backend, otherwise it will report an error and exit.

Multi-model concurrent inference does not support FP32 type data reasoning. Binding cores only supports no core binding or binding large cores. It does not support the parameter settings of the bound cores, and does not support configuring the binding core list.

Initialization

When using MindSpore Lite to execute concurrent inference, ModelParallelRunner is the main entry of concurrent inference. Through ModelParallelRunner, you can initialize and execute concurrent inference. Use the RunnerConfig created in the previous step and call the init interface of ModelParallelRunner to initialize ModelParallelRunner.

ret = runner.predict(inputs,outputs);
if (!ret) {
    System.err.println("MindSpore Lite predict failed.");
    freeTensor();
    runner.free();
    return;
}

For Initialization of ModelParallelRunner, you do not need to set the RunnerConfig configuration parameters, and the default parameters will be used for concurrent inference of multiple models.

Executing Concurrent Inference

MindSpore Lite calls the Predict interface of ModelParallelRunner for model concurrent inference.

ret = runner.predict(inputs,outputs);
if (!ret) {
    System.err.println("MindSpore Lite predict failed.");
    freeTensor();
    runner.free();
    return;
}

Memory release

When you do not need to use the MindSpore Lite reasoning framework, you need to release the created ModelParallelRunner.

freeTensor();
runner.free();