MindSpore Lite Cloud-side Documentation

MindSpore Lite inference comprises two components: cloud-side inference and device-side inference. This document primarily introduces MindSpore Lite cloud-side inference. For device-side inference, please refer to the Device-side Inference Documentation .

Usage Scenarios

MindSpore Lite cloud-side inference is primarily designed for server-side devices. It offers compatibility with model structures exported from the MindSpore training framework and various open-source formats including ONNX, TFLite, and Pb. This version is applicable to Ascend cards (such as Atlas 300I Duo, Atlas 800I A2, Atlas 800I A3 series) and CPU hardware based on X86/Arm architectures. MindSpore Lite has also implemented targeted optimizations and adaptations for various algorithmic scenarios. Its current features and optimizations primarily focus on multi-modal generation, speech recognition, speech synthesis, autonomous driving, vector models, and traditional computer vision domains.

Advantages

  1. MindSpore Lite effectively reduces operator dispatch latency through whole-graph sinking during model inference, thereby enhancing model inference performance;

  2. For multi-modal generative algorithm models, MindSpore Lite supports key capabilities including multiple cache mechanisms, quantization, shared memory, and multi-dimensional hybrid parallelism. For Ascend hardware, MindSpore Lite enables user-defined operator integration;

  3. For speech-related algorithm models, MindSpore Lite supports key capabilities such as zero-copy I/O data processing;

  4. For autonomous driving models, MindSpore Lite supports hybrid scheduling of single operators and subgraphs during inference on Ascend hardware. This ensures subgraph-sinking inference performance while enabling rapid integration of custom operators for autonomous driving applications through hybrid scheduling.

Development Process

_images/lite_runtime.png

Using the MindSpore Lite inference framework primarily involves the following steps:

  1. Model loading: You can directly load MindIR models exported from MindSpore training, or convert models exported from third-party frameworks into MindIR format using the MindSpore Lite conversion tool. These converted models can then be loaded via MindSpore Lite's interfaces.

  2. Model compilation:

    1. Create a configuration context: By creating a Context, save some essential configuration parameters to guide graph compilation and model execution.

    2. Model loading: Before performing inference, the Build interface of the Model must be invoked to load the model. This process parses the cached file into a runtime model.

    3. Graph compilation: After model loading completes, the MindSpore Lite runtime compiles the graph. The model compilation phase consumes significant time, so it is recommended to create the model once, compile it once, and perform multiple inferences.

  3. Model inference:

    1. Input data must be padded before model execution.

    2. Execute inference: Use the Predict function of the Model for model inference.

    3. Obtain the output: The outputs parameter in the Predict interface returns the inference results. By parsing the MSTensor object, you can obtain the model's inference results along with the output data type and size.

  4. Memory release: During the model compilation phase, resources such as resident memory, video memory, and thread pools are allocated. These resources must be released after model inference concludes to prevent resource leaks.