[{"data":1,"prerenderedAt":669},["ShallowReactive",2],{"content-query-H7VnVsTDt4":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"body":13,"_type":663,"_id":664,"_source":665,"_file":666,"_stem":667,"_extension":668},"/version-updates/en/3602","en",false,"","MindSpore 2.5 Is Officially Released, Enhancing Dynamic Graph Performance, Improving Static Graph Generalization, and Lowering Foundation Model Inference Costs and Latency","After months of development and contributions from the MindSpore open-source community, the MindSpore 2.5 is now available.","2025-02-12","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2025/02/12/cf7eb4fbcd7c4e248784871e7f0709b3.png","version-updates",{"type":14,"children":15,"toc":626},"root",[16,24,29,34,39,44,49,60,71,76,81,89,100,105,116,121,126,133,138,143,154,165,170,181,186,193,281,290,301,311,316,321,328,333,340,351,360,365,370,375,382,387,394,405,410,419,424,429,434,443,448,453,462,467,478,487,492,497,504,509,518,523,528,538,547,558,563,570,575,579,588,599,604,611,616],{"type":17,"tag":18,"props":19,"children":21},"element","h1",{"id":20},"mindspore-25-is-officially-released-enhancing-dynamic-graph-performance-improving-static-graph-generalization-and-lowering-foundation-model-inference-costs-and-latency",[22],{"type":23,"value":8},"text",{"type":17,"tag":25,"props":26,"children":27},"p",{},[28],{"type":23,"value":9},{"type":17,"tag":25,"props":30,"children":31},{},[32],{"type":23,"value":33},"Dynamic graphs are supplemented with the view and in-place functions, optimizing dynamic shape capabilities to boost execution performance. Improvements to graph kernel fusion enhance the generalization of static graphs in O1 mode. A resource-efficient simulation cluster execution process has been introduced to improve optimization efficiency. In addition, existing functions on Atlas A2 are smoothly migrated to supernodes. By leveraging the interconnection advantages, the performance can reach up to 2.9 times that of Atlas A2, continuously enhancing framework usability.",{"type":17,"tag":25,"props":35,"children":36},{},[37],{"type":23,"value":38},"In terms of foundation model inference, Golden Stick offers low-bit weight quantization and dynamic quantization algorithms to reduce inference costs. By integrating graph kernel fusion optimization, it lowers overall network latency and enhances throughput. Additionally, it supports the DiT text-to-image model with store-generation computation and gate algorithms to decrease end-to-end latency, achieving improved inference performance for foundation models.",{"type":17,"tag":25,"props":40,"children":41},{},[42],{"type":23,"value":43},"In terms of tool efficiency enhancement, the msprobe tool introduces hierarchical visual comparison, enabling swift accuracy issue analysis. Furthermore, the Profiler features lightweight dotting for rapid issue delineation in cluster scenarios.",{"type":17,"tag":25,"props":45,"children":46},{},[47],{"type":23,"value":48},"Now, let's delve into the key features of MindSpore 2.5.",{"type":17,"tag":50,"props":51,"children":53},"h3",{"id":52},"improvement-of-framework-usability",[54],{"type":17,"tag":55,"props":56,"children":57},"strong",{},[58],{"type":23,"value":59},"Improvement of Framework Usability",{"type":17,"tag":50,"props":61,"children":63},{"id":62},"_1-dynamic-graphs-supplemented-with-the-view-and-in-place-functions-improving-tensor-index-performance-by-an-average-of-34-times",[64,66],{"type":23,"value":65},"1 ",{"type":17,"tag":55,"props":67,"children":68},{},[69],{"type":23,"value":70},"Dynamic Graphs Supplemented with the View and In-place Functions, Improving Tensor Index Performance by an Average of 3.4 Times",{"type":17,"tag":25,"props":72,"children":73},{},[74],{"type":23,"value":75},"In the AI framework, tensor operations are classified into view operations and in-place operations based on common computing operations. The view operation creates a tensor that shares the same data storage as the original tensor but has a different shape or arrangement. It interprets existing data from different perspectives without copying, making the view operation efficient by avoiding unnecessary memory allocation and data replication. The in-place operation modifies the input tensor's content directly without creating a tensor, typically indicated by adding an underscore to the function name, such as add_() which is the in-place version of add().",{"type":17,"tag":25,"props":77,"children":78},{},[79],{"type":23,"value":80},"The tensor index operation is a complex operation based on the view and in-place operations. The dynamic graphs of MindSpore 2.5 have been supplemented with the view and in-place operation capabilities, improving the tensor index operation performance. As shown in the following figure, the tensor index performance is improved by 3.4 times on average in different scenarios.",{"type":17,"tag":25,"props":82,"children":83},{},[84],{"type":17,"tag":85,"props":86,"children":88},"img",{"alt":7,"src":87},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2025/02/12/d0cb2e7936b647eeb0e55f0e30c5c022.png",[],{"type":17,"tag":50,"props":90,"children":92},{"id":91},"_2-reverse-engineering-the-dynamic-shape-capabilities-to-enhance-dynamic-graph-execution-performance-by-30",[93,95],{"type":23,"value":94},"2 ",{"type":17,"tag":55,"props":96,"children":97},{},[98],{"type":23,"value":99},"Reverse-engineering the Dynamic Shape Capabilities to Enhance Dynamic Graph Execution Performance by 30%",{"type":17,"tag":25,"props":101,"children":102},{},[103],{"type":23,"value":104},"In the dynamic shape scenario of MindSpore's dynamic graphs, the reverse execution first constructs a complete IR graph, which is then split into individual operators for execution. This process involves constructing the entire graph before splitting, resulting in redundant operations. In MindSpore 2.5, the reverse execution process is optimized to establish logical connections and directly execute graphs based on the connections. This optimization reduces redundant operations and improves the dynamic graph execution performance in the dynamic shape scenario by up to 30%, enhancing the end-to-end performance of both the SDXL network and the OpenSora network.",{"type":17,"tag":50,"props":106,"children":108},{"id":107},"_3-improving-graph-kernel-fusion-and-enhancing-the-generalization-of-static-graphs-in-o1-mode",[109,111],{"type":23,"value":110},"3 ",{"type":17,"tag":55,"props":112,"children":113},{},[114],{"type":23,"value":115},"Improving Graph Kernel Fusion and Enhancing the Generalization of Static Graphs in O1 Mode",{"type":17,"tag":25,"props":117,"children":118},{},[119],{"type":23,"value":120},"MindSpore 2.3 supports static graph O(n) multi-level build. In this version, the O1 mode primarily adds graph kernel fusion optimization on top of O0 to satisfy models that require higher training performance.",{"type":17,"tag":25,"props":122,"children":123},{},[124],{"type":23,"value":125},"After continuous optimization and extensive testing, the O1 mode in MindSpore 2.5 has met the requirements for generalization in most scenarios. The following figure shows a typical network test based on Atlas A2. Enabling the O1 mode achieves an average network performance acceleration of approximately 10%. The specific benefits depend on the network structure, operator usage, and tensor shape.",{"type":17,"tag":25,"props":127,"children":128},{},[129],{"type":17,"tag":85,"props":130,"children":132},{"alt":7,"src":131},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2025/02/12/7e3f186df8ec49178292421639e400c7.png",[],{"type":17,"tag":25,"props":134,"children":135},{},[136],{"type":23,"value":137},"After the O1 mode is enabled, graph kernel fusion can automatically identify subgraphs that can be fused and perform fusion and replacement during static graph build. Compared to manual fusion, graph kernel fusion offers advantages such as simplicity, ease of use, and better generalization.",{"type":17,"tag":25,"props":139,"children":140},{},[141],{"type":23,"value":142},"Reference link:",{"type":17,"tag":25,"props":144,"children":145},{},[146],{"type":17,"tag":147,"props":148,"children":152},"a",{"href":149,"rel":150},"https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.JitConfig.html",[151],"nofollow",[153],{"type":23,"value":149},{"type":17,"tag":50,"props":155,"children":157},{"id":156},"_4-smooth-migration-of-supernode-functions-fully-utilizing-interconnection-advantages",[158,160],{"type":23,"value":159},"4 ",{"type":17,"tag":55,"props":161,"children":162},{},[163],{"type":23,"value":164},"Smooth Migration of Supernode Functions, Fully Utilizing Interconnection Advantages",{"type":17,"tag":25,"props":166,"children":167},{},[168],{"type":23,"value":169},"MindSpore's existing features on Atlas A2 can be smoothly migrated to Atlas A3 without modification, fully leveraging the hardware capabilities of Atlas A3 and comprehensively supporting the training and inference processes of models. In the supernode sweet spot scenario, MindSpore enables affinity features like high-dimensional tensor parallelism and RingAttention. High-dimensional tensor parallelism optimizes the MatMul computation shape partition strategy based on communication savings, fully unleashing the hardware performance of supernodes, and supports the performance improvement of 10% to 20% for the dense LLaMA models with hundreds of billions of parameters. By fully leveraging interconnection advantages, typical sparse models with hundreds of billions of parameters can achieve long-sequence performance on Atlas A3 that is 2.9 times higher than Atlas A2, supporting training with sequence lengths of 10-trillion-level tokens.",{"type":17,"tag":50,"props":171,"children":173},{"id":172},"_5-adding-the-simulation-cluster-execution-process-that-does-not-occupy-resources-improving-the-optimization-efficiency",[174,176],{"type":23,"value":175},"5 ",{"type":17,"tag":55,"props":177,"children":178},{},[179],{"type":23,"value":180},"Adding the Simulation Cluster Execution Process that Does Not Occupy Resources, Improving the Optimization Efficiency",{"type":17,"tag":25,"props":182,"children":183},{},[184],{"type":23,"value":185},"During training, to enhance the computing power or memory utilization of devices, it is often necessary to iteratively tune hyperparameters related to parallel strategies, recomputation, and load balancing. For massive clusters comprising tens of thousands of devices, the cost of this iterative tuning is extremely high. MindSpore 2.5 supports simulation execution. You can directly simulate the graph build result and memory usage of any device without occupying the device. As shown in the following figure, you can adjust the hyperparameters based on the memory result. After the adjustment, you can reuse the hyperparameters to the production cluster and start large cluster training in one-click mode. This reduces resource usage during debugging and greatly improves debugging and optimization efficiency.",{"type":17,"tag":25,"props":187,"children":188},{},[189],{"type":17,"tag":85,"props":190,"children":192},{"alt":7,"src":191},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2025/02/12/c5e2ffa6f7ac4311ab72d07f63ef5935.png",[],{"type":17,"tag":25,"props":194,"children":195},{},[196,198,204,206,211,213,218,219,224,226,231,233,237,239,243,245,249,251,256,257,261,263,267,269,273,275,279],{"type":23,"value":197},"You can set the simulation execution level based on your needs by setting the environment variable ",{"type":17,"tag":199,"props":200,"children":201},"em",{},[202],{"type":23,"value":203},"export MS_SIMULATION_LEVEL",{"type":23,"value":205}," to ",{"type":17,"tag":55,"props":207,"children":208},{},[209],{"type":23,"value":210},"0",{"type":23,"value":212},", ",{"type":17,"tag":55,"props":214,"children":215},{},[216],{"type":23,"value":217},"1",{"type":23,"value":212},{"type":17,"tag":55,"props":220,"children":221},{},[222],{"type":23,"value":223},"2",{"type":23,"value":225},", or ",{"type":17,"tag":55,"props":227,"children":228},{},[229],{"type":23,"value":230},"3",{"type":23,"value":232},". ",{"type":17,"tag":55,"props":234,"children":235},{},[236],{"type":23,"value":210},{"type":23,"value":238}," represents only model build; you can focus on model build time. ",{"type":17,"tag":55,"props":240,"children":241},{},[242],{"type":23,"value":217},{"type":23,"value":244}," records the input and output memory information of the operator based on ",{"type":17,"tag":55,"props":246,"children":247},{},[248],{"type":23,"value":210},{"type":23,"value":250},"; you can analyze the memory based on memory statistics or ",{"type":17,"tag":55,"props":252,"children":253},{},[254],{"type":23,"value":255},"memory_tracker",{"type":23,"value":232},{"type":17,"tag":55,"props":258,"children":259},{},[260],{"type":23,"value":223},{"type":23,"value":262}," adds the operator workspace memory information based on ",{"type":17,"tag":55,"props":264,"children":265},{},[266],{"type":23,"value":217},{"type":23,"value":268},"; the expected number of devices to be simulated is required. ",{"type":17,"tag":55,"props":270,"children":271},{},[272],{"type":23,"value":230},{"type":23,"value":274}," adds the execution process of the compute operator on the current device based on ",{"type":17,"tag":55,"props":276,"children":277},{},[278],{"type":23,"value":223},{"type":23,"value":280},"; you can optimize the performance.",{"type":17,"tag":50,"props":282,"children":284},{"id":283},"foundation-model-inference-performance-improvement",[285],{"type":17,"tag":55,"props":286,"children":287},{},[288],{"type":23,"value":289},"Foundation Model Inference Performance Improvement",{"type":17,"tag":50,"props":291,"children":293},{"id":292},"_6-golden-stick-supporting-low-bit-weight-quantization-and-dynamic-quantization-algorithms-reducing-inference-costs",[294,296],{"type":23,"value":295},"6 ",{"type":17,"tag":55,"props":297,"children":298},{},[299],{"type":23,"value":300},"Golden Stick Supporting Low-Bit Weight Quantization and Dynamic Quantization Algorithms, Reducing Inference Costs",{"type":17,"tag":302,"props":303,"children":305},"h4",{"id":304},"_61-golden-stick-adding-the-awq-and-gptq-algorithms-and-providing-4-bit-weight-quantization-inference-capability-reducing-inference-latency-by-40-and-the-number-of-parameters-by-60",[306],{"type":17,"tag":55,"props":307,"children":308},{},[309],{"type":23,"value":310},"6.1 Golden Stick Adding the AWQ and GPTQ Algorithms and Providing 4-Bit Weight Quantization Inference Capability, Reducing Inference Latency by 40% and the Number of Parameters by 60%",{"type":17,"tag":25,"props":312,"children":313},{},[314],{"type":23,"value":315},"Activation-aware weight quantization (AWQ) is a low-bit weight quantization algorithm. It selects significant weights based on the activation value distribution and protects the significant weights through scaling in consideration of hardware efficiency, implementing a hardware-friendly high-accuracy weight quantization algorithm. Similar to the AWQ algorithm, the gradient-based post-training quantization (GPTQ) algorithm is also a low-bit weight quantization algorithm. The core idea of the GPTQ algorithm is to quantize all parameters in a block one by one. After each parameter is quantized, other unquantized parameters in the block need to be adjusted to compensate for the accuracy loss caused by quantization.",{"type":17,"tag":25,"props":317,"children":318},{},[319],{"type":23,"value":320},"In MindSpore 2.5, Golden Stick reproduces the two weight quantization algorithms and optimizes the inference performance of 4-bit weight quantization. We performed performance and accuracy tests using eight-device tensor parallel inference on the Ascend Atlas 800I A2 hardware, and the results are as shown in the following table. AWQ and GPTQ implemented A16W4 quantization with almost lossless accuracy on the BoolQ, SQuAD 1.1, and WikiText2 datasets.",{"type":17,"tag":25,"props":322,"children":323},{},[324],{"type":17,"tag":85,"props":325,"children":327},{"alt":7,"src":326},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2025/02/12/17d8401167f844deacf5b1b3098488e9.png",[],{"type":17,"tag":25,"props":329,"children":330},{},[331],{"type":23,"value":332},"For the LLaMA2-70B network, as illustrated in the following figure, 4-bit weight quantization can achieve up to 48.7% latency reduction when the batch size is less than 8. However, when the batch size is 16 or greater, the 4-bit weight quantization has a detrimental effect on latency due to the Ascend hardware architecture.",{"type":17,"tag":25,"props":334,"children":335},{},[336],{"type":17,"tag":85,"props":337,"children":339},{"alt":7,"src":338},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2025/02/12/8c7ee24b0ec443ff8b2bb92a07e46bc1.png",[],{"type":17,"tag":25,"props":341,"children":342},{},[343,345],{"type":23,"value":344},"Reference link: ",{"type":17,"tag":147,"props":346,"children":349},{"href":347,"rel":348},"https://gitee.com/mindspore/golden-stick/tree/master/mindspore_gs/ptq/ptq#gptq%E7%AE%97%E6%B3%95",[151],[350],{"type":23,"value":347},{"type":17,"tag":302,"props":352,"children":354},{"id":353},"_62-golden-stick-adding-the-activation-dynamic-quantization-algorithm-to-enhance-8-bit-quantization-accuracy",[355],{"type":17,"tag":55,"props":356,"children":357},{},[358],{"type":23,"value":359},"6.2 Golden Stick Adding the Activation Dynamic Quantization Algorithm to Enhance 8-Bit Quantization Accuracy",{"type":17,"tag":25,"props":361,"children":362},{},[363],{"type":23,"value":364},"For some models and tasks highly sensitive to accuracy, even with outlier suppression technologies like SmoothQuant, it may still be challenging to meet the accuracy requirements. In such cases, dynamic quantization can be employed to further minimize quantization accuracy losses, albeit at the expense of some 8-bit quantization performance benefits.",{"type":17,"tag":25,"props":366,"children":367},{},[368],{"type":23,"value":369},"Unlike weight quantization, the algorithm cannot obtain the actual activation values in the offline quantization phase; it can only approximate the distribution of activations using the calibration set, resulting in additional quantization saturation errors. Activation dynamic quantization refers to the process of collecting statistics on the activation distribution in real-time during inference to implement quantization inference, reducing the quantization accuracy loss.",{"type":17,"tag":25,"props":371,"children":372},{},[373],{"type":23,"value":374},"In MindSpore 2.5, Golden Stick offers the activation dynamic quantization algorithm, integrating static weight quantization and SmoothQuant outlier suppression technologies to deliver nearly lossless 8-bit quantization capabilities. We performed tests using eight-device tensor parallelism on the Ascend Atlas 800I A2 hardware, and the results are as shown in the following table. The activation dynamic quantization algorithm achieved lossless A8W8 quantization on the C-Eval and SQuAD1.1 datasets.",{"type":17,"tag":25,"props":376,"children":377},{},[378],{"type":17,"tag":85,"props":379,"children":381},{"alt":7,"src":380},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2025/02/12/8b26e6156ca84d12b2520bfdfe5116a7.png",[],{"type":17,"tag":25,"props":383,"children":384},{},[385],{"type":23,"value":386},"We performed fusion optimization for the RMSNorm and DynamicQuant operators. On the LLaMA2-57B network, as shown in the following figure, with the batch_size range set to [1,16] and the seq_length range set to [512,2048], we achieved an end-to-end time reduction of 3.8% to 8.1%.",{"type":17,"tag":25,"props":388,"children":389},{},[390],{"type":17,"tag":85,"props":391,"children":393},{"alt":7,"src":392},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2025/02/12/723beaccdc45492591857923bc3e012c.png",[],{"type":17,"tag":50,"props":395,"children":397},{"id":396},"_7-leveraging-graph-kernel-fusion-optimization-to-reduce-network-wide-inference-latency-and-improve-throughput",[398,400],{"type":23,"value":399},"7 ",{"type":17,"tag":55,"props":401,"children":402},{},[403],{"type":23,"value":404},"Leveraging Graph Kernel Fusion Optimization to Reduce Network-wide Inference Latency and Improve Throughput",{"type":17,"tag":25,"props":406,"children":407},{},[408],{"type":23,"value":409},"MindSpore 2.5 leverages graph kernel fusion technology to specifically handle various quantization scenarios, minimizing additional quantization operation-induced delivery time and memory access. It achieves improved computing efficiency through the Ascend affinity fusion operator and collaborates with quantization compression and PrefillFlatten technologies. By working with the Atlas A2 and Atlas inference series products, it optimizes network-wide latency reduction and throughput enhancement.",{"type":17,"tag":302,"props":411,"children":413},{"id":412},"_71-fusion-of-hidden-quantization-computation-overhead-accelerating-the-entire-network-by-3-to-8",[414],{"type":17,"tag":55,"props":415,"children":416},{},[417],{"type":23,"value":418},"7.1 Fusion of Hidden Quantization Computation Overhead, Accelerating the Entire Network by 3% to 8%",{"type":17,"tag":25,"props":420,"children":421},{},[422],{"type":23,"value":423},"PagedAttention supports the use of quantized KVCache as input, and integrates dequantization operations within the operator. By utilizing the cube/vector features of Ascend hardware, it enables parallel multi-computation pipelines, allowing the cube/vector computation latencies to mask each other and significantly improving the operator's performance. Compared to floating-point input, the performance is boosted by over 10%.",{"type":17,"tag":25,"props":425,"children":426},{},[427],{"type":23,"value":428},"We also provide a fusion pattern based on RmsNorm to merge forward Add operations and backward quantizations into larger vector operators. By utilizing key technologies such as multi-channel fusion parallelism, UB Bank conflict optimization, and in-place memory overcommitment, the operator performance improves by 10–30% after fusion. The above fusion simultaneously supports the Quant operator (per channel quantization) and the DynamicQuant operator (per token activation dynamic quantization).",{"type":17,"tag":25,"props":430,"children":431},{},[432],{"type":23,"value":433},"In addition, the MatMul parallel fusion and backward fusion, which were supported in previous versions, are now compatible with quantized data type inputs in MindSpore 2.5.",{"type":17,"tag":302,"props":435,"children":437},{"id":436},"_72-prefillflatten-load-balancing-throughput-improvement-by-5",[438],{"type":17,"tag":55,"props":439,"children":440},{},[441],{"type":23,"value":442},"7.2 PrefillFlatten Load Balancing, Throughput Improvement by 5%",{"type":17,"tag":25,"props":444,"children":445},{},[446],{"type":23,"value":447},"In foundation model inference scenarios where sequential data is processed, padding is often used to make the sequence length consistent for batch processing. This inevitably increases redundant computations.",{"type":17,"tag":25,"props":449,"children":450},{},[451],{"type":23,"value":452},"MindSpore 2.5 uses the PrefillFlatten method to concatenate input sequences according to their actual lengths, eliminating the need to pad them to a uniform length. Regarding this optimization, the Attention computation modules (FlashAttention, PagedAttention, ApplyRotaryPosEmb operators), combined with the Ascend AI Processor features, sort the actual lengths of the input sequences within the operator. It dynamically allocates different numbers of cores based on the computational load of each batch, ensuring that longer sequences can be distributed across different computing units. This reduces the amount of computation while ensuring load balancing for each computing unit. At the same time, it optimizes the parallel task processing of vector computation pipelines, improves UB utilization, and further increases the overall computational efficiency by more than 10%.",{"type":17,"tag":302,"props":454,"children":456},{"id":455},"_73-flexformat-fusion-achieving-matrix-multiplication-optimization-accelerating-the-entire-network-by-57",[457],{"type":17,"tag":55,"props":458,"children":459},{},[460],{"type":23,"value":461},"7.3 FlexFormat Fusion Achieving Matrix Multiplication Optimization, Accelerating the Entire Network by 5–7%",{"type":17,"tag":25,"props":463,"children":464},{},[465],{"type":23,"value":466},"Frequent conversions between the special formats required for Cube computation on Atlas inference hardware and the native formats introduce significant overhead. Additionally, the performance of the same operator can vary greatly depending on the format used. To address this issue, MindSpore 2.5 implements a more flexible format selection and operator fusion optimization scheme. By combining efficient UB scheduling, buffer reuse, and other technologies, Cube computation can effectively mask the time spent on format conversions, resulting in a substantial performance improvement. This optimization scheme supports various types of MatMul operators, including floating-point types, quantized types, and sparse quantization. Performance improvements of 3–40% can be achieved for individual operator fusion, with end-to-end performance improving by 5–7%.",{"type":17,"tag":50,"props":468,"children":470},{"id":469},"_8-support-for-dit-text-to-image-model-with-storage-to-computation-cache-and-gate-algorithms-reducing-end-to-end-latency-by-32",[471,473],{"type":23,"value":472},"8 ",{"type":17,"tag":55,"props":474,"children":475},{},[476],{"type":23,"value":477},"Support for DiT Text-to-Image Model with Storage-to-Computation (Cache) and Gate Algorithms, Reducing End-to-End Latency by 32%",{"type":17,"tag":302,"props":479,"children":481},{"id":480},"_81-reducing-attention-mechanism-computation-with-the-cache-algorithm-accelerating-end-to-end-processing-by-24",[482],{"type":17,"tag":55,"props":483,"children":484},{},[485],{"type":23,"value":486},"8.1 Reducing Attention Mechanism Computation with the Cache Algorithm, Accelerating End-to-End Processing by 24%",{"type":17,"tag":25,"props":488,"children":489},{},[490],{"type":23,"value":491},"Mainstream text-to-image generation models typically use a multi-step iterative diffusion denoising method based on the attention mechanism. During these multiple iterations, there is often redundant computation between adjacent time steps, and these computations exhibit high similarity. By identifying and reusing the results of these redundant computations, it is possible to reduce the computational load while maintaining accuracy.",{"type":17,"tag":25,"props":493,"children":494},{},[495],{"type":23,"value":496},"Based on this idea and the design concept of Delta-Cache, MindSpore 2.5 introduces the cache algorithm for the DiT text-to-image model. As shown in the diagram below, this method caches the offset of two specific feature values (⑦-⑥) from the previous time step (xt) and applies this offset directly to the input of layer B1 in the next time step (xt-1), thus skipping the computation process from B1 to B3.",{"type":17,"tag":25,"props":498,"children":499},{},[500],{"type":17,"tag":85,"props":501,"children":503},{"alt":7,"src":502},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2025/02/12/ed5544cd510647c9ad02e0c1440a5f8a.png",[],{"type":17,"tag":25,"props":505,"children":506},{},[507],{"type":23,"value":508},"Unlike the deployment strategy used by Delta-Cache, we have re-adjusted the parameters to better fit different model structures. Test results on the SD3 model show that when images with a resolution of 1024 x 1024 pixels are generated, this algorithm can reduce end-to-end inference latency by approximately 24% with almost no loss in accuracy.",{"type":17,"tag":302,"props":510,"children":512},{"id":511},"_82-adding-the-gate-algorithm-for-an-additional-10-inference-acceleration",[513],{"type":17,"tag":55,"props":514,"children":515},{},[516],{"type":23,"value":517},"8.2 Adding the Gate Algorithm for an Additional 10% Inference Acceleration",{"type":17,"tag":25,"props":519,"children":520},{},[521],{"type":23,"value":522},"Most mainstream text-to-image models use classifier-free guidance (CFG) technology, which involves executing two generation processes during each iteration: one is unconditional guidance, and the other is conditional text guidance. Experiments reveal that during the later stages of iteration, where the focus is on improving image quality, unconditional guidance has a relatively minor impact on the generation results.",{"type":17,"tag":25,"props":524,"children":525},{},[526],{"type":23,"value":527},"Based on this observation, MindSpore 2.5 introduces the Gate algorithm, which allows for the deactivation of unconditional guidance computation at specific time steps in text-to-image models that employ CFG technology. Additionally, the Gate algorithm can be used in collaboration with the aforementioned cache algorithm to further optimize performance. When the Gate algorithm is applied to the SD3 model, the inference latency is reduced by an additional 10% compared to using only the cache algorithm, achieving a total inference acceleration of approximately 32%.",{"type":17,"tag":25,"props":529,"children":530},{},[531,532],{"type":23,"value":344},{"type":17,"tag":147,"props":533,"children":536},{"href":534,"rel":535},"https://github.com/mindspore-lab/mindone/tree/master/examples/dit_infer_acceleration",[151],[537],{"type":23,"value":534},{"type":17,"tag":50,"props":539,"children":541},{"id":540},"tool-efficiency-improvement",[542],{"type":17,"tag":55,"props":543,"children":544},{},[545],{"type":23,"value":546},"Tool Efficiency Improvement",{"type":17,"tag":50,"props":548,"children":550},{"id":549},"_9-msprobe-tool-adding-hierarchical-visual-comparison-for-quick-accuracy-issue-analysis",[551,553],{"type":23,"value":552},"9 ",{"type":17,"tag":55,"props":554,"children":555},{},[556],{"type":23,"value":557},"msprobe Tool Adding Hierarchical Visual Comparison for Quick Accuracy Issue Analysis",{"type":17,"tag":25,"props":559,"children":560},{},[561],{"type":23,"value":562},"To address the low efficiency of locating accuracy issues in large model scenarios and the lack of intuitive presentation of accuracy data, the msprobe tool has been enhanced to support hierarchical visual comparison for MindSpore scenarios. This feature allows for comparing accuracy data across different levels of the model, helping users better understand the model structure and quickly analyze accuracy issues. As illustrated in the following figure, users can choose to view a single graph to observe the model structure or select a dual-graph comparison to perform cross-framework comparisons between MindSpore and PyTorch. The visual comparison displays the hierarchical structure of the model, with each node showing input-output data information, stack details, and more. It also supports searching by node name and filtering by node color to analyze accuracy.",{"type":17,"tag":25,"props":564,"children":565},{},[566],{"type":17,"tag":85,"props":567,"children":569},{"alt":7,"src":568},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2025/02/12/014cb18fbd3a41d380a42b039dfa082b.png",[],{"type":17,"tag":25,"props":571,"children":572},{},[573],{"type":23,"value":574},"The msprobe tool supports visual alignment and analysis of accuracy based on model structure, enabling users to quickly identify model implementation differences and accuracy anomalies with just one click, significantly improving the efficiency of accuracy comparison and analysis.",{"type":17,"tag":25,"props":576,"children":577},{},[578],{"type":23,"value":142},{"type":17,"tag":25,"props":580,"children":581},{},[582],{"type":17,"tag":147,"props":583,"children":586},{"href":584,"rel":585},"https://gitee.com/ascend/mstt/tree/master/debug/accuracy_tools/msprobe#/ascend/mstt/blob/master/debug/accuracy_tools/msprobe/./docs/22.visualization_MindSpore.md",[151],[587],{"type":23,"value":584},{"type":17,"tag":50,"props":589,"children":591},{"id":590},"_10-profiler-implementing-lightweight-marking-to-support-quick-issue-demarcation-in-cluster-environments",[592,594],{"type":23,"value":593},"10 ",{"type":17,"tag":55,"props":595,"children":596},{},[597],{"type":23,"value":598},"Profiler Implementing Lightweight Marking to Support Quick Issue Demarcation in Cluster Environments",{"type":17,"tag":25,"props":600,"children":601},{},[602],{"type":23,"value":603},"To address the traditional profiler process being time-consuming and dealing with large amounts of data in large cluster scenarios, MindSpore 2.5 offers a lightweight profiler capability to assist in obtaining performance data for critical model metrics in a lightweight manner for large-scale clusters. As illustrated in the following figure, users can customize marking through the mstx.mark, mstx.range_start, and mstx.range_end interfaces, and also support built-in marking of communication operators. When users enable the lightweight marking function, marking is automatically performed before and after the communication operators. All marking tasks are issued by the runtime to the device side, which can present the time points or time slices of the marking tasks on the host side and the device side.",{"type":17,"tag":25,"props":605,"children":606},{},[607],{"type":17,"tag":85,"props":608,"children":610},{"alt":7,"src":609},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2025/02/12/30b5157450b14e60a3f0ca3b757fd776.png",[],{"type":17,"tag":25,"props":612,"children":613},{},[614],{"type":23,"value":615},"Lightweight marking can support MindSpore 2.5 large cluster training business scenarios, providing the ability to locate issue boundaries with a small amount of data in large cluster scenarios.",{"type":17,"tag":25,"props":617,"children":618},{},[619,620],{"type":23,"value":344},{"type":17,"tag":147,"props":621,"children":624},{"href":622,"rel":623},"https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.profiler.mstx.html?highlight=mstx#mindspore.profiler.mstx",[151],[625],{"type":23,"value":622},{"title":7,"searchDepth":627,"depth":627,"links":628},4,[629,631,633,635,637,639,641,642,647,653,658,659,661],{"id":52,"depth":630,"text":59},3,{"id":62,"depth":630,"text":632},"1 Dynamic Graphs Supplemented with the View and In-place Functions, Improving Tensor Index Performance by an Average of 3.4 Times",{"id":91,"depth":630,"text":634},"2 Reverse-engineering the Dynamic Shape Capabilities to Enhance Dynamic Graph Execution Performance by 30%",{"id":107,"depth":630,"text":636},"3 Improving Graph Kernel Fusion and Enhancing the Generalization of Static Graphs in O1 Mode",{"id":156,"depth":630,"text":638},"4 Smooth Migration of Supernode Functions, Fully Utilizing Interconnection Advantages",{"id":172,"depth":630,"text":640},"5 Adding the Simulation Cluster Execution Process that Does Not Occupy Resources, Improving the Optimization Efficiency",{"id":283,"depth":630,"text":289},{"id":292,"depth":630,"text":643,"children":644},"6 Golden Stick Supporting Low-Bit Weight Quantization and Dynamic Quantization Algorithms, Reducing Inference Costs",[645,646],{"id":304,"depth":627,"text":310},{"id":353,"depth":627,"text":359},{"id":396,"depth":630,"text":648,"children":649},"7 Leveraging Graph Kernel Fusion Optimization to Reduce Network-wide Inference Latency and Improve Throughput",[650,651,652],{"id":412,"depth":627,"text":418},{"id":436,"depth":627,"text":442},{"id":455,"depth":627,"text":461},{"id":469,"depth":630,"text":654,"children":655},"8 Support for DiT Text-to-Image Model with Storage-to-Computation (Cache) and Gate Algorithms, Reducing End-to-End Latency by 32%",[656,657],{"id":480,"depth":627,"text":486},{"id":511,"depth":627,"text":517},{"id":540,"depth":630,"text":546},{"id":549,"depth":630,"text":660},"9 msprobe Tool Adding Hierarchical Visual Comparison for Quick Accuracy Issue Analysis",{"id":590,"depth":630,"text":662},"10 Profiler Implementing Lightweight Marking to Support Quick Issue Demarcation in Cluster Environments","markdown","content:version-updates:en:3602.md","content","version-updates/en/3602.md","version-updates/en/3602","md",1776506143806]