[{"data":1,"prerenderedAt":423},["ShallowReactive",2],{"content-query-jMwVcwcgH9":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"body":13,"_type":417,"_id":418,"_source":419,"_file":420,"_stem":421,"_extension":422},"/news/en/3012","en",false,"","MindSpore-powered PanGu-Draw Model Advances Training and Inference Efficiency and Effect of Text-to-Image Models","Huawei Noah&#39;s Ark Laboratory, collaborating with partners, has recently launched the MindSpore-powered PanGu-Draw model architecture.","2024-01-16","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/11/29/7f288fe1208c492796b996d59486805e.png","news",{"type":14,"children":15,"toc":414},"root",[16,24,30,35,40,45,50,55,60,65,70,81,86,95,100,109,114,119,127,132,137,142,147,152,159,164,169,174,181,186,191,196,201,206,211,218,223,230,235,242,247,252,257,264,269,274,281,286,291,298,303,310,315,320,325,336,346,356,361,366,371,376,389,394,399,404,409],{"type":17,"tag":18,"props":19,"children":21},"element","h1",{"id":20},"mindspore-powered-pangu-draw-model-advances-training-and-inference-efficiency-and-effect-of-text-to-image-models",[22],{"type":23,"value":8},"text",{"type":17,"tag":25,"props":26,"children":27},"p",{},[28],{"type":23,"value":29},"Huawei Noah's Ark Laboratory, collaborating with partners, has recently launched the MindSpore-powered PanGu-Draw model architecture. With the support of Ascend Atlas hardware, PanGu-Draw significantly improves the efficiency of data utilization, training, and inference of text-to-image models. PanGu-Draw supports efficient fusion of several image diffusion models and further improves the generation effect through multi-control image generation and one-stage super-resolution. (The inference code has been open-sourced in the MindONE repository.)",{"type":17,"tag":25,"props":31,"children":32},{},[33],{"type":23,"value":34},"In the AI field, text-to-image models, such as SD[1], SDXL[2], Imagen[3], and DALL-E 3[4], have made significant strides in performance. Secondary training techniques, such as ControlNet[5] and LoRA[6], have contributed to the widespread adoption of these models on tasks like generating images from references, lines, and human poses.",{"type":17,"tag":25,"props":36,"children":37},{},[38],{"type":23,"value":39},"As the number of parameters in text-to-image models increases and the resolution of generated images improves, the demand for training data and computational resources also grows. It is vital to improve data utilization, training, and inference efficiency for these models. And achieving this goal is essential for minimizing resource consumption, accelerating model iteration and update, and expanding the range of application scenarios.",{"type":17,"tag":25,"props":41,"children":42},{},[43],{"type":23,"value":44},"Huawei has introduced PanGu-Draw, a novel model architecture, to tackle these issues. This model consists of two innovations: time-decoupling training strategy and Coop-Diffusion algorithm. The efficiency-oriented training strategy divides the model into two sub-models for structure generation and texture generation to optimize the training , respectively, thereby improving data utilization efficiency by about 48%, training efficiency by 51%, and inference efficiency by 50%. The Coop-Diffusion algorithm integrates image diffusion models in different potential spaces or with resolutions, providing a new way for innovative image generation tasks.",{"type":17,"tag":25,"props":46,"children":47},{},[48],{"type":23,"value":49},"PanGu-Draw now is the largest Chinese text-to-image model in the industry. Developing based on MindSpore and Ascend training hardware, it has been iteratively upgraded based on Wukong-Huahua, with the number of parameters increasing to 5 billion from 1 billion. The model not only outperforms open source models like Taiyi-CN[7] and SDXL in generation quality, but also rivals industry-leading closed-source models such as DALL-E 3 and MJ v5.2. Moreover, it offers the flexibility of handling mixed input containing both Chinese and English characters. It directly produces native 1024 x 1024 images and also provides output in various aspect ratios, including 16:9, 4:3, and 2:1.",{"type":17,"tag":25,"props":51,"children":52},{},[53],{"type":23,"value":54},"To enhance the inference process of text-to-image models, PanGu-Draw provides customizable options for generating different image styles, such as animation, art, and photography, through quantifiable values.",{"type":17,"tag":25,"props":56,"children":57},{},[58],{"type":23,"value":59},"Paper Title",{"type":17,"tag":25,"props":61,"children":62},{},[63],{"type":23,"value":64},"PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion",{"type":17,"tag":25,"props":66,"children":67},{},[68],{"type":23,"value":69},"Paper URL",{"type":17,"tag":25,"props":71,"children":72},{},[73],{"type":17,"tag":74,"props":75,"children":79},"a",{"href":76,"rel":77},"https://arxiv.org/abs/2312.16486",[78],"nofollow",[80],{"type":23,"value":76},{"type":17,"tag":25,"props":82,"children":83},{},[84],{"type":23,"value":85},"Project Homepage",{"type":17,"tag":25,"props":87,"children":88},{},[89],{"type":17,"tag":74,"props":90,"children":93},{"href":91,"rel":92},"https://pangu-draw.github.io",[78],[94],{"type":23,"value":91},{"type":17,"tag":25,"props":96,"children":97},{},[98],{"type":23,"value":99},"Code Repository",{"type":17,"tag":25,"props":101,"children":102},{},[103],{"type":17,"tag":74,"props":104,"children":107},{"href":105,"rel":106},"https://github.com/mindspore-lab/mindone/tree/master/examples/pangu_draw_v3",[78],[108],{"type":23,"value":105},{"type":17,"tag":25,"props":110,"children":111},{},[112],{"type":23,"value":113},"1. Method Introduction",{"type":17,"tag":25,"props":115,"children":116},{},[117],{"type":23,"value":118},"1.1 Resource-Efficient Text-to-Image Model Training Strategy: Time-Decoupling Training Strategy",{"type":17,"tag":25,"props":120,"children":121},{},[122],{"type":17,"tag":123,"props":124,"children":126},"img",{"alt":7,"src":125},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/08/b43a61a532db4ea5aad3e60d54f20f74.png",[],{"type":17,"tag":25,"props":128,"children":129},{},[130],{"type":23,"value":131},"Figure 1. Illustration of three text-to-image training strategies and comparison among them in resource efficiency in data, training, and inference. The proposed time-decoupling training strategy significantly surpasses the representative cascaded training and resolution boost training strategies in resource efficiency.",{"type":17,"tag":25,"props":133,"children":134},{},[135],{"type":23,"value":136},"As shown in Figure 1, the time-decoupling training strategy is more resource-efficient in data utilization, training, and inference, comparing with the conventional cascaded training and resolution boost training strategies.",{"type":17,"tag":25,"props":138,"children":139},{},[140],{"type":23,"value":141},"The cascading training strategy has high data utilization efficiency at the cost of three times longer training and inference duration. On the other hand, the resolution boost training enhances image resolution significantly after low-resolution training, which saves time but falls short on data utilization efficiency. In contrast, the time-decoupling training strategy takes a clever approach by dividing a text-to-image model into two specialized submodels: a structure generator and a texture generator. Such a decoupling strategy not only reduces the dependency on high-performance computing resources, but also simplifies the training process and avoids complex model sharding and inter-node communication overheads.",{"type":17,"tag":25,"props":143,"children":144},{},[145],{"type":23,"value":146},"In the inference phase, the structure generator first generates the basic contour from the noise image, and then the texture generator adds details to the basic contour, improving the inference efficiency by about 50%. In addition, the structure generator uses full data (including high-resolution and enlarged low-resolution images) for training, improving data utilization by about 48%. The texture generator trains data at a lower resolution, but upsamples data at a higher resolution, improving the overall training efficiency by about 51%.",{"type":17,"tag":25,"props":148,"children":149},{},[150],{"type":23,"value":151},"1.2 Fusion Algorithm for Multiple Diffusion Models: Coop-Diffusion",{"type":17,"tag":25,"props":153,"children":154},{},[155],{"type":17,"tag":123,"props":156,"children":158},{"alt":7,"src":157},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/08/fd3eff3241ca436ebcc3c1cdd2d58152.png",[],{"type":17,"tag":25,"props":160,"children":161},{},[162],{"type":23,"value":163},"Figure 2. Visualization of the Coop-Diffusion algorithm. The Huawei-proposed algorithm utilizes two submodules to eliminate the differences caused by different latent spaces and resolutions, so as to unify the denoising processes of multiple diffusion models into the same space, as shown in Figure 2. This provides a new way for generating multi-condition images.",{"type":17,"tag":25,"props":165,"children":166},{},[167],{"type":23,"value":168},"The algorithm firstly uses pixel image space as a bridge to unify the model prediction in different latent spaces to achieve multi-model fusion in the same latent space. Then, it fuses diffusion models with different resolutions by means of downsampling or upsampling after specific steps, without compromising image quality.",{"type":17,"tag":25,"props":170,"children":171},{},[172],{"type":23,"value":173},"The innovation of the Coop-Diffusion algorithm lies in that it can fuse multiple incompatible diffusion models into a unified one, thereby improving flexibility and efficiency of models in practical applications while maintaining image quality. The following figure shows the complete algorithm process.",{"type":17,"tag":25,"props":175,"children":176},{},[177],{"type":17,"tag":123,"props":178,"children":180},{"alt":7,"src":179},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/08/fdacffb530bc46d6afb643d3d9b953b3.png",[],{"type":17,"tag":25,"props":182,"children":183},{},[184],{"type":23,"value":185},"Figure 3. Coop-Diffusion: process for fusing multiple diffusion models.",{"type":17,"tag":25,"props":187,"children":188},{},[189],{"type":23,"value":190},"2. Experiment Results",{"type":17,"tag":25,"props":192,"children":193},{},[194],{"type":23,"value":195},"Huawei's research team has successfully implemented the time-decoupling training strategy on the MindSpore framework and created a groundbreaking text-to-image model PanGu-Draw. The model holds 5 billion parameters and can generate high-resolution and high-quality images in both Chinese and English. To implement such a function, PanGu-Draw uses dedicated Chinese and English text encoders to extract features of the input text. In addition, to meet the requirements of generating multi-resolution images, the team selected 11 different resolutions closely approximating the 1024 x 1024 resolution and incorporated the corresponding position code into the model.",{"type":17,"tag":25,"props":197,"children":198},{},[199],{"type":23,"value":200},"In terms of training data construction, the team selected data from multiple sources, including Noah-Wukong[8], LAION[9], photography, animation, portrait, and game materials, to ensure data diversity and high quality. The data is strictly filtered by CLIP scores, aesthetic scores, and watermark scores. The team also removed low-quality text annotations while adopting approaches based on open-set detection models and large language models to obtain high-quality text annotations.",{"type":17,"tag":25,"props":202,"children":203},{},[204],{"type":23,"value":205},"Finally, PanGu-Draw uses technologies such as Flash Attention, mixed precision training, and gradient accumulation to optimize GPU memory usage.",{"type":17,"tag":25,"props":207,"children":208},{},[209],{"type":23,"value":210},"2.1 Comparison of Quantitative Indicators",{"type":17,"tag":25,"props":212,"children":213},{},[214],{"type":17,"tag":123,"props":215,"children":217},{"alt":7,"src":216},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/08/ba87c9e7d845457aac4953e9ecc75138.png",[],{"type":17,"tag":25,"props":219,"children":220},{},[221],{"type":23,"value":222},"Table 1. Comparison of PanGu-Draw with representative English text-to-image models on COCO dataset. PanGu-Draw surpasses open source models and rivals the best closed-source model in the industry in generation quality.",{"type":17,"tag":25,"props":224,"children":225},{},[226],{"type":17,"tag":123,"props":227,"children":229},{"alt":7,"src":228},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/08/084d888afbbd46e2b71f1cc1c3923c93.png",[],{"type":17,"tag":25,"props":231,"children":232},{},[233],{"type":23,"value":234},"Table 2. Comparison of PanGu-Draw with representative Chinese text-to-image models on COCO-CN dataset. PanGu-Draw achieves the optimal generation quality.",{"type":17,"tag":25,"props":236,"children":237},{},[238],{"type":17,"tag":123,"props":239,"children":241},{"alt":7,"src":240},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/08/fa7b9a25c5394386ba4f0f02bd0cff2a.png",[],{"type":17,"tag":25,"props":243,"children":244},{},[245],{"type":23,"value":246},"Table 3. Comparison between PanGu-Draw and baseline models in manual evaluation. PanGu-Draw has better generation quality than SD and SDXL and is on a par with DALL-E 3 and MJ 5.2.",{"type":17,"tag":25,"props":248,"children":249},{},[250],{"type":23,"value":251},"Tables 1, 2, and 3 show the comparison results of PanGu-Draw and baseline models in Chinese and English COCO datasets in terms of quantitative indicators during manual evaluation. The results show that PanGu-Draw achieves the best generation quality among open source models and rivals industry-leading closed-source models such as DALL-E 3 and MJ v5.2.",{"type":17,"tag":25,"props":253,"children":254},{},[255],{"type":23,"value":256},"2.2 Visualized Result Comparison",{"type":17,"tag":25,"props":258,"children":259},{},[260],{"type":17,"tag":123,"props":261,"children":263},{"alt":7,"src":262},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/08/f2588d5e4fbb4fb7ad2afe2a71eba6be.png",[],{"type":17,"tag":25,"props":265,"children":266},{},[267],{"type":23,"value":268},"Figure 4. Visual comparison of PanGu-Draw model with baseline methods. The used input prompts are sourced from RAPHAEL and are displayed at the bottom of the figure. The results of PanGu-Draw are better than or on par with these top-performing baseline models.",{"type":17,"tag":25,"props":270,"children":271},{},[272],{"type":23,"value":273},"2.3 Display of Generation Results",{"type":17,"tag":25,"props":275,"children":276},{},[277],{"type":17,"tag":123,"props":278,"children":280},{"alt":7,"src":279},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/08/4bc7113ac5014eb0a76a3304580721f6.png",[],{"type":17,"tag":25,"props":282,"children":283},{},[284],{"type":23,"value":285},"Figure 5. Multi-resolution high-quality images generated by PanGu-Draw that are consistent with input prompts.",{"type":17,"tag":25,"props":287,"children":288},{},[289],{"type":23,"value":290},"Figure 5 shows the text-to-image results of PanGu-Draw. For more visualization results, see the project home page.",{"type":17,"tag":25,"props":292,"children":293},{},[294],{"type":17,"tag":123,"props":295,"children":297},{"alt":7,"src":296},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/08/8eff3683dab84227aefc00ea70d56530.png",[],{"type":17,"tag":25,"props":299,"children":300},{},[301],{"type":23,"value":302},"Figure 6. Multi-condition (Chinese text and images) generation results of an image reconstruction model and PanGu-Draw based on Coop-Diffusion.",{"type":17,"tag":25,"props":304,"children":305},{},[306],{"type":17,"tag":123,"props":307,"children":309},{"alt":7,"src":308},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/08/8bc8e9d1fded45efadbdda817cbddd29.png",[],{"type":17,"tag":25,"props":311,"children":312},{},[313],{"type":23,"value":314},"Figure 7. Single-stage super-resolution images generated with a low-resolution model and PanGu-Draw based on Coop-Diffusion. The left column shows generation results of a low-resolution image generation model, and the right column shows high-resolution generation results output by PanGu-Draw.",{"type":17,"tag":25,"props":316,"children":317},{},[318],{"type":23,"value":319},"As shown in Figures 6 and 7, Coop-Diffusion enables PanGu-Draw to be fused with existing models for multi-condition image generation and one-stage super-resolution tasks without additional training.",{"type":17,"tag":25,"props":321,"children":322},{},[323],{"type":23,"value":324},"3. Summary",{"type":17,"tag":25,"props":326,"children":327},{},[328,334],{"type":17,"tag":329,"props":330,"children":331},"strong",{},[332],{"type":23,"value":333},"Innovative training strategy",{"type":23,"value":335},": Huawei's PanGu-Draw architecture introduces a resource-efficient time-decoupling training strategy. This strategy significantly maximizes the efficiency of text-to-image models in data utilization, training, and inference, bringing new breakthroughs to the AI image generation field.",{"type":17,"tag":25,"props":337,"children":338},{},[339,344],{"type":17,"tag":329,"props":340,"children":341},{},[342],{"type":23,"value":343},"Breakthrough in multi-model fusion",{"type":23,"value":345},": Coop-Diffusion, a novel algorithm for fusing multiple diffusion models, is designed for PanGu-Draw. The algorithm unifies the denoising process of diffusion models with different latent spaces and resolutions into the same space and effectively fuses multiple image diffusion models, thereby paving a new way to generate images.",{"type":17,"tag":25,"props":347,"children":348},{},[349,354],{"type":17,"tag":329,"props":350,"children":351},{},[352],{"type":23,"value":353},"High generation quality and flexibility",{"type":23,"value":355},": In terms of generation quality, PanGu-Draw not only ranks top among open source models in the industry, but also is comparable to top closed-source models such as DALL-E 3 and MJ v5.2. In addition, Coop-Diffusion enables PanGu-Draw to be integrated with existing models for new downstream image generation tasks without additional training, demonstrating extremely high flexibility and practicability.",{"type":17,"tag":25,"props":357,"children":358},{},[359],{"type":23,"value":360},"References",{"type":17,"tag":25,"props":362,"children":363},{},[364],{"type":23,"value":365},"[1] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 6",{"type":17,"tag":25,"props":367,"children":368},{},[369],{"type":23,"value":370},"[2] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨uller, Joe Penna, and Robin Rombach. Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 3, 5, 6, 7, 2",{"type":17,"tag":25,"props":372,"children":373},{},[374],{"type":23,"value":375},"[3] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022. 1, 2, 5",{"type":17,"tag":25,"props":377,"children":378},{},[379,381,387],{"type":23,"value":380},"[4] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. ",{"type":17,"tag":74,"props":382,"children":385},{"href":383,"rel":384},"https://cdn",[78],[386],{"type":23,"value":383},{"type":23,"value":388},". openai. com/papers/dall-e-3. pdf, 2023. 7",{"type":17,"tag":25,"props":390,"children":391},{},[392],{"type":23,"value":393},"[5] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 1, 3, 2",{"type":17,"tag":25,"props":395,"children":396},{},[397],{"type":23,"value":398},"[6] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 2, 1",{"type":17,"tag":25,"props":400,"children":401},{},[402],{"type":23,"value":403},"[7] Jiaxing Zhang, Ruyi Gan, Junjie Wang, Yuxiang Zhang, Lin Zhang, Ping Yang, Xinyu Gao, Ziwei Wu, Xiaoqun Dong, Junqing He, Jianheng Zhuo, Qi Yang, Yongfeng Huang, Xiayu Li, Yanghan Wu, Junyu Lu, Xinyu Zhu, Weifeng Chen, Ting Han, Kunhao Pan, Rui Wang, Hao Wang, Xiaojun Wu, Zhongshen Zeng, and Chongpei Chen. Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence. CoRR, abs/2209.02970, 2022. 7",{"type":17,"tag":25,"props":405,"children":406},{},[407],{"type":23,"value":408},"[8] Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems, 35:26418–26431, 2022. 6",{"type":17,"tag":25,"props":410,"children":411},{},[412],{"type":23,"value":413},"[9] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In arXiv, 2021.",{"title":7,"searchDepth":415,"depth":415,"links":416},4,[],"markdown","content:news:en:3012.md","content","news/en/3012.md","news/en/3012","md",1776506046802]