# Quick Start ## Overview This article introduces the quick start guide for LiteBoost, including the configuration and usage of multi-card parallel inference. ## Multi-Card Parallel Inference LiteBoost provides a one-line API to enable multi-card parallel inference for supported models. Two parallelism strategies are applied automatically: - **Ulysses Sequence Parallel (USP)** for DiT models — sequence-dimension parallelism via `all_to_all` communication around attention layers. - **Data Parallel (DP) temporal tiling** for VAE models — temporal-dimension slicing with overlap, distributed across devices. ### Model Background This guide uses **Wan2.2-TI2V-5B** as an example. Wan2.2-TI2V-5B is a text-image-to-video generation model that takes a text prompt and a reference image as input and generates a video. The model consists of three main components: - **T5 text encoder**: Encodes the text prompt into embeddings. - **DiT (Diffusion Transformer)**: Iteratively denoises latent representations conditioned on text and image embeddings. - **VAE**: Encodes the reference image into latent space and decodes the denoised latents back into video frames. The inputs required for inference are: - **Text prompt**: A natural language description of the desired video content. - **Reference image**: An image that provides visual context (e.g., the starting appearance of a subject). ### Prerequisites - LiteBoost has been installed. See [Build and Install](build_and_install.md). - Ascend CANN and HCCL are properly configured. - Multiple NPU devices are available. ### Usage The typical workflow for multi-card parallel inference is: 1. Call `initialize_usp()` to initialize the HCCL distributed environment. 2. Load the model (e.g., `WanTI2V`). 3. Wrap the model with `ParallelManager` to enable parallel inference. The code below marks each section as either a **LiteBoost addition** or part of the **original workflow**: ```python import os import torch import torch_npu # --- Original workflow: Load the model --- from wan.configs import WAN_CONFIGS from wan.textimage2video import WanTI2V local_rank = int(os.getenv("LOCAL_RANK", "0")) rank = int(os.getenv("RANK", "0")) world_size = int(os.getenv("WORLD_SIZE", "1")) cfg = WAN_CONFIGS["ti2v-5B"] pipe = WanTI2V( config=cfg, checkpoint_dir="/path/to/Wan2.2-TI2V-5B", device_id=local_rank, rank=rank, t5_fsdp=False, dit_fsdp=False, use_sp=False, t5_cpu=True, init_on_cpu=True, ) # --- LiteBoost addition: Multi-card parallel inference --- if world_size > 1: from lite_boost.parallel import initialize_usp, ParallelManager initialize_usp() ParallelManager(pipe) # --- Original workflow: Run inference --- device = torch.device(f"npu:{local_rank}") pipe.model.to(device) from PIL import Image img = Image.open("input.jpg").convert("RGB") video = pipe.generate( "Your prompt here", img=img, size=(832, 480), max_area=832 * 480, frame_num=81, shift=3.0, sample_solver="unipc", sampling_steps=20, guide_scale=5.0, seed=42, offload_model=False, ) # --- Original workflow: Save the generated video (rank 0 only) --- if rank == 0: from wan.utils.utils import save_video tag = f"{world_size}card" save_video(tensor=video[None], save_file=f"ti2v-5B_{tag}.mp4", fps=cfg.sample_fps, nrow=1, normalize=True, value_range=(-1, 1)) ``` ### Launching Multi-Card Inference Use `torchrun` to launch the script across multiple NPU devices: ```bash # 2-card inference ASCEND_RT_VISIBLE_DEVICES=4,5 torchrun --nproc_per_node=2 your_script.py ``` ### Expected Output After a successful run, the generated video is saved as `ti2v-5B_2card.mp4` in the current directory. ### Environment Variables The following environment variables control the distributed environment and are read by `initialize_usp()`: | Variable | Description | Default | |----------|-------------|---------| | `RANK` | Global rank of the current process | `0` | | `WORLD_SIZE` | Total number of distributed processes | `1` | | `LOCAL_RANK` | Local rank on the current node | `0` | | `MASTER_ADDR` | IP address of the master node | `127.0.0.1` | | `MASTER_PORT` | Port of the master node | `29502` | | `NUM_THREADS` | Number of CPU threads per process | `24` | > When using `torchrun`, `RANK`, `WORLD_SIZE`, `LOCAL_RANK`, `MASTER_ADDR`, and `MASTER_PORT` are set automatically. ### How It Works When `ParallelManager(pipe)` is called on a pipeline object, it automatically: 1. Detects the model type. 2. For the DiT model: replaces `flash_attention` with an NPU-compatible version, patches each attention block's `forward` with `usp_attn_forward` (inserting `all_to_all` communication pairs), and replaces the model's `forward` with `usp_dit_forward` (entry sequence split + exit `all_gather`). 3. For the VAE model: replaces `vae.encode` and `vae.decode` with DP temporal tiling versions that split the video along the temporal dimension into overlapping chunks, distribute them across devices, and gather results. The model is modified in-place and returned as-is, so all existing attributes and methods (`.to`, `.cpu`, `.eval`, etc.) continue to work normally. ### Reference The [generate.py](https://github.com/Wan-Video/Wan2.2/blob/main/generate.py) file from the Wan2.2 repository is the original inference script without LiteBoost. Users can compare it with the LiteBoost code above to understand the differences and modify it accordingly to enable multi-card parallel inference. ## Fusion Operators Fusion operators currently supported by lite_boost: | Operator Name | Hardware | Operator Interface | |------------------------------|----------------|------------------------------| | [RainFusionAttention](https://atomgit.com/mindspore/mindspore-lite/blob/master/mindspore-lite/lite_boost/docs/ops/RainFusionAttention.md) | Atlas 800I A2 | lite_boost.ops.rain_fusion_attention
lite_boost.ops.sparse_attention |