MindSpore Graph Learning Prototype — Compilation and Optimization of Seastar
MindSpore Graph Learning Prototype — Compilation and Optimization of Seastar
Author: Jin Xuefeng | Source: Zhihu
Paper Title
Seastar: Vertex-Centric Programming for Graph Neural Networks
Paper URL
http://www.cse.cuhk.edu.hk/~jcheng/papers/seastar_eurosys21.pdf
Published in EuroSys 2021, this paper is jointly produced by the team of Professor James Cheng of the Chinese University of Hong Kong (CUHK) and Huawei. Seastar is the design prototype of MindSpore Graph Learning.
This blog will briefly introduces its design principle.
01 Graph Neural Network Framework
Graph neural networks (GNNs) are designed to extract information representations from graph structure data such as social networks and knowledge graphs. Since its first proposal in 2005, GNNs have been widely used in various fields, such as commodity recommendation, financial risk control, computer vision, biocomputing, and even materials science. Different from image and text data, graph structure data has irregular characteristics. As a result, the tensor-based deep learning frameworks cannot directly express graph operations and GNNs. To solve this problem, major technology companies have proposed their own GNN frameworks, such as DGL, PYG, Graph-Learn, and PGL.
From the perspective of expression, model expressions of many frameworks are operator libraries for GNNs. Some frameworks provide graph abstraction, allowing you to convert feature vectors into tensors and save them as graph attributes. Graphs provide interfaces to execute GNNs and offer message and aggregation functions based on the message propagation mechanism to receive graph attribute data as input and output. Some frameworks provide operators in different phases of computing models based on the message propagation mechanism or Scatter-Apply-Gather-Apply (SAGA) model, requiring you to organize the input tensors of the operators in the corresponding phases. In addition, some frameworks focus on the graph sampling process during GNN training and provide sampling operation interfaces. Model computing is implemented based on operators of deep learning frameworks. Although programming models such as the message propagation mechanism are sufficient to express GNNs, you still need to convert graph operations into the code logic of tensor computing of an entire graph. This task is challenging when complex graph operations need to be implemented. In addition, you need to learn bottom-layer tensor operators and domain-related APIs.
From the perspective of running, many frameworks seek for tradeoff between operator generalization and performance optimization. Some frameworks focus on operator generalization by reusing tensor computing optimization of deep learning frameworks while inserting operators such as Scatter and Gather to convert graph data into tensors. As a result, a large amount of memory and data read-write operations are required, leading to suboptimal performance. On the contrary, some frameworks provide fusion operators for typical operators, which greatly improves the performance. However, custom operator cannot be optimized, resulting in poor generalization.
All this poses a challenge on separating GNN expressions from tensors and balancing generalization and optimization efficiency for frameworks. This is exactly what Seastar attempts to solve. Seastar proposes a vertex-centric programming model that supports Python-style GNN expression, allowing you to use Python syntax to implement the logic of the central vertex. For the running mode of GNNs, Seastar designs an operator generator to generate efficient forward and backward fusion operators for GNNs.
The following describes the Seastar programming model, operator generator, and performance optimization methods.
02 Vertex-Centric Programming Model
This model focuses on how to compute the features of a center vertex by aggregating the features of its neighbors. To support vertex-centric programming, Seastar introduces a function decorator. A decorated function uses a vertex as input. In this decorated function, you can use the Python syntax to implement the computing logic of vertexes.
To implement a classic GAT network through the vertex-centric programming model, you need to define a function with vertex v as the input parameter. In the function, you can obtain the neighbor vertex list through v.innbs() and traverse each neighbor vertex u to obtain vertex features, calculate features between the neighbor node and the central node to obtain a weight list of the neighbor edge, and then perform weighted averaging on weights of the neighbor edge and the neighbor node to obtain updated feature of the central node.
Different from other frameworks that execute operators one by one and specify intermediate outputs of operators, this method can be directly integrated with a deep learning framework. Seastar needs to convert vertex-centric programming code into Tensor operator-based computational graphs and combine them with the backend runtime of a deep learning framework. According to the overall framework of Seastar, the vertex-centric programming code is recorded by the tracer, converted into a forward computational graph, and verified. The conversion process is divided into two parts.
First, operators in the frontend vertex-centric code are recorded and converted into DAGs (referring to computational graphs). Tensors are created for each node or edge attribute feature to inherit the data type and shape of the node or edge feature as the input and output data of the operators. All deep learning operators and their implementations are inserted into computational graphs in monky-patch mode. DAGs can be compiled and executed based on a deep learning framework, but the execution efficiency of a single node is low because parallel processing is not enabled.
Secondly, tensors and operators in DAGs are combined with the graph structure or operation information, and type labels such as S (source node), D (target node), and E (edge) are added to each tensor (A for neighboring aggregation additionally added to operators) to generate a graph-aware intermediate expression (GIR). These type labels indicate that the dimension of full data is the node granularity or edge granularity, and also indicate the data type that can be processed by operators. The type labels can be used for encoding verification and provide support for backward propagation and optimization.
Based on GIR of forward computational graphs and the chain rule, GIR of backward computational graphs is generated. The graph type labels can be used to verify the correctness of the backward logic. Finally, compilation is performed according to GIR to generate an execution unit. Seastar divides the computational graph into fused and unfused execution units based on whether optimization is performed for fusion. The fused unit is compiled based on the Seastar fusion template and inserted into the graph backend as a user custom operator. The non-fused unit directly invokes the backend for implementation.
The computational graph GIR of a GNN has a common S-E-A computation mode. S means to process features of source node granularity; E means to obtain features of a source node granularity, a destination node, or an edge granularity based on IDs to perform edge computation; and A means to aggregate the calculation result on the edge based on the ID of the destination node and then output the updated feature to the corresponding ID. The output of each step of S-E, E-E, and E-A can be directly used for computing in the next step. However, the operator fusion of deep learning frameworks does not cover this fusion mode. As a result, concrete intermediate results will cause high memory usage and time-consuming data read/write.
Seastar designs a finite state machine to identify the modules that can be fused in a computational graph and generate fusion operators. The status of each operator is determined by its own type label and upstream operator. Each operator is traversed in sequence based on the structure of the computational graph, and the status of its downstream operator is changed based on its status and state machine rules. After the traversal is complete, an operator's transition upstream is traced back recursively to merge a fusion operator.
To efficiently execute fusion operators, the parallel mechanism of hardware needs to be fully utilized to properly allocate computing resources and data to parallel threads. There are three parallel dimensions for GNN computing: feature-wise parallelism (FP), vertex-wise parallelism (VP), and edge-wise parallelism (EP). FP is the most notable difference of GNNs from traditional graph processing. In GNN training, the minimum dimension calculated is a node feature with up to hundreds or thousands of elements.
EP first needs to map edges to corresponding threads. When the edge scale is large, the computing overhead of the distribution process cannot be ignored. In addition, for nodes with high in-degrees, in the aggregation phase, conflicts occur when computing results of different edges are concurrently written. To improve hardware resource usage and make full use of FP, Seastar proposes a feature-adaptive parallelism policy. In addition, to solve the problem of nodes with high in-degrees in real-world graphs, Seastar uses the locality-centric execution strategy and leverages dynamic load balancing.
During FP, a feature vector is allocated to a thread of a block, resulting in coalesced memory access and SIMT execution. If the feature dimension is small, the number of threads of each block decreases accordingly, leading to low hardware utilization. If the number of threads in a block is greater than that of the feature dimension, the block is unloaded. Therefore, thread allocation that adapts to the feature dimension is required. If the block size is fixed, add a grouping dimension. A group can share a block with other groups or occupy multiple blocks.
To implement data localization, Seastar allocates a group to each node to implement parallel processing. The computing on the adjacent edges of a node is serial, which is preferable. Features of the destination node need to be loaded only once, and the result of each edge is accumulated and updated during aggregation, without result synchronization. Different nodes have different in-degrees. When nodes are parallel, the load is unbalanced. To implement load balancing, Seastar dynamically configures and sorts nodes based on their in-degrees. This step can be completed during data processing. Computing loads of nodes with similar in-degree are allocated to a block. In addition, nodes with high in-degrees take up long computing time. As a result, they need to be started preferentially to cover the time consumed by nodes with low in-degrees.
Backward propagation and forward propagation are perfectly symmetric. The forward central node receives the features of the aggregated neighbor node, and the backward central node gradient is sent to the neighbor node along the backward direction of the edge. Therefore, backward propagation can directly reuse the compilation optimization of forward propagation and sort nodes based on the new direction, as only the edge of the data needs to be changed. To reduce the memory usage of data and improve the efficiency of obtaining data, Seastar uses the CSR format to represent graph data. During backward node sorting, the node ID and edge ID must be the same as those in the forward direction to obtain the correct forward result.
For heterogeneous GNNs, edges of the same type need to be aggregated in the aggregation phase, and then results of edges of different types need to be aggregated. Compared with homogeneous GNNs, an outer loop is added to calculate edge types. For fusion operators, the loop of edge types is flattened. The edge type change is used as a signal to perform outer accumulation.
If we compare the performance of Seastar with other frameworks by training GAT, GCN, APPNP, and R-GCN, we can find that the Seastar's performance is 3x to 14x higher than DGL-0.4 and PyG-1.6.0 in a single epoch. This is attributed to better operator fusion and execution policies for data localization, as memory read and write of intermediate results and cross-block data acquisition are reduced. Seastar performs better on datasets with average high in-degrees, illustrating the effectiveness of VP combining data localization execution and dynamic load balancing. In terms of memory usage, the memory usage of other frameworks is about 2 to 8 times that of Seastar. The main reason is that the intermediate result is an edge-granularity feature, which is much larger than the node-granularity feature. Through fusion optimization, Seastar does not need to specify the full intermediate result.
03 MindSpore Graph Learning
With Seastar's design as the prototype, MindSpore Graph Learning's frontend expression implements a vertex-centric programming model, which further improves the use of decorators to define vertex-centric programming code. Users can use the For loop to traverse central nodes, and the programming logic is closer to the GNN formula. In the code generation part, the tracer and GIR are replaced. The parser is used to translate vertex-centric code into MindSpore code and print the translation result for debugging. For operator fusion and compilation optimization executed at the backend, GNN fusion patterns are loaded to graph kernel fusion, and AKG is reused to automatically generate high-performance operators. In addition to the optimization policy used by Seastar, a recomputation solution based on operator fusion is proposed, to address the problem of insufficient memory caused by dependency of backward propagation on the ultra-large edge granularity feature in the forward propagation result. This ensures GNN high-performance training. Compared with frameworks in the industry, the performance is improved by three to four times.
04 Summary
In recent years, GNNs have gradually developed into an important research direction of AI. However, in real-world development and application, GNNs face problems such as complex expression, low execution efficiency, and high memory usage. Seastar proposes a complete set of solutions from frontend expression, intermediate expression, and backend execution. The vertex-centric programming model at the frontend allows you to use GNN formula logic programming to ignore data preparation required for computing. Operator fusion and optimization are performed for the GNN computing mode at the backend, including feature-adaptive parallelism, data locality-centric parallelism, and load balancing. All this reduces the training time and improves memory usage of typical GNN networks, compared with those of other frameworks.
Distributed training of large-scale graph data still faces challenges such as efficient graph sampling, graph partition, and irregular communication. For heterogeneous GNNs, heterogeneous graphs are partitioned into multiple homogeneous graphs for execution. This ignores type information, affecting precision and performance. These problems present obstacles for the future development of GNN frameworks.