https://andytang02.github.io/parallel-topo-sort/
We are going to implement a parallel topological sort to evaluate complex computational graphs with backpropagation. We are going to implement two versions, in OpenMP and CUDA. We will evaluate speedup and memory efficiency on different kinds of graphs.
The application accelerates execution of an arbitrary computational graph, represented as a directed acyclic graph (DAG) where each node corresponds to a computation and edges represent data dependencies. Unlike simple pipelines, this graph can have complex branching and merging structure, capturing general-purpose computations such as scientific workflows, compiler intermediate representations, or differentiable programs.
The topological sort algorithm performs the computation while respecting data dependencies. In brief, the algorithm maintains a set of “ready” nodes with no unmet dependencies, processes them, and updates dependent nodes until the entire graph is evaluated. The reverse pass follows the same structure but traverses the graph in the opposite direction while accumulating gradients. An example sequential implementation is shown below.
The application can benefit from parallelism because the DAG defines only a partial order, not a total order. At any point in time, multiple nodes may have all dependencies satisfied and can therefore execute simultaneously. This creates a dynamic “frontier” of ready nodes that can be processed in parallel.
Nonetheless, parallelizing the algorithm is also challenging due to complex data dependencies and massive contention to shared data structures. As mentioned above, the computational graphs in this application have complex dependencies. A node cannot execute until all of its predecessors have completed, which limits parallelism along the graph’s critical path and can lead to load imbalance in irregular graphs.
In addition, shared data structures introduce contention: for example, the ready queue (q) used in the sequential implementation becomes a bottleneck when multiple threads attempt to push and pop nodes simultaneously. Similarly, updating the in-degree array (in_deg[v]) requires synchronization (e.g., atomic decrement operations), and many threads may contend when multiple predecessors finish at the same time. Speedup would be undermined without techniques such as work-stealing queues or batching.
Yet another challenge is latency. Graph memory accesses have low arithmetic intensity, and locality is very poor without techniques such as compressed row-formatting or sharding.
We will implement the project primarily in C++, using OpenMP for multicore CPU parallelism and CUDA for GPU acceleration. Development and testing will be performed on the GHC machines (8 cores) and PSC clusters (up to 128 cores and GPUs) available for the course.
We will build the system from scratch, including graph generation, a sequential baseline, and both parallel implementations. We may also consult prior work or open-source implementations of parallel graph processing for guidance if they prove useful.
Plan to Achieve:
Hope to Achieve (Stretch Goals):
Fallback (Reduced goals):
We will implement the system using both OpenMP and CUDA to evaluate parallelism across CPU and GPU architectures. OpenMP is well-suited for this workload because the DAG resides in shared memory and requires frequent concurrent access to structures such as the ready queue and in-degree array, making shared-memory parallelism a natural fit. CUDA complements this by enabling large-scale parallel processing of frontier nodes, which can expose substantial parallelism in wide regions of the graph. This type of forward and backward traversal is also similar to computations in neural networks and other scientific workloads, where graph-structured dependencies arise naturally, making it an interesting candidate for GPU acceleration. Although the irregular structure and synchronization requirements of the algorithm may limit GPU efficiency compared to more regular workloads, implementing a CUDA version allows us to study these tradeoffs and directly compare performance across architectures.
| Week | Plan |
|---|---|
| Week 1 (Mar 23) | Implement graph generator Implement sequential baseline |
| Week 2 (Mar 30) | Implement initial OpenMP versions Benchmark on small/medium graphs and low cores Identify bottlenecks |
| Week 3 (Apr 6) | Optimize OpenMP with work stealing queues and batching Scale to larger graphs and more cores (10M+ nodes, 64+ cores) Begin CUDA design and kernel planning |
| Week 4 (Apr 13) — Milestone | Implement initial CUDA version Basic GPU kernel for frontier processing Compare CPU vs GPU performance Prepare milestone report and evaluation |
| Week 5 (Apr 20) | Optimize CUDA implementation Run large-scale experiments Start detailed performance analysis |
| Week 6 (Apr 27) — Final | Final optimizations (OpenMP and CUDA) Complete full evaluation including speedup curves and CPU vs. GPU comparison Prepare final report and visualizations |
| Week | Plan |
|---|---|
| Week 1–3 (Mar 23–Apr 6) | Implemented graph generator Implemented sequential baseline Implemented initial OpenMP version Implemented initial CUDA versions with some warp-level and block-level optimizations |
| Week 4 First Half (Apr 13–Apr 17) — Milestone | Benchmark initial OpenMP and CUDA versions on preliminary graphs Identify key bottlenecks Prepare Milestone Report |
| Week 4 Second Half (Apr 17–Apr 20) | (Yizhou) Optimize OpenMP using: thread-local queues, reduced synchronization, possible work stealing (Andy) Optimize CUDA by: reducing global memory traffic, improving frontier processing Run experiments on more diverse graph types |
| Week 5 First Half (Apr 20–Apr 24) | Implement backpropagation on computational graph Integrate backprop into sequential baseline Extend to parallel versions Continue profiling and optimizing OpenMP and CUDA |
| Week 5 Second Half | Optimize performance of backpropagation on parallel versions Begin to run large-scale experiments on GHC and PSC |
| Week 6 First Half | Finalize optimizations for OpenMP and CUDA Evaluate performance across graph size, outdegree, frontier size, graph depth Collect data on speedup, scalability, and memory behavior |
| Week 6 Second Half (Apr 27) — Final | Complete full evaluation including speedup curves and CPU vs. GPU comparison Prepare final report, poster, and visualizations |
So far, we have implemented a graph generator, the implementation of a sequential baseline, as well as a preliminary implementation of the OpenMP and CUDA topological sort.
The graph generator allows the specification of several parameters including number of vertices, average outdegree per node, number of layers, Dirichlet shape, etc. It allows us to generate several different types of graphs based on size, sparsity, and depth.
The CUDA implementation processes the full frontiers of the graph one by one, and launches a new CUDA kernel for each frontier. We implemented three different versions: a naive version which launches a thread per node in the frontier and uses atomics to update shared indegree and state variables, and then writes to a new global frontier; a second version which optimizes the write to the new global frontier by first aggregating into a block-local frontier; and a third version which further optimizes the approach by launching a warp for each different node, which allows parallel processing of edges.
The basic OpenMP implementation keeps the current frontier and parallelizes over vertices in the frontier. When discovering new nodes with in-degree 0, they are added to the queue for next. Contention occurs to update in degree of discovered nodes and when pushing to the next queue.
So far, we are on track with our goals and deliverables in terms of implementing a graph generator as well as implementing a parallel OpenMP and CUDA version of topological sort. However, we may have to reduce the performance expectations of the CUDA version.
Specifically, the CUDA version requires massive parallelism in order to be successful, with either a large outdegree of 1000+ per node and large frontier sizes of 1000+. Even then, it achieves only 2x speedup of the CPU version due to the fact CUDA version requires a lot of global writes which is not efficient on GPUs, and overall there is low arithmetic intensity.
The OpenMP version with 8 processors achieves 3x speedup on best inputs. From now we can profile the code to find out what the bottlenecks are.
By the poster session, we also believe we can implement the backpropagation which was one of our stretch goals. We will present many detailed graphs on the performance based on the different types of graphs and different optimizations we implemented.
Results on CUDA:
| Nodes | Average Outdegree | Average Frontier Size | CUDA Baseline Speedup | CUDA Block-Level Speedup | CUDA Warp-Level Speedup |
|---|---|---|---|---|---|
| 100,000 | 1000 | 1000 | 0.46x | 0.39x | 2.45x |
| 10M | 5 | 100 | 0.20x | 0.29x | 0.57x |
| 10M | 5 | 1000 | 1.11x | 1.50x | 2.57x |
Results on OpenMP:
| Nodes | Average Outdegree | OpenMP Basic Speedup | OpenMP Thread-local Speedup |
|---|---|---|---|
| 100,000 | 1000 | 0.44x | 0.53x |
| 10M | 5 | 1.60x | 3.42x |
| 10M | 50 | 2.69x | 2.11x |