Parallel Topological Sort for Computational Graph and Back-Propagation

Team members: Andy Tang (andyt), Yizhou Wang (yizhouw3)

URL

https://andytang02.github.io/parallel-topo-sort/

Summary

We are going to implement a parallel topological sort to evaluate complex computational graphs with backpropagation. We are going to implement two versions, in OpenMP and CUDA. We will evaluate speedup and memory efficiency on different kinds of graphs.

Background

The application accelerates execution of an arbitrary computational graph, represented as a directed acyclic graph (DAG) where each node corresponds to a computation and edges represent data dependencies. Unlike simple pipelines, this graph can have complex branching and merging structure, capturing general-purpose computations such as scientific workflows, compiler intermediate representations, or differentiable programs.

The topological sort algorithm performs the computation while respecting data dependencies. In brief, the algorithm maintains a set of “ready” nodes with no unmet dependencies, processes them, and updates dependent nodes until the entire graph is evaluated. The reverse pass follows the same structure but traverses the graph in the opposite direction while accumulating gradients. An example sequential implementation is shown below.

The application can benefit from parallelism because the DAG defines only a partial order, not a total order. At any point in time, multiple nodes may have all dependencies satisfied and can therefore execute simultaneously. This creates a dynamic “frontier” of ready nodes that can be processed in parallel.

The Challenge

Nonetheless, parallelizing the algorithm is also challenging due to complex data dependencies and massive contention to shared data structures. As mentioned above, the computational graphs in this application have complex dependencies. A node cannot execute until all of its predecessors have completed, which limits parallelism along the graph’s critical path and can lead to load imbalance in irregular graphs.

In addition, shared data structures introduce contention: for example, the ready queue (q) used in the sequential implementation becomes a bottleneck when multiple threads attempt to push and pop nodes simultaneously. Similarly, updating the in-degree array (in_deg[v]) requires synchronization (e.g., atomic decrement operations), and many threads may contend when multiple predecessors finish at the same time. Speedup would be undermined without techniques such as work-stealing queues or batching.

Yet another challenge is latency. Graph memory accesses have low arithmetic intensity, and locality is very poor without techniques such as compressed row-formatting or sharding.

Resources

We will implement the project primarily in C++, using OpenMP for multicore CPU parallelism and CUDA for GPU acceleration. Development and testing will be performed on the GHC machines (8 cores) and PSC clusters (up to 128 cores and GPUs) available for the course.

We will build the system from scratch, including graph generation, a sequential baseline, and both parallel implementations. We may also consult prior work or open-source implementations of parallel graph processing for guidance if they prove useful.

Goals and Deliverables

Plan to Achieve:

Hope to Achieve (Stretch Goals):

Fallback (Reduced goals):

Platform Choice

We will implement the system using both OpenMP and CUDA to evaluate parallelism across CPU and GPU architectures. OpenMP is well-suited for this workload because the DAG resides in shared memory and requires frequent concurrent access to structures such as the ready queue and in-degree array, making shared-memory parallelism a natural fit. CUDA complements this by enabling large-scale parallel processing of frontier nodes, which can expose substantial parallelism in wide regions of the graph. This type of forward and backward traversal is also similar to computations in neural networks and other scientific workloads, where graph-structured dependencies arise naturally, making it an interesting candidate for GPU acceleration. Although the irregular structure and synchronization requirements of the algorithm may limit GPU efficiency compared to more regular workloads, implementing a CUDA version allows us to study these tradeoffs and directly compare performance across architectures.

Schedule

WeekPlan
Week 1 (Mar 23) Implement graph generator
Implement sequential baseline
Week 2 (Mar 30) Implement initial OpenMP versions
Benchmark on small/medium graphs and low cores
Identify bottlenecks
Week 3 (Apr 6) Optimize OpenMP with work stealing queues and batching
Scale to larger graphs and more cores (10M+ nodes, 64+ cores)
Begin CUDA design and kernel planning
Week 4 (Apr 13) — Milestone Implement initial CUDA version
Basic GPU kernel for frontier processing
Compare CPU vs GPU performance
Prepare milestone report and evaluation
Week 5 (Apr 20) Optimize CUDA implementation
Run large-scale experiments
Start detailed performance analysis
Week 6 (Apr 27) — Final Final optimizations (OpenMP and CUDA)
Complete full evaluation including speedup curves and CPU vs. GPU comparison
Prepare final report and visualizations

Project Milestone Report

Team members: Andy Tang (andyt), Yizhou Wang (yizhouw3)

Revised Plan

WeekPlan
Week 1–3 (Mar 23–Apr 6) Implemented graph generator
Implemented sequential baseline
Implemented initial OpenMP version
Implemented initial CUDA versions with some warp-level and block-level optimizations
Week 4 First Half (Apr 13–Apr 17) — Milestone Benchmark initial OpenMP and CUDA versions on preliminary graphs
Identify key bottlenecks
Prepare Milestone Report
Week 4 Second Half (Apr 17–Apr 20) (Yizhou) Optimize OpenMP using: thread-local queues, reduced synchronization, possible work stealing
(Andy) Optimize CUDA by: reducing global memory traffic, improving frontier processing
Run experiments on more diverse graph types
Week 5 First Half (Apr 20–Apr 24) Implement backpropagation on computational graph
Integrate backprop into sequential baseline
Extend to parallel versions
Continue profiling and optimizing OpenMP and CUDA
Week 5 Second Half Optimize performance of backpropagation on parallel versions
Begin to run large-scale experiments on GHC and PSC
Week 6 First Half Finalize optimizations for OpenMP and CUDA
Evaluate performance across graph size, outdegree, frontier size, graph depth
Collect data on speedup, scalability, and memory behavior
Week 6 Second Half (Apr 27) — Final Complete full evaluation including speedup curves and CPU vs. GPU comparison
Prepare final report, poster, and visualizations

Work Completed So Far

So far, we have implemented a graph generator, the implementation of a sequential baseline, as well as a preliminary implementation of the OpenMP and CUDA topological sort.

The graph generator allows the specification of several parameters including number of vertices, average outdegree per node, number of layers, Dirichlet shape, etc. It allows us to generate several different types of graphs based on size, sparsity, and depth.

The CUDA implementation processes the full frontiers of the graph one by one, and launches a new CUDA kernel for each frontier. We implemented three different versions: a naive version which launches a thread per node in the frontier and uses atomics to update shared indegree and state variables, and then writes to a new global frontier; a second version which optimizes the write to the new global frontier by first aggregating into a block-local frontier; and a third version which further optimizes the approach by launching a warp for each different node, which allows parallel processing of edges.

The basic OpenMP implementation keeps the current frontier and parallelizes over vertices in the frontier. When discovering new nodes with in-degree 0, they are added to the queue for next. Contention occurs to update in degree of discovered nodes and when pushing to the next queue.

Updated Goals and Deliverables

So far, we are on track with our goals and deliverables in terms of implementing a graph generator as well as implementing a parallel OpenMP and CUDA version of topological sort. However, we may have to reduce the performance expectations of the CUDA version.

Specifically, the CUDA version requires massive parallelism in order to be successful, with either a large outdegree of 1000+ per node and large frontier sizes of 1000+. Even then, it achieves only 2x speedup of the CPU version due to the fact CUDA version requires a lot of global writes which is not efficient on GPUs, and overall there is low arithmetic intensity.

The OpenMP version with 8 processors achieves 3x speedup on best inputs. From now we can profile the code to find out what the bottlenecks are.

By the poster session, we also believe we can implement the backpropagation which was one of our stretch goals. We will present many detailed graphs on the performance based on the different types of graphs and different optimizations we implemented.

Preliminary Results

Results on CUDA:

Nodes Average Outdegree Average Frontier Size CUDA Baseline Speedup CUDA Block-Level Speedup CUDA Warp-Level Speedup
100,000 1000 1000 0.46x 0.39x 2.45x
10M 5 100 0.20x 0.29x 0.57x
10M 5 1000 1.11x 1.50x 2.57x

Results on OpenMP:

Nodes Average Outdegree OpenMP Basic Speedup OpenMP Thread-local Speedup
100,000 1000 0.44x 0.53x
10M 5 1.60x 3.42x
10M 50 2.69x 2.11x

Concerns and Issues