Parallel Topological Sort for Computational Graph and Back-Propagation

Team members: Andy Tang (andyt), Yizhou Wang (yizhouw3)

Final Report

URL

https://andytang02.github.io/parallel-topo-sort/

Summary

We are going to implement a parallel topological sort to evaluate complex computational graphs with backpropagation. We are going to implement two versions, in OpenMP and CUDA. We will evaluate speedup and memory efficiency on different kinds of graphs.

Background

The application accelerates execution of an arbitrary computational graph, represented as a directed acyclic graph (DAG) where each node corresponds to a computation and edges represent data dependencies. Unlike simple pipelines, this graph can have complex branching and merging structure, capturing general-purpose computations such as scientific workflows, compiler intermediate representations, or differentiable programs.

The topological sort algorithm performs the computation while respecting data dependencies. In brief, the algorithm maintains a set of “ready” nodes with no unmet dependencies, processes them, and updates dependent nodes until the entire graph is evaluated. The reverse pass follows the same structure but traverses the graph in the opposite direction while accumulating gradients. An example sequential implementation is shown below.

The application can benefit from parallelism because the DAG defines only a partial order, not a total order. At any point in time, multiple nodes may have all dependencies satisfied and can therefore execute simultaneously. This creates a dynamic “frontier” of ready nodes that can be processed in parallel.

The Challenge

Nonetheless, parallelizing the algorithm is also challenging due to complex data dependencies and massive contention to shared data structures. As mentioned above, the computational graphs in this application have complex dependencies. A node cannot execute until all of its predecessors have completed, which limits parallelism along the graph’s critical path and can lead to load imbalance in irregular graphs.

In addition, shared data structures introduce contention: for example, the ready queue (q) used in the sequential implementation becomes a bottleneck when multiple threads attempt to push and pop nodes simultaneously. Similarly, updating the in-degree array (in_deg[v]) requires synchronization (e.g., atomic decrement operations), and many threads may contend when multiple predecessors finish at the same time. Speedup would be undermined without techniques such as work-stealing queues or batching.

Yet another challenge is latency. Graph memory accesses have low arithmetic intensity, and locality is very poor without techniques such as compressed row-formatting or sharding.

Resources

We will implement the project primarily in C++, using OpenMP for multicore CPU parallelism and CUDA for GPU acceleration. Development and testing will be performed on the GHC machines (8 cores) and PSC clusters (up to 128 cores and GPUs) available for the course.

We will build the system from scratch, including graph generation, a sequential baseline, and both parallel implementations. We may also consult prior work or open-source implementations of parallel graph processing for guidance if they prove useful.

Goals and Deliverables

Plan to Achieve:

A graph generator capable of producing large DAGs (10M–100M vertices) with controllable structure (depth, branching factor, sparsity)
A sequential baseline implementation of topological sort with forward pass, to evaluate a computational graph with addition and multiplication
A parallel OpenMP version with up to 4x speedup on GHC (8 cores) and 30–40x speedup on PSC (128 cores) on parallel-friendly graphs using techniques such as work-stealing queues and batching
A parallel CUDA implementation whose runtime is within 2× of the OpenMP version on 128 cores for parallel-friendly graphs
A performance evaluation including runtime vs. graph structure, speedup vs. number of cores, and CPU vs. GPU evaluation.

Hope to Achieve (Stretch Goals):

Improve the OpenMP implementation to have 5–6x speedup on GHC (8 cores) and 50–75x speedup on PSC (128 cores).
Improve the CUDA version to match or exceed the runtime of OpenMP version on 128 cores on parallel-friendly graphs
Extend the system to support the backward pass (backpropagation) over the computational graph

Fallback (Reduced goals):

Focus on a highly optimized OpenMP implementation if CUDA is too complex.
Reduce graph size if necessary (1M–10M nodes instead of 100M)

Platform Choice

We will implement the system using both OpenMP and CUDA to evaluate parallelism across CPU and GPU architectures. OpenMP is well-suited for this workload because the DAG resides in shared memory and requires frequent concurrent access to structures such as the ready queue and in-degree array, making shared-memory parallelism a natural fit. CUDA complements this by enabling large-scale parallel processing of frontier nodes, which can expose substantial parallelism in wide regions of the graph. This type of forward and backward traversal is also similar to computations in neural networks and other scientific workloads, where graph-structured dependencies arise naturally, making it an interesting candidate for GPU acceleration. Although the irregular structure and synchronization requirements of the algorithm may limit GPU efficiency compared to more regular workloads, implementing a CUDA version allows us to study these tradeoffs and directly compare performance across architectures.

Schedule

Week	Plan
Week 1 (Mar 23)	Implement graph generator Implement sequential baseline
Week 2 (Mar 30)	Implement initial OpenMP versions Benchmark on small/medium graphs and low cores Identify bottlenecks
Week 3 (Apr 6)	Optimize OpenMP with work stealing queues and batching Scale to larger graphs and more cores (10M+ nodes, 64+ cores) Begin CUDA design and kernel planning
Week 4 (Apr 13) — Milestone	Implement initial CUDA version Basic GPU kernel for frontier processing Compare CPU vs GPU performance Prepare milestone report and evaluation
Week 5 (Apr 20)	Optimize CUDA implementation Run large-scale experiments Start detailed performance analysis
Week 6 (Apr 27) — Final	Final optimizations (OpenMP and CUDA) Complete full evaluation including speedup curves and CPU vs. GPU comparison Prepare final report and visualizations

Project Milestone Report

Team members: Andy Tang (andyt), Yizhou Wang (yizhouw3)

Revised Plan

Week	Plan
Week 1–3 (Mar 23–Apr 6)	Implemented graph generator Implemented sequential baseline Implemented initial OpenMP version Implemented initial CUDA versions with some warp-level and block-level optimizations
Week 4 First Half (Apr 13–Apr 17) — Milestone	Benchmark initial OpenMP and CUDA versions on preliminary graphs Identify key bottlenecks Prepare Milestone Report
Week 4 Second Half (Apr 17–Apr 20)	(Yizhou) Optimize OpenMP using: thread-local queues, reduced synchronization, possible work stealing (Andy) Optimize CUDA by: reducing global memory traffic, improving frontier processing Run experiments on more diverse graph types
Week 5 First Half (Apr 20–Apr 24)	Implement backpropagation on computational graph Integrate backprop into sequential baseline Extend to parallel versions Continue profiling and optimizing OpenMP and CUDA
Week 5 Second Half	Optimize performance of backpropagation on parallel versions Begin to run large-scale experiments on GHC and PSC
Week 6 First Half	Finalize optimizations for OpenMP and CUDA Evaluate performance across graph size, outdegree, frontier size, graph depth Collect data on speedup, scalability, and memory behavior
Week 6 Second Half (Apr 27) — Final	Complete full evaluation including speedup curves and CPU vs. GPU comparison Prepare final report, poster, and visualizations

Work Completed So Far

So far, we have implemented a graph generator, the implementation of a sequential baseline, as well as a preliminary implementation of the OpenMP and CUDA topological sort.

The graph generator allows the specification of several parameters including number of vertices, average outdegree per node, number of layers, Dirichlet shape, etc. It allows us to generate several different types of graphs based on size, sparsity, and depth.

The CUDA implementation processes the full frontiers of the graph one by one, and launches a new CUDA kernel for each frontier. We implemented three different versions: a naive version which launches a thread per node in the frontier and uses atomics to update shared indegree and state variables, and then writes to a new global frontier; a second version which optimizes the write to the new global frontier by first aggregating into a block-local frontier; and a third version which further optimizes the approach by launching a warp for each different node, which allows parallel processing of edges.

The basic OpenMP implementation keeps the current frontier and parallelizes over vertices in the frontier. When discovering new nodes with in-degree 0, they are added to the queue for next. Contention occurs to update in degree of discovered nodes and when pushing to the next queue.

Updated Goals and Deliverables

So far, we are on track with our goals and deliverables in terms of implementing a graph generator as well as implementing a parallel OpenMP and CUDA version of topological sort. However, we may have to reduce the performance expectations of the CUDA version.

Specifically, the CUDA version requires massive parallelism in order to be successful, with either a large outdegree of 1000+ per node and large frontier sizes of 1000+. Even then, it achieves only 2x speedup of the CPU version due to the fact CUDA version requires a lot of global writes which is not efficient on GPUs, and overall there is low arithmetic intensity.

The OpenMP version with 8 processors achieves 3x speedup on best inputs. From now we can profile the code to find out what the bottlenecks are.

By the poster session, we also believe we can implement the backpropagation which was one of our stretch goals. We will present many detailed graphs on the performance based on the different types of graphs and different optimizations we implemented.

Preliminary Results

Results on CUDA:

Nodes	Average Outdegree	Average Frontier Size	CUDA Baseline Speedup	CUDA Block-Level Speedup	CUDA Warp-Level Speedup
100,000	1000	1000	0.46x	0.39x	2.45x
10M	5	100	0.20x	0.29x	0.57x
10M	5	1000	1.11x	1.50x	2.57x

Results on OpenMP:

Nodes	Average Outdegree	OpenMP Basic Speedup	OpenMP Thread-local Speedup
100,000	1000	0.44x	0.53x
10M	5	1.60x	3.42x
10M	50	2.69x	2.11x

Concerns and Issues

GPU parallelism requires graphs which are very large in some way
What computation should we compute on the graph?
How to further optimize the parallel versions in interesting ways?
How to overcome contention on in_deg arrays and next_queue. Perhaps work stealing.