The Doctoral School in Sciences and Engineering is happy to invite you to Guillaume HELBECQUE’s defence entitled
PGAS-based Parallel Branch-and-Bound for Ultra-Scale GPU-powered Supercomputers
Supervisor: Prof Pascal BOUVRY
Branch-and-Bound (B&B) algorithms are widely used for the exact resolution of many combinatorial optimization problems. Their parallel implementation to solve increasingly large instances presents several challenges related to the dynamic generation of large, highly irregular trees. With the advent of the exascale era, modern supercomputers are now composed of thousands of hybrid compute nodes, each integrating multi-core processors coupled with GPU accelerators. This hierarchical organization, providing multi-level parallelism (intra-node, GPU, inter-node or cluster, etc.), makes exascale parallel implementation complex. To address this complexity, most existing works employ the “evolutionary” MPI+X approach, which extends the MPI standard used for inter-node level with environments for intra-node parallelism (OpenMP, CUDA, etc.). In this thesis, we investigate the PGAS (Partitioned Global Address Space) approach, an alternative to MPI+X, in the context of implementing B&B algorithms for exascale. This “revolutionary” approach provides a higher level of parallelism abstraction, unifying intra-node and inter-node levels.
The first contribution of this thesis focuses on the design and implementation of a PGAS-based data structure, named distBag, dedicated to depth-first exploration of large, irregular trees. This multi-pool data structure integrates a dynamic load-balancing mechanism based on large-scale work-stealing, operating at both intra- and inter-node levels. This mechanism, which required sophisticated synchronization, promotes locality in work-stealing, enabling scalability. The data structure and its load-balancing mechanism are implemented in Chapel (HPE/Cray) and provided as a module in this PGAS-based language designed for exascale. The second contribution of this thesis extends the proposed work to the multi-GPU context to accelerate the extensive and costly evaluation of tree nodes being explored. The challenge of ensuring implementation portability across multi-vendor GPU architectures (NVIDIA and AMD) is addressed.
The algorithms developed in this thesis have been designed to be generic and to promote their reuse. This is evidenced by the application of these algorithms to various combinatorial optimization problems, including permutation flow-shop scheduling, binary knapsack problems, the N-Queens problem, and the Unbalanced Tree Search benchmark. Experimental validation was conducted on two TOP500 supercomputers (MeluXina and LUMI), among others. The results show that our PGAS-based algorithms are competitive in terms of scalability at both intra-node and inter-node levels compared to those obtained with the MPI+X approach. Additionally, the results confirmed the optimality of solutions for some of the largest flow-shop instances, utilizing up to 400 compute nodes, or 51,200 CPU cores. Furthermore, scalability concerning the number of GPUs was evaluated on 128 compute nodes, totaling 1,024 GPU accelerators. Overall, our results demonstrate the competitiveness of PGAS approaches compared to MPI+X, while identifying opportunities for further enhancement.