A3 Mega Slurm cluster overview

This document provides an overview of the Slurm deployment of an A3 accelerator-optimized machine family cluster on Google Cloud. This solution uses the A3 Mega (a3-megagpu-8g) machine type, each of which has eight NVIDIA H100 GPUs, offers 80 GB GPU memory per GPU and can be configured to use GPUDirect-TCPXO. Clusters created by using these machine types are ideal for running large-scale artificial intelligence (AI) and machine learning (ML) training workloads.

Deployment architecture

This section provides an overview of the deployment architecture.

HPC blueprints

The deployment of an A3 Mega Slurm cluster uses three HPC blueprints to complete each of the following tasks as follows:

Base (networking & home filesystem) setup: the slurm-a3mega-base.yaml blueprint provisions a Virtual Private Cloud network and one Filestore file system for mounting /home across the cluster.
Image building: the slurm-a3mega-image.yaml blueprint builds a custom Debian 12 OS image that has Slurm pre-installed. This OS image also includes the latest kernel modules and configurations that are necessary to support the highest network performance.

Note: The base setup and image building blueprints should be provisioned once and rarely need any modifications.
Cluster deployment: the slurm-a3mega-cluster.yaml blueprint provisions a Slurm cluster using a custom Debian 12 OS image. You can update and re-provision the cluster deployment blueprint as needed.

In addition to a system network card, each a3-megagpu-8g machine type has eight network interfaces (NICs) dedicated to GPU communication. This blueprint also creates one Virtual Private Cloud network for each GPU and sets the MTU for each network to 8244.

Deployment files

This solution uses two deployment files to centralize the configuration needed across the three HPC blueprints for each deployment, minimizing the number of changes needed in each individual HPC blueprint file. The deployment files are as follows: deployment-base.yaml and deployment-image-cluster.yaml.

With this approach, the lifecycle of the Filestore instance and the lifecycle of the cluster are separated which allows the cluster to be deleted while retaining access to data and home directories.

Network performance components

The following components are used to optimize the network performance for your a3-megagpu-8g Slurm cluster. After deploying the cluster, see Enable GPUDirect-TCPXO optimized NCCL communication for an example of configuring a workload to use GPUDirect-TCPXO.

GPUDirect-TCPXO

GPUDirect-TCPXO is a custom, remote direct memory access (RDMA) networking stack that increases the network performance of your VMs by allowing data packet payloads to transfer directly from GPU memory to the network interface without having to go through the CPU and system memory. a3-megagpu-8g VMs can use GPUDirect-TCPXO combined with Google Virtual NIC (gVNIC) to deliver higher throughput between VMs in a cluster when compared to the A2 accelerator-optimized machine types on Google Cloud.

The Receive Data Path Manager (RxDM)

To achieve optimal application performance, an additional service called the Receive Data Path Manager (RxDM) runs alongside the applications that use GPUDirect-TCPXO.

Additionally, a NCCL net plugin must be installed into the execution environment of the workload. Both the RxDM and plugin are distributed by a PyTorch Docker image.

The cluster deployment blueprint

The slurm-a3mega-cluster.yaml blueprint includes a Slurm Prolog and Epilog script that runs before and after every job running on more than one a3-megagpu-8g compute node.

The Prolog performs the following actions:

Ensures that the import-helper kernel module is loaded.
Installs the NCCL net plugin into /var/lib/tcpxo/lib64/ of the host.
Runs the RxDM service, which is a long-lived service that runs alongside the job. Starting the RxDM can take 10-20 seconds, which blocks the start of the job until the RxDM service is initialized. Because of this you won't see the slurm job output/error logs until RxDM has started.

The Epilog performs the following actions:

Stops the RxDM service.
Prunes any stopped containers and frees up disk space.

For more information about Prologs and Epilogs, see the Slurm documentation.

What's next

Deploy an A3 Mega Slurm cluster for ML training.