TPU v5e

This document describes the architecture and supported configurations of Cloud TPU v5e.

TPU v5e supports single and multi-host training and single-host inference. Multi-host inference is supported using Sax. For more information, see Large Language Model Serving.

System architecture

Each v5e chip contains one TensorCore. Each TensorCore has four matrix-multiply units (MXUs), a vector unit, and a scalar unit.

The following diagram illustrates a TPU v5e chip.

Diagram of a v5e chip

The following table shows the key chip specifications and their values for v5e.

Key chip specifications v5e values
Peak compute per chip (bf16) 197 TFLOPs
HBM2 capacity and bandwidth 16 GB, 819 GBps
Interchip Interconnect BW 1600 Gbps

The following table shows Pod specifications and their values for v5e.

Key Pod specifications v5e values
TPU Pod size 256 chips
Interconnect topology 2D Torus
Peak compute per Pod 100 PetaOps(Int8)
All-reduce bandwidth per Pod 51.2 TB/s
Bisection bandwidth per Pod 1.6 TB/s
Data center network bandwidth per Pod 6.4 Tbps

Configurations

Cloud TPU v5e is a combined training and inference (serving) product. To differentiate between a training and an inference environment, use the AcceleratorType or AcceleratorConfig flags with the TPU API or the --machine-type flag when creating a GKE node pool.

Training jobs are optimized for throughput and availability, while serving jobs are optimized for latency. A training job on TPUs provisioned for serving could have lower availability and similarly, a serving job executed on TPUs provisioned for training could have higher latency.

You use AcceleratorType to specify the number of TensorCores you want to use. You specify the AcceleratorType when creating a TPU using the gcloud CLI or the Google Cloud console. The value you specify for AcceleratorType is a string with the format: v$VERSION_NUMBER-$CHIP_COUNT.

You can also use AcceleratorConfig to specify the number of TensorCores you want to use. However, because there are no custom 2D topology variants for TPU v5e, there is no difference between using AcceleratorConfig and AcceleratorType.

To configure a TPU v5e using AcceleratorConfig, use the --version and the --topology flags. Set --version to the TPU version you want to use and --topology to the physical arrangement of the TPU chips in the slice. The value you specify for AcceleratorConfig is a string with the format AxB, where A and B are the chip counts in each direction.

The following 2D slice shapes are supported for v5e:

Topology Number of TPU chips Number of Hosts
1x1 1 1/8
2x2 4 1/2
2x4 8 1
4x4 16 2
4x8 32 4
8x8 64 8
8x16 128 16
16x16 256 32

Each TPU VM in a v5e TPU slice contains 1, 4 or 8 chips. In 4-chip and smaller slices, all TPU chips share the same Non Uniform Memory Access (NUMA) node.

For 8-chip v5e TPU VMs, CPU-TPU communication will be more efficient within NUMA partitions. For example, in the following figure, CPU0-Chip0 communication will be faster than CPU0-Chip4 communication.

NUMA node communication

Cloud TPU v5e types for serving

Single-host serving is supported for up to 8 v5e chips. The following configurations are supported: 1x1, 2x2 and 2x4 slices. Each slice has 1, 4 and 8 chips respectively.

TPU v5e configurations that support serving: 1x1, 2x2, and 2x4.

To provision TPUs for a serving job, use one of the following accelerator types in your CLI or API TPU creation request:

AcceleratorType (TPU API) Machine type (GKE API)
v5litepod-1 ct5lp-hightpu-1t
v5litepod-4 ct5lp-hightpu-4t
v5litepod-8 ct5lp-hightpu-8t

Serving on more than 8 v5e chips, also called multi-host serving, is supported using Sax. For more information, see Large Language Model Serving.

Cloud TPU v5e types for training

Training is supported for up to 256 chips.

To provision TPUs for a v5e training job, use one of the following accelerator types in your CLI or API TPU creation request:

AcceleratorType (TPU API) Machine type (GKE API) Topology
v5litepod-16 ct5lp-hightpu-4t 4x4
v5litepod-32 ct5lp-hightpu-4t 4x8
v5litepod-64 ct5lp-hightpu-4t 8x8
v5litepod-128 ct5lp-hightpu-4t 8x16
v5litepod-256 ct5lp-hightpu-4t 16x16

v5e TPU VM type comparison:

VM Type n2d-48-24-v5lite-tpu n2d-192-112-v5lite-tpu n2d-384-224-v5lite-tpu
# of v5e chips 1 4 8
# of vCPUs 24 112 224
RAM (GB) 48 192 384
# of NUMA Nodes 1 1 2
Applies to v5litepod-1 v5litepod-4 v5litepod-8
Disruption High Medium Low

To make space for workloads that require more chips, schedulers may preempt VMs with fewer chips. So 8-chip VMs are likely to preempt 1 and 4-chip VMs.