TPU v5p

This document describes the architecture and supported configurations of Cloud TPU v5p.

System architecture

This section describes the system architecture specific to the v5p version. Each TensorCore has four Matrix Multiply Units (MXU), a vector unit, and a scalar unit.

There are 8960 chips in a single v5p Pod. The largest job that can be scheduled is a 96 cube (6144 chip) job.

The following table shows the key specifications for a v5p.

Key specifications v5p values
Peak compute per chip (bf16) 459 TFLOPs
HBM2e capacity and bandwidth 95GB, 2765 GBps
TPU Pod size 8960 chips
Interconnect topology 3D Torus *
Interchip Interconnect BW 4800 Gbps

Configurations

A TPU v5p Pod is composed of 8960 chips interconnected with reconfigurable high-speed links. TPU v5p's flexible networking lets you connect the chips in a same-sized slice in multiple ways. When you create a TPU slice using the gcloud compute tpus tpu-vm create command, you specify its type and shape using the AcceleratorType or AcceleratorConfig parameters.

The following table shows the most common single-slice shapes supported with v5p, plus most (but not all) full cube shapes greater than 1 cube. The maximum v5p shape is 16x16x24 (6144 chips, 96 cubes).

Slice Shape VM Size # Cores # Chips # of Machines # of Cubes Supports Twisted?
2x2x1 Full host 8 4 1 N/A N/A
2x2x2 Full host 16 8 2 N/A N/A
2x4x4 Full host 64 32 8 N/A N/A
4x4x4 Full host 128 64 16 1 N/A
4x4x8 Full host 256 128 32 2 Yes
4x8x8 Full host 512 256 64 4 Yes
8x8x8 Full host 1024 512 128 8 N/A
8x8x16 Full host 2048 1024 256 16 Yes
8x16x16 Full host 4096 2048 512 32 Yes
16x16x16 Full host 8192 4096 1024 64 N/A
16x16x24 Full host 12288 6144 1536 96 N/A

Single slice training is supported for up to 6144 chips. It is extensible to 18432 chips using Multislice. See the Cloud TPU Multislice Overview for Multislice details.

Using the AcceleratorType parameter

When you allocate TPU resources, you use the --accelerator-type argument to specify the number of TensorCores in a slice. --accelerator-type is a formatted string "v$VERSION_NUMBERp-$CORES_COUNT". For example, v5p-32 specifies a v5p TPU slice with 32 TensorCores (16 chips).

To provision TPUs for a v5p training job, use one of the following accelerator types in your CLI or TPU API creation request:

  • v5p-8
  • v5p-16
  • v5p-32
  • v5p-64
  • v5p-128 (one full cube/rack)
  • v5p-256 (2 cubes)
  • v5p-512
  • v5p-1024 ... v5p-12288

Using the AcceleratorConfig parameter

For v5p and later Cloud TPU versions, AcceleratorConfig is used in much the same way it is with Cloud TPU v4 The difference is that instead of specifying the TPU type as --type=v4, you specify it as the TPU version you are using (for example, --type=v5p for the v5p release).

Cloud TPU ICI resiliency

ICI resiliency helps improve fault tolerance of optical links and optical circuit switches (OCS) that connect TPUs between cubes. (ICI connections within a cube use copper links that are not impacted). ICI resiliency allows ICI connections to be routed around OCS and optical ICI faults. As a result, it improves the scheduling availability of TPU slices, with the trade-off of temporary degradation in ICI performance.

Similar to Cloud TPU v4, ICI resiliency is enabled by default for v5p slices that are one cube or larger:

  • v5p-128 when specifying accelerator type
  • 4x4x4 when specifying accelerator config

VM, host and slice properties

Property Value in a TPU
# of v5p chips 4
# of vCPUs 208 (only half is usable if using NUMA binding to avoid cross-NUMA performance penalty)
RAM (GB) 448 (only half is usable if using NUMA binding to avoid cross-NUMA performance penalty)
# of NUMA Nodes 2
NIC Throughput (Gbps) 200

Relationship between the number of TensorCores, chips, hosts/VMs, and cubes in a Pod:

Cores Chips Hosts/VMs Cubes
Host 8 4 1
Cube (aka rack) 128 64 16 1
Largest supported slice 12288 6144 1536 96
v5p full Pod 17920 8960 2240 140