TPU v5p
This document describes the architecture and supported configurations of Cloud TPU v5p.
System architecture
This section describes the system architecture specific to the v5p version. Each TensorCore has four Matrix Multiply Units (MXU), a vector unit, and a scalar unit.
There are 8960 chips in a single v5p Pod. The largest job that can be scheduled is a 96 cube (6144 chip) job.
The following table shows the key specifications for a v5p.
Key specifications | v5p values |
---|---|
Peak compute per chip (bf16) | 459 TFLOPs |
HBM2e capacity and bandwidth | 95GB, 2765 GBps |
TPU Pod size | 8960 chips |
Interconnect topology | 3D Torus * |
Interchip Interconnect BW | 4800 Gbps |
Configurations
A TPU v5p Pod is composed of 8960 chips interconnected with reconfigurable
high-speed links. TPU v5p's flexible networking lets you connect the
chips in a same-sized slice in multiple ways. When you create a TPU slice
using the gcloud compute tpus tpu-vm create
command, you specify
its type and shape using the AcceleratorType
or AcceleratorConfig
parameters.
The following table shows the most common single-slice shapes supported with v5p, plus most (but not all) full cube shapes greater than 1 cube. The maximum v5p shape is 16x16x24 (6144 chips, 96 cubes).
Slice Shape | VM Size | # Cores | # Chips | # of Machines | # of Cubes | Supports Twisted? |
2x2x1 | Full host | 8 | 4 | 1 | N/A | N/A |
2x2x2 | Full host | 16 | 8 | 2 | N/A | N/A |
2x4x4 | Full host | 64 | 32 | 8 | N/A | N/A |
4x4x4 | Full host | 128 | 64 | 16 | 1 | N/A |
4x4x8 | Full host | 256 | 128 | 32 | 2 | Yes |
4x8x8 | Full host | 512 | 256 | 64 | 4 | Yes |
8x8x8 | Full host | 1024 | 512 | 128 | 8 | N/A |
8x8x16 | Full host | 2048 | 1024 | 256 | 16 | Yes |
8x16x16 | Full host | 4096 | 2048 | 512 | 32 | Yes |
16x16x16 | Full host | 8192 | 4096 | 1024 | 64 | N/A |
16x16x24 | Full host | 12288 | 6144 | 1536 | 96 | N/A |
Single slice training is supported for up to 6144 chips. It is extensible to 18432 chips using Multislice. See the Cloud TPU Multislice Overview for Multislice details.
Using the AcceleratorType parameter
When you allocate TPU resources, you use the --accelerator-type
argument to
specify the number of TensorCores in a slice. --accelerator-type
is
a formatted string
"v$VERSION_NUMBER
p-$CORES_COUNT
".
For example, v5p-32
specifies a v5p TPU slice with 32 TensorCores (16 chips).
To provision TPUs for a v5p training job, use one of the following accelerator types in your CLI or TPU API creation request:
- v5p-8
- v5p-16
- v5p-32
- v5p-64
- v5p-128 (one full cube/rack)
- v5p-256 (2 cubes)
- v5p-512
- v5p-1024 ... v5p-12288
Using the AcceleratorConfig parameter
For v5p and later Cloud TPU versions, AcceleratorConfig
is used in much the same way it is with Cloud TPU v4
The difference is
that instead of specifying the TPU type as --type=v4
, you specify it as
the TPU version you are using (for example, --type=v5p
for the v5p release).
Cloud TPU ICI resiliency
ICI resiliency helps improve fault tolerance of optical links and optical circuit switches (OCS) that connect TPUs between cubes. (ICI connections within a cube use copper links that are not impacted). ICI resiliency allows ICI connections to be routed around OCS and optical ICI faults. As a result, it improves the scheduling availability of TPU slices, with the trade-off of temporary degradation in ICI performance.
Similar to Cloud TPU v4, ICI resiliency is enabled by default for v5p slices that are one cube or larger:
- v5p-128 when specifying accelerator type
- 4x4x4 when specifying accelerator config
VM, host and slice properties
Property | Value in a TPU |
# of v5p chips | 4 |
# of vCPUs | 208 (only half is usable if using NUMA binding to avoid cross-NUMA performance penalty) |
RAM (GB) | 448 (only half is usable if using NUMA binding to avoid cross-NUMA performance penalty) |
# of NUMA Nodes | 2 |
NIC Throughput (Gbps) | 200 |
Cores | Chips | Hosts/VMs | Cubes | |
---|---|---|---|---|
Host | 8 | 4 | 1 | |
Cube (aka rack) | 128 | 64 | 16 | 1 |
Largest supported slice | 12288 | 6144 | 1536 | 96 |
v5p full Pod | 17920 | 8960 | 2240 | 140 |