Train a machine learning model with TensorFlow 2 on AI Platform Training by using runtime version 2.1 or later. TensorFlow 2 simplifies many APIs from TensorFlow 1. The TensorFlow documentation provides a guide to migrating TensorFlow 1 code to TensorFlow 2.
Running a training job with TensorFlow 2 on AI Platform Training follows the same process as running other custom code training jobs. However, some AI Platform Training features work differently with TensorFlow 2 compared to how they work with TensorFlow 1. This document provides a summary of these differences.
Python version support
Runtime versions 2.1 and later only support training with Python 3.7. Therefore you must use Python 3.7 to train with TensorFlow 2.
The Python Software Foundation ended support for Python 2.7 on January 1, 2020. No AI Platform runtime versions released after January 1, 2020 support Python 2.7.
Distributed training
TensorFlow 2 provides an updated API for distributed training. Additionally,
AI Platform Training sets the TF_CONFIG
environment variable differently in runtime
versions 2.1 and later. This section describes both changes.
Distribution strategies
To perform distributed training with multiple virtual machine (VM) instances in
TensorFlow 2, use the tf.distribute.Strategy
API.
In particular, we recommend that you use the Keras API together with the
MultiWorkerMirroredStrategy
or, if you specify parameter servers for your job,
the
ParameterServerStrategy
.
However, note that TensorFlow currently only provides experimental
support
for these strategies.
TF_CONFIG
TensorFlow expects a TF_CONFIG
environment
variable
to be set on each VM used for training. AI Platform Training automatically sets this
environment variable on each VM used in your training job. This lets each VM
behave differently depending on its type and it helps the VMs communicate with
each other.
In runtime version 2.1 and later, AI Platform Training no longer uses the master
task
type in any TF_CONFIG
environment variables. Instead, your training job's
master worker is labeled with the chief
type in
the TF_CONFIG
environment variable. Learn more about how AI Platform Training sets the
TF_CONFIG
environment variable.
Accelerators for training
AI Platform Training lets you accelerate your training jobs with GPUs and TPUs.
GPUs
To learn how to use GPUs for training, read the AI Platform Training guide to configuring GPUs and TensorFlow's guide to using GPUs.
If you want to train on a single VM with multiple GPUs, the best practice
is to use TensorFlow's
MirroredStrategy
.
If you want to train using multiple VMs with GPUs, the best practice is to use
TensorFlow's
MultiWorkerMirroredStrategy
.
TPUs
To learn how to use TPUs for training, read the guide to training with TPUs.
Hyperparameter tuning
If you are running a hyperparameter tuning job with TensorFlow 2, you might need to adjust how your training code reports your hyperparameter tuning metric to the AI Platform Training service.
If you are training with an Estimator, you can write your metric to a summary in
the same way that you do in TensorFlow 1. If you are training with Keras, we
recommend that you use
tf.summary.scalar
to write a summary.
What's next
- Read about configuring the runtime version and Python version for a training job.
- Read more about configuring distributed training.
- Read more about hyperparameter tuning.