This page explains how custom containers support the structure of distributed training on AI Platform Training.
With custom containers, you can do distributed training with any ML framework that supports distribution. Although the terminology used here is based on TensorFlow's distributed model, you can use any other ML framework that has a similar distribution structure. For example, distributed training in MXNet uses a scheduler, workers, and servers. This corresponds with the distributed training structure of AI Platform Training custom containers, which uses a master, workers and parameter servers.
Structure of the training cluster
If you run a distributed training job with AI Platform Training, you specify multiple machines (nodes) in a training cluster. The training service allocates the resources for the machine types you specify. Your running job on a given node is called a replica. In accordance with the distributed TensorFlow model, each replica in the training cluster is given a single role or task in distributed training:
Master worker: Exactly one replica is designated the master worker (also known as the chief worker). This task manages the others and reports status for the job as a whole.
Worker(s): One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your job configuration.
Parameter server(s): One or more replicas may be designated as parameter servers. These replicas store model parameters and coordinate shared model state between the workers.
Evaluator(s): One or more replicas may be designated as evaluators. These replicas can be used to evaluate your model. If you are using TensorFlow, note that TensorFlow generally expects that you use no more than one evaluator.
API mapping
The four different roles you can assign to machines in your training cluster
correspond to four fields you can specify in TrainingInput
,
which represents the input parameters for a training job:
masterConfig.imageUri
represents the container image URI to be run on the master worker.workerConfig.imageUri
,parameterServerConfig.imageUri
, andevaluatorConfig.imageUri
represent the container image URIs to be run on worker(s), parameter server(s), and evaluator(s) respectively. If no value is set for these fields, AI Platform Training uses the value ofmasterConfig.imageUri
.
You can also set the values for each of these fields with their corresponding
flags in gcloud ai-platform jobs submit training
:
- For the master worker config, use
--master-image-uri
. - For the worker config, use
--worker-image-uri
. - For the parameter server config, use
--parameter-server-image-uri
. - There is not currently a flag for specifying the container image URI for
evaluators. You can specify
evaluatorConfig.imageUri
in a config.yaml configuration file.
See an example of how to submit a distributed training job with custom containers.
Understanding CLUSTER_SPEC
AI Platform Training populates an environment variable, CLUSTER_SPEC
, on
every replica to describe how the overall cluster is set up. Like TensorFlow's
TF_CONFIG
, CLUSTER_SPEC
describes every replica in the cluster,
including its index and role (master worker, worker, parameter server, or
evaluator).
When you run distributed training with TensorFlow, TF_CONFIG
is parsed to
build
tf.train.ClusterSpec
.
Similarly, when you run distributed training with other ML frameworks, you must
parse CLUSTER_SPEC
to populate any environment variables or settings required
by the framework.
The format of CLUSTER_SPEC
The CLUSTER_SPEC
environment variable is a JSON string with the following
format:
Key | Description | |
---|---|---|
"cluster"
|
The cluster description for your custom container. As with `TF_CONFIG`, this object is formatted as a TensorFlow cluster specification, and can be passed to the constructor of tf.train.ClusterSpec. | |
"task"
|
Describes the task of the particular node on which your code is running. You can use this information to write code for specific workers in a distributed job. This entry is a dictionary with the following keys: | |
"type"
|
The type of task performed by this node. Possible values are
master , worker , ps , and
evaluator .
|
|
"index"
|
The zero-based index of the task. Most distributed training jobs have a single master task, one or more parameter servers, and one or more workers. | |
"trial"
|
The identifier of the hyperparameter tuning trial currently running. When you configure hyperparameter tuning for your job, you set a number of trials to train. This value gives you a way to differentiate in your code between trials that are running. The identifier is a string value containing the trial number, starting at 1. | |
"job"
|
The job parameters you used when you initiated the job. In most cases, you can ignore this entry, as it replicates the data passed to your application through its command-line arguments. |
Comparison to TF_CONFIG
Note that AI Platform Training also sets the TF_CONFIG
environment
variable on each replica of all training jobs. AI Platform Training only
sets the CLUSTER_SPEC
environment variable on replicas of custom container
training jobs. The two environment variables share some values, but they have
different formats.
When you train with custom containers, the master replica is labeled in the
TF_CONFIG
environment variable with the master
task name by default. You can
configure it to be labeled with the chief
task name
instead by
setting the
trainingInput.useChiefInTfConfig
field to true
when you create your training job, or by using one or more
evaluator replicas in your job. This is especially helpful if your custom
container uses TensorFlow 2.
Besides this configuration option, distributed training with TensorFlow works
the same way when you use custom containers as when you use an AI Platform Training
runtime version. See more details and examples on how to use TF_CONFIG
for
distributed training on AI Platform Training.
What's next
- Learn how to use custom containers for your training jobs.
- Train a PyTorch model using custom containers.