When running a training job on AI Platform Training you must specify the number and types of machines you need. To make the process easier, you can pick from a set of predefined cluster specifications called scale tiers. Alternatively, you can choose a custom tier and specify the machine types yourself.
Specifying your configuration
How you specify your cluster configuration depends on how you plan to run your training job:
gcloud
Create a YAML configuration file representing the
TrainingInput
object, and specify the scale tier
identifier and machine types in the configuration file. You can name this
file whatever you want. By convention the name is config.yaml
.
The following example shows the contents of the configuration file,
config.yaml
, for a job with a custom processing cluster.
trainingInput: scaleTier: CUSTOM masterType: n1-highcpu-16 workerType: n1-highcpu-16 parameterServerType: n1-highmem-8 evaluatorType: n1-highcpu-16 workerCount: 9 parameterServerCount: 3 evaluatorCount: 1
Provide the path to the YAML file in the --config
flag when running the
gcloud ai-platform jobs submit training
command:
gcloud ai-platform jobs submit training $JOB_NAME \ --package-path $TRAINER_PACKAGE_PATH \ --module-name $MAIN_TRAINER_MODULE \ --job-dir $JOB_DIR \ --region $REGION \ --config config.yaml \ -- \ --user_first_arg=first_arg_value \ --user_second_arg=second_arg_value
Alternatively, you may specify cluster configuration details with command-line flags, rather than in a configuration file. Learn more about how to use these flags.
The following example shows how to submit a training job with a similar configuration as the previous example, but without using a configuration file:
gcloud ai-platform jobs submit training $JOB_NAME \
--package-path $TRAINER_PACKAGE_PATH \
--module-name $MAIN_TRAINER_MODULE \
--job-dir $JOB_DIR \
--region $REGION \
--scale-tier custom \
--master-machine-type n1-highcpu-16 \
--worker-machine-type n1-highcpu-16 \
--parameter-server-machine-type n1-highmem-8 \
--worker-count 9 \
--parameter-server-count 3 \
-- \
--user_first_arg=first_arg_value \
--user_second_arg=second_arg_value
See more details on how to run a training job.
Python
Specify the scale tier identifier and machine types in the
TrainingInput
object in your job configuration.
The following example shows how to build a Job representation for a job with a custom processing cluster.
training_inputs = {'scaleTier': 'CUSTOM', 'masterType': 'n1-highcpu-16', 'workerType': 'n1-highcpu-16', 'parameterServerType': 'n1-highmem-8', 'evaluatorType': 'n1-highcpu-16', 'workerCount': 9, 'parameterServerCount': 3, 'evaluatorCount': 1, 'packageUris': ['gs://my/trainer/path/package-0.0.0.tar.gz'], 'pythonModule': 'trainer.task' 'args': ['--arg1', 'value1', '--arg2', 'value2'], 'region': 'us-central1', 'jobDir': 'gs://my/training/job/directory', 'runtimeVersion': '2.11', 'pythonVersion': '3.7'} job_spec = {'jobId': my_job_name, 'trainingInput': training_inputs}
Note that training_inputs
and job_spec
are arbitrary identifiers: you
can name these dictionaries whatever you want. However, the dictionary keys
must be named exactly as shown, to match the names in the Job
and TrainingInput
resources.
Scale tiers
Google may optimize the configuration of the scale tiers for different jobs over time, based on customer feedback and the availability of cloud resources. Each scale tier is defined in terms of its suitability for certain types of jobs. Generally, the more advanced the tier, the more machines are allocated to the cluster, and the more powerful the specifications of each virtual machine. As you increase the complexity of the scale tier, the hourly cost of training jobs, measured in training units, also increases. See the pricing page to calculate the cost of your job.
AI Platform Training does not support distributed training or training with accelerators
for scikit-learn or XGBoost code. If your training job runs scikit-learn or
XGBoost code, you must set the scale tier to either BASIC
or CUSTOM
.
Below are the scale tier identifiers:
AI Platform Training scale tier | |
---|---|
BASIC
|
A single worker instance. This tier is suitable for learning how to use AI Platform Training and for experimenting with new models using small datasets. Compute Engine machine name: n1-standard-4 |
STANDARD_1
|
One master instance, plus four workers and three parameter servers. Only use this scale tier if you are training with TensorFlow or using custom containers. Compute Engine machine name, master: n1-highcpu-8, workers: n1-highcpu-8, parameter servers: n1-standard-4 |
PREMIUM_1
|
One master instance, plus 19 workers and 11 parameter servers. Only use this scale tier if you are training with TensorFlow or using custom containers. Compute Engine machine name, master: n1-highcpu-16, workers: n1-highcpu-16, parameter servers: n1-highmem-8 |
BASIC_GPU
|
A single worker instance with one GPU. To learn more about graphics processing units (GPUs), see the section on training with GPUs. Only use this scale tier if you are training with TensorFlow or using a custom container. Compute Engine machine name: n1-standard-8 with one GPU |
BASIC_TPU
|
A master VM and a Cloud TPU with eight TPU v2 cores. See how to use TPUs for your training job. Only use this scale tier if you are training with TensorFlow or using custom containers. Compute Engine machine name, master: n1-standard-4, workers: Cloud TPU (8 TPU v2 cores) |
CUSTOM
|
The CUSTOM tier is not a set tier, but rather enables you to use your own cluster specification. When you use this tier, set values to configure your processing cluster according to these guidelines:
|
Machine types for the custom scale tier
Use a custom scale tier for finer control over the processing cluster that you
use to train your model. Specify the configuration in the
TrainingInput
object in your job configuration. If you're
using the gcloud ai-platform jobs submit training
command
to submit your training job, you can use the same identifiers:
Set the scale tier (
scaleTier
) toCUSTOM
.Set values for the number of workers (
workerCount
), parameter servers (parameterServerCount
), and evaluators (evaluatorCount
) that you need.AI Platform Training only supports distributed training when you train with TensorFlow or use a custom container. If your training job runs scikit-learn or XGBoost code, do not specify workers, parameter servers, or evaluators.
Set the machine type for your master worker (
masterType
). If you have chosen to use workers, parameter servers, or evaluators, then set machine types for them in theworkerType
,parameterServerType
, andevaluatorType
fields respectively.You can specify different machine types for
masterType
,workerType
,parameterServerType
, andevaluatorType
, but you can't use different machine types for individual instances. For example, you can use an1-highmem-8
machine type for your parameter servers, but you can't set some parameter servers to usen1-highmem-8
and some to usen1-highcpu-16
.If you need just one worker with a custom configuration (not a full cluster), you should specify a custom scale tier with a machine type for the master only. That gives you just the single worker. Here's an example
config.yaml
file:trainingInput: scaleTier: CUSTOM masterType: n1-highcpu-16
Compute Engine machine types
You can use the names of certain Compute Engine predefined machine types for
your job's masterType
, workerType
, parameterServerType
, and
evaluatorType
. If you are training with TensorFlow or using custom containers,
you can optionally use various types of GPUs with
these machine types.
The following list contains the Compute Engine machine type identifiers that you can use for your training job:
e2-standard-4
e2-standard-8
e2-standard-16
e2-standard-32
e2-highmem-2
e2-highmem-4
e2-highmem-8
e2-highmem-16
e2-highcpu-16
e2-highcpu-32
n2-standard-4
n2-standard-8
n2-standard-16
n2-standard-32
n2-standard-48
n2-standard-64
n2-standard-80
n2-highmem-2
n2-highmem-4
n2-highmem-8
n2-highmem-16
n2-highmem-32
n2-highmem-48
n2-highmem-64
n2-highmem-80
n2-highcpu-16
n2-highcpu-32
n2-highcpu-48
n2-highcpu-64
n2-highcpu-80
n1-standard-4
n1-standard-8
n1-standard-16
n1-standard-32
n1-standard-64
n1-standard-96
n1-highmem-2
n1-highmem-4
n1-highmem-8
n1-highmem-16
n1-highmem-32
n1-highmem-64
n1-highmem-96
n1-highcpu-16
n1-highcpu-32
n1-highcpu-64
n1-highcpu-96
c2-standard-4
c2-standard-8
c2-standard-16
c2-standard-30
c2-standard-60
m1-ultramem-40
m1-ultramem-80
m1-ultramem-160
m1-megamem-96
a2-highgpu-1g
* (preview)a2-highgpu-2g
* (preview)a2-highgpu-4g
* (preview)a2-highgpu-8g
* (preview)a2-megagpu-16g
* (preview)
To learn about the technical specifications of each machine type, read the Compute Engine documentation about machine types.
Legacy machine types
Instead of using Compute Engine machine types for your job, you can specify legacy machine type names. These machine types provide the same vCPU and memory resources as equivalent Compute Engine machine types, but they have additional configuration limitations:
You cannot customize GPU usage using an
acceleratorConfig
. However, some legacy machine types include GPUs. See the following table.If your training job configuration uses multiple machines, you cannot mix Compute Engine machine types with legacy machine types. Your master worker, workers, parameter servers, and evaluators must all use machine types from one group or the other.
For example, if you configure
masterType
to ben1-highcpu-32
(a Compute Engine machine type), you cannot setworkerType
tocomplex_model_m
(a legacy machine type), but you can set it ton1-highcpu-16
(another Compute Engine machine type).
The following table describes the legacy machine types:
Legacy machine types | |
---|---|
standard
|
A basic machine configuration suitable for training simple models with small to moderate datasets. Compute Engine machine name: n1-standard-4 |
large_model
|
A machine with a lot of memory, specially suited for parameter servers when your model is large (having many hidden layers or layers with very large numbers of nodes). Compute Engine machine name: n1-highmem-8 |
complex_model_s
|
A machine suitable for the master and workers of the cluster when your model requires more computation than the standard machine can handle satisfactorily. Compute Engine machine name: n1-highcpu-8 |
complex_model_m
|
A machine with roughly twice the number of cores and roughly double the memory of complex_model_s. Compute Engine machine name: n1-highcpu-16 |
complex_model_l
|
A machine with roughly twice the number of cores and roughly double the memory of complex_model_m. Compute Engine machine name: n1-highcpu-32 |
standard_gpu
|
A machine equivalent to standard that also includes a single GPU. Only use this machine type if you are training with TensorFlow or using custom containers. Compute Engine machine name: n1-standard-8 with one GPU |
complex_model_m_gpu
|
A machine equivalent to complex_model_m that also includes four GPUs. Only use this machine type if you are training with TensorFlow or using custom containers. Compute Engine machine name: n1-standard-16 with 4 GPUs |
complex_model_l_gpu
|
A machine equivalent to complex_model_l that also includes eight GPUs. Only use this machine type if you are training with TensorFlow or using custom containers. Compute Engine machine name: n1-standard-32 with 8 GPUs |
standard_p100
|
A machine equivalent to standard that also includes a single NVIDIA Tesla P100 GPU. Only use this machine type if you are training with TensorFlow or using custom containers. Compute Engine machine name: n1-standard-8-p100x1 |
complex_model_m_p100
|
A machine equivalent to complex_model_m that also includes four NVIDIA Tesla P100 GPUs. Only use this machine type if you are training with TensorFlow or using custom containers. Compute Engine machine name: n1-standard-16-p100x4 |
standard_v100
|
A machine equivalent to a standard that also includes a single NVIDIA Tesla V100 GPU. Only use this machine type if you are training with TensorFlow or using custom containers. Compute Engine machine name: n1-standard-8-v100x1 |
large_model_v100
|
A machine equivalent to large_model that also includes a single NVIDIA Tesla V100 GPU. Only use this machine type if you are training with TensorFlow or using custom containers. Compute Engine machine name: n1-highmem-8-v100x1 |
complex_model_m_v100
|
A machine equivalent to complex_model_m that also includes four NVIDIA Tesla V100 GPUs. Only use this machine type if you are training with TensorFlow or using custom containers. Compute Engine machine name: n1-standard-16-v100x4 |
complex_model_l_v100
|
A machine equivalent to complex_model_l that also includes eight NVIDIA Tesla V100 GPUs. Only use this machine type if you are training with TensorFlow or using custom containers. Compute Engine machine name: n1-standard-32-v100x8 |
Training with GPUs and TPUs
Some scale tiers and legacy machine types include graphics processing units (GPUs). You can also attach your own choice of several GPUs if you use a Compute Engine machine type. To learn more, read about training with GPUs.
To perform training with Tensor Processing Units (TPUs), you must use the
BASIC_TPU
scale tier or the cloud_tpu
machine type. The cloud_tpu
machine
type has special configuration options: you can use it together with either
Compute Engine machine types or with legacy machine types, and you can
configure it to use 8 TPU v2 cores or 8 TPU v3 cores. Read about how to use
TPUs for your training
job.