AI Platform Training provides model training as an asynchronous (batch) service.
This page describes how to configure and submit a training job by running
gcloud ai-platform jobs submit training
from the command line or by sending a request to the API at
projects.jobs.create.
Before you begin
Before you can submit a training job, you must package your application and upload it and any unusual dependencies to a Cloud Storage bucket. Note: If you use the Google Cloud CLI to submit your job, you can package the application and submit the job in the same step.
Configuring the job
You pass your parameters to the training service by setting the members of the
Job
resource, which includes the items in the
TrainingInput
resource.
If you use the Google Cloud CLI to submit your training jobs, you can:
- Specify the most common training parameters as flags of the
gcloud ai-platform jobs submit training
command. - Pass the remaining parameters in a YAML configuration file, named
config.yaml
by convention. The configuration file mirrors the structure of the JSON representation of theJob
resource. You pass the path of your configuration file in the--config
flag of thegcloud ai-platform jobs submit training
command. So, if the path to your configuration file isconfig.yaml
, you must set--config=config.yaml
.
Gathering the job configuration data
The following properties are used to define your job.
- Job name (
jobId
) - A name to use for the job (mixed-case letters, numbers, and underscores only, starting with a letter).
- Cluster configuration (
scaleTier
) - A scale tier
specifying the type of processing cluster to run your job on. This can be
the
CUSTOM
scale tier, in which case you also explicitly specify the number and type of machines to use. - Disk configuration (
diskConfig
) - Configuration of the boot disk for each training VM. This field is optional;
by default, each VM runs with a 100 GB
pd-ssd
boot disk. Specifying this field might incur extra disk charges. - Training application package (
packageUris
) - A packaged training application that is staged in a Cloud Storage location. If you are using the Google Cloud CLI, the application packaging step is largely automated. See the details in the guide to packaging your application.
- Module name (
pythonModule
) - The name of the main module in your package. The main module is the
Python file you call to start the application. If you use the
gcloud
command to submit your job, specify the main module name in the--module-name
flag. See the guide to packaging your application. - Region (
region
) - The Compute Engine region where you want your job to run. You should run your training job in the same region as the Cloud Storage bucket that stores your training data. See the available regions for AI Platform Training services.
- Job directory (
jobDir
) - The path to a Cloud Storage location to use for job output.
Most training applications save checkpoints during training and save the
trained model to a file at the end of the job. You need a
Cloud Storage location to save them to. Your Google Cloud project
must have write access to this bucket. The
training service automatically passes the path you set for the job directory
to your training application as a command-line argument named
job_dir
. You can parse it along with your application's other arguments and use it in your code. The advantage to using the job directory is that the training service validates the directory before starting your application. - Runtime version (
runtimeVersion
) - The AI Platform Training runtime version to use for the job.
- Python version (
pythonVersion
) - The Python version to use for the job. Python 3.5 is available in runtime versions 1.13 through 1.14. Python 3.7 is available in runtime versions 1.15 and later.
- Maximum wait time (
scheduling.maxWaitTime
) - A maximum waiting duration in seconds with the suffix
s
(for example,3600s
) determining how long you allow your job to remain in theQUEUED
andPREPARING
states. AI Platform Training does not always start running your job immediately due to resource constraints; specify this field if you are not willing to wait longer than a certain duration for the job to run. The limited duration starts when you create the job. If the job has not yet entered theRUNNING
state by the end of this period, AI Platform Training cancels the job. This field is optional and it defaults to no limit. If you specify this field, you must set the value to at least1800s
(30 minutes). - Maximum running time (
scheduling.maxRunningTime
) - A maximum running duration in seconds with the suffix
s
(for example,7200s
) for your training job. The limited duration starts when the job enters theRUNNING
state. If the job is still running after this amount of time, AI Platform Training cancels the job. This field is optional and it defaults to seven days (604800s
). - Service account (
serviceAccount
) - The email address of a service account for AI Platform Training to use when it runs your training application. This can provide your training application access to Google Cloud resources without granting direct access to your project's AI Platform service agent. This field is optional. Learn more about the requirements for custom service accounts.
Formatting your configuration parameters
How you specify your configuration details depends on how you are starting your training job:
gcloud
Provide the job configuration details to the
gcloud ai-platform jobs submit training
command.
You can do this in two ways:
- With command-line flags.
- In a YAML file representing the
Job
resource. You can name this file whatever you want. By convention the name isconfig.yaml
.
Even if you use a YAML file, certain details must be supplied as
command-line flags. For example, you must provide the --module-name
flag
and at least one of --package-path
or --packages
. If you use
--package-path
, you must also include --job-dir
or --staging-bucket
.
Additionally, you must either provide the --region
flag or set a default
region for your gcloud
client.
These options—and any others you provide as command line flags—will override
values for those options in your configuration file.
Example 1: In this example, you choose a preconfigured machine cluster and supply all the required details as command-line flags when submitting the job. No configuration file is necessary. See the guide to submitting the job in the next section.
Example 2: The following example shows the contents of the configuration file for a job with a custom processing cluster. The configuration file includes some but not all of the configuration details, assuming that you supply the other required details as command-line flags when submitting the job.
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m
workerType: complex_model_m
parameterServerType: large_model
workerCount: 9
parameterServerCount: 3
runtimeVersion: '2.11'
pythonVersion: '3.7'
scheduling:
maxWaitTime: 3600s
maxRunningTime: 7200s
The preceding example specifies Python version 3.7, which is available when you use AI Platform Training runtime version 1.15 or later. It also configures worker and parameter server virtual machines; only configure these machines if your performs distributed training using TensorFlow or custom containers. Read more about machine types.
Python
When you submit a training job
using the Google API Client Library for Python, set
your configuration in a dictionary with the same structure as the
Job
resource. This takes the form of a dictionary with two
keys: jobId
and trainingInput
, with their respective data being the name
for the job and a second dictionary with keys for the objects in the
TrainingInput
resource.
The following example shows how to build a Job representation for a job with a custom processing cluster.
training_inputs = {
'scaleTier': 'CUSTOM',
'masterType': 'complex_model_m',
'workerType': 'complex_model_m',
'parameterServerType': 'large_model',
'workerCount': 9,
'parameterServerCount': 3,
'packageUris': ['gs://my/trainer/path/package-0.0.0.tar.gz'],
'pythonModule': 'trainer.task',
'args': ['--arg1', 'value1', '--arg2', 'value2'],
'region': 'us-central1',
'jobDir': 'gs://my/training/job/directory',
'runtimeVersion': '2.11',
'pythonVersion': '3.7',
'scheduling': {'maxWaitTime': '3600s', 'maxRunningTime': '7200s'},
}
job_spec = {'jobId': 'my_job_name', 'trainingInput': training_inputs}
Note that training_inputs
and job_spec
are arbitrary identifiers: you
can name these dictionaries whatever you want. However, the dictionary keys
must be named exactly as shown, to match the names in the Job
and TrainingInput
resources.
The preceding example specifies Python version 3.7, which is available when you use AI Platform Training runtime version 1.15 or later. It also configures worker and parameter server virtual machines; only configure these machines if your performs distributed training using TensorFlow or custom containers. Read more about machine types.
Submitting the job
When submitting a training job, you specify two sets of flags:
- Job configuration parameters. AI Platform Training needs these values to set up resources in the cloud and deploy your application on each node in the processing cluster.
- User arguments, or application parameters. AI Platform Training passes the value of these flags through to your application.
Create your job:
gcloud
Submit a training job using the
gcloud ai-platform jobs submit training
command.
First, it's useful to define some environment variables containing your configuration details. To create a job name, the following code appends the date and time to the model name:
PACKAGE_PATH="/path/to/your/application/sources"
now=$(date +"%Y%m%d_%H%M%S")
JOB_NAME="your_name_$now"
MODULE_NAME="trainer.task"
JOB_DIR="gs://your/chosen/job/output/path"
REGION="us-east1"
RUNTIME_VERSION="2.11"
The following job submission corresponds to configuration example 1
above, where you choose a preconfigured scale tier (basic
) and you
decide to supply all the configuration details via command-line flags. There
is no need for a config.yaml
file:
gcloud ai-platform jobs submit training $JOB_NAME \
--scale-tier basic \
--package-path $PACKAGE_PATH \
--module-name $MODULE_NAME \
--job-dir $JOB_DIR \
--region $REGION \
-- \
--user_first_arg=first_arg_value \
--user_second_arg=second_arg_value
The following job submission corresponds to configuration example 2 above, where some of the configuration is in the file and you supply the other details via command-line flags:
gcloud ai-platform jobs submit training $JOB_NAME \
--package-path $PACKAGE_PATH \
--module-name $MODULE_NAME \
--job-dir $JOB_DIR \
--region $REGION \
--config config.yaml \
-- \
--user_first_arg=first_arg_value \
--user_second_arg=second_arg_value
Notes:
- If you specify an option both in your configuration file
(
config.yaml
) and as a command-line flag, the value on the command line overrides the value in the configuration file. - The empty
--
flag marks the end of thegcloud
specific flags and the start of theUSER_ARGS
that you want to pass to your application. - Flags specific to AI Platform Training, such as
--module-name
,--runtime-version
, and--job-dir
, must come before the empty--
flag. The AI Platform Training service interprets these flags. - The
--job-dir
flag, if specified, must come before the empty--
flag, because AI Platform Training uses the--job-dir
to validate the path. - Your application must handle the
--job-dir
flag too, if specified. Even though the flag comes before the empty--
, the--job-dir
is also passed to your application as a command-line flag. - You can define as many
USER_ARGS
as you need. AI Platform Training passes--user_first_arg
,--user_second_arg
, and so on, through to your application.
Python
You can use the Google API Client Library for Python to call the AI Platform Training and Prediction API without manually constructing HTTP requests. Before you run the following code sample, you must set up authentication.
Save your project ID in the format the APIs need ('projects/_projectname'):
project_name = 'my_project_name' project_id = 'projects/{}'.format(project_name)
Get a Python representation of the AI Platform Training services:
cloudml = discovery.build('ml', 'v1')
Form your request and send it. Note that
job_spec
was created in the previous step where you formatted the configuration parametersrequest = cloudml.projects().jobs().create(body=job_spec, parent=project_id) response = request.execute()
Catch any HTTP errors. The simplest way is to put the previous command in a
try
block:try: response = request.execute() # You can put your code for handling success (if any) here. except errors.HttpError, err: # Do whatever error response is appropriate for your application. # For this example, just send some text to the logs. # You need to import logging for this to work. logging.error('There was an error creating the training job.' ' Check the details:') logging.error(err._get_reason())
What's next
- Monitor or visualize your training job while it runs.
- See more details on specifying machine types.
- See how to configure a hyperparameter tuning job.
- Get ready to deploy your trained model for prediction.