Containers on AI Platform Training is a feature that allows you to run your application within a Docker image. You can build your own custom container to run jobs on AI Platform Training, using ML frameworks and versions as well as non-ML dependencies, libraries and binaries that are not otherwise supported on AI Platform Training.
How training with containers works
Your training application, implemented in the ML framework of your choice, is the core of the training process.
- Create an application that trains your model, using the ML framework of your choice.
- Decide whether to use a custom container. There could be a runtime version that already supports your dependencies. Otherwise, you'll need to build a custom container for your training job. In your custom container, you pre-install your training application and all its dependencies onto an image that you'll use to run your training job.
- Store your training and verification data in a source that AI Platform Training can access. This usually means putting it in Cloud Storage, Bigtable, or another Google Cloud storage service associated with the same Google Cloud project that you're using for AI Platform Training.
- When your application is ready to run, you must build your Docker image and push it to Container Registry, making sure that the AI Platform Training service can access your registry.
- Submit your job using
gcloud ai-platform jobs submit training
, specifying your arguments in aconfig.yaml
file or the correspondinggcloud
flags. - The AI Platform Training training service sets up resources for your
job. It allocates one or more virtual machines (called
training instances) based on your job configuration. You set up a training
instance by using the custom container you specify as part of the
TrainingInput
object when you submit your training job. - The training service runs your Docker image, passing through any command-line arguments you specify when you create the training job.
- You can get information about your running job in the following ways:
- On Cloud Logging. You can find a link to your job logs in the AI Platform Training Jobs detail page in Google Cloud console.
- By requesting job details or running log streaming with the
gcloud
command-line tool (specifically,gcloud ai-platform jobs stream-logs
). - By programmatically making status requests to the training service,
using the
projects.jobs.get
method. See more details about how to monitor training jobs.
- When your training job succeeds or encounters an unrecoverable error, AI Platform Training halts all job processes and cleans up the resources.
Advantages of custom containers
Custom containers allow you to specify and pre-install all the dependencies needed for your application.
- Faster start-up time. If you use a custom container with your dependencies pre-installed, you can save the time that your training application would otherwise take to install dependencies when starting up.
- Use the ML framework of your choice. If you can't find an AI Platform Training runtime version that supports the ML framework you want to use, then you can build a custom container that installs your chosen framework and use it to run jobs on AI Platform Training. For example, you can train with PyTorch.
- Extended support for distributed training. With custom containers, you can do distributed training using any ML framework.
- Use the newest version. You can also use the latest build or minor version
of an ML framework. For example, you can
build a custom container to train with
tf-nightly
.
Hyperparameter tuning with custom containers
To do hyperparameter tuning on AI Platform Training, you specify a goal metric, along with whether to minimize or maximize it. For example, you might want to maximize your model accuracy, or minimize your model loss. You also list the hyperparameters you'd like to adjust, along with a target value for each hyperparameter. AI Platform Training does multiple trials of your training application, tracking and adjusting the hyperparameters after each trial. When the hyperparameter tuning job is complete, AI Platform Training reports values for the most effective configuration of your hyperparameters, as well as a summary for each trial.
In order to do hyperparameter tuning with custom containers, you need to make the following adjustments:
- In your Dockerfile: install
cloudml-hypertune
. - In your training code:
- Use
cloudml-hypertune
to report the results of each trial by calling its helper function,report_hyperparameter_tuning_metric
. - Add command-line arguments for each hyperparameter, and handle the argument
parsing with an argument parser such as
argparse
.
- Use
- In your job request: add a
HyperparameterSpec
to theTrainingInput
object.
See an example of training with custom containers using hyperparameter tuning or learn more about how hyperparameter tuning works on AI Platform Training.
Using GPUs with custom containers
For training with GPUs, your custom container needs to meet a few special requirements. You must build a different Docker image than what you'd use for training with CPUs.
- Pre-install the CUDA toolkit and cuDNN in your Docker image. Using the
nvidia/cuda
image as your base image is the recommended way to handle this. It has the matching versions of CUDA toolkit and cuDNN pre-installed, and it helps you set up the related environment variables correctly. - Install your training application, along with your required ML framework and other dependencies in your Docker image.
See an example Dockerfile for training with GPUs.
What's next
- Learn how to use custom containers for your training jobs.
- Learn about distributed training with custom containers.