This legacy version of AI Platform Training is deprecated and will no longer be available on Google Cloud after January 31, 2025. Migrate your resources to Vertex AI custom training to get new machine learning features that are unavailable in AI Platform.

Getting started with the built-in image classification algorithm

With built-in algorithms on AI Platform Training, you can submit your training data, select an algorithm, and let AI Platform Training handle the preprocessing and training for you, without writing any code for a training application. Built-in image algorithms allow you to train on TPUs with minimal configuration. The resulting TensorFlow SavedModel is compatible for serving on CPUs and GPUs.

Overview

In this tutorial, you train an image classification model without writing any code. You submit the Flowers dataset to AI Platform Training for training, and then you deploy the model on AI Platform Prediction to get predictions. The resulting model classifies flower images based on species (daisy, tulip, rose, sunflower, or dandelion).

Before you begin

To complete this tutorial on the command line, use either Cloud Shell or any environment where the Google Cloud CLI is installed.

Complete the following steps to set up a GCP account, enable the required APIs, and install and activate the Google Cloud CLI:

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the AI Platform Training & Prediction and Compute Engine APIs.

Enable the APIs

Install the Google Cloud CLI.

To initialize the gcloud CLI, run the following command:

gcloud init

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the AI Platform Training & Prediction and Compute Engine APIs.

Enable the APIs

Install the Google Cloud CLI.

To initialize the gcloud CLI, run the following command:

gcloud init

Authorize your Cloud TPU to access your project

Follow these steps to authorize the Cloud TPU service account name associated with your Google Cloud project:

Get your Cloud TPU service account name by calling projects.getConfig. Example:

PROJECT_ID=PROJECT_ID

curl -H "Authorization: Bearer $(gcloud auth print-access-token)"  \
    https://ml.googleapis.com/v1/projects/$PROJECT_ID:getConfig

Save the value of the serviceAccountProject and tpuServiceAccount field returned by the API.

Initialize the Cloud TPU service account:

curl -H "Authorization: Bearer $(gcloud auth print-access-token)"  \
  -H "Content-Type: application/json" -d '{}'  \
  https://serviceusage.googleapis.com/v1beta1/projects/<serviceAccountProject>/services/tpu.googleapis.com:generateServiceIdentity

Now add the Cloud TPU service account as a member in your project, with the role Cloud ML Service Agent. Complete the following steps in the Google Cloud console or using the gcloud command:

Console

Log in to the Google Cloud console and choose the project in which you're using the TPU.
Choose IAM & Admin > IAM.
Click the Add button to add a member to the project.
Enter the TPU service account in the Members text box.
Click the Roles dropdown list.
Enable the Cloud ML Service Agent role (Service Agents > Cloud ML Service Agent).

gcloud

Set environment variables containing your project ID and the Cloud TPU service account:

PROJECT_ID=PROJECT_ID
SVC_ACCOUNT=your-tpu-sa-123@your-tpu-sa.google.com.iam.gserviceaccount.com

Grant the ml.serviceAgent role to the Cloud TPU service account:

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member serviceAccount:$SVC_ACCOUNT --role roles/ml.serviceAgent

For more details about granting roles to service accounts, see the IAM documentation.

Setup

We have modified the TensorFlow Flowers dataset for use with this tutorial, and hosted it in a public Cloud Storage bucket, gs://cloud-samples-data/ai-platform/built-in/image/flowers/.

Console

Select your algorithm

Go to the AI Platform Training Jobs page in the Google Cloud console:

AI Platform Training Jobs page
Click the New training job button. From the options that display below, click Built-in algorithm training. The Create a new training job page displays.
The training job creation is divided into four steps. The first step is Training algorithm. Select image classification and click Next.

Training data

In the Training data section, select the training data for the sample dataset, hosted in our public Cloud Storage bucket:
1. Select Use multiple files stored in one Cloud Storage directory.
2. For Directory path, fill in: "cloud-samples-data/ai-platform/built-in/image/flowers/"
3. For Wildcard name, fill in "flowers_train*" to select all the training files in the directory.
4. The Complete GCS path displays below: "gs://cloud-samples-data/ai-platform/built-in/image/flowers/flowers_train*"
In the Validation data section, select the validation data for the sample dataset, hosted in our public Cloud Storage bucket:
1. Select Use multiple files stored in one Cloud Storage directory.
2. For Directory path, fill in: "cloud-samples-data/ai-platform/built-in/image/flowers/"
3. For Wildcard name, fill in "flowers_validation*" to select all the validation files in the directory.
4. The Complete GCS path displays below: "gs://cloud-samples-data/ai-platform/built-in/image/flowers/flowers_validation*"
Specify the Output directory in your Cloud Storage bucket where you want AI Platform Training to store your trained model, checkpoints, and other training job output. You can fill in the exact path in your bucket, or use the Browse button to select the path.

gcloud

Set up environment variables for your project ID, your Cloud Storage bucket, the Cloud Storage path to the training data, and your algorithm selection.

AI Platform Training built-in algorithms are in Docker containers hosted in Container Registry.

PROJECT_ID="YOUR_PROJECT_ID"
BUCKET_NAME="YOUR_BUCKET_NAME"
REGION="us-central1"

gcloud config set project $PROJECT_ID
gcloud config set compute/region $REGION

# Set paths to the training and validation data.
TRAINING_DATA_PATH="gs://cloud-samples-data/ai-platform/built-in/image/flowers/flowers_train*"
VALIDATION_DATA_PATH="gs://cloud-samples-data/ai-platform/built-in/image/flowers/flowers_validation*"

# Specify the Docker container for your built-in algorithm selection.
IMAGE_URI="gcr.io/cloud-ml-algos/image_classification:latest"

Submit a training job

To submit a job, you must specify some basic training arguments and some basic arguments related to the image classification algorithm.

General arguments for the training job:

Training job arguments
Argument	Description
`job-id`	Unique ID for your training job. You can use this to find logs for the status of your training job after you submit it.
`job-dir`	Cloud Storage path where AI Platform Training saves training files after completing a successful training job.
`scale-tier`	Specifies machine types for training. Use `BASIC` to select a configuration of just one machine.
`master-image-uri`	Container Registry URI used to specify which Docker container to use for the training job. Use the container for the built-in image classification algorithm defined earlier as `IMAGE_URI`.
`region`	Specify the available region in which to run your training job. For this tutorial, you can use the region `us-central1`.

Arguments specific to the built-in image classification algorithm:

Algorithm arguments
Argument	Description
`training_data_path`	Path to a TFRecord path pattern used for training.
`validation_data_path`	Path to a TFRecord path pattern used for validation.
`pretrained_checkpoint_path`	The path of pretrained checkpoints. You may use some published checkpoints.
`num_classes`	The number of classes in the training/validation data.
`max_steps`	The number of steps that the training job will run.
`train_batch_size`	The number of images to use per training step.
`num_eval_images`	The number of total images used for evaluation. If it is 0, all the images in `validation_data_path` will be used for evaluation.
`learning_rate_decay_type`	The method by which the learning rate decays during training.
`warmup_learning_rate`	The learning rate at the start of the warm-up phase.
`warmup_steps`	The number of steps to run during the warm-up phase, or the length of the warm-up phase in steps. The training job uses `warmup_learning_rate` during the warm-up phase. When the warm-up phase is over, the training job uses `initial_learning_rate`.
`initial_learning_rate`	The initial learning rate after the warm-up phase is complete.
`stepwise_learning_rate_steps`	The steps to decay/change learning rates for stepwise learning rate decay type. For example, 100,200 means the learning rate will change (with respect to `stepwise_learning_rate_levels`) at step 100 and step 200. Note that it will be respected only when `learning_rate_decay_type` is set to stepwise.
`stepwise_learning_rate_levels`	The learning rate value of each step for stepwise learning rate decay type. Note that it will be respected only when `learning_rate_decay_type` is set to stepwise.
`image_size`	The image size (width and height) used for training.
`optimizer_type`	The optimizer used for training. It should be one of: `{momentum, adam, rmsprop}`
`optimizer_arguments`	The arguments for optimizer. It is a comma separated list of "name=value" pairs. It needs to be compatible with `optimizer_type`. Examples: For Momentum optimizer, it accepts `momentum=0.9`. See tf.train.MomentumOptimizer for more details. For Adam optimizer, it can be `beta1=0.9,beta2=0.999`. See tf.train.AdamOptimizer for more details. For RMSProp optimizer, it can be `decay=0.9,momentum=0.1,epsilon=1e-10`. See RMSPropOptimizer for more details.
`model_type`	The model architecture type used to train models. It can be one of: `resnet-(18\|34\|50\|101\|152\|200)` `efficientnet-(b0\|b1\|b2\|b3\|b4\|b5\|b6\|b7)`
`label_smoothing`	Label smoothing parameter used in the `softmax_cross_entropy`.
`weight_decay`	Weight decay co-efficient for L2 regularization. `loss = cross_entropy + params['weight_decay'] * l2_loss`

For a detailed list of all other image classification algorithm flags, refer to the built-in image classification reference.

Console

Algorithm arguments

In the first part of the Algorithm arguments tab, fill in the following values:

Number of classes: 5
Max steps: 15000
Train batch size: 128
Number of evaluation images: 1

In the Model Section of the Algorithm arguments tab:

For Model type, select Efficientnet-b4.
Leave Pretrained checkpoint path blank.
Leave Label smoothing and Weight Decay at their default values.

Job settings

On the Job settings tab:

Enter a unique Job ID (such as "image_classification_example").
Enter an available region (such as "us-central1").
To select machine types, select "CUSTOM" for the scale tier. A section to provide your Custom cluster specification displays.
1. For the Master type, select complex_model_m.
2. For the Worker type, select cloud_tpu. The worker count defaults to 1.

Click Done to submit the training job.

gcloud

Set up all the arguments for the training job and the algorithm, before using gcloud to submit the job:

DATASET_NAME="flowers"
ALGORITHM="image_classification"
MODEL_NAME="${DATASET_NAME}_${ALGORITHM}_model"

# Give a unique name to your training job.
DATE="$(date '+%Y%m%d_%H%M%S')"
JOB_ID="${MODEL_NAME}_${DATE}"

# Make sure you have access to this Cloud Storage bucket.
JOB_DIR="gs://${BUCKET_NAME}/algorithms_training/${MODEL_NAME}/${DATE}"

Submit the job:

gcloud ai-platform jobs submit training $JOB_ID \
  --region=$REGION \
  --config=config.yaml \
  --master-image-uri=$IMAGE_URI \
  -- \
  --training_data_path=$TRAINING_DATA_PATH \
  --validation_data_path=$VALIDATION_DATA_PATH \
  --job-dir=$JOB_DIR \
  --max_steps=30000 \
  --train_batch_size=128 \
  --num_classes=5 \
  --num_eval_images=100 \
  --initial_learning_rate=0.128 \
  --warmup_steps=1000 \
  --model_type='efficientnet-b4'

After the job is submitted successfully, you can view the logs using the following gcloud commands:
```
gcloud ai-platform jobs describe $JOB_ID
gcloud ai-platform jobs stream-logs $JOB_ID
```
Note: The training job can take 1.5 hours to complete. You can confirm that the training job has completed successfully when you see a log that states: "Job completed successfully."

Understand your job directory

After the successful completion of a training job, AI Platform Training creates a trained model in your Cloud Storage bucket, along with some other artifacts. You can find the following directory structure within your JOB_DIR:

model/ (a TensorFlow SavedModel directory that also contains a deployment_config.yaml file)
- saved_model.pb
- deployment_config.yaml
eval/
- events.out.tfevents.[timestamp].cmle-training-[timestamp]
- events.out.tfevents...
- ...
variables/
- variables.data-00000-of-00001
- variables.index

The job directory also contains various model checkpoint files.

Confirm that the directory structure in your JOB_DIR matches:

gcloud storage ls $JOB_DIR/* --all-versions

Deploy the trained model

AI Platform Training organizes your trained models using model and version resources. An AI Platform Training model is a container for the versions of your machine learning model.

To deploy a model, you create a model resource in AI Platform Training, create a version of that model, then use the model and version to request online predictions.

For more information on how to deploy models to AI Platform Training, see how to deploy a TensorFlow model.

Console

On the Jobs page, you can find a list of all your training jobs. Click the name of the training job you just submitted ("image_classification" or the job name you used).
On the Job details page, you can view the general progress of your job, or click View logs for a more detailed view of its progress.
When the job is successful, the Deploy model button appears at the top. Click Deploy model.
Select Deploy as new model, and enter a model name, such as "algorithms_image_classification_model". Next, click Confirm.
On the Create version page, enter a version name, such as "v1", and leave all other fields at their default settings. Click Save.

gcloud

The training process with the built-in image classification algorithm produces a file, deployment_config.yaml, that makes it easier to deploy your model on AI Platform Training for predictions.

Copy the file to your local directory and view its contents:

gcloud storage cp $JOB_DIR/model/deployment_config.yaml .
cat deployment_config.yaml

Your deployment_config.yaml file should appear similar to the following:

deploymentUri: gs://BUCKET_NAME/algorithms_training/flowers_image_classification/model
framework: TENSORFLOW
labels:
  global_step: '1000'
  job_id: flowers_image_classification_20190227060114
runtimeVersion: '1.14'

Create the model and version in AI Platform Training:

gcloud ai-platform models create $MODEL_NAME --regions $REGION

# Create a model and a version using the file above.
VERSION_NAME="v_${DATE}"

gcloud ai-platform versions create $VERSION_NAME \
  --model $MODEL_NAME \
  --config deployment_config.yaml

The version takes a few minutes to create.

Get online predictions

When requesting predictions, you need to make sure that your input data is formatted as JSON.

Download the training artifact files:

gcloud storage cp $JOB_DIR/artifacts/* .

Prepare the prediction input for one image.

To send an online prediction request using the Google Cloud CLI, as in this example, write each instance to a row in a newline-delimited JSON file.

Note: If you don't use the gcloud CLI, you must provide instances as an array in the instances field of your predict request body instead.

Run the following commands in your terminal to create input for a single instance that you can send to AI Platform Prediction:

The following Python script encodes a single image using base64, formats it for prediction, adds an instance key, and writes the result to a file named prediction_instances.json:
```
import json
import base64
import tensorflow as tf

IMAGE_URI='gs://cloud-samples-data/ai-platform/built-in/image/tutorial_examples/daisy.jpg'

with tf.gfile.Open(IMAGE_URI, 'rb') as image_file:
  encoded_string = base64.b64encode(image_file.read()).decode('utf-8')

image_bytes = {'b64': str(encoded_string)}
instances = {'image_bytes': image_bytes, 'key': '1'}
with open("prediction_instances.json","w") as f:
  f.write(json.dumps(instances))
```
Note: Adding an instance key is optional for online prediction. For batch prediction requests, instance keys help to map predictions to your input data in the correct order.

Send the prediction request:

gcloud ai-platform predict --model $MODEL_NAME \
 --version $VERSION_NAME \
 --json-instances prediction_instances.json

Most likely, the prediction output includes the class daisy, indicating that the deployed model has classified the input image as a daisy. (Since training is non-deterministic, your model may differ.)

About the data

The Flowers dataset that this sample uses for training is provided by the TensorFlow Team.

What's next

Learn more about using the built-in image classification algorithm.

Training using the built-in image classification algorithm