Training with built-in algorithms on AI Platform Training allows you to submit your dataset and train a model without writing any training code. This page explains how the built-in BERT algorithm works, and how to use it.
Overview
This built-in algorithm can do both training and model exporting:
- Training: Using the dataset and the model parameters you supplied, AI Platform Training runs training using TensorFlow's BERT implementation.
- Exporting: Using the initial checkpoint supplied, produce a serialized model in the desired job directory. This model can then be deployed to AI Platform.
Limitations
The following features are not supported for training with the built-in BERT algorithm:
- Automated Data Preprocessing This version of BERT requires input data to be in the form of TFRecords for both training and output. A training application must be made to handle unformatted input automatically.
Supported machine types
The following AI Platform Training scale tiers and machine types are supported:
BASIC
scale tierBASIC_TPU
scale tierCUSTOM
scale tier with any of the Compute Engine machine types supported by AI Platform Training.CUSTOM
scale tier with any of the following legacy machine types:standard
large_model
complex_model_s
complex_model_m
complex_model_l
standard_gpu
standard_p100
standard_v100
large_model_v100
complex_model_m_gpu
complex_model_l_gpu
complex_model_m_p100
complex_model_m_v100
complex_model_l_v100
TPU_V2
(8 cores)
We recommend using a machine type with access to TPUs.
Format input data
Ensure that input and evaluation data are in the form of TFRecords before training the model.
Check Cloud Storage bucket permissions
To store your data, use a Cloud Storage bucket in the same Google Cloud project you're using to run AI Platform Training jobs. Otherwise, grant AI Platform Training access to the Cloud Storage bucket where your data is stored.
Submit a BERT training job
This section explains how to submit a training job using the built-in BERT algorithm.
You can find brief explanations of each hyperparameter within the Google Cloud console, and a more comprehensive explanation in the reference for the built-in BERT algorithm.
Console
Go to the AI Platform Training Jobs page in the Google Cloud console:
Click the New training job button. From the options that display below, click Built-in algorithm training.
On the Create a new training job page, select BERT and click Next.
To learn more about all the available parameters, follow the links in the Google Cloud console and refer to the built-in BERT reference for more details.
gcloud
Set environment variables for your job, filling in
[VALUES-IN-BRACKETS]
with your own values:# Specify the name of the Cloud Storage bucket where you want your # training outputs to be stored, and the Docker container for # your built-in algorithm selection. BUCKET_NAME='BUCKET_NAME' IMAGE_URI='gcr.io/cloud-ml-algos/bert:latest' DATE="$(date '+%Y%m%d_%H%M%S')" MODEL_NAME='MODEL_NAME' JOB_ID="${MODEL_NAME}_${DATE}" JOB_DIR="gs://${BUCKET_NAME}/algorithm_training/${MODEL_NAME}/${DATE}" BERT_BASE_DIR='gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16' MODEL_DIR='${STORAGE_BUCKET}/bert-output' GLUE_DIR='gs://cloud-tpu-checkpoints/bert/classification' TASK='mnli'
Submit the training job using
gcloud ai-platform jobs training submit
. Adjust this generic example to work with your dataset:gcloud ai-platform jobs submit training $JOB_ID \ --master-image-uri=$IMAGE_URI --scale-tier=BASIC_TPU --job-dir=$JOB_DIR \ -- \ --mode='train_and_eval' \ --input_meta_data_path=${GLUE_DIR}/${TASK}_meta_data \ --train_data_path=${GLUE_DIR}/${TASK}_train.tf_record \ --eval_data_path=${GLUE_DIR}/${TASK}_eval.tf_record \ --bert_config_file=${BERT_BASE_DIR}/bert_config.json \ --init_checkpoint=${BERT_BASE_DIR}/bert_model.ckpt \ --train_batch_size=32 \ --eval_batch_size=32 \ --learning_rate=2e-5 \ --num_train_epochs=1 \ --steps_per_loop=1000
Monitor the status of your training job by viewing logs with
gcloud
. Refer togcloud ai-platform jobs describe
andgcloud ai-platform jobs stream-logs
.gcloud ai-platform jobs describe ${JOB_ID} gcloud ai-platform jobs stream-logs ${JOB_ID}
Further learning resources
- Learn more about Cloud TPU.
- Learn more about TensorFlow Model Garden.