Training with built-in algorithms on AI Platform Training allows you to submit your dataset and train a model without writing any training code. This page explains how the built-in NCF algorithm works, and how to use it.
Overview
This built-in algorithm only does training:
- Training: Using the dataset and the model parameters you supplied, AI Platform Training runs training using TensorFlow's NCF implementation.
Limitations
The following features are not supported for training with the built-in NCF algorithm:
- Automated Data Preprocessing This version of NCF requires input data to be in the form of TFRecords for both training and output. A training application must be made to handle unformatted input automatically.
Supported machine types
The following AI Platform Training scale tiers and machine types are supported:
BASIC
scale tierBASIC_GPU
scale tierBASIC_TPU
scale tierCUSTOM
scale tier with any of the Compute Engine machine types supported by AI Platform Training.CUSTOM
scale tier with any of the following legacy machine types:standard
large_model
complex_model_s
complex_model_m
complex_model_l
standard_gpu
standard_p100
standard_v100
large_model_v100
complex_model_m_gpu
complex_model_l_gpu
complex_model_m_p100
complex_model_m_v100
complex_model_l_v100
TPU_V2
(8 cores)
It is recommended that a machine type with access to TPUs or GPUs is used.
Format input data
Ensure that input and evaluation data are in the form of TFRecords before training the model.
Check Cloud Storage bucket permissions
To store your data, use a Cloud Storage bucket in the same Google Cloud project you're using to run AI Platform Training jobs. Otherwise, grant AI Platform Training access to the Cloud Storage bucket where your data is stored.
Submit a NCF training job
This section explains how to submit a training job using the built-in NCF algorithm.
You can find brief explanations of each hyperparameter within the Google Cloud console, and a more comprehensive explanation in the reference for the built-in NCF algorithm.
Console
Go to the AI Platform Training Jobs page in the Google Cloud console:
Click the New training job button. From the options that display below, click Built-in algorithm training.
On the Create a new training job page, select NCF and click Next.
To learn more about all the available parameters, follow the links in the Google Cloud console and refer to the built-in NCF reference for more details.
gcloud
Set environment variables for your job:
# Specify the name of the Cloud Storage bucket where you want your # training outputs to be stored, and the Docker container for # your built-in algorithm selection. BUCKET_NAME='BUCKET_NAME' IMAGE_URI='gcr.io/cloud-ml-algos/ncf:latest' # Specify the Cloud Storage path to your training input data. DATA_DIR='gs://$BUCKET_NAME/ncf_data' DATE="$(date '+%Y%m%d_%H%M%S')" MODEL_NAME='MODEL_NAME' JOB_ID="${MODEL_NAME}_${DATE}" JOB_DIR="gs://${BUCKET_NAME}/algorithm_training/${MODEL_NAME}/${DATE}"
Replace the following:
- BUCKET_NAME: The name of the Cloud Storage bucket where you want training outputs to be stored.
- MODEL_NAME: A name for your model, to identify where model artifacts get stored in your Cloud Storage bucket.
Submit the training job using
gcloud ai-platform jobs training submit
. Adjust this generic example to work with your dataset:gcloud ai-platform jobs submit training $JOB_ID \ --master-image-uri=$IMAGE_URI --scale-tier=BASIC_TPU --job-dir=$JOB_DIR \ -- \ --train_dataset_path=${DATA_DIR}/training_cycle_*/* \ --eval_dataset_path=${DATA_DIR}/eval_data/* \ --input_meta_data_path=${DATA_DIR}/metadata \ --learning_rate=3e-5 \ --train_epochs=3 \ --eval_batch_size=160000 \ --learning_rate=0.00382059 \ --beta1=0.783529 \ --beta2=0.909003 \ --epsilon=1.45439e-07 \ --num_factors=64 \ --hr_threshold=0.635 \ --keras_use_ctl=true \ --layers=256,256,128,64 \
Monitor the status of your training job by viewing logs with
gcloud
. Refer togcloud ai-platform jobs describe
andgcloud ai-platform jobs stream-logs
.gcloud ai-platform jobs describe ${JOB_ID} gcloud ai-platform jobs stream-logs ${JOB_ID}
Further learning resources
- Learn more about Cloud TPU.
- Learn more about TensorFlow Model Garden.