Getting started with the built-in NCF algorithm

This tutorial walks you through training the Neural Collaborative Filtering (NCF) model on the MovieLens dataset. It covers preprocessing the data, training using the built-in NCF algorithm, deploying the model to AI Platform, and requesting a prediction from the deployed model.

Dataset

The tutorial uses the following MovieLens datasets for model training and evaluation:

  • ml-1m (short for MovieLens 1 million)
  • ml-20m (short for MovieLens 20 million)

ml-1m

ml-1m dataset contains 1,000,209 anonymous ratings of approximately 3,706 movies made by 6,040 users who joined MovieLens in 2000. All ratings are contained in the file "ratings.dat" without a header row, and are in the following format:

UserID::MovieID::Rating::Timestamp

  • UserIDs range between 1 and 6040.
  • MovieIDs range between 1 and 3952.
  • Ratings are made on a 5-star scale (whole-star ratings only).
  • The timestamp is represented in seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

ml-20m

ml-20m dataset contains 20,000,263 ratings of 26,744 movies by 138493 users. All ratings are contained in the file "ratings.csv". Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

userId,movieId,rating,timestamp

The lines within this file are ordered first by userId. Rows with the same userId are ordered by movieId. Ratings are made on a 5-star scale, with half-star increments (0.5 stars to 5.0 stars). The timestamp is represented in seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970. Each user has at least 20 ratings.

Objectives

  • Prepare the MovieLens dataset
  • Run training and evaluation

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the AI Platform Training & Prediction API.

    Enable the API

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the AI Platform Training & Prediction API.

    Enable the API

  8. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

Prepare the data

  1. In Cloud Shell, create and activate a Python virtual environment:

    (vm)$  virtualenv ncf-env
    (vm)$  source ncf-env/bin/activate
  2. Install the TensorFlow Model Garden code:

    (vm)$  pip install tf-models-official==2.3.0
  3. Add environment variables for the URI to a Cloud Storage bucket in your Google Cloud project and a directory to store data within this bucket. Replace BUCKET_NAME with your bucket name.

    (vm)$ export STORAGE_BUCKET=gs://BUCKET_NAME
    (vm)$ export DATA_DIR=${STORAGE_BUCKET}/ncf_data
  4. Generate training and evaluation data for the ml-20m dataset in DATA_DIR:

    (vm)$ python -m official.recommendation.create_ncf_data \
        --dataset ml-20m \
        --num_train_epochs 4 \
        --meta_data_file_path ${DATA_DIR}/metadata \
        --eval_prebatch_size 160000 \
        --data_dir ${DATA_DIR}

This script generates and preprocesses the dataset in Cloud Shell. Preprocessing converts the data into TFRecord format required by the model. The download and pre-processing takes approximately 25 minutes and generates output similar to the following:

I0804 23:03:02.370002 139664166737728 movielens.py:124] Successfully downloaded /tmp/tmpicajrlfc/ml-20m.zip 198702078 bytes
I0804 23:04:42.665195 139664166737728 data_preprocessing.py:223] Beginning data preprocessing.
I0804 23:04:59.084554 139664166737728 data_preprocessing.py:84] Generating user_map and item_map...
I0804 23:05:20.934210 139664166737728 data_preprocessing.py:103] Sorting by user, timestamp...
I0804 23:06:39.859857 139664166737728 data_preprocessing.py:194] Writing raw data cache.
I0804 23:06:42.375952 139664166737728 data_preprocessing.py:262] Data preprocessing complete. Time: 119.7 sec.
%lt;BisectionDataConstructor(Thread-1, initial daemon)>
General:
  Num users: 138493
  Num items: 26744

Training:
  Positive count:          19861770
  Batch size:              99000
  Batch count per epoch:   1004

Eval:
  Positive count:          138493
  Batch size:              160000
  Batch count per epoch:   866

I0804 23:07:14.137242 139664166737728 data_pipeline.py:887] Negative total vector built. Time: 31.8 seconds
I0804 23:11:25.013135 139664166737728 data_pipeline.py:588] Epoch construction complete. Time: 250.9 seconds
I0804 23:15:46.391308 139664166737728 data_pipeline.py:674] Eval construction complete. Time: 261.4 seconds
I0804 23:19:54.345858 139664166737728 data_pipeline.py:588] Epoch construction complete. Time: 248.0 seconds
I0804 23:24:09.182484 139664166737728 data_pipeline.py:588] Epoch construction complete. Time: 254.8 seconds
I0804 23:28:26.224653 139664166737728 data_pipeline.py:588] Epoch construction complete. Time: 257.0 seconds

Submit a training job

To submit a job, you must specify some basic training arguments and some basic arguments related to the NCF algorithm.

General arguments for the training job:

Training job arguments
Argument Description
job-id Unique ID for your training job. You can use this to find logs for the status of your training job after you submit it.
job-dir Cloud Storage path where AI Platform Training saves training files after completing a successful training job.
scale-tier Specifies machine types for training. Use BASIC to select a configuration of just one machine.
master-image-uri Container Registry URI used to specify which Docker container to use for the training job. Use the container for the built-in NCF algorithm defined earlier as IMAGE_URI.
region Specify the available region in which to run your training job. For this tutorial, you can use the region us-central1.

Arguments specific to the built-in NCF algorithm training on MovieLens:

Algorithm arguments
Argument Value to use for this tutorial Description
train_dataset_path ${DATA_DIR}/training_cycle_*/* Cloud Storage path where the training data is stored.
eval_dataset_path ${DATA_DIR}/eval_data/* Cloud Storage path where the evaluation data is stored.
input_meta_data_path ${DATA_DIR}/metadata Cloud Storage path where the input schema is stored.
train_epochs 3 Number of training epochs to run.
batch_size 99000 Batch size for training.
eval_batch_size 160000 Batch size for evaluation.
learning_rate 0.00382059 Learning rate used by the Adam optimizer.
beta1 0.783529 Beta 1 hyperparameter for the Adam optimizer.
beta2 0.909003 Beta 2 hyperparameter for the Adam optimizer.
epsilon 1.45439e-07 Epsilon hyperparameter for the Adam optimizer.
num_factors 64 Embedding size of the MF model.
hr_threshold 0.635 Value of HR evalutation metric at which training should stop.
layers 256,256,128,64 Sizes of the hidden layers for MLP. Format as comma-separated integers.
keras_use_ctl True Use custom Keras training loop in model training.

For a detailed list of all other NCF algorithm flags, refer to the built-in NCF reference.

Run the training job

  1. In the Google Cloud console, go to the AI Platform page:

    Go to AI Platform

  2. In the Model training section, select Train with a built-in algorithm.

  3. In the drop-down list, select NCF. Click Next.

  4. Use the Browse button to select the training and evaluation datasets in your Cloud Storage bucket and choose the output directory. Click Next.

  5. On the Algorithm arguments page, use the argument values in the table in in the preceding section to configure the training job.

  6. Give your training job a name and use the BASIC_TPU or BASIC_GPU machine type.

  7. Click Submit to start your job.

Understand your job directory

After the successful completion of a training job, AI Platform Training creates a trained model in your Cloud Storage bucket, along with some other artifacts. You can find the following directory structure within your JOB_DIR:

  • model/ (a TensorFlow SavedModel directory)
    • saved_model.pb
    • assets/
    • variables/
  • summaries/ (logging from training and evaluation)
    • eval/
    • train/
  • various checkpoint files (created and used during training)
    • checkpoint
    • ctl_checkpoint-1.data-00000-of-00002
    • ...
    • ctl_checkpoint-1.index

Confirm that the directory structure in your JOB_DIR matches the structure described in the preceding list:

gcloud storage ls -a $JOB_DIR/*

Deploy the trained model

AI Platform Prediction organizes your trained models using model and version resources. An AI Platform Prediction model is a container for the versions of your machine learning model.

To deploy a model, you create a model resource in AI Platform Prediction, create a version of that model, then use the model and version to request online predictions.

Learn more about how to deploy models to AI Platform Prediction.

Console

  1. On the Jobs page, you can find a list of all your training jobs. Click the name of the training job you just submitted.

  2. On the Job details page, you can view the general progress of your job, or click View logs for a more detailed view of its progress.

  3. When the job is successful, the Deploy model button appears at the top. Click Deploy model.

  4. Select "Deploy as new model", and enter a model name. Next, click Confirm.

  5. On the Create version page, enter a version name, such as v1, and leave all other fields at their default settings. Click Save.

  6. On the Model details page, your version name displays. The version takes a few minutes to create. When the version is ready, a checkmark icon appears by the version name.

  7. Click the version name (v1) to navigate to the Version details page. In the next step of this tutorial, you send a prediction request

Get online predictions

When you request predictions, you must format input data as JSON in a manner that the model expects. Current NCF models do not automatically preprocess inputs.

Console

  1. On the Version details page for "v1", the version you just created, you can send a sample prediction request.

    Select the Test & Use tab.

  2. Copy the following sample to the input field:

     {
       "instances": [{
         "duplicate_mask": [0],
         "item_id": [1],
         "train_labels": [true],
         "user_id": [1],
         "valid_point_mask": [false]
       }]
      }
    
  3. Click Test.

    Wait a moment, and a prediction vector should be returned.

What's next