Training HuggingFace GPT2 on Cloud TPU (TF 2.x)


If you are not familiar with Cloud TPU, we recommend that you go through the quickstart to learn how to create a TPU VM.

This tutorial shows you how to train the HuggingFace GPT2 model on Cloud TPU.

Objectives

  • Create a Cloud TPU
  • Install dependencies
  • Run the training job

Costs

In this document, you use the following billable components of Google Cloud:

  • Compute Engine
  • Cloud TPU

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

Before starting this tutorial, check that your Google Cloud project is correctly set up.

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Google Cloud project.

  6. This walkthrough uses billable components of Google Cloud. Check the Cloud TPU pricing page to estimate your costs. Be sure to clean up resources you create when you've finished with them to avoid unnecessary charges.

Train HuggingFace GPT2 with Cloud TPUs

  1. Open a Cloud Shell window.

    Open Cloud Shell

  2. Create an environment variable for your project ID.

    export PROJECT_ID=your-project-id
    
  3. Configure Google Cloud CLI to use the your Google Cloud project where you want to create a Cloud TPU.

    gcloud config set project ${PROJECT_ID}
    

    The first time you run this command in a new Cloud Shell VM, an Authorize Cloud Shell page is displayed. Click Authorize at the bottom of the page to allow gcloud to make Google Cloud API calls with your Google Cloud credentials.

  4. Create a Service Account for the Cloud TPU project.

    Service accounts allow the Cloud TPU service to access other Google Cloud services.

    $ gcloud beta services identity create --service tpu.googleapis.com --project $PROJECT_ID
    

    The command returns a Cloud TPU Service Account with following format:

    service-PROJECT_NUMBER@cloud-tpu.iam.gserviceaccount.com
    

Create a Cloud TPU

  1. Create a Cloud TPU VM using the gcloud command. The following command creates a v4-8 TPU. You can also create a TPU Podslice by setting the --accelerator-type flag to a Pod slice type, for example v4-32.

    $ gcloud compute tpus tpu-vm create hf-gpt2 \
      --zone=us-central2-b \
      --accelerator-type=v4-8 \
      --version=tpu-vm-tf-2.18.0-pjrt
    

    Command flag descriptions

    zone
    The zone where you plan to create your Cloud TPU.
    accelerator-type
    The accelerator type specifies the version and size of the Cloud TPU you want to create. For more information about supported accelerator types for each TPU version, see TPU versions.
    version
    The Cloud TPU software version.
  2. Connect to the TPU VM using SSH. When you are connected to the VM, your shell prompt changes from username@projectname to username@vm-name:

    gcloud compute tpus tpu-vm ssh hf-gpt2 --zone=us-central2-b
    

Install dependencies

  1. Clone the HuggingFace Transformers repository:

    (vm)$ cd /tmp
    (vm)$ git clone https://github.com/huggingface/transformers.git
    (vm)$ cd transformers
    
  2. Install dependencies:

    (vm)$ pip install .
    (vm)$ pip install -r examples/tensorflow/_tests_requirements.txt
    (vm)$ cd /tmp/transformers/examples/tensorflow/language-modeling
    (vm)$ pip install -r requirements.txt
    
  3. Create temp directory:

    (vm)$ mkdir /tmp/gpt2-wikitext
    
  4. When creating your TPU, if you set the --version parameter to a version ending with -pjrt, set the following environment variables to enable the PJRT runtime:

      (vm)$ export NEXT_PLUGGABLE_DEVICE_USE_C_API=true
      (vm)$ export TF_PLUGGABLE_DEVICE_LIBRARY_PATH=/lib/libtpu.so
    

Run training script

(vm)$ python3 run_clm.py \
  --model_name_or_path distilgpt2 \
  --max_train_samples 1000 \
  --max_eval_samples 100 \
  --num_train_epochs 1 \
  --output_dir /tmp/gpt2-wikitext \
  --dataset_name wikitext \
  --dataset_config_name wikitext-103-raw-v1

Command flag descriptions

model_name_or_path
The name of the model to train.
max_train_samples
The maximum number of samples to use for training.
max_eval_samples
The maximum number of samples to use for evaluation.
num_train_epochs
The number of epochs to train the model.
output_dir
The output directory for the training script.
dataset_name
The name of the dataset to use.
dataset_config_name
The dataset configuration name

When the training is complete, a message similar to the following is displayed:

  125/125 [============================] - ETA: 0s - loss: 3.61762023-07-07 21:38:17.902850: W tensorflow/core/framework/dataset.cc:956] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
  125/125 [============================] - 763s 6s/step - loss: 3.6176 - val_loss: 3.4233
  Configuration saved in /tmp/gpt2-wikitext/config.json
  Configuration saved in /tmp/gpt2-wikitext/generation_config.json
  Model weights saved in /tmp/gpt2-wikitext/tf_model.h5
  D0707 21:38:45.640973681   12027 init.cc:191]                          grpc_shutdown starts clean-up now
  

Clean up

  1. Disconnect from the TPU VM instance:

    (vm)$ exit
    

    Your prompt should now be username@projectname, showing you are in the Cloud Shell.

  2. Delete the TPU resource.

    $ gcloud compute tpus tpu-vm delete hf-gpt2 \
    --zone=us-central2-b
    

What's next

Try one of the other supported reference models.