Optimize GKE resource utilization for mixed AI/ML training and inference workloads


This tutorial shows you how to efficiently share accelerator resources between training- and inference-serving workloads within a single Google Kubernetes Engine (GKE) cluster. By distributing your mixed workloads across a single cluster, you improve resource utilization, simplify cluster management, reduce issues from accelerator quantity limitations, and enhance overall cost-effectiveness.

In this tutorial, you create a high-priority serving Deployment using the Gemma 2 large language model (LLM) for inference and the Hugging Face TGI (Text Generation Interface) serving framework, along with a low-priority LLM fine-tuning Job. Both workloads run on a single cluster that uses NVIDIA L4 GPUs. You use Kueue, an open source Kubernetes-native Job queueing system, to manage and schedule your workloads. Kueue lets you prioritize serving tasks and preempt lower-priority training Jobs to optimize resource utilization. As serving demands decrease, you reallocate the freed-up accelerators to resume training Jobs. You use Kueue and priority classes to manage resource quotas throughout the process.

This tutorial is intended for Machine learning (ML) engineers, Platform admins and operators, and Data and AI specialists who want to train and host a machine learning (ML) model on a GKE cluster, and who also want to reduce costs and management overhead, especially when dealing with a limited number of accelerators. To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE Enterprise user roles and tasks.

Before reading this page, ensure that you're familiar with the following:

Objectives

By the end of this guide, you should be able to perform the following steps:

  • Configure a high-priority serving Deployment.
  • Set up lower-priority training Jobs.
  • Implement preemption strategies to address varying demand.
  • Manage resource allocation between training and serving tasks using Kueue.

Before you begin

  • Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  • In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  • Make sure that billing is enabled for your Google Cloud project.

  • Enable the required APIs.

    Enable the APIs

  • In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  • Make sure that billing is enabled for your Google Cloud project.

  • Enable the required APIs.

    Enable the APIs

  • Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin

    Check for the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.

    4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

    Grant the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. Click Grant access.
    4. In the New principals field, enter your user identifier. This is typically the email address for a Google Account.

    5. In the Select a role list, select a role.
    6. To grant additional roles, click Add another role and add each additional role.
    7. Click Save.

Prepare the environment

In this section, you provision the resources that you need to deploy TGI and the model for your inference and training workloads.

Get access to the model

To get access to the Gemma models for deployment to GKE, you must first sign the license consent agreement, then generate a Hugging Face access token.

  1. Sign the license consent agreement. Access the model consent page, verify consent using your Hugging Face account, and accept the model terms.
  2. Generate an access token. To access the model through Hugging Face, you need a Hugging Face token. Follow these steps to generate a new token if you don't have one already:

    1. Click Your Profile > Settings > Access Tokens.
    2. Select New Token.
    3. Specify a Name of your choice and a Role of at least Read.
    4. Select Generate a token.
    5. Copy the generated token to your clipboard.

Launch Cloud Shell

In this tutorial, you use Cloud Shell to manage resources hosted on Google Cloud. Cloud Shell comes preinstalled with the software you need for this tutorial, including kubectl, gcloud CLI, and Terraform.

To set up your environment with Cloud Shell, follow these steps:

  1. In the Google Cloud console, launch a Cloud Shell session by clicking Cloud Shell activation icon Activate Cloud Shell in the Google Cloud console. This launches a session in the bottom pane of Google Cloud console.

  2. Set the default environment variables:

    gcloud config set project PROJECT_ID
    export PROJECT_ID=$(gcloud config get project)
    

    Replace PROJECT_ID with your Google Cloud project ID.

  3. Clone the sample code from GitHub. In Cloud Shell, run the following commands:

    git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/
    cd kubernetes-engine-samples/ai-ml/mix-train-and-inference
    export EXAMPLE_HOME=$(pwd)
    

Create a GKE cluster

You can use an Autopilot or Standard cluster for your mixed workloads. We recommend that you use an Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that's the best fit for your workloads, see Choose a GKE mode of operation.

Autopilot

  1. Set the default environment variables in Cloud Shell:

    export HF_TOKEN=HF_TOKEN
    export REGION=REGION
    export CLUSTER_NAME="llm-cluster"
    export PROJECT_NUMBER=$(gcloud projects list \
        --filter="$(gcloud config get-value project)" \
        --format="value(PROJECT_NUMBER)")
    export MODEL_BUCKET="model-bucket-$PROJECT_ID"
    

    Replace the following values:

    • HF_TOKEN: the Hugging Face token you generated earlier.
    • REGION: a region that supports the accelerator type you want to use, for example, us-central1 for the L4 GPU.

    You can adjust the MODEL_BUCKET variable—this represents the Cloud Storage bucket where you store your trained model weights.

  2. Create an Autopilot cluster:

    gcloud container clusters create-auto ${CLUSTER_NAME} \
        --project=${PROJECT_ID} \
        --region=${REGION} \
        --release-channel=rapid
    
  3. Create the Cloud Storage bucket for the fine-tuning job:

    gcloud storage buckets create gs://${MODEL_BUCKET} \
        --location ${REGION} \
        --uniform-bucket-level-access
    
  4. To grant access to the Cloud Storage bucket, run this command:

    gcloud storage buckets add-iam-policy-binding "gs://$MODEL_BUCKET" \
        --role=roles/storage.objectAdmin \
        --member=principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/llm/sa/default \
        --condition=None
    
  5. To get authentication credentials for the cluster, run this command:

    gcloud container clusters get-credentials llm-cluster \
        --region=$REGION \
        --project=$PROJECT_ID
    
  6. Create a namespace for your Deployments. In Cloud Shell, run the following command:

    kubectl create ns llm
    

Standard

  1. Set the default environment variables in Cloud Shell:

    export HF_TOKEN=HF_TOKEN
    export REGION=REGION
    export CLUSTER_NAME="llm-cluster"
    export GPU_POOL_MACHINE_TYPE="g2-standard-24"
    export GPU_POOL_ACCELERATOR_TYPE="nvidia-l4"
    export PROJECT_NUMBER=$(gcloud projects list \
        --filter="$(gcloud config get-value project)" \
        --format="value(PROJECT_NUMBER)")
    export MODEL_BUCKET="model-bucket-$PROJECT_ID"
    

    Replace the following values:

    • HF_TOKEN: the Hugging Face token you generated earlier.
    • REGION: the region that supports the accelerator type you want to use, for example, us-central1 for the L4 GPU.

    You can adjust these variables:

    • GPU_POOL_MACHINE_TYPE: the node pool machine series that you want to use in your selected region. This value depends on the accelerator type you selected. To learn more, see Limitations of using GPUs on GKE. For example, this tutorial uses g2-standard-24 with two GPUs attached per node. For the most up-to-date list of available GPUs, see GPUs for Compute Workloads.
    • GPU_POOL_ACCELERATOR_TYPE: the accelerator type that's supported in your selected region. For example, this tutorial uses nvidia-l4. For the latest list of available GPUs, see GPUs for Compute Workloads.
    • MODEL_BUCKET: the Cloud Storage bucket where you store your trained model weights.
  2. Create a Standard cluster:

    gcloud container clusters create ${CLUSTER_NAME} \
        --project=${PROJECT_ID} \
        --region=${REGION} \
        --workload-pool=${PROJECT_ID}.svc.id.goog \
        --release-channel=rapid \
        --machine-type=e2-standard-4 \
        --addons GcsFuseCsiDriver \
        --num-nodes=1
    
  3. Create the GPU node pool for inference and fine-tuning workloads:

    gcloud container node-pools create gpupool \
        --accelerator type=${GPU_POOL_ACCELERATOR_TYPE},count=2,gpu-driver-version=latest \
        --project=${PROJECT_ID} \
        --location=${REGION} \
        --node-locations=${REGION}-a \
        --cluster=${CLUSTER_NAME} \
        --machine-type=${GPU_POOL_MACHINE_TYPE} \
        --num-nodes=3
    
  4. Create the Cloud Storage bucket for the fine-tuning job:

    gcloud storage buckets create gs://${MODEL_BUCKET} \
        --location ${REGION} \
        --uniform-bucket-level-access
    
  5. To grant access to the Cloud Storage bucket, run this command:

    gcloud storage buckets add-iam-policy-binding "gs://$MODEL_BUCKET" \
        --role=roles/storage.objectAdmin \
        --member=principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/llm/sa/default \
        --condition=None
    
  6. To get authentication credentials for the cluster, run this command:

    gcloud container clusters get-credentials llm-cluster \
        --region=$REGION \
        --project=$PROJECT_ID
    
  7. Create a namespace for your Deployments. In Cloud Shell, run the following command:

    kubectl create ns llm
    

Create a Kubernetes Secret for Hugging Face credentials

To create a Kubernetes Secret that contains the Hugging Face token, run the following command:

kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=$HF_TOKEN \
    --dry-run=client -o yaml | kubectl apply --namespace=llm --filename=-

Configure Kueue

In this tutorial, Kueue is the central resource manager, enabling efficient sharing of GPUs between your training and serving workloads. Kueue achieves this by defining resource requirements ("flavors"), prioritizing workloads through queues (with serving tasks prioritized over training), and dynamically allocating resources based on demand and priority. This tutorial uses the Workload resource type to group the inference and fine-tuning workloads, respectively.

Kueue's preemption feature ensures that high-priority serving workloads always have the necessary resources by pausing or evicting lower-priority training Jobs when resources are scarce.

To control the inference server Deployment with Kueue, you enable the v1/pod integration by applying a custom configuration using Kustomize, to ensure the server Pods are labeled with "kueue-job: true".

  1. In the /kueue directory, view the code in kustomization.yaml. This manifest installs the Kueue resource manager with custom configurations.

    apiVersion: kustomize.config.k8s.io/v1beta1
    kind: Kustomization
    resources:
    - https://github.com/kubernetes-sigs/kueue/releases/download/v0.10.0/manifests.yaml
    patches:
    - path: patch.yaml
      target:
        version: v1
        kind: ConfigMap
        name: kueue-manager-config
    
  2. In the /kueue directory, view the code in patch.yaml. This ConfigMap customizes Kueue to manage Pods with the "kueue-job: true" label.

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: kueue-manager-config
    data:
      controller_manager_config.yaml: |
        apiVersion: config.kueue.x-k8s.io/v1beta1
        kind: Configuration
        health:
          healthProbeBindAddress: :8081
        metrics:
          bindAddress: :8080
        # enableClusterQueueResources: true
        webhook:
          port: 9443
        leaderElection:
          leaderElect: true
          resourceName: c1f6bfd2.kueue.x-k8s.io
        controller:
          groupKindConcurrency:
            Job.batch: 5
            Pod: 5
            Workload.kueue.x-k8s.io: 5
            LocalQueue.kueue.x-k8s.io: 1
            ClusterQueue.kueue.x-k8s.io: 1
            ResourceFlavor.kueue.x-k8s.io: 1
        clientConnection:
          qps: 50
          burst: 100
        #pprofBindAddress: :8083
        #waitForPodsReady:
        #  enable: false
        #  timeout: 5m
        #  blockAdmission: false
        #  requeuingStrategy:
        #    timestamp: Eviction
        #    backoffLimitCount: null # null indicates infinite requeuing
        #    backoffBaseSeconds: 60
        #    backoffMaxSeconds: 3600
        #manageJobsWithoutQueueName: true
        #internalCertManagement:
        #  enable: false
        #  webhookServiceName: ""
        #  webhookSecretName: ""
        integrations:
          frameworks:
          - "batch/job"
          - "kubeflow.org/mpijob"
          - "ray.io/rayjob"
          - "ray.io/raycluster"
          - "jobset.x-k8s.io/jobset"
          - "kubeflow.org/mxjob"
          - "kubeflow.org/paddlejob"
          - "kubeflow.org/pytorchjob"
          - "kubeflow.org/tfjob"
          - "kubeflow.org/xgboostjob"
          - "pod"
        #  externalFrameworks:
        #  - "Foo.v1.example.com"
          podOptions:
            # You can change namespaceSelector to define in which 
            # namespaces kueue will manage the pods.
            namespaceSelector:
              matchExpressions:
              - key: kubernetes.io/metadata.name
                operator: NotIn
                values: [ kube-system, kueue-system ]
            # Kueue uses podSelector to manage pods with particular 
            # labels. The default podSelector will match all the pods. 
            podSelector:
              matchExpressions:
              - key: kueue-job
                operator: In
                values: [ "true", "True", "yes" ]
    
  3. In Cloud Shell, run the following command to install Kueue:

    cd ${EXAMPLE_HOME}
    kubectl kustomize kueue |kubectl apply --server-side --filename=-
    

    Wait until the Kueue Pods are ready:

    watch kubectl --namespace=kueue-system get pods
    

    The output should look similar to the following:

    NAME                                        READY   STATUS    RESTARTS   AGE
    kueue-controller-manager-bdc956fc4-vhcmx    2/2     Running   0          3m15s
    
  4. In the /workloads directory, view the flavors.yaml, cluster-queue.yaml, and local-queue.yaml files. These manifests specify how Kueue manages resource quotas:

    ResourceFlavor

    This manifest defines a default ResourceFlavor in Kueue for resource management.

    apiVersion: kueue.x-k8s.io/v1beta1
    kind: ResourceFlavor
    metadata:
      name: default-flavor
    

    ClusterQueue

    This manifest sets up a Kueue ClusterQueue with resource limits for CPU, memory, and GPU.

    This tutorial uses nodes with two Nvidia L4 GPUs attached, with the corresponding node type of g2-standard-24, offering 24 vCPU and 96 GB RAM. The example code shows how to limit your workload's resource usage to a maximum of six GPUs.

    The preemption field in the ClusterQueue configuration references the PriorityClasses to determine which Pods can be preempted when resources are scarce.

    apiVersion: kueue.x-k8s.io/v1beta1
    kind: ClusterQueue
    metadata:
      name: "cluster-queue"
    spec:
      namespaceSelector: {} # match all.
      preemption:
        reclaimWithinCohort: LowerPriority
        withinClusterQueue: LowerPriority
      resourceGroups:
      - coveredResources: [ "cpu", "memory", "nvidia.com/gpu", "ephemeral-storage" ]
        flavors:
        - name: default-flavor
          resources:
          - name: "cpu"
            nominalQuota: 72
          - name: "memory"
            nominalQuota: 288Gi
          - name: "nvidia.com/gpu"
            nominalQuota: 6
          - name: "ephemeral-storage"
            nominalQuota: 200Gi
    

    LocalQueue

    This manifest creates a Kueue LocalQueue named lq in the llm namespace.

    apiVersion: kueue.x-k8s.io/v1beta1
    kind: LocalQueue
    metadata:
      namespace: llm # LocalQueue under llm namespace 
      name: lq
    spec:
      clusterQueue: cluster-queue # Point to the ClusterQueue
    
  5. View the default-priorityclass.yaml, low-priorityclass.yaml, and high-priorityclass.yaml files. These manifests define the PriorityClass objects for Kubernetes scheduling.

    Default priority

    apiVersion: scheduling.k8s.io/v1
    kind: PriorityClass
    metadata:
      name: default-priority-nonpreempting
    value: 10
    preemptionPolicy: Never
    globalDefault: true
    description: "This priority class will not cause other pods to be preempted."
    

    Low priority

    apiVersion: scheduling.k8s.io/v1
    kind: PriorityClass
    metadata:
      name: low-priority-preempting
    value: 20
    preemptionPolicy: PreemptLowerPriority
    globalDefault: false
    description: "This priority class will cause pods with lower priority to be preempted."
    

    High priority

    apiVersion: scheduling.k8s.io/v1
    kind: PriorityClass
    metadata:
      name: high-priority-preempting
    value: 30
    preemptionPolicy: PreemptLowerPriority
    globalDefault: false
    description: "This high priority class will cause other pods to be preempted."
    
  6. Create the Kueue and Kubernetes objects by running these commands to apply the corresponding manifests.

    cd ${EXAMPLE_HOME}/workloads
    kubectl apply --filename=flavors.yaml
    kubectl apply --filename=default-priorityclass.yaml
    kubectl apply --filename=high-priorityclass.yaml
    kubectl apply --filename=low-priorityclass.yaml
    kubectl apply --filename=cluster-queue.yaml
    kubectl apply --filename=local-queue.yaml --namespace=llm
    

Deploy the TGI inference server

In this section, you deploy the TGI container to serve the Gemma 2 model.

  1. In the /workloads directory, view the tgi-gemma-2-9b-it-hp.yaml file. This manifest defines a Kubernetes Deployment to deploy the TGI serving runtime and gemma-2-9B-it model. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster.

    The Deployment prioritizes inference tasks and uses two GPUs for the model. It uses tensor parallelism, by setting the NUM_SHARD environment variable, to fit the model into GPU memory.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: tgi-gemma-deployment
      labels:
        app: gemma-server
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: gemma-server
      template:
        metadata:
          labels:
            app: gemma-server
            ai.gke.io/model: gemma-2-9b-it
            ai.gke.io/inference-server: text-generation-inference
            examples.ai.gke.io/source: user-guide
            kueue.x-k8s.io/queue-name: lq
            kueue-job: "true"
        spec:
          priorityClassName: high-priority-preempting
          containers:
          - name: inference-server
            image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-1.ubuntu2204.py310
            resources:
              requests:
                cpu: "4"
                memory: "30Gi"
                ephemeral-storage: "30Gi"
                nvidia.com/gpu: "2"
              limits:
                cpu: "4"
                memory: "30Gi"
                ephemeral-storage: "30Gi"
                nvidia.com/gpu: "2"
            env:
            - name: AIP_HTTP_PORT
              value: '8000'
            - name: NUM_SHARD
              value: '2'
            - name: MODEL_ID
              value: google/gemma-2-9b-it
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
          volumes:
          - name: dshm
            emptyDir:
              medium: Memory
          nodeSelector:
            cloud.google.com/gke-accelerator: "nvidia-l4"
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: llm-service
    spec:
      selector:
        app: gemma-server
      type: ClusterIP
      ports:
      - protocol: TCP
        port: 8000
        targetPort: 8000
    
  2. Apply the manifest by running the following command:

    kubectl apply --filename=tgi-gemma-2-9b-it-hp.yaml --namespace=llm
    

    The deployment operation will take a few minutes to complete.

  3. To check if GKE successfully created the Deployment, run the following command:

    kubectl --namespace=llm get deployment
    

    The output should look similar to the following:

    NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
    tgi-gemma-deployment   1/1     1            1           5m13s
    

Verify Kueue quota management

In this section, you confirm that Kueue is correctly enforcing the GPU quota for your Deployment.

  1. To check if Kueue is aware of your Deployment, run this command to retrieve the status of the Workload objects:

    kubectl --namespace=llm get workloads
    

    The output should look similar to the following:

    NAME                                              QUEUE   RESERVED IN     ADMITTED   FINISHED   AGE
    pod-tgi-gemma-deployment-6bf9ffdc9b-zcfrh-84f19   lq      cluster-queue   True                  8m23s
    
  2. To test overriding the quota limits, scale the Deployment to four replicas:

    kubectl scale --replicas=4 deployment/tgi-gemma-deployment --namespace=llm
    
  3. Run the following command to see the number of replicas that GKE deploys:

    kubectl get workloads --namespace=llm
    

    The output should look similar to the following:

    NAME                                              QUEUE   RESERVED IN     ADMITTED   FINISHED   AGE
    pod-tgi-gemma-deployment-6cb95cc7f5-5thgr-3f7d4   lq      cluster-queue   True                  14s
    pod-tgi-gemma-deployment-6cb95cc7f5-cbxg2-d9fe7   lq      cluster-queue   True                  5m41s
    pod-tgi-gemma-deployment-6cb95cc7f5-tznkl-80f6b   lq                                            13s
    pod-tgi-gemma-deployment-6cb95cc7f5-wd4q9-e4302   lq      cluster-queue   True                  13s
    

    The output shows that only three Pods are admitted due to the resource quota that Kueue enforces.

  4. Run the following to display the Pods in the llm namespace:

    kubectl get pod --namespace=llm
    

    The output should look similar to the following:

    NAME                                    READY   STATUS            RESTARTS   AGE
    tgi-gemma-deployment-7649884d64-6j256   1/1     Running           0          4m45s
    tgi-gemma-deployment-7649884d64-drpvc   0/1     SchedulingGated   0          7s
    tgi-gemma-deployment-7649884d64-thdkq   0/1     Pending           0          7s
    tgi-gemma-deployment-7649884d64-znvpb   0/1     Pending           0          7s
    
  5. Now, scale down the Deployment back to 1. This step is required before deploying the fine-tuning job, otherwise it won't get admitted due to the inference job having priority.

    kubectl scale --replicas=1 deployment/tgi-gemma-deployment --namespace=llm
    

Explanation of the behavior

The scaling example results in only three replicas (despite scaling to four) because of the GPU quota limit that you set in the ClusterQueue configuration. The ClusterQueue's spec.resourceGroups section defines a nominalQuota of "6" for nvidia.com/gpu. The Deployment specifies that each Pod requires "2" GPUs. Therefore, the ClusterQueue can only accommodate a maximum of three replicas of the Deployment at a time (since 3 replicas * 2 GPUs per replica = 6 GPUs, which is the total quota).

When you attempt to scale to four replicas, Kueue recognizes that this action would exceed the GPU quota and it prevents the fourth replica from being scheduled. This is indicated by the SchedulingGated status of the fourth Pod. This behavior demonstrates Kueue's resource quota enforcement.

Deploy the training Job

In this section, you deploy a lower-priority fine-tuning Job for a Gemma 2 model that requires four GPUs across two Pods. This job will use the remaining GPU quota in the ClusterQueue. The Job uses a prebuilt image and saves checkpoints to allow restarting from intermediate results.

The fine-tuning Job uses the b-mc2/sql-create-context dataset. The source for the fine-tuning job can be found in the repository.

  1. View the fine-tune-l4.yaml file. This manifest defines the fine-tuning Job.

    apiVersion: v1
    kind: Service
    metadata:
      name: headless-svc-l4
    spec:
      clusterIP: None # clusterIP must be None to create a headless service
      selector:
        job-name: finetune-gemma-l4 # must match Job name
    ---
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: finetune-gemma-l4
      labels:
        kueue.x-k8s.io/queue-name: lq
    spec:
      backoffLimit: 4
      completions: 2
      parallelism: 2
      completionMode: Indexed
      suspend: true # Set to true to allow Kueue to control the Job when it starts
      template:
        metadata:
          labels:
            app: finetune-job
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/memory-limit: "35Gi"
        spec:
          priorityClassName: low-priority-preempting
          containers:
          - name: gpu-job
            imagePullPolicy: Always
            image: us-docker.pkg.dev/google-samples/containers/gke/gemma-fine-tuning:v1.0.0
            ports:
            - containerPort: 29500
            resources:
              requests:
                nvidia.com/gpu: "2"
              limits:
                nvidia.com/gpu: "2"
            command:
            - bash
            - -c
            - |
              accelerate launch \
              --config_file fsdp_config.yaml \
              --debug \
              --main_process_ip finetune-gemma-l4-0.headless-svc-l4 \
              --main_process_port 29500 \
              --machine_rank ${JOB_COMPLETION_INDEX} \
              --num_processes 4 \
              --num_machines 2 \
              fine_tune.py
            env:
            - name: "EXPERIMENT"
              value: "finetune-experiment"
            - name: MODEL_NAME
              value: "google/gemma-2-2b"
            - name: NEW_MODEL
              value: "gemma-ft"
            - name: MODEL_PATH
              value: "/model-data/model-gemma2/experiment"
            - name: DATASET_NAME
              value: "b-mc2/sql-create-context"
            - name: DATASET_LIMIT
              value: "5000"
            - name: EPOCHS
              value: "1"
            - name: GRADIENT_ACCUMULATION_STEPS
              value: "2"
            - name: CHECKPOINT_SAVE_STEPS
              value: "10"
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
            - name: gcs-fuse-csi-ephemeral
              mountPath: /model-data
              readOnly: false
          nodeSelector:
            cloud.google.com/gke-accelerator: nvidia-l4
          restartPolicy: OnFailure
          serviceAccountName: default
          subdomain: headless-svc-l4
          terminationGracePeriodSeconds: 60
          volumes:
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: <MODEL_BUCKET>
                mountOptions: "implicit-dirs"
                gcsfuseLoggingSeverity: warning
    
  2. Apply the manifest to create the fine-tuning Job:

    cd ${EXAMPLE_HOME}/workloads
    
    sed -e "s/<MODEL_BUCKET>/$MODEL_BUCKET/g" \
        -e "s/<PROJECT_ID>/$PROJECT_ID/g" \
        -e "s/<REGION>/$REGION/g" \
        fine-tune-l4.yaml |kubectl apply --filename=- --namespace=llm
    
  3. Verify that your Deployments are running. To check the status of the Workload objects, run the following command:

    kubectl get workloads --namespace=llm
    

    The output should look similar to the following:

    NAME                                              QUEUE   RESERVED IN     ADMITTED   FINISHED   AGE
    job-finetune-gemma-l4-3316f                       lq      cluster-queue   True                  29m
    pod-tgi-gemma-deployment-6cb95cc7f5-cbxg2-d9fe7   lq      cluster-queue   True                  68m
    

    Next, view the Pods in the llm namespace by running this command:

    kubectl get pod --namespace=llm
    

    The output should look similar to the following:

    NAME                                    READY   STATUS    RESTARTS   AGE
    finetune-gemma-l4-0-vcxpz               2/2     Running   0          31m
    finetune-gemma-l4-1-9ppt9               2/2     Running   0          31m
    tgi-gemma-deployment-6cb95cc7f5-cbxg2   1/1     Running   0          70m
    

    The output shows that Kueue admits both your fine-tune Job and inference server Pods to run, reserving the correct resources based on your specified quota limits.

  4. View the output logs to verify that your fine-tuning Job saves checkpoints to the Cloud Storage bucket. The fine-tuning Job takes around 10 minutes before it starts saving the first checkpoint.

    kubectl logs --namespace=llm --follow --selector=app=finetune-job
    

    The output for the first saved checkpoint looks similar to the following:

    {"name": "finetune", "thread": 133763559483200, "threadName": "MainThread", "processName": "MainProcess", "process": 33, "message": "Fine tuning started", "timestamp": 1731002351.0016131, "level": "INFO", "runtime": 451579.89835739136}
    …
    {"name": "accelerate.utils.fsdp_utils", "thread": 136658669348672, "threadName": "MainThread", "processName": "MainProcess", "process": 32, "message": "Saving model to /model-data/model-gemma2/experiment/checkpoint-10/pytorch_model_fsdp_0", "timestamp": 1731002386.1763802, "level": "INFO", "runtime": 486753.8924217224}
    

Test Kueue preemption and dynamic allocation on your mixed workload

In this section, you simulate a scenario where the inference server's load increases, requiring it to scale up. This scenario demonstrates how Kueue prioritizes the high-priority inference server by suspending and preempting the lower-priority fine-tuning Job when resources are constrained.

  1. Run the following command to scale the inference server's replicas to two:

    kubectl scale --replicas=2 deployment/tgi-gemma-deployment --namespace=llm
    
  2. Check the status of the Workload objects:

    kubectl get workloads --namespace=llm
    

    The output looks similar to the following:

    NAME                                              QUEUE   RESERVED IN     ADMITTED   FINISHED   AGE
    job-finetune-gemma-l4-3316f                       lq                      False                 32m
    pod-tgi-gemma-deployment-6cb95cc7f5-cbxg2-d9fe7   lq      cluster-queue   True                  70m
    pod-tgi-gemma-deployment-6cb95cc7f5-p49sh-167de   lq      cluster-queue   True                  14s
    

    The output shows that the fine-tuning Job is no longer admitted because the increased inference server replicas are using the available GPU quota.

  3. Check the status of the fine-tune Job:

    kubectl get job --namespace=llm
    

    The output looks similar to the following, indicating that the fine-tune Job status is now suspended:

    NAME                STATUS      COMPLETIONS   DURATION   AGE
    finetune-gemma-l4   Suspended   0/2                      33m
    
  4. Run the following command to inspect your Pods:

    kubectl get pod --namespace=llm
    

    The output looks similar to the following, indicating that Kueue terminated the fine-tune Job Pods to free resources for the higher priority inference server Deployment.

    NAME                                    READY   STATUS              RESTARTS   AGE
    tgi-gemma-deployment-6cb95cc7f5-cbxg2   1/1     Running             0          72m
    tgi-gemma-deployment-6cb95cc7f5-p49sh   0/1     ContainerCreating   0          91s
    
  5. Next, test the scenario where the inference server load decreases and its Pods scale down. Run the following command:

    kubectl scale --replicas=1 deployment/tgi-gemma-deployment --namespace=llm
    

    Run the following command to display the Workload objects:

    kubectl get workloads --namespace=llm
    

    The output looks similar to the following, indicating that one of the inference server Deployments is terminated, and the fine-tune Job is re-admitted.

    NAME                                              QUEUE   RESERVED IN     ADMITTED   FINISHED   AGE
    job-finetune-gemma-l4-3316f                       lq      cluster-queue   True                  37m
    pod-tgi-gemma-deployment-6cb95cc7f5-cbxg2-d9fe7   lq      cluster-queue   True                  75m
    
  6. Run this command to display the Jobs:

    kubectl get job --namespace=llm
    

    The output looks similar to the following, indicating that the fine-tune Job is running again, resuming from the latest available checkpoint.

    NAME                STATUS    COMPLETIONS   DURATION   AGE
    finetune-gemma-l4   Running   0/2           2m11s      38m
    

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the deployed resources

To avoid incurring charges to your Google Cloud account for the resources that you created in this guide, run the following commands:

gcloud storage rm --recursive gs://${MODEL_BUCKET}
gcloud container clusters delete ${CLUSTER_NAME} --location ${REGION}

What's next