Use Hugging Face Models

HuggingFace provides pre-trained models, fine-tuning scripts, and development APIs that make the process of creating and discovering LLMs easier. Model Garden can serve Text Embedding Inference, Regular Pytorch Inference, and Text Generation Inference supported models in HuggingFace.

Deployment options for Hugging Face models

You can deploy supported Hugging Face models in Vertex AI or Google Kubernetes Engine (GKE). The deployment option you choose can depend on the model you're using and how much control you want over your workloads.

Deploy in Vertex AI

Vertex AI offers a managed platform for building and scaling machine learning projects without in-house MLOps expertise. You can use Vertex AI as the downstream application that serves the Hugging Face models. We recommend using Vertex AI if you want end-to-end MLOps capabilities, value-added ML features, and a serverless experience for streamlined development.

To deploy a supported Hugging Face model in Vertex AI, go to Model Garden.

Go to Model Garden
Go to the Open source models on Hugging Face section and click Show more.
Find and select a model to deploy.
Optional: For the Deployment environment, select Vertex AI.
Optional: Specify the deployment details.
Click Deploy.

To get started, see the following examples:

Some models have detailed model cards and the deployment settings are verified by Google, such as google/gemma-7b-it, meta-llama/Llama-2-7b-chat-hf, mistralai/Mistral-7B-v0.1, BAAI/bge-m3, intfloat/multilingual-e5-large-instruct, stabilityai/stable-diffusion-2-1 and HuggingFaceFW/fineweb-edu-classifier.
Some models have the deployment settings verified by Google but no detailed model cards, such as NousResearch/Genstruct-7B.
Some models have deployment settings generated automatically, such as ai4bharat/Airavata.
Some models have automatically generated deployment settings that are based on model metadata, such as some latest trending models in text generation, text embedding, and text to image generation.

Deploy in GKE

Google Kubernetes Engine (GKE) is the Google Cloud solution for managed Kubernetes that provides scalability, security, resilience, and cost effectiveness. We recommend this option if you have existing Kubernetes investments, your organization has in-house MLOps expertise, or if you need granular control over complex AI/ML workloads with unique security, data pipeline, and resource management requirements.

To deploy a supported Hugging Face model in GKE, go to Model Garden.

Go to Model Garden
Go to the Open source models on Hugging Face section and click Show more.
Find and select a model to deploy.
For the Deployment environment, select GKE.
Follow the deployment instructions.

To get started, see the following examples:

Some models have detailed model cards and verified deployment settings, such as google/gemma-7b-it, meta-llama/Llama-2-7b-chat-hf, and mistralai/Mistral-7B-v0.1).
Some models have verified deployment settings, but no detailed model cards, such as NousResearch/Genstruct-7B.