Vertex AI RAG Engine quotas

For each service to perform retrieval-augmented generation (RAG) using RAG Engine, the following quotas apply, with the quota measured as requests per minute (RPM).

Service	Quota	Metric
RAG Engine data management APIs	60 RPM	`VertexRagDataService requests per minute per region`
`RetrievalContexts` API	1,500 RPM	`VertexRagService retrieve requests per minute per region`
`base_model: textembedding-gecko`	1,500 RPM	`Online prediction requests per base model per minute per region per base_model` An additional filter for you to specify is `base_model: textembedding-gecko`

The following limits apply:

Service	Limit	Metric
Concurrent `ImportRagFiles` requests	3 RPM	`VertexRagService concurrent import requests per region`
Maximum number of files per `ImportRagFiles` request	10,000	`VertexRagService import rag files requests per region`

For more rate limits and quotas, see Generative AI on Vertex AI rate limits.

What's next

To learn how to use the Vertex AI SDK to run Vertex AI RAG Engine tasks, see RAG quickstart for Python.
To learn about grounding, see Grounding overview.
To learn about the differences between RAG and grounding, see Ground responses using RAG.
To learn about the RAG architecture:
- Infrastructure for a RAG-capable generative AI application using Vertex AI and Vector Search
- Infrastructure for a RAG-capable generative AI application using Vertex AI and AlloyDB for PostgreSQL.