Best practices for repository size
Repository size impacts multiple aspects of development in Dataform, such as:
- Collaboration
- Codebase readability
- Development processes
- Workflow compilation
- Workflow execution
Dataform enforces API quotas and limits on compilation resources. Large repository size can cause your repository to exceed these quotas and limits. This can lead to failed compilation and execution of your SQL workflow.
To mitigate that risk, we recommend splitting large repositories. When you split a large repository, you divide a large SQL workflow into a number of smaller SQL workflows housed in different repositories and connected by cross-repository dependencies.
This approach lets you adhere to Dataform quotas and limits, fine-grain processes and permissions, and improve codebase readability and collaboration. However, managing split repositories can be more challenging than managing a single repository.
To learn more about the impact of repository size in Dataform and best practices for splitting repositories, see Splitting repositories.
Best practices for repository structure
We recommend structuring files in the definitions
directory to reflect the
stages of your workflow. Keep in mind that you can adopt a custom structure
that best fits your needs.
The following recommended structure of definitions
subdirectories reflects
the key stages of most SQL workflows:
sources
, storing data source declarationsintermediate
, storing data transformation logicoutput
, storing definitions of output tables- Optional:
extras
, storing additional files
Names of all files in Dataform must conform to BigQuery
table naming guidelines. We recommend that the names of files in the
definitions
directory in a Dataform repository reflect the
subdirectory structure.
To learn more about best practices for structuring and naming files in a repository, see Structuring code in a repository.
Best practices for code lifecycle
The default code lifecycle in Dataform consists of the following phases:
Development of SQL workflow code in Dataform workspaces
You can develop with Dataform core or exclusively with JavaScript.
Compilation of your code into a compilation result using settings from your workflow settings file.
You can configure custom compilation results with release configurations and workspace compilation overrides.
With release configurations, you can configure custom compilation results of your whole repository. You can later schedule their execution in workflow configurations.
With workspace compilation overrides, you can configure compilation overrides for all workspaces in your repository, creating custom compilation results of each workspace.
Execution of the compilation result in BigQuery
You can schedule executions or repository compilation results with workflow configurations.
To manage code lifecycle in Dataform, you can create execution environments, for example, development, staging, and production.
To learn more about code lifecycle in Dataform, see Introduction to code lifecycle in Dataform.
You can select to keep your execution environments in a single repository, or in multiple repositories.
Execution environments in a single repository
You can create isolated execution environments such as development, staging, and production in a single Dataform repository with workspace compilation overrides and release configurations.
You can create isolated execution environments the following ways:
- Split development and production tables by schema
- Split development and production tables by schema and Google Cloud project
- Split development, staging, and production tables per Google Cloud project
Then, you can schedule executions in staging and production environments with workflow configurations. We recommend triggering executions manually in the development environment.
To learn more about best practices for managing code lifecycle in Dataform, see Managing code lifecycle.
Code lifecycle in multiple repositories
To tailor Identity and Access Management permissions to each stage of the code lifecycle, you can create multiple copies of a repository and store them in different Google Cloud projects.
Each Google Cloud project serves as an execution environment that corresponds to a stage of your code lifecycle, for example, development and production.
In this approach, we recommend keeping the codebase of the repository the same in all Google Cloud project. To customize compilation and execution in each copy of the repository, use workspace compilation overrides, release configurations, and workflow configurations.
What's next
- To learn more about repository size in Dataform, see Overview of repository size.
- To learn more about best practices for splitting repositories, see Splitting repositories.
- To learn more about best practices for repository structure, see Structuring code in a repository
- To learn more about code lifecycle in Dataform and different ways to configure it, see Introduction to code lifecycle in Dataform.
- To learn more about best practices for code lifecycle, see Managing code lifecycle.