Cloud Composer 1 is in the post-maintenance mode. Google does not release any further updates to Cloud Composer 1, including new versions of Airflow, bugfixes, and security updates. We recommend planning migration to Cloud Composer 2.

Write Airflow DAGs

Cloud Composer 3 | Cloud Composer 2 | Cloud Composer 1

This guide shows you how to write an Apache Airflow directed acyclic graph (DAG) that runs in a Cloud Composer environment.

Because Apache Airflow does not provide strong DAG and task isolation, we recommend that you use separate production and test environments to prevent DAG interference. For more information, see Testing DAGs.

Structuring an Airflow DAG

An Airflow DAG is defined in a Python file and is composed of the following components:

DAG definition
Airflow operators
Operator relationships

The following code snippets show examples of each component out of context.

A DAG definition

The following example demonstrates an Airflow DAG definition:

import datetime

from airflow import models

default_dag_args = {
    # The start_date describes when a DAG is valid / can be run. Set this to a
    # fixed point in time rather than dynamically, since it is evaluated every
    # time a DAG is parsed. See:
    # https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
    "start_date": datetime.datetime(2018, 1, 1),
}

# Define a DAG (directed acyclic graph) of tasks.
# Any task you create within the context manager is automatically added to the
# DAG object.
with models.DAG(
    "composer_sample_simple_greeting",
    schedule_interval=datetime.timedelta(days=1),
    default_args=default_dag_args,
) as dag:

Operators and tasks

Airflow Operators describe the work to be done. A task task is a specific instance of an operator.

from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator

    def greeting():
        import logging

        logging.info("Hello World!")

    # An instance of an operator is called a task. In this case, the
    # hello_python task calls the "greeting" Python function.
    hello_python = PythonOperator(task_id="hello", python_callable=greeting)

    # Likewise, the goodbye_bash task calls a Bash script.
    goodbye_bash = BashOperator(task_id="bye", bash_command="echo Goodbye.")

Task relationships

Task relationships describe the order in which the work must be completed.

# Define the order in which the tasks complete by using the >> and <<
# operators. In this example, hello_python executes before goodbye_bash.
hello_python >> goodbye_bash

Full DAG workflow example in Python

The following workflow is a complete working DAG template that is composed of two tasks: a hello_python task and a goodbye_bash task:


import datetime

from airflow import models

from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator



default_dag_args = {
    # The start_date describes when a DAG is valid / can be run. Set this to a
    # fixed point in time rather than dynamically, since it is evaluated every
    # time a DAG is parsed. See:
    # https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
    "start_date": datetime.datetime(2018, 1, 1),
}

# Define a DAG (directed acyclic graph) of tasks.
# Any task you create within the context manager is automatically added to the
# DAG object.
with models.DAG(
    "composer_sample_simple_greeting",
    schedule_interval=datetime.timedelta(days=1),
    default_args=default_dag_args,
) as dag:
    def greeting():
        import logging

        logging.info("Hello World!")

    # An instance of an operator is called a task. In this case, the
    # hello_python task calls the "greeting" Python function.
    hello_python = PythonOperator(task_id="hello", python_callable=greeting)

    # Likewise, the goodbye_bash task calls a Bash script.
    goodbye_bash = BashOperator(task_id="bye", bash_command="echo Goodbye.")

    # Define the order in which the tasks complete by using the >> and <<
    # operators. In this example, hello_python executes before goodbye_bash.
    hello_python >> goodbye_bash

For more information about defining Airflow DAGs, see the Airflow tutorial and Airflow concepts.

Airflow operators

The following examples show a few popular Airflow operators. For an authoritative reference of Airflow operators, see the Operators and Hooks Reference and Providers index.

BashOperator

Use the BashOperator to run command-line programs.

from airflow.operators import bash

    # Create BigQuery output dataset.
    make_bq_dataset = bash.BashOperator(
        task_id="make_bq_dataset",
        # Executing 'bq' command requires Google Cloud SDK which comes
        # preinstalled in Cloud Composer.
        bash_command=f"bq ls {bq_dataset_name} || bq mk {bq_dataset_name}",
    )

Cloud Composer runs the provided commands in a Bash script on an Airflow worker. The worker is a Debian-based Docker container and includes several packages.

gcloud command, including the gcloud storage sub-command for working with Cloud Storage buckets.
bq command
kubectl command

PythonOperator

Use the PythonOperator to run arbitrary Python code.

Cloud Composer runs the Python code in a container that includes packages for the Cloud Composer image version used in your environment.

To install additional Python packages, see Installing Python Dependencies.

Google Cloud Operators

To run tasks that use Google Cloud products, use the Google Cloud Airflow operators. For example, BigQuery operators query and process data in BigQuery.

There are many more Airflow operators for Google Cloud and individual services provided by Google Cloud. See Google Cloud Operators for the full list.

from airflow.providers.google.cloud.operators import bigquery
from airflow.providers.google.cloud.transfers import bigquery_to_gcs

    bq_recent_questions_query = bigquery.BigQueryInsertJobOperator(
        task_id="bq_recent_questions_query",
        configuration={
            "query": {
                "query": RECENT_QUESTIONS_QUERY,
                "useLegacySql": False,
                "destinationTable": {
                    "projectId": project_id,
                    "datasetId": bq_dataset_name,
                    "tableId": bq_recent_questions_table_id,
                },
            }
        },
        location=location,
    )

EmailOperator

Use the EmailOperator to send email from a DAG. To send email from a Cloud Composer environment, configure your environment to use SendGrid.

from airflow.operators import email

    # Send email confirmation (you will need to set up the email operator
    # See https://cloud.google.com/composer/docs/how-to/managing/creating#notification
    # for more info on configuring the email operator in Cloud Composer)
    email_summary = email.EmailOperator(
        task_id="email_summary",
        to="{{var.value.email}}",
        subject="Sample BigQuery notify data ready",
        html_content="""
        Analyzed Stack Overflow posts data from {min_date} 12AM to {max_date}
        12AM. The most popular question was '{question_title}' with
        {view_count} views. Top 100 questions asked are now available at:
        {export_location}.
        """.format(
            min_date=min_query_date,
            max_date=max_query_date,
            question_title=(
                "{{ ti.xcom_pull(task_ids='bq_read_most_popular', "
                "key='return_value')[0][0] }}"
            ),
            view_count=(
                "{{ ti.xcom_pull(task_ids='bq_read_most_popular', "
                "key='return_value')[0][1] }}"
            ),
            export_location=output_file,
        ),
    )

Notifications on operator failure

Set email_on_failure to True to send an email notification when an operator in the DAG fails. To send email notifications from a Cloud Composer environment, you must configure your environment to use SendGrid.

from airflow import models

default_dag_args = {
    "start_date": yesterday,
    # Email whenever an Operator in the DAG fails.
    "email": "{{var.value.email}}",
    "email_on_failure": True,
    "email_on_retry": False,
    "retries": 1,
    "retry_delay": datetime.timedelta(minutes=5),
    "project_id": project_id,
}

with models.DAG(
    "composer_sample_bq_notify",
    schedule_interval=datetime.timedelta(weeks=4),
    default_args=default_dag_args,
) as dag:

DAG workflow guidelines

Place any custom Python libraries in a DAG's ZIP archive in a nested directory. Do not place libraries at the top level of the DAGs directory.

When Airflow scans the dags/ folder, Airflow only checks for DAGs in Python modules that are in the top-level of the DAGs folder and in the top level of a ZIP archive also located in the top-level dags/ folder. If Airflow encounters a Python module in a ZIP archive that does not contain both airflow and DAG substrings, Airflow stops processing the ZIP archive. Airflow returns only the DAGs found up to that point.
For fault tolerance, do not define multiple DAG objects in the same Python module.
Do not use SubDAGs. Instead, group tasks inside DAGs.
Place files that are required at DAG parse time into dags/ folder, not in the data/ folder.
Implement unit tests for your DAGs.
Test developed or modified DAGs as recommended in instructions for testing DAGs.
Composer Local Development CLI tool streamlines Apache Airflow DAG development for Cloud Composer 2 by running an Airflow environment locally. This local Airflow environment uses an image of a specific Cloud Composer 2 version.
Verify that developed DAGs do not increase DAG parse times too much.
Airflow tasks can fail for multiple reasons. To avoid failures of whole DAG runs, we recommend to enable task retries. Setting maximum retries to 0 means that no retries are performed.

We recommend to override the default_task_retries option with a value for the task retries other than 0. In addition, you can set the retries parameter at the task level.
If you want to use GPU in your Airflow tasks then create a separate GKE cluster based on nodes using machines with GPUs. Use GKEStartPodOperator to run your tasks.
Avoid running CPU- and memory-heavy tasks in the cluster's node pool where other Airflow components (schedulers, workers, web servers) are running. Instead, use KubernetesPodOperator or GKEStartPodOperator instead.
When deploying DAGs into an environment, upload only the files that are absolutely necessary for interpreting and executing DAGs into the /dags folder.
Limit the number of DAG files in /dags folder.

Airflow is continuously parsing DAGs in /dags folder. The parsing is a process that loops through the DAGs folder and the number of files that need to be loaded (with their dependencies) makes impacts the performance of DAG parsing and task scheduling. It is much more efficient to use 100 files with 100 DAGs each than 10000 files with 1 DAG each and so such optimization is recommended. This optimization is a balance between parsing time and efficiency of DAG authoring and management.

You can also consider, for example, to deploy 10000 DAG files you could create 100 zip files each containing 100 DAG files.

In addition to hints above, if you have more than 10000 DAG files then generating DAGs in a programamtic way might be a good option. For example, you can implement a single Python DAG file that generates some number of DAG objects (for example, 20, 100 DAG objects).
Avoid using deprecated Airflow operators. Instead, use their up-to-date alternatives.

FAQs for writing DAGs

How do I minimize code repetition if I want to run the same or similar tasks in multiple DAGs?

We suggest defining libraries and wrappers to minimize the code repetition.

How do I reuse code between DAG files?

Put your utility functions in a local Python library and import the functions. You can reference the functions in any DAG located in the dags/ folder in your environment's bucket.

How do I minimize the risk of different definitions arising?

For example, you have two teams that want to aggregate raw data into revenue metrics. The teams write two slightly different tasks that accomplish the same thing. Define libraries to work with the revenue data so that the DAG implementers must clarify the definition of revenue that's being aggregated.

How do I set dependencies between DAGs?

This depends on how you want to define the dependency.

If you have two DAGs (DAG A and DAG B) and you want DAG B to trigger after DAG A, you can put a TriggerDagRunOperator at the end of DAG A.

If DAG B depends only on an artifact that DAG A generates, such as a Pub/Sub message, then a sensor might work better.

If DAG B is integrated closely with DAG A, you might be able to merge the two DAGs into one DAG.

How do I pass unique run IDs to a DAG and its tasks?

For example, you want to pass Dataproc cluster names and file paths.

You can generate a random unique ID by returning str(uuid.uuid4()) in a PythonOperator. This places the ID into XComs so that you can refer to the ID in other operators via templated fields.

Before generating a uuid, consider whether a DagRun-specific ID would be more valuable. You can also reference these IDs in Jinja substitutions by using macros.

How do I separate tasks in a DAG?

Each task should be an idempotent unit of work. Consequently, you should avoid encapsulating a multi-step workflow within a single task, such as a complex program running in a PythonOperator.

Should I define multiple tasks in a single DAG to aggregate data from multiple sources?

For example, you have multiple tables with raw data and want to create daily aggregates for each table. The tasks are not dependent on each other. Should you create one task and DAG for each table or create one general DAG?

If you are okay with each task sharing the same DAG-level properties, such as schedule_interval, then it makes sense to define multiple tasks in a single DAG. Otherwise, to minimize code repetition, multiple DAGs can be generated from a single Python module by placing them into the module's globals().

How do I limit the number of concurrent tasks running in a DAG?

For example, you want to avoid exceeding API usage limits/quotas or avoid running too many simultaneous processes.

You can define Airflow pools in the Airflow web UI and associate tasks with existing pools in your DAGs.

FAQs for using operators

Should I use the `DockerOperator`?

We do not recommend using the DockerOperator, unless it's used to launch containers on a remote Docker installation (not within an environment's cluster). In a Cloud Composer environment the operator does not have access to Docker daemons.

Instead, use KubernetesPodOperator or GKEStartPodOperator. These operators launch Kubernetes pods into Kubernetes or GKE clusters respectively. Note that we don't recommend launching pods into an environment's cluster, because this can lead to resource competition.

Should I use the `SubDagOperator`?

We do not recommend using the SubDagOperator.

Use alternatives as suggested in Grouping tasks.

Should I run Python code only in `PythonOperators` to fully separate Python operators?

Depending on your goal, you have a few options.

If your only concern is maintaining separate Python dependencies, you can use the PythonVirtualenvOperator.

Consider using the KubernetesPodOperator. This operator allows you to define Kubernetes pods and run the pods in other clusters.

How do I add custom binary or non-PyPI packages?

You can install packages hosted in private package repositories.

How do I uniformly pass arguments to a DAG and its tasks?

You can use Airflow's built-in support for Jinja templating to pass arguments that can be used in templated fields.

When does template substitution happen?

Template substitution occurs on Airflow workers just before the pre_execute function of an operator is called. In practice, this means that templates are not substituted until just before a task runs.

How do I know which operator arguments support template substitution?

Operator arguments that support Jinja2 template substitution are explicitly marked as such.

Look for the template_fields field in the Operator definition, which contains a list of argument names that undergo template substitution.

For example, see the BashOperator, which supports templating for the bash_command and env arguments.

Deprecated and removed Airflow operators

Airflow Operators listed in the following table are deprecated:

Avoid using these operators in your DAGs. Instead, use provided up-to-date replacement operators.
If an operator is listed as removed, then it already became unavailable in one of the released versions of Cloud Composer 2.
If an operator is listed as planned for removal, then it is deprecated and will be removed in a future version of Cloud Composer 2.
If an operator is listed as already removed in latest Google providers, then the operator is removed in the latest version of the apache-airflow-providers-google package. At the same time, Cloud Composer still uses the version of this package where the operator is not yet removed.

Deprecated operator	Status	Replacement operator	Replacement available from
CreateAutoMLTextTrainingJobOperator	Deprecated, Removal planned, Already removed in latest Google providers	SupervisedFineTuningTrainOperator	composer-2.9.5-airflow-2.9.3 composer-2.9.5-airflow-2.9.1
GKEDeploymentHook	Deprecated, Removal planned, Already removed in latest Google providers	GKEKubernetesHook	composer-2.7.1-airflow-2.7.3
GKECustomResourceHook	Deprecated, Removal planned, Already removed in latest Google providers	GKEKubernetesHook	composer-2.7.1-airflow-2.7.3
GKEPodHook	Deprecated, Removal planned, Already removed in latest Google providers	GKEKubernetesHook	composer-2.7.1-airflow-2.7.3
GKEJobHook	Deprecated, Removal planned, Already removed in latest Google providers	GKEKubernetesHook	composer-2.7.1-airflow-2.7.3
GKEPodAsyncHook	Deprecated, Removal planned, Already removed in latest Google providers	GKEKubernetesAsyncHook	composer-2.7.1-airflow-2.7.3
SecretsManagerHook	Deprecated, Removal planned, Already removed in latest Google providers	GoogleCloudSecretManagerHook	composer-2.8.3-airflow-2.7.3
BigQueryExecuteQueryOperator	Deprecated, Removal planned, Already removed in latest Google providers	BigQueryInsertJobOperator	All versions
BigQueryPatchDatasetOperator	Deprecated, Removal planned, Already removed in latest Google providers	BigQueryUpdateDatasetOperator	All versions
DataflowCreateJavaJobOperator	Deprecated, Removal planned, Already removed in latest Google providers	beam.BeamRunJavaPipelineOperator	All versions
DataflowCreatePythonJobOperator	Deprecated, Removal planned, Already removed in latest Google providers	beam.BeamRunPythonPipelineOperator	All versions
DataprocSubmitPigJobOperator	Deprecated, Removal planned, Already removed in latest Google providers	DataprocSubmitJobOperator	All versions
DataprocSubmitHiveJobOperator	Deprecated, Removal planned, Already removed in latest Google providers	DataprocSubmitJobOperator	All versions
DataprocSubmitSparkSqlJobOperator	Deprecated, Removal planned, Already removed in latest Google providers	DataprocSubmitJobOperator	All versions
DataprocSubmitSparkJobOperator	Deprecated, Removal planned, Already removed in latest Google providers	DataprocSubmitJobOperator	All versions
DataprocSubmitHadoopJobOperator	Deprecated, Removal planned, Already removed in latest Google providers	DataprocSubmitJobOperator	All versions
DataprocSubmitPySparkJobOperator	Deprecated, Removal planned, Already removed in latest Google providers	DataprocSubmitJobOperator	All versions
BigQueryTableExistenceAsyncSensor	Deprecated, Removal planned, Already removed in latest Google providers	BigQueryTableExistenceSensor	composer-2.3.0-airflow-2.5.1, composer-2.3.0-airflow-2.4.3
BigQueryTableExistencePartitionAsyncSensor	Deprecated, Removal planned, Already removed in latest Google providers	BigQueryTablePartitionExistenceSensor	composer-2.3.0-airflow-2.5.1, composer-2.3.0-airflow-2.4.3
CloudComposerEnvironmentSensor	Deprecated, Removal planned, Already removed in latest Google providers	CloudComposerCreateEnvironmentOperator, CloudComposerDeleteEnvironmentOperator, CloudComposerUpdateEnvironmentOperator	composer-2.3.0-airflow-2.5.1, composer-2.3.0-airflow-2.4.3
GCSObjectExistenceAsyncSensor	Deprecated, Removal planned, Already removed in latest Google providers	GCSObjectExistenceSensor	composer-2.3.0-airflow-2.5.1, composer-2.3.0-airflow-2.4.3
GoogleAnalyticsHook	Deprecated, Removal planned, Already removed in latest Google providers	GoogleAnalyticsAdminHook	composer-2.3.0-airflow-2.5.1, composer-2.3.0-airflow-2.4.3
GoogleAnalyticsListAccountsOperator	Deprecated, Removal planned, Already removed in latest Google providers	GoogleAnalyticsAdminListAccountsOperator	composer-2.3.0-airflow-2.5.1, composer-2.3.0-airflow-2.4.3
GoogleAnalyticsGetAdsLinkOperator	Deprecated, Removal planned, Already removed in latest Google providers	GoogleAnalyticsAdminGetGoogleAdsLinkOperator	composer-2.3.0-airflow-2.5.1, composer-2.3.0-airflow-2.4.3
GoogleAnalyticsRetrieveAdsLinksListOperator	Deprecated, Removal planned, Already removed in latest Google providers	GoogleAnalyticsAdminListGoogleAdsLinksOperator	composer-2.3.0-airflow-2.5.1, composer-2.3.0-airflow-2.4.3
GoogleAnalyticsDataImportUploadOperator	Deprecated, Removal planned, Already removed in latest Google providers	GoogleAnalyticsAdminCreateDataStreamOperator	composer-2.3.0-airflow-2.5.1, composer-2.3.0-airflow-2.4.3
GoogleAnalyticsDeletePreviousDataUploadsOperator	Deprecated, Removal planned, Already removed in latest Google providers	GoogleAnalyticsAdminDeleteDataStreamOperator	composer-2.3.0-airflow-2.5.1, composer-2.3.0-airflow-2.4.3
DataPipelineHook	Deprecated, Removal planned, Already removed in latest Google providers	DataflowHook	composer-2.8.6-airflow-2.9.1 composer-2.8.6-airflow-2.7.3
CreateDataPipelineOperator	Deprecated, Removal planned, Already removed in latest Google providers	DataflowCreatePipelineOperator	composer-2.8.6-airflow-2.9.1 composer-2.8.6-airflow-2.7.3
RunDataPipelineOperator	Deprecated, Removal planned, Already removed in latest Google providers	DataflowRunPipelineOperator	composer-2.8.6-airflow-2.9.1 composer-2.8.6-airflow-2.7.3
AutoMLDatasetLink	Deprecated, Removal planned	TranslationLegacyDatasetLink	composer-2.8.6-airflow-2.9.1 composer-2.8.6-airflow-2.7.3
AutoMLDatasetListLink	Deprecated, Removal planned	TranslationDatasetListLink	composer-2.8.6-airflow-2.9.1 composer-2.8.6-airflow-2.7.3
AutoMLModelLink	Deprecated, Removal planned	TranslationLegacyModelLink	composer-2.8.6-airflow-2.9.1 composer-2.8.6-airflow-2.7.3
AutoMLModelTrainLink	Deprecated, Removal planned	TranslationLegacyModelTrainLink	composer-2.8.6-airflow-2.9.1 composer-2.8.6-airflow-2.7.3
AutoMLModelPredictLink	Deprecated, Removal planned	TranslationLegacyModelPredictLink	composer-2.8.6-airflow-2.9.1 composer-2.8.6-airflow-2.7.3
AutoMLBatchPredictOperator	Deprecated, Removal planned	vertex_ai.batch_prediction_job	composer-2.9.8-airflow-2.9.3
AutoMLPredictOperator	Deprecated, Removal planned	vertex_aigenerative_model. TextGenerationModelPredictOperator, translate.TranslateTextOperator	composer-2.8.3-airflow-2.7.3
PromptLanguageModelOperator	Deprecated, Removal planned	TextGenerationModelPredictOperator	composer-2.8.6-airflow-2.9.1 composer-2.8.6-airflow-2.7.3
GenerateTextEmbeddingsOperator	Deprecated, Removal planned	TextEmbeddingModelGetEmbeddingsOperator	composer-2.8.6-airflow-2.9.1 composer-2.8.6-airflow-2.7.3
PromptMultimodalModelOperator	Deprecated, Removal planned	GenerativeModelGenerateContentOperator	composer-2.8.6-airflow-2.9.1 composer-2.8.6-airflow-2.7.3
PromptMultimodalModelWithMediaOperator	Deprecated, Removal planned	GenerativeModelGenerateContentOperator	composer-2.8.6-airflow-2.9.1 composer-2.8.6-airflow-2.7.3
DataflowStartSqlJobOperator	Deprecated, Removal planned	DataflowStartYamlJobOperator	composer-2.9.5-airflow-2.9.3 composer-2.9.5-airflow-2.9.1
LifeSciencesHook	Deprecated, Removal planned	Google Cloud Batch Operators' hook	To be announced
DataprocScaleClusterOperator	Deprecated, Removal planned	DataprocUpdateClusterOperator	To be announced
MLEngineStartBatchPredictionJobOperator	Deprecated, Removal planned	CreateBatchPredictionJobOperator	To be announced
MLEngineManageModelOperator	Deprecated, Removal planned	MLEngineCreateModelOperator, MLEngineGetModelOperator	To be announced
MLEngineGetModelOperator	Deprecated, Removal planned	GetModelOperator	To be announced
MLEngineDeleteModelOperator	Deprecated, Removal planned	DeleteModelOperator	To be announced
MLEngineManageVersionOperator	Deprecated, Removal planned	MLEngineCreateVersion, MLEngineSetDefaultVersion, MLEngineListVersions, MLEngineDeleteVersion	To be announced
MLEngineCreateVersionOperator	Deprecated, Removal planned	parent_model parameter for VertexAI operators	To be announced
MLEngineSetDefaultVersionOperator	Deprecated, Removal planned	SetDefaultVersionOnModelOperator	To be announced
MLEngineListVersionsOperator	Deprecated, Removal planned	ListModelVersionsOperator	To be announced
MLEngineDeleteVersionOperator	Deprecated, Removal planned	DeleteModelVersionOperator	To be announced
MLEngineStartTrainingJobOperator	Deprecated, Removal planned	CreateCustomPythonPackageTrainingJobOperator	To be announced
MLEngineTrainingCancelJobOperator	Deprecated, Removal planned	CancelCustomTrainingJobOperator	To be announced
LifeSciencesRunPipelineOperator	Deprecated, Removal planned	Google Cloud Batch Operators	To be announced
MLEngineCreateModelOperator	Deprecated, Removal planned	corresponding VertexAI operator	To be announced