Cloud Composer 1 | Cloud Composer 2 | Cloud Composer 3
This page describes what data Cloud Composer stores for your environment in Cloud Storage.
When you create an environment, Cloud Composer creates a
Cloud Storage bucket and associates the bucket
with your environment. The name of the bucket is based on the environment
region, name, and a random ID such as us-central1-b1-6efabcde-bucket
.
Cloud Composer synchronizes specific folders in your environment's bucket to Airflow components that run in your environment. For example, when you update a file with the code of your Airflow DAG in the environment's bucket, Airflow components also receive the updated version. Cloud Composer uses Cloud Storage FUSE for synchronization.
Folders in the Cloud Storage bucket
Folder | Storage path | Mapped directory | Description |
---|---|---|---|
DAG | gs://bucket-name/dags |
/home/airflow/gcs/dags |
Stores the DAGs for your environment. |
Plugins | gs://bucket-name/plugins |
/home/airflow/gcs/plugins |
Stores your custom plugins, such as custom in-house Airflow operators, hooks, sensors, or interfaces. |
Data | gs://bucket-name/data |
/home/airflow/gcs/data |
Stores the data that tasks produce and use. |
Logs | gs://bucket-name/logs |
/home/airflow/gcs/logs |
Stores the Airflow logs for tasks. Logs are also available in the Airflow web interface and in Logs tab in Google Cloud console. |
Cloud Composer synchronizes the dags/
and plugins/
folders
uni-directionally. Unidirectional syncing means that local changes in these
folders on an Airflow component are overwritten. The data/
and logs/
folders synchronize bidirectionally.
Data synchronization is eventually consistent. To send messages from one operator to another, use XComs.
Capacity considerations
Data from dags/
, plugins/
and data/
folders are synchronized to Airflow
scheduler(s) and workers.
In Airflow 2, the content of the plugins/
folder is also synchronized to the
Airflow web server. In Airflow 1, the content dags/
and plugins/
folders
is synchronized to Airflow web server only if DAG Serialization is turned off.
Otherwise, the synchronization is not performed.
The more data is put into these folders, the more space is occupied
in the local storage of Airflow components. Saving too much data in
dags/
and plugins/
can disrupt your operations and lead to issues such as:
A worker or a scheduler runs out of local storage and is evicted because of insufficient space on the local disk of the component.
Synchronization of files from
dags/
andplugins/
folders to workers and schedulers takes a long time.Synchronizing files from
dags/
andplugins/
folders to workers and schedulers becomes impossible. For example, you store a 2 GB file in thedags/
folder, but the local disk of an Airflow worker can only accommodate 1 GB. During the synchronization, the worker runs out of local storage and synchronization can't be completed.
DAGs and plugins folders
To avoid DAG run failures, store your DAGs, plugins, and Python modules in the
dags/
or plugins/
folders, even if your Python modules don't contain DAGs
or plugins.
For example, you use a DataFlowPythonOperator
that references a py_file
Dataflow pipeline. That py_file
doesn't contain DAGs or
plugins, but you must still store it in the dags/
or plugins/
folder.
Data folder
There are scenarios when certain files from the data/
folder are
synchronized to a specific Airflow component. For example, when
Cloud Composer attempts to read a given file for the first time during:
DAG parsing: When a file is read for the first time during DAG parsing, Cloud Composer synchronizes it to the scheduler that parses the DAG.
DAG execution: When a file is read for the first time during DAG execution, Cloud Composer synchronizes it to the worker running the execution.
Airflow components have limited local storage, so consider deleting downloaded files to free disk space in your components. Notice that local storage usage can also temporarily go up if you have concurrent tasks that download the same file to a single Airflow worker.
Logs folder
The logs/
folder is synchronized from Airflow workers to the environment's
bucket using the Cloud Storage API.
Cloud Storage API quota is calculated by the amount of data moved, so the number of Airflow tasks your system runs can increase your Cloud Storage API usage: the more tasks you run, the bigger your log files.
Synchronization with the web server
Airflow 2 uses DAG serialization out of the box. The plugins/
folder
is automatically synchronized to the web server so that plugins can be loaded by
Airflow UI. You can't turn off DAG serialization in Airflow 2.
In Airflow 1, DAG serialization is supported and is turned on by default in Cloud Composer.
- When DAG serialization is turned on, the files from
dags/
andplugins/
folders aren't synchronized to the web server. - When DAG serialization is turned off, the files from
dags/
andplugins/
are synchronized to the web server.