Gathering Cloud Composer settings
This page describes how the gather Cloud Composer settings to automate data within the Google Cloud Cortex Framework. If Cloud Composer is available, you need to create connections within Cloud Composer that point to the source project where your data resides. These connections act as bridges for Cloud Composer to access and interact with the data in the source project. For more information, see Creating new Airflow connections.
Create connections with the following names for DAG execution, based on the workload to deployment. Consider that SFDC Raw Ingestion module uses the same Airflow connection as SFDC CDC module. For details about workloads, see Data sources and workloads. If you are creating tables in the Reporting layer, make sure to create separate connections for Reporting DAGs.
Deploying workload | Create for Raw | Create for CDC | Create for Reporting |
SAP | N/A | sap_cdc_bq
|
sap_reporting_bq
|
SFDC | sfdc_cdc_bq
|
sfdc_cdc_bq
|
sfdc_reporting_bq
|
Google Ads | googleads_raw_dataflow
|
googleads_cdc_bq
|
googleads_reporting_bq
|
CM360 | cm360_raw_dataflow
|
cm360_cdc_bq
|
cm360_reporting_bq
|
TikTok | tiktok_raw_dataflow
|
tiktok_cdc_bq
|
tiktok_reporting_bq
|
LiveRamp | N/A | liveramp_cdc_bq
|
N/A |
Connection Naming Conventions
Consider the following specifications for connection naming conventions:
- Connection suffixes: The connection names include suffixes that
indicate their intended purpose:
_bq
: used for accessing BigQuery data._dataflow
: Used for running Dataflow jobs.
- Raw data connections: You only need to create connections for Raw data if you are using the data ingestion modules provided by Cortex.
- Multiple data sources: If you are deploying multiple data sources (for example, both SAP and Salesforce), it's recommended to create separate connections for each, assuming security limitations are applied to individual service accounts. Alternatively, you can modify the connection name in the template before deployment to use the same connection for writing to BigQuery.
Security Best Practices
- Avoid Default Connections: It's not recommended using the default connections and service accounts offered by Airflow, especially in production environments. This aligns with the principle of least privilege which emphasizes granting only the minimum access permissions necessary.
- Secret Manager Integration: If you have Secret Manager enabled for Airflow, you can create these connections within Secret Manager using the same names. Connections stored in Secret Manager take precedence over those defined directly in Airflow.
The Cloud Storage bucket structure for some of the template DAG expects
the folders to be in /data/bq_data_replication
, as the following example.
You can modify this path prior to deployment. If you don't have an
environment of Cloud Composer available yet, you can create one
afterwards and move the files into the DAG bucket.
with airflow.DAG("CDC_BigQuery_${base table}",
template_searchpath=['/home/airflow/gcs/data/bq_data_replication/'], ##example
default_args=default_dag_args,
schedule_interval="${load_frequency}") as dag:
start_task = DummyOperator(task_id="start")
copy_records = BigQueryOperator(
task_id='merge_query_records',
sql="${query_file}",
create_disposition='CREATE_IF_NEEDED',
bigquery_conn_id="sap_cdc_bq", ## example
use_legacy_sql=False)
stop_task = DummyOperator (task_id="stop")
start_task >> copy_records >> stop_task
The scripts that process data in Airflow or Cloud Composer are purposefully generated separately from the Airflow-specific scripts. This lets you port those scripts to another tool of choice.