After you deploy a replication job, you cannot edit or add tables to it. Instead, add the tables to a new or duplicate replication job.
Option 1: Create a new replication job
Adding tables to a new job is the simplest approach. It prevents historical reloading of all the tables and prevents data inconsistency issues.
The drawbacks are the increased overhead of managing multiple replication jobs and the consumption of more compute resources, as each job runs on a separate ephemeral Dataproc cluster by default. The latter can be mitigated to some extent by using a shared static Dataproc cluster for both jobs.
For more information about creating new jobs, see the Replication tutorials.
For more information about using static Dataproc cluster in Cloud Data Fusion, see Run a pipeline against an existing Dataproc cluster
Option 2: Stop the current replication job and create a duplicate
If you duplicate the replication job to add the tables, consider the following:
Enabling the snapshot for the duplicate job results in the historical load of all the tables from scratch. This is recommended if you cannot use the previous option, where you run separate jobs.
Disabling the snapshot to prevent the historical load can result in data loss, as there could be missed events between when the old pipeline stops and the new one starts. Creating an overlap to mitigate this issue isn't recommended, as it can also result in data loss—historical data for the new tables isn't replicated.
To create a duplicate replication job, follow these steps:
Stop the existing pipeline.
From the Replication jobs page, locate the job that you want to duplicate, click
and Duplicate.Enable the snapshot:
- Go to Configure source.
- In the Replicate existing data field, select Yes.
Add tables in the Select tables and transformations window and follow the wizard to deploy the replication pipeline.
What's next
- Learn more about Replication.