This page shows you how to resolve issues with Cloud Data Fusion.
Troubleshoot batch pipelines
The following advice is for batch pipelines.
Pipeline error: Text file busy
The following error occurs when you run a batch pipeline, causing it to fail:
error=26, Text file busy
Recommendation
To resolve this issue, set up a trigger that automatically retries a pipeline when it fails.
- Stop the pipeline.
- Create a trigger. In this case, when you select an event to execute, choose Fails. For more information, see Create an inbound trigger on a downstream pipeline.
- Start the pipeline.
Concurrent pipeline is stuck
In Cloud Data Fusion, running many concurrent batch pipelines can put a strain on the instance, causing jobs to get stuck in Starting
, Provisioning
, or Running
states. As a result, pipelines cannot be stopped through the web interface or API calls. When you run many pipelines concurrently, the web interface can become slow or unresponsive. This issue occurs due to multiple UI requests made to HTTP handler in the backend.
Recommendation
To resolve this issue, control the number of new requests using Cloud Data Fusion flow control, which is available in instances running in version 6.6 and later.
SSH connection times out while a running pipeline
The following error occurs when you run a batch pipeline:
`java.io.IOException: com.jcraft.jsch.JSchException:
java.net.ConnectException: Connection timed out (Connection timed out)`
Recommendation
To resolve the error, check for the following issues:
- Check for a missing firewall rule (typically port 22). To create a new firewall rule, see Dataproc cluster network configuration
- Check that the Compute Engine enforcer allows the connection between your Cloud Data Fusion instance and the Dataproc cluster.
Response code: 401. Error: unknown error
The following error occurs when you run a batch pipeline:
`java.io.IOException: Failed to send message for program run program_run:
Response code: 401. Error: unknown error`
Recommendation
To resolve this error, you must grant the Cloud Data Fusion Runner role (roles/datafusion.runner
) to the service account used by Dataproc.
Pipeline with BigQuery plugin fails with Access Denied
error
There is a known issue where a pipeline fails with an Access Denied
error when
running BigQuery jobs. This impacts pipelines that use the
following plugins:
- BigQuery sources
- BigQuery sinks
- BigQuery Multi Table sinks
- Transformation Pushdown
Example error in the logs (might differ depending on the plugin you are using):
POST https://bigquery.googleapis.com/bigquery/v2/projects/PROJECT_ID/jobs
{
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "Access Denied: Project xxxx: User does not have bigquery.jobs.create permission in project PROJECT_ID",
"reason" : "accessDenied"
} ],
"message" : "Access Denied: Project PROJECT_ID: User does not have bigquery.jobs.create permission in project PROJECT_ID.",
"status" : "PERMISSION_DENIED"
}
In this example, PROJECT_ID is the project ID that you specified in the plugin. The service account for the project specified in the plugin does not have permission to do at least one of the following:
- Run a BigQuery job
- Read a BigQuery dataset
- Create a temporary bucket
- Create a BigQuery dataset
- Create the BigQuery table
Recommendation
To resolve this issue, grant the missing roles to the project (PROJECT_ID) that you specified in the plugin:
To run a BigQuery job, grant the BigQuery Job User role (
roles/bigquery.jobUser
).To read a BigQuery dataset, grant the BigQuery Data Viewer role (
roles/bigquery.dataViewer
).To create a temporary bucket, grant the Storage Admin role (
roles/storage.admin
).To create a BigQuery dataset or table, grant the BigQuery Data Editor role (
roles/bigquery.dataEditor
).
For more information, see the plugin's troubleshooting documentation (Google BigQuery Multi Table Sink Troubleshooting).
Pipeline doesn't stop at the error threshold
A pipeline might not stop after multiple errors, even if you set the error
threshold to 1
.
The error threshold is intended for any exceptions raised from the directive in the event of a failure that is not otherwise handled. If the directive already uses the emitError API, then the error threshold is not activated.
Recommendation
To design a pipeline that fails when a certain threshold is met, use the
FAIL
directive.
Whenever the condition passed to the FAIL
directive is satisfied, it counts
against the error threshold and the pipeline fails after the threshold is
reached.
Oracle batch source plugin converts NUMBER
to string
In Oracle batch source versions 1.9.0, 1.8.3, and earlier, the Oracle NUMBER
data type, with undefined precision and scale, is mapped to the CDAP
decimal(38,0)
data type.
Plugin versions 1.9.1, 1.8.4, and 1.8.5 are backward incompatible, and pipelines
that use earlier versions might not work after upgrading to versions 1.9.1,
1.8.5, and 1.8.4, if a downstream stage in the pipeline relies on the output
schema of the source because the output schema has changed. When there's an
output schema defined for the Oracle NUMBER
data type defined without
precision and scale in the previous plugin version, after upgrading to versions
1.9.1, 1.8.5, or 1.8.4, the Oracle batch source plugin throws the following
schema mismatch error for the types: Schema field '<field name>' is expected to
have type 'decimal with precision <precision> and scale <scale> but found
'string'. Change the data type of field <field name> to string.
Versions 1.9.1, 1.8.5, and 1.8.4 will work with an output schema of CDAP
string
data type for Oracle NUMBER
data type defined without precision and
scale. If there's any Oracle NUMBER
data type defined without precision and
scale present in the Oracle source output schema, using the older version of the
Oracle plugin isn't recommended, as it can lead to rounding errors.
The special case is when you use a macro for the database name, schema name, or
table name, and if you haven't manually specified an output schema. The schema
gets detected and mapped at runtime. The older version of the Oracle batch
source plugin maps the Oracle NUMBER
data type defined without precision and
scale to the CDAP decimal(38,0)
data type, while versions 1.9.1, 1.8.5, and
1.8.4 and later map the data types to string
at runtime.
Recommendation
To resolve the possible precision loss issue while working with Oracle NUMBER
data types with undefined precision and scale, upgrade your pipelines to use
Oracle batch source plugin versions 1.9.1, 1.8.5, or 1.8.4.
After the upgrade, the Oracle NUMBER
data type defined without precision and
scale is mapped to the CDAP string
data type at runtime. If you have a
downstream stage or sink that consumes the original CDAP decimal
data type (to
which the Oracle NUMBER
data type defined without precision and scale was
mapped), either update it or expect it to consume string data.
If you understand the risk of possible data loss due to rounding errors, but
choose to use Oracle NUMBER data type defined without precision and scale as
CDAP decimal(38,0)
data type, then deploy the Oracle plugin version 1.8.6 (for
Cloud Data Fusion 6.7.3) or 1.9.2 (for Cloud Data Fusion 6.8.1) from
the Hub, and update the pipelines to use them instead.
For more information, see the Oracle Batch Source reference.
Delete an ephemeral Dataproc cluster
When Cloud Data Fusion creates an ephemeral Dataproc cluster during pipeline run provisioning, the cluster gets deleted after the pipeline run is finished. In rare cases, the cluster deletion fails.
Strongly recommended: Upgrade to the most recent Cloud Data Fusion version to ensure proper cluster maintenance.
Set Max Idle Time
To resolve this issue, configure the Max Idle Time
option. This lets Dataproc delete clusters automatically, even if
an explicit call on the pipeline finish fails.
Max Idle Time
is available in Cloud Data Fusion versions 6.4 and later.
Recommended: For versions before 6.6, set Max Idle Time
manually to 30
minutes or greater.
Delete clusters manually
If you cannot upgrade your version or configure the Max Idle Time
option,
instead delete stale clusters manually:
Get each project ID where the clusters were created:
In the pipeline's runtime arguments, check if the Dataproc project ID is customized for the run.
If a Dataproc project ID is not specified explicitly, determine which provisioner is used, and then check for a project ID:
In the pipeline runtime arguments, check the
system.profile.name
value.Open the provisioner settings and check if the Dataproc project ID is set. If the setting is not present or the field is empty, the project that the Cloud Data Fusion instance is running in is used.
For each project:
Open the project in the Google Cloud console and go to the Dataproc Clusters page.
Sort the clusters by the date that they were created, from oldest to newest.
If the info panel is hidden, click Show info panel and go to the Labels tab.
For every cluster that is not in use—for example, more than a day has elapsed—check if it has a Cloud Data Fusion version label. That is an indication that it was created by Cloud Data Fusion.
Select the checkbox by the cluster name and click Delete.
Unable to create Cloud Data Fusion instance
While creating a Cloud Data Fusion instance, you may encounter the following issue:
Read access to project PROJECT_NAME was denied.
Recommendation
To resolve this issue, disable and re-enable the Cloud Data Fusion API. Then, create the instance.
Pipelines fail when run on Dataproc clusters with primary or secondary workers
In Cloud Data Fusion versions 6.8 and 6.9, an issue occurs causing pipelines to fail if they run on Dataproc clusters:
ERROR [provisioning-task-2:i.c.c.i.p.t.ProvisioningTask@161] - PROVISION task failed in REQUESTING_CREATE state for program run program_run:default.APP_NAME.UUID.workflow.DataPipelineWorkflow.RUN_ID due to
Caused by: io.grpc.StatusRuntimeException: CANCELLED: Failed to read message.
Caused by: com.google.protobuf.GeneratedMessageV3$Builder.parseUnknownField(Lcom/google/protobuf/CodedInputStream;Lcom/google/protobuf/ExtensionRegistryLite;I)Z.
Recommendation
To resolve the issue,
upgrade to the patch
revision 6.8.3.1
, 6.9.2.1
, or later.