This page provides troubleshooting tips and debugging strategies that you might find helpful if you're using Dataflow Flex Templates. This information can help you detect a polling timeout, determine the reason behind the timeout, and correct the problem.
Troubleshoot polling timeouts
This section provides steps for identifying the cause of polling timeouts.
Polling timeouts
Your Flex Template job might return the following error message:
Timeout in polling result file: ${file_path}.
Service account: ${service_account_email}
Image URL: ${image_url}
Troubleshooting guide at https://cloud.google.com/dataflow/docs/guides/common-errors#timeout-polling
This error can occur for the following reasons:
- The base Docker image was overridden.
- The service account that fills in
${service_account_email}
does not have some necessary permissions. - External IP addresses are disabled, and VMs can't connect to the set of external IP addresses used by Google APIs and services.
- The program that creates the graph takes too long to finish.
- Pipeline options are being overwritten.
- (Python only) There is a problem with the
requirements.txt
file. - There was a transient error.
To resolve this issue, first check for transient errors by checking the job logs and retrying. If those steps don't resolve the issue, try the following troubleshooting steps.
Verify Docker entrypoint
Try this step if you're running a template from a custom Docker image rather than using one of the provided templates.
Check for the container entrypoint using the following command:
docker inspect $TEMPLATE_IMAGE
The following output is expected:
Java
/opt/google/dataflow/java_template_launcher
Python
/opt/google/dataflow/python_template_launcher
If you get a different output, then the entrypoint of your Docker container is
overridden. Restore $TEMPLATE_IMAGE
to the default.
Check service account permissions
Check that the service account mentioned in the message has the following permissions:
- It must be able read and write the Cloud Storage path that fills in
${file_path}
in the message. - It must be able to read the Docker image that fills in
${image_url}
in the message.
Configure Private Google Access
If external IP addresses are disabled, you need to allow Compute Engine VMs to connect to the set of external IP addresses used by Google APIs and services. Enable Private Google Access on the subnet used by the network interface of the VM.
For configuration details, see Configuring Private Google Access.
By default, when a Compute Engine VM lacks an external IP address assigned to its network interface, it can only send packets to other internal IP address destinations.
Check if the launcher program fails to exit
The program that constructs the pipeline must finish before the pipeline can be launched. The polling error could indicate that it took too long to do so.
Some things you can do to locate the cause in code are:
- Check job logs and see if any operation appears to take a long time to complete. An example would be a request for an external resource.
- Make sure no threads are blocking the program from exiting. Some clients might create their own threads, and if these clients are not shut down, the program waits forever for these threads to be joined.
Pipelines launched directly that don't use a template don't have these limitations. Therefore, if the pipeline worked directly but not as a template, then the use of a template might be the root cause. Finding the issue in the template and fixing the template might resolve the issue.
Verify whether required pipeline options are suppressed
When using Flex Templates, you can configure some but not all pipeline options during pipeline initialization. For more information, see the Failed to read the job file section in this document.
Remove Apache Beam from the requirements file (Python Only)
If your Dockerfile includes a requirements.txt
with apache-beam[gcp]
,
remove it from the file and install it separately. The following command
demonstrates how to complete this step:
RUN pip install apache-beam[gcp]
RUN pip install -U -r ./requirements.txt
Putting Apache Beam in the requirements file can cause long launch times, often resulting in a timeout.
Polling timeouts when using Python
If you're running a Dataflow job by using a Flex Template and Python, your job might queue for a period, fail to run, and then display the following error:
Timeout in polling
The requirements.txt
file that's used to install the required dependencies
causes the error. When you launch a Dataflow job, all of the
dependencies are staged first to make these files accessible to
the worker VMs. This process involves downloading and compiling
every direct and indirect dependency in the requirements.txt
file.
Some dependencies might take several minutes to compile. Notably
PyArrow might
take time to compile. PyArrow is an indirect dependency that's used by
Apache Beam and most Cloud Client Libraries.
To optimize your job's performance, use a Dockerfile or a custom container to prepackage the dependencies. For more information, see Package dependencies in "Configure Flex Templates."
Job launch failures
The following section contains common errors that lead to job launch failures and steps for resolving or troubleshooting the errors.
Early startup issues
When the template launching process fails in an early stage, regular Flex Template logs might not be available. To investigate startup issues, enable serial port logging for the templates launcher VM.
To enable logging for Java templates, set the
enableLauncherVmSerialPortLogging
option to true
. To enable logging for Python and Go templates, set the
enable_launcher_vm_serial_port_logging
option to true
. In the Google Cloud console, the parameter is
listed in Optional parameters as Enable Launcher VM Serial Port Logging.
You can view the serial port output logs of the templates launcher VM in
Cloud Logging. To find the logs for a particular launcher VM, use the query
resource.type="gce_instance" "launcher-number"
where number starts
with the current date in the format YYYMMDD
.
Your organization policy might prohibit you from enabling serial port outputs logging.
Failed to read the job file
When you try to run a job from a Flex Template, your job might fail with the following error:
Failed to read the job file : gs://dataflow-staging-REGION-PROJECT_ID/staging/template_launches/TIMESTAMP/job_object with error message: ...: Unable to open template file
This error occurs when the necessary pipeline initialization options are overwritten. When using Flex Templates, you can configure some but not all pipeline options during pipeline initialization. If the command line arguments required by the Flex Template are overwritten, the job might ignore, override, or discard the pipeline options passed by the template launcher. The job might fail to launch, or a job that doesn't use the Flex Template might launch.
To avoid this issue, during pipeline initialization, don't change the following
pipeline options
in user code or in the metadata.json
file:
Java
runner
project
jobName
templateLocation
region
Python
runner
project
job_name
template_location
region
Go
runner
project
job_name
template_location
region
Failed to read the result file
When you try to run a job from a Flex Template, your job might fail with the following error:
Failed to read the result file : gs://BUCKET_NAME with error message: (ERROR_NUMBER): Unable to open template file: gs://BUCKET_NAME
This error occurs when the Compute Engine default service account doesn't have all the permissions that it needs to run a Flex Template. For the list of required permissions, see Permissions to run a Flex Template.
Permission denied on resource
When you try to run a job from a Flex Template, your job might fail with the following error:
Permission "MISSING_PERMISSION" denied on resource "projects/PROJECT_ID/locations/REGION/repositories/REPOSITORY_NAME" (or it may not exist).
This error occurs when the used service account does not have permissions to access necessary resources to run a Flex Template.
To avoid this issue, verify that the service account has the required permissions. Adjust the service account permissions as needed.
Flag provided but not defined
When you try to run a Go Flex Template with the worker_machine_type
pipeline
option, the pipeline fails with the following error:
flag provided but not defined: -machine_type
This error is caused by a known issue in the Apache Beam Go SDK versions 2.47.0 and earlier. To resolve this issue, upgrade to Apache Beam Go version 2.48.0 or later.
Unable to fetch remote job server jar
If you try to run a job from a Flex Template when you're not connected to the internet, your job might fail with the following error:
Unable to fetch remote job server jar at
https://repo.maven.apache.org/maven2/org/apache/beam/beam-sdks-java-io-expansion-service/VERSION/beam-sdks-java-io-expansion-service-VERSION.jar:
\u003curlopen error [Errno 101] Network is unreachable\u003e
This error occurs because the VM is unable to download the Apache Beam Java package from the internet. This package is required when you run a multi-language job by using a Flex Template.
To resolve this issue, make one of the following changes:
Connect to the internet. When connected to the internet, your job can access the required file.
Include the Apache Beam Java package in your local directory so that your job can access it locally. Put the file in the following directory:
/root/.apache_beam/cache/jars/
. For example,/root/.apache_beam/cache/jars/beam-sdks-java-io-expansion-service-SDK_VERSION.jar
.
Unable to get filesystem from specified path
When you try to run a job from a Flex Template, your job might fail with the following error:
ValueError: Unable to get filesystem from specified path, please use
the correct path or ensure the required dependency is installed, e.g., pip
install apache-beam[gcp]. Path specified: PATH
This error occurs when the job uses a Flex Template container image, and the container image doesn't contain a Java installation.
To resolve this issue, add the following line to your Dockerfile:
sh
RUN apt-get update && apt-get install -y openjdk-17-jdk
This command installs Java in your container environment.
Flex Template launcher delay
When you submit a Flex Template job, the job request goes into a Spanner queue. The template launcher picks up the job from the Spanner queue and then runs the template. When Spanner has a message backlog, a significant delay might occur between the time you submit the job and the time the job launches.
To work around this issue, launch your Flex Template from a different region.
The template parameters are invalid
When you try to use the gcloud CLI to run a job that uses a Google-provided template, the following error occurs:
ERROR: (gcloud.beta.dataflow.flex-template.run) INVALID_ARGUMENT: The template
parameters are invalid. Details: defaultSdkHarnessLogLevel: Unrecognized
parameter defaultWorkerLogLevel: Unrecognized parameter
This error occurs because some Google-provided templates don't support the
defaultSdkHarnessLog
and defaultWorkerLog
options.
As a workaround, copy the template specification file to a Cloud Storage bucket. Add the following additional parameters to the file.
"metadata": {
...
"parameters": [
...,
{
"name": "defaultSdkHarnessLogLevel",
"isOptional": true,
"paramType": "TEXT"
},
{
"name": "defaultWorkerLogLevel",
"isOptional": true,
"paramType": "TEXT"
}
]
}
After you make this change to the template file, use the following command to run the template.
--template-file-gcs-location=gs://BUCKET_NAME/FILENAME
Replace the following values:
BUCKET_NAME
: the name of your Cloud Storage bucketFILENAME
: the name of your template specification file
Flex Template launcher logs show wrong severity
When a custom Flex Template
launch fails, the following message appears in the log files with the severity
ERROR
:
ERROR: Error occurred in the launcher container: Template launch failed. See console logs.
The root cause of the launch failure usually appears in the logs prior to this
message with the severity INFO
. Although this log level may be incorrect, it
is expected, because the Flex template launcher has no way to extract severity
details from the log messages produced by the Apache Beam application.
If you want to see the correct severity for every message in the launcher log, configure your template to generate logs in the JSON format instead of in plain text. This configuration allows the template launcher to extract the correct log message severity. Use the following message structure:
{
"message": "The original log message",
"severity": "DEBUG/INFO/WARN/ERROR"
}
In Java, you can use Logback logger with a custom JSON appender implementation. For more information, see the Logback example configuration and the JSON appender example code in GitHub.
This issue only impacts the logs generated by the Flex Template launcher when the pipeline is launching. When the launch succeeds and the pipeline is running, the logs produced by Dataflow workers have the proper severity.
Google-provided templates
show the correct severity during job launch, because the Google-provided
templates use this JSON logging approach.