You can use this generic source whenever you need to read from a database. For example, you might use it to create daily snapshots of a generic database table and write the output to BigQuery.
Configure the plugin
- Go to the Cloud Data Fusion web interface and click Studio.
- Check that Data Pipeline - Batch is selected (not Realtime).
- In the Source menu, click Database.
- To configure the plugin, hold the pointer over the plugin node and click Properties.
Enter the following properties. For more information about each property, see Properties.
- Enter a Label for the database node—for
example,
database tables
. Enter the connection details. You can set up a new, one-time connection, or an existing, reusable connection.
New connection
To add a one-time connection to the database, follow these steps:
- Keep Use connection turned off.
- Enter the following connection properties:
- In the JDBC driver name field, enter the name of the driver, if there is one. Otherwise, keep No JDBC plugins selected.
- In the Connection string field, enter the JDBC connection string, including the database name.
- Optional: if your database requires authentication, enter database username and password credentials.
- Optional: if your JDBC driver needs additional configurations, in the Connection arguments field, enter key-value arguments for the connection.
Reusable connection
To reuse an existing connection, follow these steps:
- Turn on Use connection.
- Click Browse connections.
Select the connection.
Optional: if a connection doesn't exist and you want to create a new, reusable connection, click Add connection and refer to the steps in the New connection tab on this page.
Optional: to test connectivity, click Get schema. This schema is used in place of whatever schema comes back from the query. It must match the schema that comes back from the query, except that you can mark fields as nullable and it can contain a subset of the fields.
In the Import query field, enter a
SELECT
query for importing data from the specified table—for example,select id, name, email, phone from users;
.Optional: in the Bounding query field, enter the minimum and maximum values to read—for example,
SELECT * FROM table WHERE $CONDITIONS
.Optional: in the Split-by field name field, enter the name of the field that generates the splits.
Optional: in the Number of splits to generate field, enter a number—for example,
2
.Optional: in the Fetch size field, enter a number—for example,
1000
.Optional: enter Advanced properties, such as changing the case of the column names.
- Enter a Label for the database node—for
example,
Optional: click Validate and address any errors found.
Click Close. Properties are saved and you can continue to build your data pipeline in the Cloud Data Fusion Studio.
Properties
Property | Macro enabled | Required | Description |
---|---|---|---|
Label | No | Yes | The name of the node in your data pipeline. |
Use connection | No | No | Browse for a connection to the source. If Use connection is selected, you don't need to provide credentials. |
Connection | Yes | Yes | Name of the connection to use. If Use connection is selected, this field appears. Database and table information is provided by the connection. |
JDBC driver name | Yes | Yes | The JDBC driver to use. Default is No JDBC plugins. |
Connection string | Yes | Yes | JDBC connection string, including database name. |
Username | Yes | No | User identity for connecting to the database. Required for databases that need authentication. Optional for databases that don't require authentication. |
Password | Yes | No | Password to use to connect to the specified database. Required for databases that need authentication. Optional for databases that don't require authentication. |
Connection arguments | Yes | No | A list of arbitrary string tag/value pairs as connection arguments.
For JDBC drivers that need additional configuration, these arguments are
passed to the JDBC driver as connection arguments in the following
format: key1=value1;key2=value . |
Reference name | No | Yes | A name that uniquely identifies this source for lineage and annotating metadata. It's usually the name of the table or the view. |
Get schema | No | No | The schema of records output by the source. It's used in place of the schema that's returned by the query. It must match the schema that comes back from the query, except that it lets you mark fields as nullable and it can contain a subset of the fields. |
Import query | Yes | Yes | The SELECT query for importing data from the specified
table. You can specify an arbitrary number of columns to import, or
import all columns using * . The query should contain the
$CONDITIONS string. For example,
SELECT * FROM table WHERE $CONDITIONS . The
$CONDITIONS string is replaced by splitBy field
limit specified in the Bounding query field. The
$CONDITIONS string isn't required if the Number of
splits field is set to 1. |
Bounding query | Yes | No | The bounding query that returns the minimum and maximum of the values
from the split column. For example,
SELECT MIN(id),MAX(id) FROM table . Not required if the
number of splits is set to 1. |
Split-by field name | Yes | No | The field name used to generate splits. Not required if the number of splits to generate is set to 1. |
Number of splits to generate | Yes | No | Number of splits to generate. Default is 1. |
Fetch size | Yes | No | The number of rows to fetch at a time per split. A greater fetch size
can cause faster import, with the trade-off of higher memory usage. Default is 1000. |
Enable auto-commit | No | No | Whether to enable auto-commit for queries run by this source. Keep
this set to False unless you use a JDBC driver that
causes an error when the commit operation is run. Default is False. |
Column name case | Yes | No | Sets the letter case of the column names returned from the query. You
can choose uppercase or lowercase. By default, or for any other input,
the column names aren't modified and letter case returned from the
database is used. Setting this property provides predictability of
column name capitalization across different databases, but it might cause
column name conflicts if multiple column names are the same when the case
is ignored. Default is No change. |
Transaction isolation level | Yes | No | The transaction isolation level for queries run by this sink. For more
information, see the
setTransactionIsolation() .
The Phoenix JDBC driver throws an exception if the Phoenix database
doesn't have transactions enabled and this field is set to
true . For drivers like that, set this field to
TRANSACTION_NONE .Default is TRANSACTION_SERIALIZABLE. |
Pattern to replace | No | No | The pattern to replace in the field name in the table (typically used with the Replace with property). If the Replace with property isn't set, the pattern is removed from the field name. |
Replace with | No | No | The string that's replaced in the field name in the table. You must also configure the Pattern to replace field. |
Output schema | No | No | Specifies the schema that's output. Only columns defined in the schema are included in the output record. |
Best practices
Check if a more specific plugin is available for your database. For example, if you have an Oracle database source, instead use the Oracle database batch source plugin because it's designed to work with the Oracle schema.
Release notes
What's next
- Learn more about plugins in Cloud Data Fusion.