Apache Hive to BigQuery migration: Overview
This document compares the differences between Apache Hive and BigQuery and discusses key considerations for migration. For more information about how to complete the migration, see the Apache Hive migration guide.
You can also use batch SQL translation to migrate your SQL scripts in bulk, or interactive SQL translation to translate ad hoc queries. Apache HiveQL is fully supported by both SQL translation services.
Features
Hive and BigQuery are both distributed data warehouse systems. The following table compares some of their high level features:
Feature | Hive | BigQuery |
---|---|---|
Availability | Depends on the implementation | >= 99.99% monthly uptime |
Quotas | Depends on the implementation | BigQuery quotas |
Supported format | Avro, Parquet, ORC, Text | Avro, Parquet, ORC, CSV, JSON, Google Drive |
Managed table storage | HDFS | BigQuery storage in Capacitor format |
External table sources | HDFS, HBase | Google Cloud Storage, Google Sheets, Bigtable |
Transaction | Limited. ACID at partition level | Limited at partition level. Supports multi- statement transactions. |
API support | JDBC, ODBC, Thrift | JDBC, ODBC, Client library with many languages |
Partitioning | Support | Time-unit column, ingestion time, or integer range |
Clustering | Support | Support |
Replication | Depends on HDFS. No cross data center replication | Reliable cross data center replication managed by Google |
UDF | All languages | SQL or Javascript |
Load data | Batch, Streaming (limited) | Batch, Streaming |
Data operations | DDL, DML | DDL, DML, DCL, TCL |
Data types
Apache Hive and BigQuery have different data type systems. In
most cases, data types in Hive can be mapped to BigQuery data
types with few
exceptions, such as MAP
and UNION
. Apache Hive performs more implicit type
casting than BigQuery. As a result, the batch SQL translator
inserts many explicit casts.
Hive | BigQuery |
---|---|
TINYINT | INT64 |
SMALLINT | INT64 |
INT | INT64 |
BIGINT | INT64 |
DECIMAL | NUMERIC |
FLOAT | FLOAT64 |
DOUBLE | FLOAT64 |
BOOLEAN | BOOL |
STRING | STRING |
VARCHAR | STRING |
CHAR | STRING |
BINARY | BYTES |
DATE | DATE |
- | DATETIME |
- | TIME |
TIMESTAMP | DATETIME/TIMESTAMP |
INTERVAL | - |
ARRAY | ARRAY |
STRUCT | STRUCT |
MAPS | STRUCT with key values (REPEAT field) |
UNION | STRUCT with different types |
- | GEOGRAPHY |
- | JSON |
User defined functions
Apache Hive supports writing user defined functions (UDFs) in Java. You can load UDFs into Hive to be used in regular queries. BigQuery UDFs must be written in GoogleSQL or JavaScript. Converting the Hive UDFs to SQL UDFs is recommended because SQL UDFs perform better. If you need to use JavaScript, read Best Practices for JavaScript UDFs. For other languages, BigQuery supports remote functions that let you invoke your functions in Cloud Run functions or Cloud Run from GoogleSQL queries.
BigQuery does not support user-defined aggregation functions (UDAFs).
Security
Hive and BigQuery have different security models which are described in the following sections:
Hive access control
For authorization purpose, Hive introduced SQL Standards Based Authorization (introduced in Hive 0.13.0) to enable fine-grained access control. It uses role-based access control (RBAC) to manage authorization for a large set of users and data objects in a typical enterprise environment. This security mechanism is based on the Hadoop identities. Hadoop relies on Kerberos to authenticate identities.
Secure with Apache Sentry and Apache Ranger
Apache Ranger and Apache Sentry are Apache projects that use plugins provided by Hive to implement authorization. Configuring Hive RBAC does not fully secure data, because Hive data is typically stored on HDFS where a user can directly reach data that they should not see. Ranger and Sentry also enforce data on HDFS with POSIX ACLs translated from the RBAC settings from Hive.
BigQuery security
BigQuery uses Identity and Access Management to control access to data under a project. You can restrict access to BigQuery datasets. Tables and views are child resources of datasets and their permissions are inherited from datasets. You can assign row-level and column-level security to individual tables.
You can also restrict access by using Authorized views and Authorized datasets.
BigQuery encrypts data by default. To manage your own encryption keys, you can use Cloud KMS with BigQuery.
What's next
Get step-by-step instructions to Migrate schema and data from Apache Hive.
Reference the Apache Hive SQL translation guide.