Apache Hive to BigQuery migration: Overview

This document compares the differences between Apache Hive and BigQuery and discusses key considerations for migration. For more information about how to complete the migration, see the Apache Hive migration guide.

You can also use batch SQL translation to migrate your SQL scripts in bulk, or interactive SQL translation to translate ad hoc queries. Apache HiveQL is fully supported by both SQL translation services.

Features

Hive and BigQuery are both distributed data warehouse systems. The following table compares some of their high level features:

Feature Hive BigQuery
Availability Depends on the implementation >= 99.99% monthly uptime
Quotas Depends on the implementation BigQuery quotas
Supported format Avro, Parquet, ORC, Text Avro, Parquet, ORC, CSV, JSON, Google Drive
Managed table storage HDFS BigQuery storage in Capacitor format
External table sources HDFS, HBase Google Cloud Storage, Google Sheets, Bigtable
Transaction Limited. ACID at partition level Limited at partition level. Supports multi- statement transactions.
API support JDBC, ODBC, Thrift JDBC, ODBC, Client library with many languages
Partitioning Support Time-unit column, ingestion time, or integer range
Clustering Support Support
Replication Depends on HDFS. No cross data center replication Reliable cross data center replication managed by Google
UDF All languages SQL or Javascript
Load data Batch, Streaming (limited) Batch, Streaming
Data operations DDL, DML DDL, DML, DCL, TCL

Data types

Apache Hive and BigQuery have different data type systems. In most cases, data types in Hive can be mapped to BigQuery data types with few exceptions, such as MAP and UNION. Apache Hive performs more implicit type casting than BigQuery. As a result, the batch SQL translator inserts many explicit casts.

Hive BigQuery
TINYINT INT64
SMALLINT INT64
INT INT64
BIGINT INT64
DECIMAL NUMERIC
FLOAT FLOAT64
DOUBLE FLOAT64
BOOLEAN BOOL
STRING STRING
VARCHAR STRING
CHAR STRING
BINARY BYTES
DATE DATE
- DATETIME
- TIME
TIMESTAMP DATETIME/TIMESTAMP
INTERVAL -
ARRAY ARRAY
STRUCT STRUCT
MAPS STRUCT with key values (REPEAT field)
UNION STRUCT with different types
- GEOGRAPHY
- JSON

User defined functions

Apache Hive supports writing user defined functions (UDFs) in Java. You can load UDFs into Hive to be used in regular queries. BigQuery UDFs must be written in GoogleSQL or JavaScript. Converting the Hive UDFs to SQL UDFs is recommended because SQL UDFs perform better. If you need to use JavaScript, read Best Practices for JavaScript UDFs. For other languages, BigQuery supports remote functions that let you invoke your functions in Cloud Run functions or Cloud Run from GoogleSQL queries.

BigQuery does not support user-defined aggregation functions (UDAFs).

Security

Hive and BigQuery have different security models which are described in the following sections:

Hive access control

For authorization purpose, Hive introduced SQL Standards Based Authorization (introduced in Hive 0.13.0) to enable fine-grained access control. It uses role-based access control (RBAC) to manage authorization for a large set of users and data objects in a typical enterprise environment. This security mechanism is based on the Hadoop identities. Hadoop relies on Kerberos to authenticate identities.

Secure with Apache Sentry and Apache Ranger

Apache Ranger and Apache Sentry are Apache projects that use plugins provided by Hive to implement authorization. Configuring Hive RBAC does not fully secure data, because Hive data is typically stored on HDFS where a user can directly reach data that they should not see. Ranger and Sentry also enforce data on HDFS with POSIX ACLs translated from the RBAC settings from Hive.

BigQuery security

BigQuery uses Identity and Access Management to control access to data under a project. You can restrict access to BigQuery datasets. Tables and views are child resources of datasets and their permissions are inherited from datasets. You can assign row-level and column-level security to individual tables.

You can also restrict access by using Authorized views and Authorized datasets.

BigQuery encrypts data by default. To manage your own encryption keys, you can use Cloud KMS with BigQuery.

What's next