Apache Hive to BigQuery migration: Overview

This document compares the differences between Apache Hive and BigQuery and discusses key considerations for migration. For more information about how to complete the migration, see the Apache Hive migration guide.

You can also use batch SQL translation to migrate your SQL scripts in bulk, or interactive SQL translation to translate ad hoc queries. Apache HiveQL is fully supported by both SQL translation services.

Features

Hive and BigQuery are both distributed data warehouse systems. The following table compares some of their high level features:

Feature	Hive	BigQuery
Availability	Depends on the implementation	>= 99.99% monthly uptime
Quotas	Depends on the implementation	BigQuery quotas
Supported format	Avro, Parquet, ORC, Text	Avro, Parquet, ORC, CSV, JSON, Google Drive
Managed table storage	HDFS	BigQuery storage in Capacitor format
External table sources	HDFS, HBase	Google Cloud Storage, Google Sheets, Bigtable
Transaction	Limited. ACID at partition level	Limited at partition level. Supports multi- statement transactions.
API support	JDBC, ODBC, Thrift	JDBC, ODBC, Client library with many languages
Partitioning	Support	Time-unit column, ingestion time, or integer range
Clustering	Support	Support
Replication	Depends on HDFS. No cross data center replication	Reliable cross data center replication managed by Google
UDF	All languages	SQL or Javascript
Load data	Batch, Streaming (limited)	Batch, Streaming
Data operations	DDL, DML	DDL, DML, DCL, TCL

Data types

Apache Hive and BigQuery have different data type systems. In most cases, data types in Hive can be mapped to BigQuery data types with few exceptions, such as MAP and UNION. Apache Hive performs more implicit type casting than BigQuery. As a result, the batch SQL translator inserts many explicit casts.

Hive	BigQuery
TINYINT	INT64
SMALLINT	INT64
INT	INT64
BIGINT	INT64
DECIMAL	NUMERIC
FLOAT	FLOAT64
DOUBLE	FLOAT64
BOOLEAN	BOOL
STRING	STRING
VARCHAR	STRING
CHAR	STRING
BINARY	BYTES
DATE	DATE
-	DATETIME
-	TIME
TIMESTAMP	DATETIME/TIMESTAMP
INTERVAL	-
ARRAY	ARRAY
STRUCT	STRUCT
MAPS	STRUCT with key values (REPEAT field)
UNION	STRUCT with different types
-	GEOGRAPHY
-	JSON

User defined functions

Apache Hive supports writing user defined functions (UDFs) in Java. You can load UDFs into Hive to be used in regular queries. BigQuery UDFs must be written in GoogleSQL or JavaScript. Converting the Hive UDFs to SQL UDFs is recommended because SQL UDFs perform better. If you need to use JavaScript, read Best Practices for JavaScript UDFs. For other languages, BigQuery supports remote functions that let you invoke your functions in Cloud Run functions or Cloud Run from GoogleSQL queries.

BigQuery does not support user-defined aggregation functions (UDAFs).

Security

Hive and BigQuery have different security models which are described in the following sections:

Hive access control

For authorization purpose, Hive introduced SQL Standards Based Authorization (introduced in Hive 0.13.0) to enable fine-grained access control. It uses role-based access control (RBAC) to manage authorization for a large set of users and data objects in a typical enterprise environment. This security mechanism is based on the Hadoop identities. Hadoop relies on Kerberos to authenticate identities.

Secure with Apache Sentry and Apache Ranger

Apache Ranger and Apache Sentry are Apache projects that use plugins provided by Hive to implement authorization. Configuring Hive RBAC does not fully secure data, because Hive data is typically stored on HDFS where a user can directly reach data that they should not see. Ranger and Sentry also enforce data on HDFS with POSIX ACLs translated from the RBAC settings from Hive.

BigQuery security

BigQuery uses Identity and Access Management to control access to data under a project. You can restrict access to BigQuery datasets. Tables and views are child resources of datasets and their permissions are inherited from datasets. You can assign row-level and column-level security to individual tables.

You can also restrict access by using Authorized views and Authorized datasets.

BigQuery encrypts data by default. To manage your own encryption keys, you can use Cloud KMS with BigQuery.

What's next

Get step-by-step instructions to Migrate schema and data from Apache Hive.
Reference the Apache Hive SQL translation guide.