Implement the CDMC key controls framework in a BigQuery data warehouse

Many organizations deploy cloud data warehouses to store sensitive information so that they can analyze the data for a variety of business purposes. This document describes how you can implement the Cloud Data Management Capabilities (CDMC) Key Controls Framework, managed by the Enterprise Data Management Council, in a BigQuery data warehouse.

The CDMC Key Controls Framework was published primarily for cloud service providers and technology providers. The framework describes 14 key controls that providers can implement to let their customers effectively manage and govern sensitive data in the cloud. The controls were written by the CDMC Working Group, with more than 300 professionals participating from more than 100 firms. While writing the framework, the CDMC Working Group considered many of the legal and regulatory requirements that exist.

This BigQuery and Data Catalog reference architecture was assessed and certified against the CDMC Key Controls Framework as a CDMC Certified Cloud Solution. The reference architecture uses various Google Cloud services and features as well as public libraries to implement the CDMC key controls and recommended automation. This document explains how you can implement the key controls to help protect sensitive data in a BigQuery data warehouse.

Architecture

The following Google Cloud reference architecture aligns with the CDMC Key Controls Framework Test Specifications v1.1.1. The numbers in the diagram represent the key controls that are addressed with Google Cloud services.

The CDMC architecture components.

The reference architecture builds on the secured data warehouse blueprint, which provides an architecture that helps you protect a BigQuery data warehouse that includes sensitive information. In the preceding diagram, projects at the top of the diagram (in gray) are part of the secured data warehouse blueprint, and the data governance project (in blue) includes the services that are added to meet the CDMC Key Controls Framework requirements. To implement the CDMC Key Controls Framework, the architecture extends the Data governance project. The Data governance project provides controls such as classification, lifecycle management and data quality management. The project also provides a way to audit the architecture and report on findings.

For more information about how to implement this reference architecture, see the Google Cloud CDMC Reference Architecture on GitHub.

Overview of the CDMC Key Controls Framework

The following table summarizes the CDMC Key Controls Framework.

#	CDMC key control	CDMC control requirement
1	Data control compliance	Cloud data management business cases are defined and governed. All data assets that contain sensitive data must be monitored for compliance with the CDMC Key Controls, using metrics and automated notifications.
2	Data ownership is established for both migrated and cloud-generated data	The Ownership field in a data catalog must be populated for all sensitive data or otherwise reported to a defined workflow.
3	Data sourcing and consumption are governed and supported by automation	A register of authoritative data sources and provisioning points must be populated for all data assets that contain sensitive data or otherwise must be reported to a defined workflow.
4	Data sovereignty and cross-border data movement are managed	The data sovereignty and cross-border movement of sensitive data must be recorded, audited, and controlled according to defined policy.
5	Data catalogs are implemented, used, and interoperable	Cataloging must be automated for all data at the point of creation or ingestion, with consistency across all environments.
6	Data classifications are defined and used	Classification must be automated for all data at the point of creation or ingestion and must be always on. Classification is automated for the following: Personally identifiable information (PII) Information sensitivity classification Material nonpublic information (MNPI) Client identifiable information Organization-defined classification
7	Data entitlements are managed, enforced, and tracked	This control requires the following: Entitlements and access for sensitive data must default to creator and owner until explicitly and authoritatively granted. Access must be tracked for all sensitive data.
8	Ethical access, use, and outcomes of data are managed	The data consumption purpose must be provided for all data sharing agreements that involve sensitive data. The purpose must specify the type of data required and, for global organizations, the country or legal entity scope.
9	Data is secured, and controls are evidenced	This control requires the following: Appropriate security controls must be enabled for sensitive data. Security control evidence must be recorded in the data catalog for all sensitive data.
10	A data privacy framework is defined and operational	Data protection impact assessments (DPIAs) must be automatically triggered for all personal data according to its jurisdiction.
11	The data lifecycle is planned and managed	Data retention, archiving, and purging must be managed according to a defined retention schedule.
12	Data quality is managed	Data quality measurement must be enabled for sensitive data with metrics distributed when available.
13	Cost management principles are established and applied	Technical design principles are established and applied. Cost metrics that are directly associated with data use, storage, and movement must be available in the catalog.
14	Data provenance and lineage are understood	Data lineage information must be available for all sensitive data. This information must at a minimum include the source from which the data was ingested or in which it was created in a cloud environment.

1. Data control compliance

This control requires that you can verify that all sensitive data is monitored for compliance with this framework using metrics.

The architecture uses metrics that show the extent to which each of the key controls are operational. The architecture also includes dashboards that indicate when the metrics don't meet defined thresholds.

The architecture includes detectors that publish findings and remediation recommendations when data assets don't meet a key control. These findings and recommendations are in JSON format and published to a Pub/Sub topic for distribution to subscribers. You can integrate your internal service desk or service management tools with the Pub/Sub topic so that incidents are automatically created in your ticketing system.

The architecture uses Dataflow to create an example subscriber to the findings events, which are then stored in a BigQuery instance that runs in the Data Governance project. Using a number of supplied views, you can query the data using BigQuery Studio in the Google Cloud console. You can also create reports using Looker Studio or other BigQuery-compatible business intelligence tools. Reports that you can view include the following:

Last run findings summary
Last run findings detail
Last run metadata
Last run data assets in scope
Last run dataset statistics

The following diagram shows the services that apply to this control.

The Control 1 architecture components.

To meet the requirements of this control, the architecture uses the following services:

Pub/Sub publishes findings.
Dataflow loads findings into a BigQuery instance.
BigQuery stores the findings data and provides summary views.
Looker Studio provides dashboards and reports.

The following screenshot shows a sample Looker Studio summary dashboard.

Sample Looker Studio summary dashboard.

The following screenshot shows a sample view of findings by data asset.

Sample view of findings by data asset.

2. Data ownership is established for both migrated and cloud-generated data

To meet the requirements of this control, the architecture automatically reviews the data in the BigQuery data warehouse and adds data classification tags that indicate that owners are identified for all sensitive data.

Data Catalog, which is a feature of Dataplex, handles two types of metadata: technical metadata and business metadata. For a given project, Data Catalog automatically catalogs BigQuery datasets, tables, and views and populates the technical metadata. Synchronization between the catalog and data assets is maintained on a near real-time basis.

The architecture uses Tag Engine to add the following business metadata tags to a CDMC controls tag template in Data Catalog:

is_sensitive: whether the data asset contains sensitive data (see Control 6 for data classification)
owner_name: the owner of the data
owner_email: the email address of the owner

The tags are populated using defaults that are stored in a reference BigQuery table in the Data governance project.

By default, the architecture sets the ownership metadata at the table level, but you can change the architecture so that metadata is set at the column level. For more information, see Data Catalog tags and tag templates.

The following diagram shows the services that apply to this control.

The Control 2 architecture components.

To meet the requirements of this control, the architecture uses the following services:

Two BigQuery data warehouses: one stores the confidential data and the other stores the defaults for data asset ownership.
Data Catalog stores ownership metadata through tag templates and tags.
Two Cloud Run instances, as follows:
- One instance runs Report Engine, which checks whether tags are applied and publishes results.
- Another instance runs Tag Engine, which tags the data in the secured data warehouse.
Pub/Sub publishes findings.

To detect issues related to this control, the architecture checks whether sensitive data is assigned an owner name tag.

3. Data sourcing and consumption are governed and supported by automation

This control requires classification of data assets and a data register of authoritative sources and authorized distributors. The architecture uses Data Catalog to add the is_authoritative tag to the CDMC controls tag template. This tag defines whether the data asset is authoritative.

Data Catalog catalogs BigQuery datasets, tables, and views with technical metadata and business metadata. Technical metadata is automatically populated and includes the resource URL, which is the location of the provisioning point. Business metadata is defined in the Tag Engine configuration file and includes the is_authoritative tag.

During the next scheduled run, Tag Engine populates the is_authoritative tag in the CDMC controls tag template from default values stored in a reference table in BigQuery.

The following diagram shows the services that apply to this control.

iThe Control 3 architecture components.

To meet the requirements of this control, the architecture uses the following services:

Two BigQuery data warehouses: one stores the confidential data and the other stores the defaults for the data asset authoritative source.
Data Catalog stores authoritative source metadata through tags.
Two Cloud Run instances, as follows:
- One instance runs Report Engine, which checks whether tags are applied and publishes results.
- Another instance runs Tag Engine, which tags the data in the secured data warehouse.
Pub/Sub publishes findings.

To detect issues related to this control, the architecture checks whether sensitive data is assigned the authoritative source tag.

4. Data sovereignty and cross-border data movement are managed

This control requires the architecture to inspect the data registry for region-specific storage requirements and enforce usage rules. A report describes the geographical location of data assets.

The architecture uses Data Catalog to add the approved_storage_location tag to the CDMC controls tag template. This tag defines the geographic location that the data asset is permitted to be stored in.

The actual location of the data is stored as technical metadata in BigQuery table details. BigQuery doesn't let administrators change the location of a dataset or table. Instead, if administrators want to change data location, they must copy the dataset.

The resource locations Organization Policy Service constraint defines the Google Cloud regions that you can store data in. By default, the architecture sets the constraint on the Confidential data project, but you can set the constraint at an organization or folder level if preferred. Tag Engine replicates the permitted locations to the Data Catalog tag template and stores the location in the approved_storage_location tag. If you activate Security Command Center Premium tier, and someone updates the resource locations Organization Policy Service constraint, Security Command Center generates vulnerability findings for resources stored outside of the updated policy.

Access Context Manager defines the geographic location that users must be in before they can access data assets. Using access levels, you can specify which regions that the requests can come from. You then add the access policy to the VPC Service Controls perimeter for the Confidential data project.

To track data movement, BigQuery maintains a full audit trail for every job and query against each dataset. The audit trail is stored in the BigQuery Information Schema Jobs view.

The following diagram shows the services that apply to this control.

The Control 4 architecture components.

To meet the requirements of this control, the architecture uses the following services:

Organization Policy Service defines and enforces the resource locations constraint.
Access Context Manager defines the locations that users can access data from.
Two BigQuery data warehouses: one stores the confidential data and the other hosts a remote function which is used to inspect the location policy.
Data Catalog stores approved storage locations as tags.
Two Cloud Run instances, as follows:
- One instance runs Report Engine, which checks whether tags are applied and publishes results.
- Another instance runs Tag Engine, which tags the data in the secured data warehouse.
Pub/Sub publishes findings.
Cloud Logging writes the audit logs.
Security Command Center reports on any findings related to resource location or data access.

To detect issues related to this control, the architecture includes a finding for whether the approved locations tag includes the location of the sensitive data.

5. Data catalogs are implemented, used, and interoperable

This control requires that a data catalog exists, and that the architecture can scan new and updated assets to add metadata as required.

To meet the requirements of this control, the architecture uses Data Catalog. Data Catalog automatically logs Google Cloud assets, including BigQuery datasets, tables, and views. When you create a new table in BigQuery, Data Catalog automatically registers the new table's technical metadata and schema. When you update a table in BigQuery, Data Catalog updates its entries almost instantaneously.

The following diagram shows the services that apply to this control.

The Control 5 architecture components.

To meet the requirements of this control, the architecture uses the following services:

Two BigQuery data warehouses: one stores the confidential data and the other stores the non-confidential data.
Data Catalog stores the technical metadata for tables and fields.

By default, in this architecture, Data Catalog stores technical metadata from BigQuery. If required, you can integrate Data Catalog with other data sources.

6. Data classifications are defined and used

This assessment requires that data can be classified based on its sensitivity, such as whether it's PII, identifies clients, or meets some other standard that your organization defines. To meet the requirements of this control, the architecture creates a report of data assets and their sensitivity. You can use this report to verify whether the sensitivity settings are correct. In addition, each new data asset or change to an existing data asset results in an update to the data catalog.

Classifications are stored in the sensitive_category tag in the Data Catalog tag template at the table level and the column level. A classification reference table lets you rank the available Sensitive Data Protection information types (infoTypes), with higher rankings for more sensitive content.

To meet the requirements of this control, the architecture uses Sensitive Data Protection, Data Catalog, and Tag Engine to add the following tags to sensitive columns in BigQuery tables:

is_sensitive: whether the data asset contains sensitive information
sensitive_category: the category of the data; one of the following:
- Sensitive Personal Identifiable Information
- Personal Identifiable Information
- Sensitive Personal Information
- Personal Information
- Public Information

You can change the data categories to meet your requirements. For example, you can add the Material Non-Public Information (MNPI) classification.

After Sensitive Data Protection inspects the data, Tag Engine reads the per-asset DLP results tables to compile the findings. If a table contains columns with one or more sensitive infoTypes, the most notable infoType is determined and both the sensitive columns and the whole table are tagged as the category that has the highest rank. Tag Engine also assigns a corresponding policy tag to the column and assigns the is_sensitive boolean tag to the table.

You can use Cloud Scheduler to automate the Sensitive Data Protection inspection.

The following diagram shows the services that apply to this control.

The Control 6 architecture components.

To meet the requirements of this control, the architecture uses the following services:

Four BigQuery data warehouses store the following information:
- Confidential data
- Sensitive Data Protection results information
- Data classification reference data
- Tag export information
Data Catalog stores the classification tags.
Sensitive Data Protection inspects assets for sensitive infoTypes.
Compute Engine runs the Inspect Datasets script, which triggers a Sensitive Data Protection job for each BigQuery dataset.
Two Cloud Run instances, as follows:
- One instance runs Report Engine, which checks whether tags are applied and publishes results.
- Another instance runs Tag Engine, which tags the data in the secured data warehouse.
Pub/Sub publishes findings.

To detect issues related to this control, the architecture includes the following findings:

Whether sensitive data is assigned a sensitive category tag.
Whether sensitive data is assigned a column-level sensitivity type tag.

7. Data entitlements are managed, enforced, and tracked

By default, only creators and owners are assigned entitlements and access to sensitive data. In addition, this control requires that the architecture tracks all access to sensitive data.

To meet the requirements of this control, the architecture uses the cdmc sensitive data classification policy tag taxonomy in BigQuery to control access to columns that contain confidential data in BigQuery tables. The taxonomy includes the following policy tags:

Sensitive Personal Identifiable Information
Personal Identifiable Information
Sensitive Personal Information
Personal Information

Policy tags let you control who can view sensitive columns in BigQuery tables. The architecture maps these policy tags to sensitivity classifications which were derived from Sensitive Data Protection infoTypes. For example, the sensitive_personal_identifiable_information policy tag and the sensitive category maps to infoTypes such as AGE, DATE_OF_BIRTH, PHONE_NUMBER, and EMAIL_ADDRESS.

The architecture uses Identity and Access Management (IAM) to manage the groups, users, and service accounts that require access to the data. IAM permissions are granted to a given asset for table-level access. In addition, column-level access based on policy tags allows for fine-grained access to sensitive data assets. By default, users don't have access to columns that have defined policy tags.

To help ensure that only authenticated users can access data, Google Cloud uses Cloud Identity, which you can federate with your existing identity providers to authenticate users.

This control also requires that the architecture checks regularly for data assets that don't have entitlements defined. The detector, which is managed by Cloud Scheduler, checks for the following scenarios:

A data asset includes a sensitive category, but there isn't a related policy tag.
A category doesn't match the policy tag.

When these scenarios happen, the detector generates findings that are published by Pub/Sub, and then are written to the events table in BigQuery by Dataflow. You can then distribute the findings to your remediation tool, as described in 1. Data control compliance.

The following diagram shows the services that apply to this control.

The Control 7 architecture components.

To meet the requirements of this control, the architecture uses the following services:

A BigQuery data warehouse stores the confidential data and the policy tag bindings for fine-grained access controls.
IAM manages access.
Data Catalog stores the table-level and column-level tags for the sensitive category.
Two Cloud Run instances, as follows:
- One instance runs Report Engine, which checks whether tags are applied and publishes results.
- Another instance runs Tag Engine, which tags the data in the secured data warehouse.

To detect issues related to this control, the architecture checks whether sensitive data has a corresponding policy tag.

8. Ethical access, use, and outcomes of data are managed

This control requires the architecture to store data sharing agreements from both the data provider and data consumers, including a list of approved consumption purposes. The consumption purpose for the sensitive data is then mapped to entitlements stored in BigQuery using query labels. When a consumer queries sensitive data in BigQuery, they must specify a valid purpose which matches their entitlement (for example, SET @@query_label = “use:3”;).

The architecture uses Data Catalog to add the following tags to the CDMC controls tag template. These tags represent the data sharing agreement with the data provider:

approved_use: the approved use or users of the data asset
sharing_scope_geography: the list of geographic locations that the data asset can be shared in
sharing_scope_legal_entity: the list of agreed entities that can share the data asset

A separate BigQuery data warehouse includes the entitlement_management dataset with the following tables:

provider_agreement: the data sharing agreement with the data provider, including the agreed legal entity and geographic scope. This data is the defaults for the shared_scope_geography and sharing_scope_legal_entity tags.
consumer_agreement: the data sharing agreement with the data consumer, including the agreed legal entity and geographic scope. Each agreement is associated with an IAM binding for the data asset.
use_purpose: the consumption purpose such as the usage description and permitted operations for the data asset
data_asset: information about the data asset, such as the asset name and details about the data owner.

To audit data sharing agreements, BigQuery maintains a full audit trail for every job and query against each dataset. The audit trail is stored in the BigQuery Information Schema Jobs view. After you associate a query label with a session and run queries inside the session, you can collect audit logs for queries with that query label. For more information, see the Audit log reference for BigQuery.

The following diagram shows the services that apply to this control.

The Control 8 architecture components.

To meet the requirements of this control, the architecture uses the following services:

Two BigQuery data warehouses: one stores the confidential data and the other stores the entitlement data, which includes the provider and consumer data sharing agreements and the approved usage purpose.
Data Catalog stores the provider data sharing agreement information as tags.
Two Cloud Run instances, as follows:
- One instance runs Report Engine, which checks whether tags are applied and publishes results.
- Another instance runs Tag Engine, which tags the data in the secured data warehouse.
Pub/Sub publishes findings.

To detect issues related to this control, the architecture includes the following findings:

Whether there is an entry for a data asset in the entitlement_management dataset.
Whether an operation is performed on a sensitive table with an expired use case (for example, the valid_until_date in the consumer_agreement table has passed).
Whether an operation is performed on a sensitive table with an incorrect label key.
Whether an operation is performed on a sensitive table with either a blank or an unapproved use-case label value.
Whether a sensitive table is queried with an unapproved operation method (for example, SELECT or INSERT).
Whether the recorded purpose that the consumer specified when querying the sensitive data matches the data sharing agreement.

9. Data is secured and controls are evidenced

This control requires the implementation of data encryption and de-identification to help protect sensitive data and provide a record of these controls.

This architecture builds on Google default security, which includes encryption at rest. In addition, the architecture lets you manage your own keys using customer-managed encryption keys (CMEK). Cloud KMS lets you encrypt your data with software-backed encryption keys or FIPS 140-2 Level 3 validated hardware security modules (HSMs).

The architecture uses column-level dynamic data masking configured through policy tags, and stores confidential data within a separate VPC Service Controls perimeter. You can also add application-level de-identification, which you can implement either on-premise or as part of the data ingestion pipeline.

By default, the architecture implements CMEK encryption with HSMs, but also supports Cloud External Key Manager (Cloud EKM).

The following table describes the example security policy that the architecture implements for the us-central1 region. You can adapt the policy to meet your requirements, including by adding different policies for different regions.

Data sensitivity	Default encryption method	Other permitted encryption methods	Default de-identification method	Other permitted de-identification methods
Public Information	Default Encryption	Any	None	Any
Sensitive Personal Identifiable Information	CMEK with HSM	EKM	Nullify	SHA-256 Hash or Default Masking Value
Personal Identifiable Information	CMEK with HSM	EKM	SHA-256 Hash	Nullify or Default Masking Value
Sensitive Personal Information	CMEK with HSM	EKM	Default Masking Value	SHA-256 Hash or Nullify
Personal Information	CMEK with HSM	EKM	Default Masking Value	SHA-256 Hash or Nullify

The architecture uses Data Catalog to add the encryption_method tag to the table-level CDMC controls tag template. The encryption_method defines the encryption method that is used by the data asset.

In addition, the architecture creates a security policy template tag to identify which de-identification method is applied to a particular field. The architecture uses the platform_deid_method which is applied using dynamic data masking. You can add the app_deid_method, and can populate it using the Dataflow and Sensitive Data Protection data ingestion pipelines that are included in the secured data warehouse blueprint.

The following diagram shows the services that apply to this control.

The Control 9 architecture components.

To meet the requirements of this control, the architecture uses the following services:

Two optional instances of Dataflow, one performs application-level de-identification and the other performs re-identification.
Three BigQuery data warehouses: one stores the confidential data, one stores the non-confidential data, and the third stores the security policy.
Data Catalog stores the encryption and de-identification tag templates.
Two Cloud Run instances, as follows:
- One instance runs Report Engine, which checks whether tags are applied and publishes results.
- Another instance runs Tag Engine, which tags the data in the secured data warehouse.
Pub/Sub published findings.

To detect issues related to this control, the architecture includes the following findings:

The value for the encryption method tag doesn't match the permitted encryption methods for specified sensitivity and location.
A table contains sensitive columns but the Security Policy Template tag contains an invalid platform-level de-identification method.
A table contains sensitive columns but the Security Policy Template tag is missing.

10. A data privacy framework is defined and operational

This control requires that the architecture inspects the data catalog and classifications to determine whether you must create a data protection impact assessment (DPIA) report or a privacy impact assessment (PIA) report. Privacy assessments vary significantly between geographies and regulators. To determine whether an impact assessment is required, the architecture must consider the residency of the data, and the residency of the data subject.

The architecture uses Data Catalog to add the following tags to the Impact assessment tag template:

subject_locations: the location of the subjects referred to by the data in this asset.
is_dpia: whether a data privacy impact assessment (DPIA) was completed for this asset.
is_pia: whether a privacy impact assessment (PIA) was completed for this asset.
impact_assessment_reports: external link to where the impact assessment report is stored.
most_recent_assessment: the date of the most recent impact assessment.
oldest_assessment: the date of the first impact assessment.

Tag Engine adds these tags to each sensitive data asset, as defined by control 6. The detector validates these tags against a policy table in BigQuery, which includes valid combinations of data residency, subject location, data sensitivity (for example, whether it's PII), and what impact assessment type (either PIA or DPIA) is required.

The following diagram shows the services that apply to this control.

The Control 10 architecture components.

To meet the requirements of this control, the architecture uses the following services:

Four BigQuery data warehouses store the following information:
- Confidential data
- Non-confidential data
- Impact assessment policy and entitlements timestamps
- Tag exports that are used for the dashboard
Data Catalog stores the impact assessment details in tags within tag templates.
Two Cloud Run instances, as follows:
- One instance runs Report Engine, which checks whether tags are applied and publishes results.
- Another instance runs Tag Engine, which tags the data in the secured data warehouse.
Pub/Sub publishes findings.

To detect issues related to this control, the architecture includes the following findings:

Sensitive data exists without an impact assessment template.
Sensitive data exists without a link to a DPIA or PIA report.
Tags don't meet the requirements in the policy table.
The impact assessment is older than the most recently approved entitlement for the data asset in the consumer agreement table.

11. The data lifecycle is planned and managed

This control requires the ability to inspect all data assets to determine that a data lifecycle policy exists and is adhered to.

The architecture uses Data Catalog to add the following tags to the CDMC controls tag template:

retention_period: the time, in days, to retain the table
expiration_action: whether to archive or purge the table when the retention period ends

By default, the architecture uses the following retention period and expiration action:

Data category	Retention period, in days	Expiration action
Sensitive Personal Identifiable Information	60	Purge
Personal Identifiable Information	90	Archive
Sensitive Personal Information	180	Archive
Personal information	180	Archive

Record Manager, an open source asset for BigQuery, automates the purging and archiving of BigQuery tables based on the above tag values and a configuration file. The purging procedure sets an expiration date on a table and creates a snapshot table with an expiration time that is defined in the Record Manager configuration. By default, the expiration time is 30 days. During the soft-deletion period, you can retrieve the table. The archival procedure creates an external table for each BigQuery table that passes its retention period. The table is stored in Cloud Storage in parquet format and upgraded to a BigLake table which allows the external file to be tagged with metadata in Data Catalog.

The following diagram shows the services that apply to this control.

The Control 11 architecture components.

To meet the requirements of this control, the architecture uses the following services:

Two BigQuery data warehouses: one stores the confidential data, and the other stores the data retention policy.
Two Cloud Storage instances, one provides archive storage and the other stores records.
Data Catalog stores the retention period and action in tag templates and the tags.
Two Cloud Run instances, one runs the Record Manager and the other deploys the detectors.
Three Cloud Run instances, as follows:
- One instance runs Report Engine, which checks whether tags are applied and publishes results.
- Another instance runs Tag Engine, which tags the data in the secured data warehouse.
- Another instance runs Record Manager, which automates the purging and archiving of BigQuery tables.
Pub/Sub publishes findings.

To detect issues related to this control, the architecture includes the following findings:

For sensitive assets, ensure that the retention method is aligned with the policy for the asset's location.
For sensitive assets, ensure that the retention period is aligned with the policy for the asset's location.

12. Data quality is managed

This control requires the ability to measure the quality of the data based on data profiling or user-defined metrics.

The architecture includes the ability to define data quality rules for either an individual or aggregate value and assign thresholds to a specific table column. It includes tag templates for correctness and completeness. Data Catalog adds the following tags to each tag template:

column_name: the name of the column that the metric applies to
metric: the name of the metric or quality rule
rows_validated: the number of validated rows
success_percentage: the percentage of values which satisfy this metric
acceptable_threshold: the acceptable threshold for this metric
meets_threshold: whether the quality score (the success_percentage value) meets the acceptable threshold
most_recent_run: the most recent time that the metric or quality rule was run

The following diagram shows the services that apply to this control.

The Control 12 architecture components.

To meet the requirements of this control, the architecture uses the following services:

Three BigQuery data warehouses: one stores the sensitive data, one stores the non-sensitive data, and the third stores the quality rule metrics.
Data Catalog stores the data quality results in tag templates and tags.
Cloud Scheduler defines when Cloud Data Quality Engine runs.
Three Cloud Run instances, as follows:
- One instance runs Report Engine, which checks whether tags are applied and publishes results.
- Another instance runs Tag Engine, which tags the data in the secured data warehouse.
- The third instance runs the Cloud Data Quality Engine.
Cloud Data Quality Engine defines data quality rules and schedules data quality checks for tables and columns.
Pub/Sub publishes findings.

A Looker Studio dashboard displays the data quality reports for both the table levels and column levels.

To detect issues related to this control, the architecture includes the following findings:

Data is sensitive, but no data quality tag templates are applied (correctness and completeness).
Data is sensitive but the data quality tag isn't applied to the sensitive column.
Data is sensitive, but data quality results aren't within the threshold set in the rule.
Data isn't sensitive and data quality results aren't within the threshold that was set by the rule.

As an alternative to the Cloud Data Quality Engine, you can configure Dataplex data quality tasks.

13. Cost management principles are established and applied

This control requires the ability to inspect data assets to confirm cost usage, based on policy requirements and the data architecture. Cost metrics should be comprehensive and not just limited to storage use and movement.

The architecture uses Data Catalog to add the following tags the cost_metrics tag template:

total_query_bytes_billed: total number of query bytes that were billed for this data asset since the start of the current month.
total_storage_bytes_billed: total number of storage bytes that were billed for this data asset since the start of the current month.
total_bytes_transferred: sum of bytes transferred cross-region into this data asset.
estimated_query_cost: estimated query cost, in US dollars, for the data asset for the current month.
estimated_storage_cost: estimated storage cost, in US dollars, for the data asset for the current month.
estimated_egress_cost: estimated egress in US dollars for the current month in which data asset was used as a destination table.

The architecture exports pricing information from Cloud Billing to a BigQuery table named cloud_pricing_export.

The following diagram shows the services that apply to this control.

The Control 13 architecture components.

To meet the requirements of this control, the architecture uses the following services:

Cloud Billing provides billing information.
Data Catalog stores the cost information in tag templates and tags.
BigQuery stores the exported pricing information and the query historical job information through its built-in INFORMATION_SCHEMA view.
Two Cloud Run instances, as follows:
- One instance runs Report Engine, which checks whether tags are applied and publishes results.
- Another instance runs Tag Engine, which tags the data in the secured data warehouse.
Pub/Sub publishes findings.

To detect issues related to this control, the architecture checks whether sensitive data assets exist without having cost metrics associated with them.

14. Data provenance and lineage are understood

This control requires the ability to inspect the traceability of the data asset from its source and any changes to the data asset lineage.

To maintain information about data provenance and lineage, the architecture uses the built-in Data Lineage features in Data Catalog. Additionally, the data ingestion scripts define the ultimate source and add the source as an additional node to the data lineage graph.

To meet the requirements of this control, the architecture uses Data Catalog to add the ultimate_source tag to the CDMC controls tag template. The ultimate_source tag defines the source for this data asset.

The following diagram shows the services that apply to this control.

The Control 14 architecture components.

To meet the requirements of this control, the architecture uses the following services:

Two BigQuery data warehouses: one stores the confidential data, and the other stores the ultimate source data.
Data Catalog stores the ultimate source in tag templates and tags.
Data ingestion scripts load the data from Cloud Storage, define the ultimate source and add the source to the data lineage graph.
Two Cloud Run instances, as follows:
- One instance runs Report Engine, which checks whether tags are applied and publishes results.
- Another instance runs Tag Engine, which tags the data in the secured data warehouse.
Pub/Sub publishes findings.

To detect issues related to this control, the architecture includes the following checks:

Sensitive data is identified with no ultimate source tag.
Lineage graph is not populated for sensitive data assets.

Tag reference

This section describes the tag templates and tags that this architecture uses to meet the requirements of the CDMC key controls.

Table-level CDMC control tag templates

The following table lists the tags that are part of the CDMC control tag template and that are applied to tables.

Tag	Tag ID	Applicable key control
Approved Storage Location	`approved_storage_location`	4
Approved Use	`approved_use`	8
Data Owner Email	`data_owner_email`	2
Data Owner Name	`data_owner_name`	2
Encryption Method	`encryption_method`	9
Expiration Action	`expiration_action`	11
Is Authoritative	`is_authoritative`	3
Is Sensitive	`is_sensitive`	6
Sensitive Category	`sensitive_category`	6
Sharing Scope Geography	`sharing_scope_geography`	8
Sharing Scope Legal Entity	`sharing_scope_legal_entity`	8
Retention Period	`retention_period`	11
Ultimate Source	`ultimate_source`	14

Impact Assessment tag template

The following table lists the tags that are part of the Impact Assessment tag template and that are applied to tables.

Tag	Tag ID	Applicable key control
Subject Locations	`subject_locations`	10
Is DPIA impact assessment	`is_dpia`	10
Is PIA impact assessment	`is_pia`	10
Impact assessment reports	`impact_assessment_reports`	10
Most recent impact assessment	`most_recent_assessment`	10
Oldest impact assessment	`oldest_assessment`	10

Cost Metrics tag template

The following table lists the tags that are part of the Cost Metrics tag template and that are applied to tables.

Tag	Tab ID	Applicable key control
Estimated Query Cost	`estimated_query_cost`	13
Estimated Storage Cost	`estimated_storage_cost`	13
Estimated Egress Cost	`estimated_egress_cost`	13
Total Query Bytes Billed	`total_query_bytes_billed`	13
Total Storage Bytes Billed	`total_storage_bytes_billed`	13
Total Bytes Transferred	`total_bytes_transferred`	13

Data Sensitivity tag template

The following table lists the tags that are part of the Data Sensitivity tag template and that are applied to fields.

Tag	Tag ID	Applicable key control
Sensitive Field	`sensitive_field`	6
Sensitive Type	`sensitive_category`	6

Security Policy tag template

The following table lists the tags that are part of the Security Policy tag template and that are applied to fields.

Tag	Tag ID	Applicable key control
Application De-Identification Method	`app_deid_method`	9
Platform De-Identification Method	`platform_deid_method`	9

Data Quality tag templates

The following table lists the tags that are part of the Completeness and Correctness Data Quality tag templates and that are applied to fields.

Tag	Tag ID	Applicable key control
Acceptable Threshold	`acceptable_threshold`	12
Column Name	`column_name`	12
Meets Threshold	`meets_threshold`	12
Metric	`metric`	12
Most Recent Run	`most_recent_run`	12
Rows Validated	`rows_validated`	12
Success Percentage	`success_percentage`	12

Field-level CDMC policy tags

The following table lists the policy tags that are part of the CDMC Sensitive Data Classification policy tag taxonomy and that are applied to fields. These policy tags restrict field-level access and enable platform level data de-identification.

Data Classification	Tag Name	Applicable key control
Personal Identifiable Information	`personal_identifiable_information`	7
Personal Information	`personal_information`	7
Sensitive Personal Identifiable Information	`sensitive_personal_identifiable_information`	7
Sensitive Personal Information	`sensitive_personal_data`	7

Prepopulated technical metadata

The following table lists the technical metadata that is synchronized by default in Data Catalog for all BigQuery data assets.

Metadata	Applicable key control
Asset Type	—
Creation Time	—
Expiration Time	11
Location	4
Resource URL	3

What's next

Learn more about CDMC.
Read about the security controls used by the secured data warehouse blueprint.
Discover Data Catalog.
Learn more about Dataplex.
Learn more about Tag Engine.
Implement this solution using the Google Cloud CDMC Reference Architecture in GitHub.