BigQuery to Redshift Sync

Modified on Mon, 25 Mar, 2024 at 6:23 PM

Background

In the process of transitioning to using CTA PAD to work with their data, some organizations may require data to be temporarily available in both BigQuery and Redshift.

CTA can facilitate this transition by syncing data from BigQuery (PAD) into Google Cloud Storage. Once data has been made available in GCS, partners have many options at their disposal for syncing that data into Redshift.

Summary

For partners who need data available in Redshift while they transition, CTA can take the additional step of syncing the partner’s PAD data to a Google Cloud Storage bucket located in the Google Project provisioned for the partner.

For example, a test partner named Partner A would have data synced to a bucket in their project, which might have a GCS URI of gs://cta-partner-a-data-3edf27 and contain subfolders for different data sources that Partner A needs to have available in both BQ and Redshift.

Process

From start to finish, the steps needed to deliver PAD data into Redshift are as follows:

CTA:

CTA syncs data into BigQuery
CTA then delivers that data from BigQuery into Google Cloud Storage (GCS)

Partner:

Sync data from GCS into S3
Copy data from S3 into Redshift

Export From BQ to GCS (owned by CTA)

Google Workflows

CTA uses Google Workflows to sync data from BigQuery into GCS. There are 2 workflows that are being used to export data from BQ to GCS. The first is a generic workflow (linked in the gist below) that exports a specified table to GCS by using the BigQuery API to submit an export job that will do a full table export to the specified GCS bucket. The second workflow is just to make setting up exports easier. It takes in a list of tables to export and a destination bucket and calls the first workflow for each table. This second workflow can be used to set up multiple schedules each with different sets of tables.

You can find an example of the Workflow file in this GitHub gist. This template is easy to apply to any data sources that need to be synced from BQ to GCS.

Files are delivered as GZIP compressed CSV files with paths in the following format:

gs://////_.csv.gz

Part number: BQ can only export 1 GB per file. To export tables with >1GB, we deliver multiple files, each with a part number at the end of the file name. (ex. hubble_election__district_000000000001.csv.gzip, hubble_election__district_000000000002.csv.gzip, etc..)
Scheduling: Workflows can be scheduled to run in accordance with the refresh schedule of the data being synced into BigQuery.
Monitoring and Alerting: These workflows produce logs that can be monitored and alerted so that sync failures will be noticed immediately and corrected as soon as possible.

Costs and Limitations

Batch exports from BigQuery to GCS come at no cost.

There are some limitations on the rate at which data may be synced from BigQuery to GCS:

Maximum number of exported bytes per day = 50 TB
Maximum number of exports per day = 100,000 exports
Maximum table size exported to a single file = 1GB

You can read more about these constraints here.

Export From GCS to Redshift (owned by partner)

GCS to S3

There are multiple methods for syncing data between GCS and S3. Some examples:

Google Cloud Platform
- gsutil - rsync command using cross-cloud mirroring
- Composer/Airflow - GCSToS3Operator is one option, though Airflow can also run any arbitrary code, so any method of syncing GCS to S3 (eg, using Python) could be implemented and scheduled in Airflow.
- Google Workflows (implementing AWS S3 REST API)
- Google Dataproc
AWS
- AWS Datasync
Third-party Libraries
- boto3 (Python)

S3 to Redshift

There exist multiple options for loading data from S3 into Redshift. The most common method is to run SQL using the COPY command.

Recommendations

In terms of ease of implementation, CTA’s recommendation is as follows:

Using managed services and no scheduler, we recommend a combination of AWS Datasync and the Redshift COPY command.
If one wishes to use an orchestration tool, we recommend Composer/Airflow.
- Google Workflows is also an option in GCP, but it does not have a pre-built operator/connector for AWS S3. Implementing the AWS S3 REST API is straightforward, however.
A final option would be to wrap libraries like boto3 and run this in Python (or, given another language, there are other libraries).

There is also Dataproc, which is a fully managed service. Because it is based on Hadoop, it is comparatively complex to run. Dataproc also requires VM (non-serverless) resources, so it is more expensive than other options. Given the various other options, we would not recommend Dataproc unless it is already being used to run workloads.

Schema and Data Type Differences

To reduce the transfer complexity, we recommend using AVRO or Parquet, where schema information is embedded with the data. If you choose CSV or a similar simple, delimited data format, you may need to explicitly handle data type conversions not handled by implicit conversion.

Data Types

AWS Redshift and Google BigQuery data types can be found here. The data types under the “Data Types” header will be handled implicitly when data is loaded into Redshift via the COPY command.

When choosing a format, CTA recommends AVRO as data types are embedded directly in the export. Here’s an example of the various options one can automate when ingesting data using COPY into Redshift in the Avro format.

When things don’t convert appropriately, you may need to cast or convert the datatype on ingestion into Redshift. Depending on where you are running the COPY command, the conversion can be done using manual SQL commands, or one can use more templated, repeatable methods, for example using a tool like dbt.