AWS Redshift to Google BigQuery

Modified on Wed, 6 Mar, 2024 at 1:21 AM

CTA can help you transfer historical data from a variety of sources, including other data warehouses. This support article focuses on a one-time transfer of data from AWS Redshift to Google BigQuery.

Estimated time to complete yourself: 1 hour to set up approximately. Total time is dependent on the amount of tables and data to transfer
Estimated time for CTA to complete: Once all information is received, 5 business days to set up transfer

Cost

There is no additional charge from Google to transfer data from Redshift to BigQuery, however, you may accrue costs in AWS for:

Exporting data to S3
Network egress costs for S3 to Google’s network

To transfer data from Redshift, CTA will set up a BigQuery Data Transfer in your project. Here is the information you will need to provide to CTA:

AWS Redshift JDBC URL, username, and password
AWS Region: this is the region that your Redshift warehouse is running in.
- Note: this could be multi-region (US or EU). If so, just let us know.
AWS S3 Bucket for staging data: AWS stages Redshift data in S3, so we will need:
- S3 bucket URI (with a prefix, if needed: E.g., s3://bucket/prefix)
- AWS IAM user access key pair
Redshift schema
Table name patterns (optional - if not provided, all tables in the schema will be imported)

Required Steps

Create an S3 bucket in the same region as your Redshift cluster.

If possible, create an IAM user dedicated to this transfer, and grant the following roles*:

AmazonS3FullAccess

AmazonRedshiftReadOnlyAccess

You must ensure that the generated policy has GetBucketLocation. Here is an example policy, but please verify with CTA that it is correct:

{
    "Statement": [
        {
            "Action": [
		"s3:ListBucket",
		"s3:GetBucketLocation"
	    ],
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::<bucket_name>",
            "Sid": "239u23rjklq2f3jlq23"
        },
        {
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::<bucket_name>/*",
            "Sid": "239u23rjklq2f3jlq24"
        }
    ],
    "Version": "2012-10-17"
}

Obtain the JDBC URL, username, and password. You can follow the instructions here to obtain these from AWS. If you haven’t yet been provided with a 1Password vault, please contact help@techallies.org. Once you have the vault, please created 2 logins:
1. AWS Redshift credentials with the username in the username field, password in the password field, and JDBC URL in the website field.
2. AWS S3 IAM credentials (see below) with the IAM Key ID in the username field, IAM Secret Key in the password field, and S3 bucket URI in the website field.
3. Make sure the user has proper access on Redshift for the schema and tables that will be transferred over. For each schema, the following should be granted with <user> and <schema> replaced accordingly.
  GRANT USAGE ON SCHEMA <schema> to <user>
  GRANT SELECT ON ALL TABLES IN SCHEMA <schema> TO <user>
If your Redshift cluster is not accessible to the Internet, you will need to whitelist Google's IP addresses. Please contact CTA to confirm the latest set of IP addresses.
Please make a request via help@techallies.org to begin the data transfer process. In this email, send a list of schemas and tables that have been granted access to transfer.
CTA will create a transfer for each schema. You’ll be notified when all transfers are complete.

A few things to be aware of:

BigQuery has a load quota of 15 TB, per load job, per table. Internally, Amazon Redshift compresses the table data, so the exported table size will be larger than the table size reported by Amazon Redshift.
When you transfer data from Amazon S3, it is possible that some of your data will not be transferred to BigQuery, particularly if the files were added to the bucket very recently. It should take approximately 10 minutes for a file to become available to the BigQuery Data Transfer Service after it is added to the bucket.
If you have concerns about resource usage on your Redshift cluster, you may want to consider workload control. There is some more information available here on how to set this up.

* If you require more fine-grained permissions and cannot grant these roles, contact CTA for help.

Please contact help@techallies.org and share a query with our team – that way, we can see exactly what you’re seeing.