CTA can help ingest batch data stored in S3 or GCS into BigQuery. The following are some preferences and recommendations for how to structure and format your data to simplify the data ingestion process.
File Formats
CTA’s preferred file formats are Avro, Parquet, and ORC. These preferred file formats are self-describing, meaning they include the schema of the data they contain. Each format differs in how it stores and compresses data, as well as how it defines the schema. Ingesting files with a defined schema simplifies the process and allows for automatic updates if changes are made.
If your source data is in CSV or JSON Lines format, we recommend that you provide CTA with a schema definition file (e.g., a data dictionary) to ensure the BigQuery Table schema matches what is expected. As these formats don’t have any planned changes to the initial schema provided to CTA should be communicated, at minimum, 3 - 4 weeks prior to implementation, preferably to reduce any downtime to partner syncs.
Below you will find all of the supported file formats CTA can ingest into BigQuery, including additional documentation and requirements for each format.
Avro
Apache Avro documentation. Avro is the preferred format for loading data into BigQuery. It allows data to be read in parallel and for the schema to be automatically retrieved from source data.
- Avro to BigQuery Data Type Conversions
- Supported Compression types for Avro files
- Snappy
- DEFLATE
Parquet
Apache Parquet documentation
- Parquet to BigQuery Data Type Conversions
- Supported Compression types for Parquet files
- GZip
- LZO_1C and LZO_1X
- Snappy
- ZSTD
ORC
Apache ORC documentation
- ORC to BigQuery Data Type Conversions
- Supported Compression Types for ORC files
- Zlib
- Snappy
- LZO
- LZ4
CSV
There are many limitations when loading CSV data into BigQuery. Please refer to this Google provided documentation to ensure your CSV data is correctly formatted for loading into BigQuery.
JSON Lines
JSON Lines documentation
BigQuery only supports loading JSON data when it is in JSON Lines format. This means that JSON files should have each record should be an independent JSON object and be delimited by a new line. Please refer to this Google BigQuery document for more information on loading JSON data.
Folder Structure
Below is a chart highlighting CTA’s preferences for S3 file structures. To view the image in a bigger window, right-click on the image and select “Open Image in a New Tab”. This will allow you to zoom in and out and navigate the chart.
For any questions about CTA’s data format best practices and recommendations, please reach out to help@techallies.org.
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article