Data Format for Ingestion - Best Practices and CTA Recommendations

Created by Ben Deverman, Modified on Mon, 25 Mar at 6:58 PM by Rebecca Sokol-Snyder

CTA can help ingest batch data stored in S3 or GCS into BigQuery. The following are some preferences and recommendations for how to structure and format your data to simplify the data ingestion process. 

File Formats

CTA’s preferred file formats are Avro, Parquet, and ORC. These preferred file formats are self-describing, meaning they include the schema of the data they contain. Each format differs in how it stores and compresses data, as well as how it defines the schema. Ingesting files with a defined schema simplifies the process and allows for automatic updates if changes are made.


If your source data is in CSV or JSON Lines format, we recommend that you provide CTA with a schema definition file (e.g., a data dictionary) to ensure the BigQuery Table schema matches what is expected. As these formats don’t have any planned changes to the initial schema provided to CTA should be communicated, at minimum, 3 - 4 weeks prior to implementation, preferably to reduce any downtime to partner syncs.


Below you will find all of the supported file formats CTA can ingest into BigQuery, including additional documentation and requirements for each format.


Avro

Apache Avro documentation. Avro is the preferred format for loading data into BigQuery. It allows data to be read in parallel and for the schema to be automatically retrieved from source data.


Parquet

Apache Parquet documentation


ORC

Apache ORC documentation

CSV

There are many limitations when loading CSV data into BigQuery. Please refer to this Google provided documentation to ensure your CSV data is correctly formatted for loading into BigQuery.


JSON Lines

JSON Lines documentation

BigQuery only supports loading JSON data when it is in JSON Lines format. This means that JSON files should have each record should be an independent JSON object and be delimited by a new line. Please refer to this Google BigQuery document for more information on loading JSON data.

 

Folder Structure

Below is a chart highlighting CTA’s preferences for S3 file structures. To view the image in a bigger window, right-click on the image and select “Open Image in a New Tab”. This will allow you to zoom in and out and navigate the chart.

 

 

For any questions about CTA’s data format best practices and recommendations, please reach out to help@techallies.org

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article