Data Format for Ingestion - Best Practices and CTA Recommendations

Modified on Mon, 25 Mar, 2024 at 6:58 PM

CTA can help ingest batch data stored in S3 or GCS into BigQuery. The following are some preferences and recommendations for how to structure and format your data to simplify the data ingestion process.

File Formats

CTA’s preferred file formats are Avro, Parquet, and ORC. These preferred file formats are self-describing, meaning they include the schema of the data they contain. Each format differs in how it stores and compresses data, as well as how it defines the schema. Ingesting files with a defined schema simplifies the process and allows for automatic updates if changes are made.

If your source data is in CSV or JSON Lines format, we recommend that you provide CTA with a schema definition file (e.g., a data dictionary) to ensure the BigQuery Table schema matches what is expected. As these formats don’t have any planned changes to the initial schema provided to CTA should be communicated, at minimum, 3 - 4 weeks prior to implementation, preferably to reduce any downtime to partner syncs.

Below you will find all of the supported file formats CTA can ingest into BigQuery, including additional documentation and requirements for each format.

Avro

Apache Avro documentation. Avro is the preferred format for loading data into BigQuery. It allows data to be read in parallel and for the schema to be automatically retrieved from source data.

Avro to BigQuery Data Type Conversions
Supported Compression types for Avro files
- Snappy
- DEFLATE

Parquet

Apache Parquet documentation

Parquet to BigQuery Data Type Conversions
Supported Compression types for Parquet files
- GZip
- LZO_1C and LZO_1X
- Snappy
- ZSTD

ORC

Apache ORC documentation

ORC to BigQuery Data Type Conversions
Supported Compression Types for ORC files
- Zlib
- Snappy
- LZO
- LZ4

CSV

There are many limitations when loading CSV data into BigQuery. Please refer to this Google provided documentation to ensure your CSV data is correctly formatted for loading into BigQuery.

JSON Lines

JSON Lines documentation

BigQuery only supports loading JSON data when it is in JSON Lines format. This means that JSON files should have each record should be an independent JSON object and be delimited by a new line. Please refer to this Google BigQuery document for more information on loading JSON data.

Folder Structure

Below is a chart highlighting CTA’s preferences for S3 file structures. To view the image in a bigger window, right-click on the image and select “Open Image in a New Tab”. This will allow you to zoom in and out and navigate the chart.

For any questions about CTA’s data format best practices and recommendations, please reach out to help@techallies.org.