Introducing dbt-cta: CTA's Open Source dbt Repository

Modified on Wed, 6 Mar, 2024 at 1:21 AM

In this NotePAD article, you will learn about dbt-cta, CTA’s open-source repository containing the dbt models we use to transform and deliver data into PAD.

Topics include:

What is this? / Why does this repository exist?
How do I view the repository?
What does all this code do?
How does CTA run the dbt models in this repository?
How is the repository structured?
How do I learn more about the dbt models used for a particular vendor?
How do I report bugs or make requests?
How can I contribute to this repository?

What is this? / Why does this repository exist?

CTA has created a public GitHub repository called dbt-cta, which is publicly accessible at this url: https://github.com/community-tech-alliance/dbt-cta.

CTA uses dbt (data build tool) to normalize, transform, and deliver data from vendors (such as digital marketing platforms) to BigQuery on behalf of PAD users. We make all of these data models publicly available to promote transparency and encourage collaboration across the progressive movement.

How do I view the repository?

This repository is public, so all you need to do is follow the URL: https://github.com/community-tech-alliance/dbt-cta. You do not need to be logged into GitHub to view the code, the history of changes to the code, or any issues submitted by other users.

If you wish to submit an issue or contribute to the repository, you will need to log in with your GitHub account (or sign up).

How do I contribute / report a bug / make a feature request?

You can read more about how to engage with this repository by checking out CONTRIBUTING.md. In general, the best way to let us know about any problems, submit feature requests, or provide general feedback is to submit an Issue. You may also reach out to us by emailing help@techallies.org, and we will make sure someone from our engineering team takes a look!

What does this code do? (what even is dbt?)

CTA uses Airbyte as our ELT platform for most of the data we sync for partners. Airbyte delivers this data in the form of raw, unflattened JSON payloads. In order to turn this raw data (alchemy-like) into usable tables, the raw data needs to be parsed - we need to extract all the fields, cast the data to the appropriate types, and sometimes join across different data sources to provide the tables PAD users need.

This is where dbt (data build tool) comes in. This tool allows us to specify models that perform different steps of the process using code that is modular, making it easier to develop, maintain, and share with the world. (Gone are the days of the 1200-line SQL files of yesteryear - at least, we hope so!)

If you are new to dbt, we highly recommend checking out their tutorial to get a feel for how dbt projects are structured.

How does CTA run this code?

CTA runs dbt in Google Cloud Composer (Airflow). This allows us to orchestrate our data syncs and dbt processes at scale. (The details of how we implement this will be explained in another article, hopefully soon!)

Dbt provides a command-line tool that can run on any command line capable of installing it: for example, the Terminal on your local machine, a GitHub Actions workflow, a container running in the cloud, and so on. You can read more about the tool on dbt’s website.

(In addition to the command line tool, which is free to use, dbt also offers Enterprise software that costs money.)

How is the repository structured?

At the top level of the repository, the main folder of interest is dbt-cta, which contains all of our dbt models.

This folder contains configuration files that apply to all dbt projects (dbt_project.yaml, packages.yaml, and profiles.yaml) and subfolders for individual vendor syncs. Each contains a README and a subdirectory, `models`, with all the dbt model files specific to that vendor sync. Below is a hierarchy view of the folders and subfolders:

dbt-cta

│   README.md

│   CONTRIBUTING.md

│

└───dbt-cta

│   │   dbt_project.yaml

│   │   packages.yaml

│   │   profiles.yaml

│   │

│   └───actblue

│      │  README.md

│      └───models

│   │

│   └───action_builder

│      │  README.md

│      └───models

│   │

│   └─── …

│   │

│   └───stripe

│      │  README.md

│      └───models

How are the dbt projects structured?

When you look at the models for a dbt project, you will see a few sets of models with different numbered prefixes, like “0_ctes”, “1_cta_full_refresh”, “1_cta_incremental“, and “2_partner_matviews”. The numbered prefixes indicate the dependency structure of the models - first, we run CTEs (common table expressions) to do some basic pre-processing, then we stage data in base tables in CTA’s project, and finally, we deliver materialized views to partner projects.

This is the most common structure we use in our dbt projects:

0_ctes: Models in this step ingest the raw JSON and parse the fields into columns with appropriate data types (strings, integers, timestamps, etc.).
1_cta_full_refresh: These models prepare data for staging in CTA’s BigQuery project. And represent tables that are fully refreshed on each run.
1_cta_incremental: These models prepare data for staging in CTA’s BigQuery project. And represent tables that are refreshed incrementally on each run.
2_partner_matviews: These models define the materialized views delivered to PAD users. This is the final step for all materialized views you see in PAD.

Not all projects use the exact same set of models. For example, in some cases, like Snapchat Marketing, we deliver views in addition to materialized views, and those models would live in a folder called 3_partner_views. If you want to see all the models we use, you can look at everything under “models” in dbt_project.yaml.

How can PAD users learn about the dbt that delivers their data?

The dbt model files in this repository match the dbt that CTA runs for all of our syncs. So, for example, if you want to learn what dbt we are using to deliver Facebook Marketing data, you can look at the files in dbt-cta/facebook_marketing and see for yourself!

Each vendor sync also contains a README that contains (or will soon contain) additional information about the sync and schemas for tables we deliver from that vendor as well as sample queries for using the data.