Data Engineering

Google Analytics, BigQuery & dbt: A dbt example project

The dbt (data build tool) framework has gained much popularity over the last few years. It's an excellent tool to transform your data, but also helps with setting up a data warehouse workflow (orchestration, dependencies, environments, documentation etc). This tutorial will provide you with an example dbt project repository, using the Google Analytics 4 BigQuery dataset, and walks you through the most important parts of the setup, so you can bootstrap your own project quickly.

  • A starter dbt project example, using BigQuery as a data warehouse backend.
  • Using the Google Analytics 4 event data export in BigQuery.
  • Transforming Google Analytic 4 event level data to sessions.
  • Create a user mapping table and grouping session data to customer profiles.
  • Create the base for marketing reports, like a channel performance report.
  • Tips regarding project structure, incremental models and workflow.

What is dbt?

I would recommend you check out the dbt introduction articles first, but for starters;

dbt (data build tool) enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles turning these select statements into tables and views. dbt does the T in ELT (Extract, Load, Transform) processes – it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.

These are the most important features;

  • Orchestration: dbt automatically detects dependencies (lineage) between models, generates a dependency graph (DAG) and can executes SQL (models) in the correct order.
  • Deployment: Support for development and production environments, track history using the dbt repository, forces you to use a code repository basically.
  • Testing & integration: dbt enables you to set up test easily (missing data, data formats etc) and performs integration tests.
  • Flexibility: Easily recreate or backfills tables.
  • Documentation: dbt automatically generates documentation / data catalog
  • DRY: Reusable chunks of code in the for of macros

dbt-overview-datawarehouse.PNG

And above all, dbt doesn't really have a steep learning curve.

1. The example project

First of all, the example project can be found here on GitHub. I recommend you check out the dbt documentation first and understand the concepts of dbt in general. This project can help to master (and how to apply) the concepts. Besides you can use this project to bootstrap your project.

The project has the following characteristics:

Data warehouse

The example project uses Google BigQuery as a data warehouse backend.

Sources

The projects transforms the default Google Analytics 4 BigQuery exports. Google Analytics exports to date sharded tables (and this is something that you can't change currently).

Transformations

The project will transform the raw GA4 data into the following models:

  • A staging model of the Google Analytics 4 data source (where you can rename and recast your data sources into a consistent format).
  • Intermediate model to aggregate event level GA4 data to sessions (in an incremental table, so the complete GA4 dataset doesn't have to be transformed completely on every run).
  • Intermediate model to build a user mapping table out of all the available user identifiers in GA4 (based on this article)
  • Create a core user (customer) model, aggregating online behavioral data to user (customer) level. Can be used to build a 360 customer view for example)
  • Create the base for marketing reports (channel performance report on day level).

2. Prerequisites

You can configure the example project to use your own settings and data warehouse (BigQuery). For that, you have te setup at least the following:

  • Created a Google Cloud Project and enabled the Google Analytics 4 exports. The project "as is" is configured to use the Google Analytics 4 example dataset.
  • Installed dbt or use the cloud hosted version of dbt (free for 1 developer seat, benefit that is has a built-in / dedicated IDE).
  • I've used dbt Cloud, were you can configure the BigQuery dbt profile, but you could also set this up manually
  • Check and adjust the settings in the following files: dbt_project.yml and .dbt/profiles.yml (especially if your not using dbt Cloud)

3. Project structure

You're free to define your own structure, but I tried to keep the proposed setup by dbt in mind, so there is some logical structure you project stays manageable when it grows over time. The following articles regarding style and project structure are a good read:

In the example project;

  • Intermediate models are placed within a subfolder of the most relevant mart (you could create a separate top-level folder for intermediate tables as well)
  • Intermediate tables and views are placed into a dedicated BigQuery dataset
  • The schema and dataset parameters are interchangeable. Since BigQuery uses the last term, dataset is used throughout the project.

Folder structure

dbt-folder-structure.PNG

BigQuery Datasets

dbt will create and populate multiple datasets within BigQuery. Next to adding some structure to your project, splitting up your models enables you (more easily) to hide certain datasets for all users in your organization.

DatasetDescription
<prefix>_stagingStaging data models (renaming / recasting source tables so sources are in a consistent format)
<prefix>_coreCore data warehouse models
<prefix>_marketingMarketing data warehouse models
<prefix>_intAll intermediate tables (e.g. joining and grouping different sources). Placed in a separate dataset so you can hide these tables from certain the user(groups)
<prefix>_utilsDataset holding utilities, like (reusable) User Defined Functions (UDFs)

The development environment will create the same datasets, but with a development prefix

dbt-bigquery-structure.PNG

dbt BigQuery dataset structure

4. GA4 source tables and incrementality

The GA4 tables are date sharded (events_YYYYMMDD) instead of partitioned. In terms of performance (and costs), it's important to query these tables correctly.

Google recommends using partitioned tables instead of sharded tables (like Google the Google Analytics 4 tables). So it strange that that Google Analytics 4 uses date sharding instead of partitioning.

If you use a date column to select incremental amounts of data (and not using the _table_suffix pseudo column), every time you select an incremental amount data, it will query all the date sharded tables of the Google Analytics 4 exports. Not that of a big problem when you don't have a lot of data, but it is a problem (in terms of cost and performance) when handling large datasets.

Some solutions;

  1. Ideally, partitioned tables are to way to go. However we have no control over this data source, so you could materialize the date sharded tables in a date (day) partitioned table yourself. However, since the the Google Analytics 4 dataset can be rather large, it will generate extra costs because your duplicating the dataset, so it might not be the logical choice in some situations.
  2. Filter on _table_suffix instead of a date column (see this article).
  3. Create a macro to insert execution_date into the models alias parameter (article from The Telegraph team)

In our project, we've selected option #2, so we don't have to duplicate our data. If there is any better method (now or in the future) please let me know and I will update the project. Example method below or see this file in the example project:

In the example below, the table_suffix column refers to the BigQuery _table_suffix pseudo column

1 {{
2  config(
3    materialized = 'incremental',
4    partition_by = {'field': 'session_start_at', 'data_type': 'timestamp'},
5    incremental_strategy = 'insert_overwrite',
6    tags=['incremental']
7  )
8}}
9
10-- google analytics 4 events
11with events as (
12    
13    select 
14        * 
15    from {{ ref('stg_google_analytics__events') }}
16
17    
18    {% if var('execution_date') != 'notset' %}
19        
20        where
21            -- specific date set by variable (using the _table_suffix_ pseudo column for performance)
22            table_suffix = '{{ var('execution_date') }}'
23
24    {% elif is_incremental() %}
25
26        where
27            -- incremental data (using the _table_suffix_ pseudo column for performance)
28            (table_suffix between format_date('%Y%m%d', date(_dbt_max_partition))
29                and format_date('%Y%m%d', date_sub(current_date(), interval 1 day)))
30
31    {% endif %}
32)

5. Execution & scheduling

You can use the following commands to run all or specific models.

1# Run all models
2dbt run
3
4# Run a specific model
5dbt run --select int_events_sessions__grouped
6
7# Run models with a specific tag
8dbt run --select tag:incremental
9
10# Run all models and rebuild tables
11dbt run --full-refresh
12
13# Run all models for a specific date -> replaces (date) partitions for incremental models, for example the incremental session models build on top the GA event exports
14dbt run --vars '{execution_date: 20211025}'

The relation between the models (lineage):

dbt-google-analytics-4-lineage.PNG

Scheduling

You can run dbt both locally as in dbt Cloud. This last options has a built-in scheduler to run your dbt commands. Another good option is to use Apache Airflow to schedule dbt model execution. Benefit is that you have more insights into model execution, tracks history and you can (re)run specific models with ease. You can check out the following articles:

6. Other features

Lastly, the example project has some other small features and capabilities.

  • Profiles: Configuration for a development and production environment (.dbt/profiles.yml).
  • Model configuration: Default model configuration (e.g. dataset / suffixes / folders) in dbt_profile.yml
  • Documentation: All fields in the final dimension and fact tables are documented (within the core.yml and marketing.yml files)
  • Variables: start_date can be used to define the start date of your data and execution_date can be used to rebuild or backfill (incremental) models for a specific date (and overwrite the old partitions). dataset_udf is used to configure the BigQuery dataset where the UDFs (macros) are stored.
  • Macros: There are several macros available in the project. Within the /macros/udf folder there are several macros that create user defined functions (UDFs), such as a channel grouping function. The create_udfs.sql macro runs all the UDF macros when the model is run (so the UDFs are up-to-date).
  • Tags: Incremental models are tagged with incremental, daily incremental tables with daily.
  • Testing: Freshness settings for the GA4 source table is setup and for the fact and dimensions tables, checks on unique identifiers are build in.

Closing thoughts

The goal of this example projects is to bootstrap your project quickly and show you some of dbt's features how to apply them in your project. If you have any suggestions and improvements, please let me know!

Did you like this article? Stay hydrated and subscribe to a monthly roundup of our newest articles and tutorials. No spam and you can unsubscribe at any time. You can also checkout my Twitter account for updates.