Insights & Analysis

Build a Data Driven Attribution Model using Google Analytics 4, BigQuery and Python

"Know your channels", a wise man once said. Get actionable insights on your channel performance by building custom attribution models using Google Analytics 4 data in BigQuery. This end-to-end tutorial will show you how to prepare the GA4 data and build rule based and data driven (Markov) models.

  • Prepare and aggerate the data (user journeys) using the Google Analytics 4 event level exports.
  • Leverage the Marketing-Attribution-Models Python library of DP6.
  • Create rule based (heuristic) models: First interaction, last (non-direct) interaction, linear, time decay & position based.
  • Create algorithmic data driven attribution model: Markov Chains
  • Attribution is calculated on user path level but also grouped on date and channel level.
  • Results saved to Google BigQuery.

As of January 2022, data driven attribution is enabled by default within the Google Analytics 4 interface. However, you can't adjust your channel groupings, the methodology is unknown and Google uses modelling for missing values. Next to that the values calculated by Google are only available within the interface and not exported to BigQuery. So building your own, custom attribution model can still be relevant.

Why you can't measure every interaction in a user journey

Attribution heavily depends on your data quality and what user interactions can be measured. It's a fact that you can't measure the complete complete journey on user level due to both technical as privacy reasons:

  • Different devices used (without some form of recognition like a user login)
  • The same user is measured as different users due to (first-party) cookies that are deleted or purged by build-in tracking prevention mechanisms within browsers like ITP or ETP.
  • Users don't give permission to be tracked.
  • Touchpoints that can't be measured on user level, like impression data in external networks, like display networks or social platforms

So it's important that the things you can measure, are measured correctly.

Use cases & business value

Besides that you probably will learn a lot from the process and how session / acquisition data is structured in GA4, creating different attribution models could bring your business the following:

  • Comparing different models will give you better insights in the value of each channel (and where in the user journey a channels adds value).
  • Help making marketing budget decisions: where to focus your marketing efforts.
  • Help making or improve marketing automation use cases (where can we add the most value)
  • Using BigQuery data will enable you to customize your user journeys (like path length and channel definitions) and tailor results to your specific organization (instead of using the default attribution models within Google Analytics).

So, let's start!

1. Prerequisites - First things first

This tutorial assumes that you already have some knowledge about Google Analytics (4) and BigQuery exports. Next to that:

  • Already set up a Google Cloud Project with an active billing account and GA4 BigQuery exports enabled (+ has some data to begin with). See this article.
  • You could also use the Google Analytics 4 sample dataset.
  • Understand and run Python scripts / Jupyter Notebook

We will use a Service Account to authenticate with your Google Cloud Project (getting data from and storing data in BigQuery).

If you're not familiar with setting up a Service Account, read the Google documentation first.

Create a Service Account

  • Create a Service Account and enable the BigQuery User role for the account.
  • Generate and download the JSON keyfile for the Service Account.

Keep your JSON key secure!

2. The Python library

This article is more about the practical approach, there are already a lot of other great articles around explaining the different attribution models in more depth. Therefore we use the Marketing-Attribution-Models Python library of DP6 to build the different attribution models, instead of writing all the logic manually.

In this article, we will use visits (sessions) and the corresponding marketing channel (e.g. social, search etc) as interactions, so we can credit conversions and conversion value to the marketing channels that generated these sessions. But attribution is not limited to these type of interactions, you could also add or use other definitions, like specific interactions on your website or within your channels (for example soft conversions or email opens).

Available models

attribution-model-types.PNG

ModelTypeDescription
First interactionRule basedSingle touchOnly the first interaction is credited for the conversion
Last interactionRule basedSingle touchOnly the last interaction is credited for the conversion
Last non-direct interactionRule basedSingle touchOnly the last (non-direct) interaction is credited for the conversion. Default model of Google Universal Analytics
LinearRule basedMulti touchEvery interaction is equally credited
Time DecayRule basedMulti touchThe more recent interaction is given more credits (based on time till conversion in hours)
Position BasedRule basedMulti touch40% to the first touchpoint, 40% to the last touchpoint and 20% is equally credited to the interactions in between
Shapley ValueData DrivenMulti touchNot used in this tutorial. See this article. Also used for the data driven model in the GA interface
Markov ChainsData DrivenMulti touchSee this article

The library expects the data in the following format:

  • journey_id: A unique journey-ID. In most cases a combination between some user-ID and a visit/session-ID.
  • path_channels: The user path leading to conversion
  • path_timestamps: The time difference in hours between interaction and conversion.
  • conversion: Indicate if the user has converted
  • conversion_value: Total conversion value (per visit)

Example input:

journey_idpath_channelspath_timestampsconversionconversion_value
123.456search_paid > direct > social50 > 20 > 0true450.00
789.321direct > search_organic > direct > email80 > 34 > 5 > 0true270.00
...............

3. Modelling your data & considerations

The interesting part. The table above doesn't look that complicated, but there are some considerations and decisions to be made when modeling your data. The following graph and table will illustrate some of the concepts:

google-analytics-bigquery-attribution-window.png Lookback window and conversion period

ConsiderationDescriptionThis tutorial
Conversion periodFor which period do you want to select conversions that the models take into consideration. For the data driven models, this period is important because it uses this data to calculate attribution for each channel (instead of looking to individual paths only). Also depends on the amount of conversions.90 days
Lookback (conversion) windowHow many days are you looking back from a conversion to construct a user path? This depends on your type of business. For example, retail typically has shorter journeys than let's say a when you're selling cars. Typical values are 30, 60 or 90 days. Good to know, due to technological and privacy developments, it's getting harder to accurately measure data of the same user over longer periods of time.30 days
Interaction / touchpoint definitionHow do you define the interactions (touchpoints) in the users (paths). Just source and/or medium or a custom channel grouping?Custom channel grouping using a user defined function
User pathsHow to build the user path when a user converts multiple times within the lookback window. Resetting the path after each transaction or also take the path of the previous conversion (within the lookback window) into consideration?For each conversion, look back 30 days, even if in those 30 days another conversion occurred
Conversions & conversion valueWhat is your conversion definition and how to deal with multiple conversion in 1 session.Purchases, multiple conversions in one session count as one
User definitionWhat type of user-ID is used to group sessions of the same user together. Based on the device-ID (cookie-ID) or (additionally) with custom user-ID such as a customer-ID of lead-ID. Using the latter, behavior on different devices for the same user can be measured.Device-ID (ga-cookie-ID), but example provided to use an additional custom user-ID.
Non-converting pathsFor some models it could be useful to also consider the non-converting paths (like the Shapley model).Not using non-converting paths

4. Google Analytics 4 data preparation

Based on the definitions of the previous chapter, we will need to prepare the Google Analytics 4 data stored in BigQuery.

Be aware that Google Analytics 4 has some quirks to deal with. For example it's unclear how GA4 is generating the Traffic Acquisition report in the interface. So it's impossible to align your BigQuery data with these interface reports. See some other issues below:

Things to take into account related to GA4 and traffic / acquisition parameters:

  • Compared to Universal Analytics, sessions are not reset when users switch channels in an active session. This means that one session could have multiple traffic sources and you should write some logic around this yourself.
  • In this tutorial, we will select the first traffic source within a session (instead of randomly selecting one with MIN() or MAX() functions.
  • Compared to Universal Analytics, GA4 doesn't seem to close sessions at midnight.
  • Google Analytics 4 batches events. Currently events are assigned the batch timestamp, instead of the actual timestamp when the event occurred.
  • It's not clear how GA4 generates the Traffic Acquisition reports in the interface / attributes sessions and conversions to channels. Partially this can be explained to to the fact that GA4 uses conversion modelling (thanks Ken Williams!). So currently it's impossible to align your BigQuery data with these interface reports.

Google Analytics 4 event and session structure

To illustrate how sessions are structured related to events and traffic acquisition data:

google_analytics_4_session_structure.PNG

  • user_pseudo_id: The device-ID (defaults to the GA cookie-ID on web and app instance-ID within apps). This ID will identify the user over different visits (sessions).
  • user_id: A custom user-ID, like a customer ID. This ID can be used to join activity of the same user over different devices or browsers. One user_id therefore can have multiple user_pseudo_id
  • ga_session_id: Identifies the specific session. Indivual events can be related to sessions using this ID. To create unique session-IDs, we have to concatenate user_pseudo_id and session_id (since the session_id only is based on a timestamp).
  • The same session can have events with different traffic acquisition parameters as mentioned earlier. In this tutorial we will use the first channel of a session.

That said, the following SQL code will fetch the Google Analytics 4 data from BigQuery according to the requirements of the Python library (see chapter 2), keeping in mind the considerations discussed earlier.

The SQL is using a persistent User Defined Function (UDF) that holds the channel groupings (<your-project>.<your-dataset>.CHANNEL_GROUPING()). You can create your channel definitions inline, but we recommend to use the channel grouping functions described in this article.

View SQL on GitHub

1declare conversion_period int64 default 90; -- select conversions in last x days
2declare lookback_window int64 default 30; -- how many days to lookback from the moment the conversion occurred;
3
4with
5-- group event level google analytics 4 data to sessions (visits)
6sessions as (
7    select
8        user_pseudo_id as ga_client_id,
9        --user_id as custom_user_id, -- use a custom user-id instead, like a customer-id
10        concat(user_pseudo_id,'.',(select cast(value.int_value as string) from unnest(event_params) where key = 'ga_session_id')) as session_id, -- combine user_pseudo_id and session_id for a unique session-id
11        timestamp_micros(min(event_timestamp)) as session_start,
12        array_agg(
13            if(event_name in('page_view','user_engagement','scroll'), struct(
14                event_timestamp,
15                lower((select value.string_value from unnest(event_params) where key = 'source')) as source,
16                lower((select value.string_value from unnest(event_params) where key = 'medium')) as medium,
17                lower((select value.string_value from unnest(event_params) where key = 'campaign')) as campaign,
18                (select value.int_value from unnest(event_params) where key = 'entrances') as is_entrance,
19                (select value.int_value from unnest(event_params) where key = 'ignore_referrer') as ignore_referrer
20            ), null) 
21        ignore nulls) as channels_in_session,
22        --countif(event_name = '<name-of-some-other-conversion-event>') as conversions_in_session,
23        countif(event_name = 'purchase') as conversions,
24        sum(ecommerce.purchase_revenue) as conversion_value
25    from `<your-project>.analytics_<your-dataset>.events_*`
26    where
27        -- select conversions based on <conversion_period> + additional daterange to construct the path of a conversion (based on <lookback_window>)
28        _table_suffix between 
29            format_date('%Y%m%d', date_sub(current_date(), interval (conversion_period + lookback_window) day))
30            and format_date('%Y%m%d', date_sub(current_date(), interval 1 day))
31    group by 1, 2
32),
33-- build conversion paths for all sessions with at least 1 conversion within the last <conversion_period> days
34sessions_converted as (
35    select
36        s.session_start,
37        s.session_id,
38        string_agg(
39            -- select first channel / campaign within session
40            `<your-project>.<your-dataset>.channel_grouping`(
41                (select t.source from unnest(s_lb.channels_in_session) as t where t.ignore_referrer is null order by t.event_timestamp asc limit 1),
42                (select t.medium from unnest(s_lb.channels_in_session) as t where t.ignore_referrer is null order by t.event_timestamp asc limit 1),
43                null
44            ),
45            ' > '
46            order by s_lb.session_start asc
47        ) as path_channels,
48        string_agg(cast(timestamp_diff(timestamp(s.session_start), timestamp(s_lb.session_start), hour) as string), ' > ' order by s_lb.session_start asc) as path_timestamps, -- hours till conversion
49        string_agg(cast(s_lb.session_start as string), ' > ' order by s_lb.session_start asc) as path_timestamps_check,
50        max(s.conversions) as conversions_in_session,
51        max(s.conversion_value) as conversion_value
52    from sessions as s
53    left join
54        -- joining historical sessions to construct the conversion path (with a max path length of <lookback_window>)
55        sessions as s_lb
56        on s.ga_client_id = s_lb.ga_client_id
57        and s.session_start >= s_lb.session_start -- only join current session and sessions before current session
58        and datetime(s_lb.session_start) >= date_sub(datetime(s.session_start), interval lookback_window day) -- only join sessions not older than <lookback_window> days counted from conversion
59    where
60        s.conversions > 0
61        and date(s.session_start) >= date_sub(current_date(), interval conversion_period day)
62    group by 1, 2
63    order by 
64        s.session_start asc
65)
66
67-- query data on user (journey) level
68select
69    date(session_start) as conversion_date,
70    session_start as conversion_timestamp,
71    session_id as journey_id,
72    path_channels,
73    path_timestamps,
74    true as conversion,
75    conversions_in_session,
76    conversion_value as conversion_value
77from sessions_converted

5. Building the models in Python

The Python script will build the different rule based and data driven (Markov) models:

First, make sure you have installed the following packages;

Make sure you have the latest versions of the packages used, since older packages seem to have difficulties with the combination Pandas / BigQuery

1pip install marketing_attribution_models
2pip install --upgrade 'google-cloud-bigquery[bqstorage,pandas]'
3pip install pyarrow

Then:

  1. Download the script (and the .sql) file from GitHub
  2. Adjust the configuration settings in the top section of the script.
  3. Run the script (paste the code in Jupyter Notebook or just run the Python script)

The script will:

  • Fetch the data from BigQuery (using the .sql file that fetches the user journeys from BigQuery)
  • Create the different models configured in the script
  • Result is a Pandas dataframe containing the attributions (for all models) on user path level.
  • Based on this dataframe, the data is grouped to date and channel level into an additional dataframe.
  • Both dataframes (user path + channel level) are saved back to Google BigQuery (good to know, the BigQuery tables are making use of repeated fields)

BigQuery table: Attribution on user path level

bigquery_table_attribution_conversion.PNG

BigQuery table: Attribution on channel and date level

bigquery_table_attribution_conversion_grouped.PNG

6. Querying the results in BigQuery

The last (and easy) step is to fetch the results from the BigQuery table that is created. The following query will get you the results grouped on channel level:

1SELECT
2    --conversion_date,
3    channels,
4    --SUM(attr_first_click[SAFE_OFFSET(0)]) AS first_click_conversions,
5    SUM(attr_first_click[SAFE_OFFSET(1)]) AS first_click_value,
6    --SUM(attr_last_click[SAFE_OFFSET(0)]) AS last_non_direct_conversions,
7    SUM(attr_last_click[SAFE_OFFSET(1)]) AS last_click_value,
8    --SUM(attr_last_non_direct_click[SAFE_OFFSET(0)]) AS last_non_direct_conversions,
9    SUM(attr_last_non_direct_click[SAFE_OFFSET(1)]) AS last_non_direct_value,
10    --SUM(attr_position[SAFE_OFFSET(0)]) AS position_conversions,
11    SUM(attr_position[SAFE_OFFSET(1)]) AS position_value,
12    --SUM(attr_time_decay[SAFE_OFFSET(0)]) AS time_decay_conversions,
13    SUM(attr_time_decay[SAFE_OFFSET(1)]) AS time_decay_value,
14    --SUM(attr_linear[SAFE_OFFSET(0)]) AS linear_conversions,
15    SUM(attr_linear[SAFE_OFFSET(1)]) AS linear_value,
16    --SUM(attr_markov[SAFE_OFFSET(0)]) AS markov_conversions,
17    SUM(attr_markov[SAFE_OFFSET(1)]) AS markov_value
18FROM
19  `<your-project>.<your-dataset>.conversions_attr_channels`
20GROUP BY
21    --conversion_date,
22    channels

Result

google_analytics_4_attribution_result.PNG

That's it!

Closing thoughts & next steps

In this more practical tutorial we hope to show you how to set up the end-to-end process. Of course, there is room for improvement and can be developed further to your specific business needs. Think of;

  • Automate the generation of attribution scores and values (and how to deal with historical data used to calculate scores of 'today')
  • User paths: How to deal with multiple conversions for the same user (cut-off the user journeys after a conversion or after a certain time of inactivity)
  • User paths: How to deal with multiple conversions in one session
  • Adding additional models, like the attribution models using Shapley Values.

Stay tuned for part 2, where we visualize the results in Google Data Studio.

Did you like this article? Stay hydrated and get notified when new articles, tutorials or other interesting updates are published. No spam and you can unsubscribe at any time. You can also check my Twitter account for updates and other interesting material.