"Know your channels", a wise man once said. Get actionable insights on your channel performance by building custom attribution models using Google Analytics 4 data in BigQuery. This end-to-end tutorial will show you how to prepare the GA4 data and build rule based and data driven (Markov) models.
- Prepare and aggerate the data (user journeys) using the Google Analytics 4 event level exports.
- Leverage the Marketing-Attribution-Models Python library of DP6.
- Create rule based (heuristic) models: First interaction, last (non-direct) interaction, linear, time decay & position based.
- Create algorithmic data driven attribution model: Markov Chains
- Attribution is calculated on user path level but also grouped on date and channel level.
- Results saved to Google BigQuery.
The upcoming part #2 of this article will show you how to visualize the results in a Google Data Studio dashboard.
Why you can't measure every interaction in a user journey
Attribution heavily depends on your data quality and what user interactions can be measured. It's a fact that you can't measure the complete complete journey on user level due to both technical as privacy reasons:
- Different devices used (without some form of recognition like a user login)
- The same user is measured as different users due to (first-party) cookies that are deleted or purged by build-in tracking prevention mechanisms within browsers like ITP or ETP.
- Users don't give permission to be tracked.
- Touchpoints that can't be measured on user level, like impression data in external networks, like display networks or social platforms
So it's important that the things you can measure, are measured correctly.
Use cases & business value
Besides that you probably will learn a lot from the process and how session / acquisition data is structured in GA4, creating different attribution models could bring your business the following:
- Comparing different models will give you better insights in the value of each channel (and where in the user journey a channels adds value).
- Help making marketing budget decisions: where to focus your marketing efforts.
- Help making or improve marketing automation use cases (where can we add the most value)
- Using BigQuery data will enable you to customize your user journeys (like path length and channel definitions) and tailor results to your specific organization (instead of using the default attribution models within Google Analytics).
So, let's start!
1. Prerequisites - First things first
This tutorial assumes that you already have some knowledge about Google Analytics (4) and BigQuery exports. Next to that:
- Already set up a Google Cloud Project with an active billing account and GA4 BigQuery exports enabled (+ has some data to begin with). See this article.
- You could also use the Google Analytics 4 sample dataset.
- Understand and run Python scripts / Jupyter Notebook
We will use a Service Account to authenticate with your Google Cloud Project (getting data from and storing data in BigQuery).
If you're not familiar with setting up a Service Account, read the Google documentation first.
Create a Service Account
- Create a Service Account and enable the
BigQuery Userrole for the account.
- Generate and download the JSON keyfile for the Service Account.
Keep your JSON key secure!
2. The Python library
This article is more about the practical approach, there are already a lot of other great articles around explaining the different attribution models in more depth. Therefore we use the Marketing-Attribution-Models Python library of DP6 to build the different attribution models, instead of writing all the logic manually.
In this article, we will use visits (sessions) and the corresponding marketing channel (e.g. social, search etc) as interactions, so we can credit conversions and conversion value to the marketing channels that generated these sessions. But attribution is not limited to these type of interactions, you could also add or use other definitions, like specific interactions on your website or within your channels (for example soft conversions or email opens).
|First interaction||Rule based||Single touch||Only the first interaction is credited for the conversion|
|Last interaction||Rule based||Single touch||Only the last interaction is credited for the conversion|
|Last non-direct interaction||Rule based||Single touch||Only the last (non-direct) interaction is credited for the conversion. Default model of Google Universal Analytics|
|Linear||Rule based||Multi touch||Every interaction is equally credited|
|Time Decay||Rule based||Multi touch||The more recent interaction is given more credits (based on time till conversion in hours)|
|Position Based||Rule based||Multi touch||40% to the first touchpoint, 40% to the last touchpoint and 20% is equally credited to the interactions in between|
|Shapley Value||Data Driven||Multi touch||Not used in this tutorial. See this article. Also used for the data driven model in the GA interface|
|Markov Chains||Data Driven||Multi touch||See this article|
The library expects the data in the following format:
- journey_id: A unique journey-ID. In most cases a combination between some user-ID and a visit/session-ID.
- path_channels: The user path leading to conversion
- path_timestamps: The time difference in hours between interaction and conversion.
- conversion: Indicate if the user has converted
- conversion_value: Total conversion value (per visit)
|123.456||search_paid > direct > social||50 > 20 > 0||true||450.00|
|789.321||direct > search_organic > direct > email||80 > 34 > 5 > 0||true||270.00|
3. Modelling your data & considerations
The interesting part. The table above doesn't look that complicated, but there are some considerations and decisions to be made when modeling your data. The following graph and table will illustrate some of the concepts:
Lookback window and conversion period
|Conversion period||For which period do you want to select conversions that the models take into consideration. For the data driven models, this period is important because it uses this data to calculate attribution for each channel (instead of looking to individual paths only). Also depends on the amount of conversions.||90 days|
|Lookback (conversion) window||How many days are you looking back from a conversion to construct a user path? This depends on your type of business. For example, retail typically has shorter journeys than let's say a when you're selling cars. Typical values are 30, 60 or 90 days. Good to know, due to technological and privacy developments, it's getting harder to accurately measure data of the same user over longer periods of time.||30 days|
|Interaction / touchpoint definition||How do you define the interactions (touchpoints) in the users (paths). Just ||Custom channel grouping using a user defined function|
|User paths||How to build the user path when a user converts multiple times within the lookback window. Resetting the path after each transaction or also take the path of the previous conversion (within the lookback window) into consideration?||For each conversion, look back 30 days, even if in those 30 days another conversion occurred|
|Conversions & conversion value||What is your conversion definition and how to deal with multiple conversion in 1 session.||Purchases, multiple conversions in one session count as one|
|User definition||What type of user-ID is used to group sessions of the same user together. Based on the device-ID (cookie-ID) or (additionally) with custom user-ID such as a customer-ID of lead-ID. Using the latter, behavior on different devices for the same user can be measured.||Device-ID (ga-cookie-ID), but example provided to use an additional custom user-ID.|
|Non-converting paths||For some models it could be useful to also consider the non-converting paths (like the Shapley model).||Not using non-converting paths|
4. Google Analytics 4 data preparation
Based on the definitions of the previous chapter, we will need to prepare the Google Analytics 4 data stored in BigQuery.
Be aware that Google Analytics 4 has some quirks to deal with. For example it's unclear how GA4 is generating the Traffic Acquisition report in the interface. So it's impossible to align your BigQuery data with these interface reports. See some other issues below:
Things to take into account related to GA4 and traffic / acquisition parameters:
- Compared to Universal Analytics, sessions are not reset when users switch channels in an active session. This means that one session could have multiple traffic sources and you should write some logic around this yourself.
- In this tutorial, we will select the first traffic source within a session (instead of randomly selecting one with MIN() or MAX() functions.
- Compared to Universal Analytics, GA4 doesn't seem to close sessions at midnight.
- Google Analytics 4 batches events. Currently events are assigned the batch timestamp, instead of the actual timestamp when the event occurred.
- It's not clear how GA4 generates the Traffic Acquisition reports in the interface / attributes sessions and conversions to channels. Partially this can be explained to to the fact that GA4 uses conversion modelling (thanks Ken Williams!). So currently it's impossible to align your BigQuery data with these interface reports.
Google Analytics 4 event and session structure
To illustrate how sessions are structured related to events and traffic acquisition data:
user_pseudo_id: The device-ID (defaults to the GA cookie-ID on web and app instance-ID within apps). This ID will identify the user over different visits (sessions).
user_id: A custom user-ID, like a customer ID. This ID can be used to join activity of the same user over different devices or browsers. One
user_idtherefore can have multiple
ga_session_id: Identifies the specific session. Indivual events can be related to sessions using this ID. To create unique session-IDs, we have to concatenate user_pseudo_id and session_id (since the session_id only is based on a timestamp).
- The same session can have events with different traffic acquisition parameters as mentioned earlier. In this tutorial we will use the first channel of a session.
That said, the following SQL code will fetch the Google Analytics 4 data from BigQuery according to the requirements of the Python library (see chapter 2), keeping in mind the considerations discussed earlier.
The SQL is using a persistent User Defined Function (UDF) that holds the channel groupings (
<your-project>.<your-dataset>.CHANNEL_GROUPING()). You can create your channel definitions inline, but we recommend to use the channel grouping functions described in this article.
1DECLARE conversion_period INT64 DEFAULT 90; -- Select conversions in last X days 2DECLARE lookback_window INT64 DEFAULT 30; -- How many days to lookback from the moment the conversion occurred; 3 4WITH 5-- Group event level Google Analytics 4 data to sessions (visits) 6sessions AS ( 7 SELECT 8 user_pseudo_id AS ga_client_id, 9 --user_id AS custom_user_id, -- Use a custom user-ID instead, like a customer-ID 10 CONCAT(user_pseudo_id,".",(SELECT CAST(value.int_value AS STRING) FROM UNNEST(event_params) WHERE key = "ga_session_id")) AS session_id, -- Combine user_pseudo_id and session_id for a unique session-ID 11 TIMESTAMP_MICROS(MIN(event_timestamp)) AS session_start, 12 ARRAY_AGG( 13 IF(event_name = "page_view", STRUCT( 14 event_timestamp, 15 LOWER((SELECT value.string_value FROM UNNEST(event_params) WHERE key = "source")) AS source, 16 LOWER((SELECT value.string_value FROM UNNEST(event_params) WHERE key = "medium")) AS medium, 17 LOWER((SELECT value.string_value FROM UNNEST(event_params) WHERE key = "campaign")) AS campaign, 18 (SELECT value.int_value FROM UNNEST(event_params) WHERE key = "entrances") AS is_entrance, 19 (SELECT value.int_value FROM UNNEST(event_params) WHERE key = "ignore_referrer") AS ignore_referrer 20 ), NULL) 21 IGNORE NULLS) AS channels_in_session, 22 --COUNTIF(event_name = "<name-of-some-other-conversion-event>") AS conversions_in_session, 23 COUNTIF(event_name = "purchase") AS conversions, 24 SUM(ecommerce.purchase_revenue) AS conversion_value 25 FROM 26 `<your-project>.analytics_<your-dataset>.events_*` 27 WHERE 28 -- Select conversions based on <conversion_period> + additional daterange to construct the path of a conversion (based on <lookback_window>) 29 _TABLE_SUFFIX BETWEEN 30 FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL (conversion_period + lookback_window) DAY)) 31 AND FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) 32 GROUP BY 33 ga_client_id, 34 session_id 35), 36-- Build conversion paths for all sessions with at least 1 conversion within the last <conversion_period> days 37sessions_converted AS ( 38 SELECT 39 s.session_start, 40 s.session_id, 41 STRING_AGG( 42 -- Select first channel / campaign within session 43 `<your-project>.<your-dataset>.CHANNEL_GROUPING`( 44 (SELECT t.source FROM UNNEST(s_lb.channels_in_session) AS t WHERE t.ignore_referrer IS NULL ORDER BY t.event_timestamp ASC LIMIT 1), 45 (SELECT t.medium FROM UNNEST(s_lb.channels_in_session) AS t WHERE t.ignore_referrer IS NULL ORDER BY t.event_timestamp ASC LIMIT 1), 46 NULL 47 ), 48 " > " 49 ORDER BY s_lb.session_start ASC 50 ) AS path_channels, 51 STRING_AGG(CAST(TIMESTAMP_DIFF(TIMESTAMP(s.session_start), TIMESTAMP(s_lb.session_start), HOUR) AS STRING), " > " ORDER BY s_lb.session_start ASC) AS path_timestamps, -- Hours till conversion 52 STRING_AGG(CAST(s_lb.session_start AS STRING), " > " ORDER BY s_lb.session_start ASC) AS path_timestamps_check, 53 SUM(s.conversions) AS conversions_in_session, 54 SUM(s.conversion_value) AS conversion_value 55 FROM 56 sessions AS s 57 LEFT JOIN 58 -- Joining historical sessions to construct the conversion path (with a max path length of <lookback_window>) 59 sessions AS s_lb 60 ON s.ga_client_id = s_lb.ga_client_id 61 AND s.session_start >= s_lb.session_start -- Only join current session and sessions before current session 62 AND DATETIME(s_lb.session_start) >= DATE_SUB(DATETIME(s.session_start), INTERVAL lookback_window DAY) -- Only join sessions not older than <lookback_window> days counted from conversion 63 WHERE 64 s.conversions > 0 65 AND DATE(s.session_start) >= DATE_SUB(CURRENT_DATE(), INTERVAL conversion_period DAY) 66 GROUP BY 67 s.session_start, 68 s.session_id 69 ORDER BY 70 s.session_start ASC 71) 72 73-- Query data on user (journey) level 74SELECT 75 DATE(session_start) AS conversion_date, 76 session_start AS conversion_timestamp, 77 session_id AS journey_id, 78 path_channels, 79 path_timestamps, 80 true AS conversion, 81 conversions_in_session, 82 conversion_value AS conversion_value 83FROM 84 sessions_converted
5. Building the models in Python
The Python script will build the different rule based and data driven (Markov) models:
First, make sure you have installed the following packages;
Make sure you have the latest versions of the packages used, since older packages seem to have difficulties with the combination Pandas / BigQuery
1pip install marketing_attribution_models 2pip install --upgrade 'google-cloud-bigquery[bqstorage,pandas]' 3pip install pyarrow
- Download the script (and the .sql) file from GitHub
- Adjust the configuration settings in the top section of the script.
- Run the script (paste the code in Jupyter Notebook or just run the Python script)
The script will:
- Fetch the data from BigQuery (using the .sql file that fetches the user journeys from BigQuery)
- Create the different models configured in the script
- Result is a Pandas
dataframecontaining the attributions (for all models) on user path level.
- Based on this dataframe, the data is grouped to date and channel level into an additional dataframe.
- Both dataframes (user path + channel level) are saved back to Google BigQuery (good to know, the BigQuery tables are making use of repeated fields)
BigQuery table: Attribution on user path level
BigQuery table: Attribution on channel and date level
6. Querying the results in BigQuery
The last (and easy) step is to fetch the results from the BigQuery table that is created. The following query will get you the results grouped on channel level:
1SELECT 2 --conversion_date, 3 channels, 4 --SUM(attr_first_click[SAFE_OFFSET(0)]) AS first_click_conversions, 5 SUM(attr_first_click[SAFE_OFFSET(1)]) AS first_click_value, 6 --SUM(attr_last_click[SAFE_OFFSET(0)]) AS last_non_direct_conversions, 7 SUM(attr_last_click[SAFE_OFFSET(1)]) AS last_click_value, 8 --SUM(attr_last_non_direct_click[SAFE_OFFSET(0)]) AS last_non_direct_conversions, 9 SUM(attr_last_non_direct_click[SAFE_OFFSET(1)]) AS last_non_direct_value, 10 --SUM(attr_position[SAFE_OFFSET(0)]) AS position_conversions, 11 SUM(attr_position[SAFE_OFFSET(1)]) AS position_value, 12 --SUM(attr_time_decay[SAFE_OFFSET(0)]) AS time_decay_conversions, 13 SUM(attr_time_decay[SAFE_OFFSET(1)]) AS time_decay_value, 14 --SUM(attr_linear[SAFE_OFFSET(0)]) AS linear_conversions, 15 SUM(attr_linear[SAFE_OFFSET(1)]) AS linear_value, 16 --SUM(attr_markov[SAFE_OFFSET(0)]) AS markov_conversions, 17 SUM(attr_markov[SAFE_OFFSET(1)]) AS markov_value 18FROM 19 `<your-project>.<your-dataset>.conversions_attr_channels` 20GROUP BY 21 --conversion_date, 22 channels
Closing thoughts & next steps
In this more practical tutorial we hope to show you how to set up the end-to-end process. Of course, there is room for improvement and can be developed further to your specific business needs. Think of;
- Automate the generation of attribution scores and values (and how to deal with historical data used to calculate scores of 'today')
- User paths: How to deal with multiple conversions for the same user (cut-off the user journeys after a conversion or after a certain time of inactivity)
- User paths: How to deal with multiple conversions in one session
- Adding additional models, like the attribution models using Shapley Values.
Stay tuned for part 2, where we visualize the results in Google Data Studio.
- Why you can't measure every interaction in a user journey
- Use cases & business value
- 1. Prerequisites - First things first
- 2. The Python library
- 3. Modelling your data & considerations
- 4. Google Analytics 4 data preparation
- 5. Building the models in Python
- 6. Querying the results in BigQuery
- Closing thoughts & next steps