Data Engineering

Enrich a Single Customer View with Google Analytics 4 BigQuery data

Use Google Analytics 4 data to enrich a 360° Customer View with online behavioral data, such as visit frequency, most used device, channel and/or engagement metrics. This tutorial provides an SQL example you can use directly.

  • Enrich a single customer view with online behavioral data collected by Google Analytics 4
  • Use the GA4 BigQuery exports to aggregate Google Analytics 4 data to user level
  • Think of most frequently used marketing channel, device, visit frequency & recency and most viewed content.
  • Go directly to the SQL code

What is a Single Customer View?

A (single) customer view is also called a Unified Customer View, a 360° Customer View or Profile. Basically it aggregates data on user level from different sources to a single profile / representation of a customer (or user).

Most Customer Data Platforms (CDPs) offer this functionality, but you can also build a single customer view within your data warehouse. Common sources for these customer views are:

  • CRM databases
  • Customer service systems
  • Order databases
  • Offline behavior (e.g. in-store purchases using a loyalty card)
  • Online behavior

In this article, we will add the online behavioral data the the customer view. Example:

google-analytics-4-single-customer-view-table.PNG

By creating such a customer view, data can be used to (but certainly not limited to):

  • Get a better and more complete understanding of your customer.
  • Input for (customer) segmentation (e.g. clustering).
  • Input for modeling / predictions (e.g. conversion or churn prediction)
  • Target ads (think of email or social/display campaigns, but also on-site personalization)
  • Channel preference / optimization
  • Content recommendations
  • Providing additional information for customer service (e.g. by connecting the view to customer service / CRM software)

Use Google Analytics 4 data to extract online behavior

As mentioned, Google Analytics 4 (GA4) can be used to add online behavior to the customer view. Leveraging the GA4 BigQuery exports, we can aggregate data to user level, thus providing all sorts of interesting user features.

Caution! Make sure you have the appropriate user consent before collecting and using the data.

single-customer-view-google-analytics-4.PNG

Customer identification & mapping table

To match GA4 data with other data sources, you will need some sort of a common identifier, like a customer-ID.

You could also build user profiles of anonymous users. So not for every use case a customer-ID is required. In this article, we will assume that you want to connect the GA4 data to other systems (but provided comments in the SQL code below to build profiles for anonymous users).

So if you're planning to connect the data to other systems, you will need to configure the collection and matching of customer-IDs / user-IDs within Google Analytics 4. This article will show you how to create a user mapping table (private Identity Graph), which you can use to:

  1. Match device-IDs (GA4 cookie-ID: user_pseudo_id) to customer-IDs or user-IDs
  2. Use this match / mapping table to easily add a customer-ID to sessions and unify sessions before aggregating the data to users, thus creating cross-device profiles, depending on the identification methods (e.g. login).

identity-graph-resolution-bigquery-sql.png

Private Identity Graph

Extracting data from the BigQuery GA4 dataset

We assume you've already set up the GA4 exports to Google BigQuery.

If you've decided to group on customer-ID, we assume you've set up the mapping table described in the previous chapter.

Now we're going to transform al the GA4 event level data to user level data. The different steps explained:

  1. Aggregate the event (hit) level data to session level. You have to set the lookback_window variable for which the user views will be constructed. We've set this to 12 months, but you should adjust this to your specific situation.
  2. Match sessions on device-ID (user_pseudo_id) to customer-IDs. We will use the mapping table that you can find in this article, as described in the previous chapter. Based on this table, we will create an intermediate match table, where the ga_client_id (= user_pseudo_id) is unique. We only select device-IDs that have a maximum of 3 (or less) customer-IDs attached. If multiple customer-IDs, the most recent one is selected. You can skip this step if you want to create a view for anonymous users.
  3. Aggregate the session level data to user level, and creating the "customer view".

The query will provide you with a bunch of user features (described in the table below). You're certainly not limited to these features, as you can come up with a lot more, specific for your business. The query will show you some concepts how to get different kinds of data and aggegrations.

All features are calculated within the configured lookback period (except first_touch and traffic_source_first, because these are persistent parameters within the GA4 dataset)

FeatureDescription
first_touchTimestamp of the very first session (can be outside the lookback window)
first_sessionTimestamp of first session within lookback period
last_sessionTimestamp of last session
last_transactionTimestamp of last transaction
sessionsTotal number of sessions (complete lookback window)
sessions_1d_7dSessions between last 1 and 7 days
sessions_7d_30dSessions between last 7 and 30 days
sessions_30d_90dSessions between last 30 and 90 days
sessions_engagedTotal number of engaged sessions (see definition)
engagement_time_avgAverage engagement time of engaged sessions (complete lookback window)
loginsSessions with a login
total_transactionsTotal number of transactions
total_transaction_valueSum of transaction values
device_typesNumber of unique device categories (mobile / tablet / desktop) used
channelsNumber of unique channel (groupings) used
days_since_last_sessionDays between current date and last session timestamp
days_since_last_transactionDays between current date and last transaction timestamp
device_firstFirst device (within lookback period)
device_lastLast device
device_most_freqMost frequently used device
os_lastLast used operating system
os_most_freqMost frequently used operating system
os_browser_lastLast used browser
os_browser_most_freqMost frequently used browser
channel_acquisitionAcquisition channel (channel of first touch session, can be outside the lookback window).
channel_firstFirst used channel (within lookback period )
channel_lastLast used channel. A session can have multiple channels,the first channel within a session is used.
channel_most_freqMost frequently used channel. A session can have multiple channels,the first channel within a session is used.
last_viewed_itemsLast 5 items viewed. Deduplicated.
most_viewed_itemsMost viewed items
mosted_viewed_categoriesMost viewed categories, based on items views

Query: The customer view

Keep in mind, this is a rather resource intensive query, so when used on large datasets make sure you select a small amount of data tot test. I'm also aware that it isn't he most efficient query (tips welcome!). In production, you could split up the query (for example create a separate sessions table that is updated daily, so you don't have to query the complete raw GA4 dataset every day).

I've included a temporary function for defining channel groupings (based on traffic source and medium), but you could also use a permanent function (see this article).

View the code on GitHub

1declare lookback_window int64 default 365; -- how many days to lookback into the ga4 dataset to calculate profiles
2
3-- udf: channel grouping (you could put this in a permanent function)
4-- also see https://stacktonic.com/article/google-analytics-4-and-big-query-create-custom-channel-groupings-in-a-reusable-sql-function
5create temporary function channel_grouping(tsource string, medium string, campaign string) as (
6    case
7        when (tsource = '(direct)' or tsource is null) 
8            and (regexp_contains(medium, r'^(\(not set\)|\(none\))$') or medium is null) 
9            then 'direct'
10        when regexp_contains(medium, r'^(social|social_advertising|social-advertising|social_network|social-network|social_media|social-media)$') 
11            then 'social'
12        when regexp_contains(medium, r'^(email|mail)$') 
13            then 'email'
14        when regexp_contains(medium, r'^(affiliate|affiliates)$') 
15            then 'affiliate'
16        when regexp_contains(medium, r'^(cpc|ppc|paidsearch)$') 
17            then 'search_paid'
18        when regexp_contains(medium, r'^(display|cpm|banner)$') 
19            then 'display'
20        when medium = 'organic'
21            then 'search_organic'
22        when medium = 'referral'
23            then 'referral'
24        else '(other)'
25    end
26);
27
28-- udf: most frequent value from array
29create temp function most_freq(arr any type) as ((
30    select x from unnest(arr) x group by x order by count(1) desc limit 1
31));
32
33-- udf: get first channel within session
34create temp function first_channel_session(arr any type) as ((
35    select 
36        channel_grouping(x.source, x.medium, x.campaign) 
37    from unnest(arr) x 
38    where x.ignore_referrer is null 
39    order by x.event_timestamp asc limit 1
40));
41
42with 
43-- this in-between table is based mapping table described in this article (you can leave out this part if you want to group on device-id only)
44-- https://stacktonic.com/article/create-a-user-mapping-table-based-on-the-google-analytics-4-big-query-dataset)
45-- only select ga_client_id (`user_pseudo_id`) with 3 of less customer-ids attached. 
46-- when multiple customer-ids, select most recent one.
47customer_id_mapping as (
48    select
49        ga_client_id,
50        customer_ids,
51        (select x.customer_id from unnest(customer_ids) x order by x.timestamp desc limit 1) as customer_id
52    from (
53        select
54            ga_id.id as ga_client_id,
55            count(customer_id) as customer_ids_count,
56            array_agg(struct(customer_id, ga_id.timestamp)) as customer_ids,
57        from 
58            `<your-project>.<your-dataset>.<your-mapping-table>`, 
59            unnest(ga_client_id) as ga_id
60        group by 
61            ga_client_id
62    ) 
63    where
64        customer_ids_count <= 3
65),
66-- aggegrate event level data to session level data and attach customer-id from mapping table
67sessions as (
68    select
69        user_pseudo_id as ga_client_id,
70        max(m.customer_id) as customer_id, -- logic to select only one customer-id per user_pseudo_id can be found in `customer_id_mapping`
71        concat(user_pseudo_id,'.',(select cast(value.int_value as string) from unnest(event_params) where key = 'ga_session_id')) as session_id, -- combine user_pseudo_id and session_id for a unique session-id
72        timestamp_micros(min(event_timestamp)) as session_start,
73        max((select value.string_value from unnest(event_params) where key = 'session_engaged')) as session_engaged,
74        max((select value.int_value from unnest(event_params) where key = 'engagement_time_msec'))/1000 as engagement_time,
75        max(lower(traffic_source.source)) as traffic_source_first,
76        max(lower(traffic_source.medium)) as traffic_medium_first,
77        min(timestamp_micros(user_first_touch_timestamp)) as session_start_first_touch,
78        max(device.category) as device,
79        max(device.operating_system) as os,
80        max(device.web_info.browser) as browser,
81        max(geo.country) as country,
82        max(geo.city) as city,
83        ifnull(first_channel_session(
84            array_agg(
85                if(event_name in('page_view','user_engagement','scroll'), struct(
86                    event_timestamp,
87                    lower((select value.string_value from unnest(event_params) where key = 'source')) as source,
88                    lower((select value.string_value from unnest(event_params) where key = 'medium')) as medium,
89                    lower((select value.string_value from unnest(event_params) where key = 'campaign')) as campaign,
90                    (select value.int_value from unnest(event_params) where key = 'entrances') as is_entrance,
91                    (select value.int_value from unnest(event_params) where key = 'ignore_referrer') as ignore_referrer
92                ), null) 
93            ignore nulls order by event_timestamp asc)
94        ), 'direct') as first_channel_in_session,
95        --countif(event_name = '<name-of-some-other-conversion-event>') as some_conversion,
96        if(countif(event_name = 'login') > 0,1,0) as session_login,
97        countif(event_name = 'purchase') as transactions,
98        sum(ecommerce.purchase_revenue) as transaction_value,
99        array_agg(
100            if(event_name = 'view_item' or event_name = 'add_to_cart', struct(
101                event_timestamp,
102                event_name,
103                (select item_id from unnest(items) limit 1) as item_id,
104                (select item_category from unnest(items) limit 1) as item_category
105            ), null) 
106        ignore nulls order by event_timestamp desc limit 100) as item_interactions
107    from
108        `<your-project>.analytics_<your-ga4-property-id>.events_*` as s
109    left join customer_id_mapping as m
110        on m.ga_client_id = s.user_pseudo_id
111    where
112        _table_suffix between format_date('%Y%m%d',date_sub(current_date(), interval lookback_window day))
113        and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))
114    group by
115        user_pseudo_id,
116        session_id
117)
118
119-- aggegrate to user level
120select
121    * except (item_interactions),
122    array_to_string(
123        array(select x.item_id from unnest(item_interactions) x where x.event_name = 'view_item' group by x.item_id order by max(x.event_timestamp) desc limit 5),
124        '|'
125    ) as last_viewed_items,
126    array_to_string(
127        array(select x.item_id from unnest(item_interactions) x where x.event_name = 'view_item' group by x.item_id order by count(1) desc limit 5),
128        '|'
129    ) as most_viewed_items,
130    array_to_string(
131        array(select x.item_category from unnest(item_interactions) x where x.event_name = 'view_item' group by x.item_category order by count(1) desc limit 5),
132        '|'
133    ) as most_viewed_categories
134from (
135    select
136        --ga_client_id, -- uncomment when grouping on device-id instead of customer-id
137        customer_id, -- remove when grouping on device-id
138        min(datetime(session_start_first_touch)) as first_touch,
139        min(datetime(session_start)) as first_session,
140        max(datetime(session_start)) as last_session,
141        max(datetime(if(transactions > 0, session_start, null))) as last_transaction,
142        count(session_id) as sessions,
143        countif(date_diff(current_date, datetime(session_start), day) between 0 and 7) as sessions_1d_7d,
144        countif(date_diff(current_date, datetime(session_start), day) between 8 and 30) as sessions_7d_30d,
145        countif(date_diff(current_date, datetime(session_start), day) between 31 and 90) as sessions_30d_90d,
146        countif(session_engaged = '1') as sessions_engaged,
147        safe_divide(sum(engagement_time), countif(session_engaged = '1')) as engagement_time_avg,
148        sum(session_login) as logins,
149        sum(transactions) as total_transactions,
150        ifnull(sum(transaction_value),0) as total_transaction_value,
151        count(distinct device) as devices,
152        count(distinct first_channel_in_session) as channels,
153        date_diff(current_date(),cast((max(session_start)) as date),day) as days_since_last_session,
154        date_diff(current_date(),cast((max(if(transactions > 0, session_start, null))) as date),day) as days_since_last_transaction,
155        string_agg(device order by session_start asc limit 1) as device_first,
156        string_agg(device order by session_start desc limit 1) as device_last,
157        most_freq(array_agg(device)) as device_most_freq,
158        string_agg(os order by session_start desc limit 1) as os_last,
159        most_freq(array_agg(os)) as os_most_freq,
160        string_agg(browser order by session_start desc limit 1) as browser_last,
161        most_freq(array_agg(browser)) as browser_most_freq,
162        channel_grouping(min(traffic_source_first), min(traffic_medium_first), null) as channel_acquisition,
163        string_agg(first_channel_in_session order by session_start asc limit 1) as channel_first,
164        string_agg(first_channel_in_session order by session_start desc limit 1) as channel_last,
165        most_freq(array_agg(first_channel_in_session ignore nulls)) as channel_most_freq,
166        array_concat_agg(item_interactions limit 200) as item_interactions
167    from
168        sessions
169    where -- remove when grouping on device-id
170        customer_id is not null -- remove when grouping on device-id
171    group by
172        customer_id -- remove when grouping on device-id
173        --ga_client_id --uncomment when grouping on device-id instead of customer-id
174)

That's it!

Closing thoughts & next steps

  • Building a customer view enables a lot of marketing and customer service use cases. But, make sure you have the appropriate consent and be transparent to your users about how you process and use the data collected.
  • You can run the query daily or multiple times per day. Based on the amount of data you have in GA4, keep in mind that this is probably a rather resource intensive query. You could split up the query and create an intermediate table for the session, an append data to this table instead of generating for the complete date range every time.
  • If your looking to activate audiences (based on the customer view) in your (marketing) channels, you could check out this article where I describe how to send an audience from BigQuery to Google Ads.
  • Lastly, if you want to personalize your website or app using (customer) data, you will need some sort of mechanism that can provide this data to your website fast. Using BigQuery directly is not a solid solution. In this article I describe how to set this up, using Redis as a fast-read database and Google Tag Manager Server to serve the data.
Did you like this article? Stay hydrated and get notified when new articles, tutorials or other interesting updates are published. No spam and you can unsubscribe at any time. You can also check my Twitter account for updates and other interesting material.