Enrich a Single Customer View with Google Analytics 4 BigQuery data
Use Google Analytics 4 data to enrich a 360° Customer View with online behavioral data, such as visit frequency, most used device, channel and/or engagement metrics. This tutorial provides an SQL example you can use directly.
- Enrich a single customer view with online behavioral data collected by Google Analytics 4
- Use the GA4 BigQuery exports to aggregate Google Analytics 4 data to user level
- Think of most frequently used marketing channel, device, visit frequency & recency and most viewed content.
- Go directly to the SQL code
What is a Single Customer View?
A (single) customer view is also called a Unified Customer View, a 360° Customer View or Profile. Basically it aggregates data on user level from different sources to a single profile / representation of a customer (or user).
Most Customer Data Platforms (CDPs) offer this functionality, but you can also build a single customer view within your data warehouse. Common sources for these customer views are:
- CRM databases
- Customer service systems
- Order databases
- Offline behavior (e.g. in-store purchases using a loyalty card)
- Online behavior
In this article, we will add the online behavioral data the the customer view. Example:
By creating such a customer view, data can be used to (but certainly not limited to):
- Get a better and more complete understanding of your customer.
- Input for (customer) segmentation (e.g. clustering).
- Input for modeling / predictions (e.g. conversion or churn prediction)
- Target ads (think of email or social/display campaigns, but also on-site personalization)
- Channel preference / optimization
- Content recommendations
- Providing additional information for customer service (e.g. by connecting the view to customer service / CRM software)
Use Google Analytics 4 data to extract online behavior
As mentioned, Google Analytics 4 (GA4) can be used to add online behavior to the customer view. Leveraging the GA4 BigQuery exports, we can aggregate data to user level, thus providing all sorts of interesting user features.
Caution! Make sure you have the appropriate user consent before collecting and using the data.
Customer identification & mapping table
To match GA4 data with other data sources, you will need some sort of a common identifier, like a customer-ID.
You could also build user profiles of anonymous users. So not for every use case a customer-ID is required. In this article, we will assume that you want to connect the GA4 data to other systems (but provided comments in the SQL code below to build profiles for anonymous users).
So if you're planning to connect the data to other systems, you will need to configure the collection and matching of customer-IDs / user-IDs within Google Analytics 4. This article will show you how to create a user mapping table (private Identity Graph), which you can use to:
- Match device-IDs (GA4 cookie-ID:
user_pseudo_id
) to customer-IDs or user-IDs - Use this match / mapping table to easily add a customer-ID to sessions and unify sessions before aggregating the data to users, thus creating cross-device profiles, depending on the identification methods (e.g. login).
Private Identity Graph
Extracting data from the BigQuery GA4 dataset
We assume you've already set up the GA4 exports to Google BigQuery.
If you've decided to group on customer-ID, we assume you've set up the mapping table described in the previous chapter.
Now we're going to transform al the GA4 event level data to user level data. The different steps explained:
- Aggregate the event (hit) level data to session level. You have to set the lookback_window variable for which the user views will be constructed. We've set this to 12 months, but you should adjust this to your specific situation.
- Match sessions on device-ID (
user_pseudo_id
) to customer-IDs. We will use the mapping table that you can find in this article, as described in the previous chapter. Based on this table, we will create an intermediate match table, where the ga_client_id (=user_pseudo_id
) is unique. We only select device-IDs that have a maximum of 3 (or less) customer-IDs attached. If multiple customer-IDs, the most recent one is selected. You can skip this step if you want to create a view for anonymous users. - Aggregate the session level data to user level, and creating the "customer view".
The query will provide you with a bunch of user features (described in the table below). You're certainly not limited to these features, as you can come up with a lot more, specific for your business. The query will show you some concepts how to get different kinds of data and aggegrations.
All features are calculated within the configured lookback period (except first_touch
and traffic_source_first
, because these are persistent parameters within the GA4 dataset)
Feature | Description |
---|---|
first_touch | Timestamp of the very first session (can be outside the lookback window) |
first_session | Timestamp of first session within lookback period |
last_session | Timestamp of last session |
last_transaction | Timestamp of last transaction |
sessions | Total number of sessions (complete lookback window) |
sessions_1d_7d | Sessions between last 1 and 7 days |
sessions_7d_30d | Sessions between last 7 and 30 days |
sessions_30d_90d | Sessions between last 30 and 90 days |
sessions_engaged | Total number of engaged sessions (see definition) |
engagement_time_avg | Average engagement time of engaged sessions (complete lookback window) |
logins | Sessions with a login |
total_transactions | Total number of transactions |
total_transaction_value | Sum of transaction values |
device_types | Number of unique device categories (mobile / tablet / desktop) used |
channels | Number of unique channel (groupings) used |
days_since_last_session | Days between current date and last session timestamp |
days_since_last_transaction | Days between current date and last transaction timestamp |
device_first | First device (within lookback period) |
device_last | Last device |
device_most_freq | Most frequently used device |
os_last | Last used operating system |
os_most_freq | Most frequently used operating system |
os_browser_last | Last used browser |
os_browser_most_freq | Most frequently used browser |
channel_acquisition | Acquisition channel (channel of first touch session, can be outside the lookback window). |
channel_first | First used channel (within lookback period ) |
channel_last | Last used channel. A session can have multiple channels,the first channel within a session is used. |
channel_most_freq | Most frequently used channel. A session can have multiple channels,the first channel within a session is used. |
last_viewed_items | Last 5 items viewed. Deduplicated. |
most_viewed_items | Most viewed items |
mosted_viewed_categories | Most viewed categories, based on items views |
Query: The customer view
Keep in mind, this is a rather resource intensive query, so when used on large datasets make sure you select a small amount of data tot test. I'm also aware that it isn't he most efficient query (tips welcome!). In production, you could split up the query (for example create a separate sessions table that is updated daily, so you don't have to query the complete raw GA4 dataset every day).
I've included a temporary function for defining channel groupings (based on traffic source and medium), but you could also use a permanent function (see this article).
1declare lookback_window int64 default 365; -- how many days to lookback into the ga4 dataset to calculate profiles
2
3-- udf: channel grouping (you could put this in a permanent function)
4-- also see https://stacktonic.com/article/google-analytics-4-and-big-query-create-custom-channel-groupings-in-a-reusable-sql-function
5create temporary function channel_grouping(tsource string, medium string, campaign string) as (
6 case
7 when (tsource = '(direct)' or tsource is null)
8 and (regexp_contains(medium, r'^(\(not set\)|\(none\))$') or medium is null)
9 then 'direct'
10 when regexp_contains(medium, r'^(social|social_advertising|social-advertising|social_network|social-network|social_media|social-media)$')
11 then 'social'
12 when regexp_contains(medium, r'^(email|mail)$')
13 then 'email'
14 when regexp_contains(medium, r'^(affiliate|affiliates)$')
15 then 'affiliate'
16 when regexp_contains(medium, r'^(cpc|ppc|paidsearch)$')
17 then 'search_paid'
18 when regexp_contains(medium, r'^(display|cpm|banner)$')
19 then 'display'
20 when medium = 'organic'
21 then 'search_organic'
22 when medium = 'referral'
23 then 'referral'
24 else '(other)'
25 end
26);
27
28-- udf: most frequent value from array
29create temp function most_freq(arr any type) as ((
30 select x from unnest(arr) x group by x order by count(1) desc limit 1
31));
32
33-- udf: get first channel within session
34create temp function first_channel_session(arr any type) as ((
35 select
36 channel_grouping(x.source, x.medium, x.campaign)
37 from unnest(arr) x
38 where x.ignore_referrer is null
39 order by x.event_timestamp asc limit 1
40));
41
42with
43-- this in-between table is based mapping table described in this article (you can leave out this part if you want to group on device-id only)
44-- https://stacktonic.com/article/create-a-user-mapping-table-based-on-the-google-analytics-4-big-query-dataset)
45-- only select ga_client_id (`user_pseudo_id`) with 3 of less customer-ids attached.
46-- when multiple customer-ids, select most recent one.
47customer_id_mapping as (
48 select
49 ga_client_id,
50 customer_ids,
51 (select x.customer_id from unnest(customer_ids) x order by x.timestamp desc limit 1) as customer_id
52 from (
53 select
54 ga_id.id as ga_client_id,
55 count(customer_id) as customer_ids_count,
56 array_agg(struct(customer_id, ga_id.timestamp)) as customer_ids,
57 from
58 `<your-project>.<your-dataset>.<your-mapping-table>`,
59 unnest(ga_client_id) as ga_id
60 group by
61 ga_client_id
62 )
63 where
64 customer_ids_count <= 3
65),
66-- aggegrate event level data to session level data and attach customer-id from mapping table
67sessions as (
68 select
69 user_pseudo_id as ga_client_id,
70 max(m.customer_id) as customer_id, -- logic to select only one customer-id per user_pseudo_id can be found in `customer_id_mapping`
71 concat(user_pseudo_id,'.',(select cast(value.int_value as string) from unnest(event_params) where key = 'ga_session_id')) as session_id, -- combine user_pseudo_id and session_id for a unique session-id
72 timestamp_micros(min(event_timestamp)) as session_start,
73 max((select value.string_value from unnest(event_params) where key = 'session_engaged')) as session_engaged,
74 max((select value.int_value from unnest(event_params) where key = 'engagement_time_msec'))/1000 as engagement_time,
75 max(lower(traffic_source.source)) as traffic_source_first,
76 max(lower(traffic_source.medium)) as traffic_medium_first,
77 min(timestamp_micros(user_first_touch_timestamp)) as session_start_first_touch,
78 max(device.category) as device,
79 max(device.operating_system) as os,
80 max(device.web_info.browser) as browser,
81 max(geo.country) as country,
82 max(geo.city) as city,
83 ifnull(first_channel_session(
84 array_agg(
85 if(event_name in('page_view','user_engagement','scroll'), struct(
86 event_timestamp,
87 lower((select value.string_value from unnest(event_params) where key = 'source')) as source,
88 lower((select value.string_value from unnest(event_params) where key = 'medium')) as medium,
89 lower((select value.string_value from unnest(event_params) where key = 'campaign')) as campaign,
90 (select value.int_value from unnest(event_params) where key = 'entrances') as is_entrance,
91 (select value.int_value from unnest(event_params) where key = 'ignore_referrer') as ignore_referrer
92 ), null)
93 ignore nulls order by event_timestamp asc)
94 ), 'direct') as first_channel_in_session,
95 --countif(event_name = '<name-of-some-other-conversion-event>') as some_conversion,
96 if(countif(event_name = 'login') > 0,1,0) as session_login,
97 countif(event_name = 'purchase') as transactions,
98 sum(ecommerce.purchase_revenue) as transaction_value,
99 array_agg(
100 if(event_name = 'view_item' or event_name = 'add_to_cart', struct(
101 event_timestamp,
102 event_name,
103 (select item_id from unnest(items) limit 1) as item_id,
104 (select item_category from unnest(items) limit 1) as item_category
105 ), null)
106 ignore nulls order by event_timestamp desc limit 100) as item_interactions
107 from
108 `<your-project>.analytics_<your-ga4-property-id>.events_*` as s
109 left join customer_id_mapping as m
110 on m.ga_client_id = s.user_pseudo_id
111 where
112 _table_suffix between format_date('%Y%m%d',date_sub(current_date(), interval lookback_window day))
113 and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))
114 group by
115 user_pseudo_id,
116 session_id
117)
118
119-- aggegrate to user level
120select
121 * except (item_interactions),
122 array_to_string(
123 array(select x.item_id from unnest(item_interactions) x where x.event_name = 'view_item' group by x.item_id order by max(x.event_timestamp) desc limit 5),
124 '|'
125 ) as last_viewed_items,
126 array_to_string(
127 array(select x.item_id from unnest(item_interactions) x where x.event_name = 'view_item' group by x.item_id order by count(1) desc limit 5),
128 '|'
129 ) as most_viewed_items,
130 array_to_string(
131 array(select x.item_category from unnest(item_interactions) x where x.event_name = 'view_item' group by x.item_category order by count(1) desc limit 5),
132 '|'
133 ) as most_viewed_categories
134from (
135 select
136 --ga_client_id, -- uncomment when grouping on device-id instead of customer-id
137 customer_id, -- remove when grouping on device-id
138 min(datetime(session_start_first_touch)) as first_touch,
139 min(datetime(session_start)) as first_session,
140 max(datetime(session_start)) as last_session,
141 max(datetime(if(transactions > 0, session_start, null))) as last_transaction,
142 count(session_id) as sessions,
143 countif(date_diff(current_date, datetime(session_start), day) between 0 and 7) as sessions_1d_7d,
144 countif(date_diff(current_date, datetime(session_start), day) between 8 and 30) as sessions_7d_30d,
145 countif(date_diff(current_date, datetime(session_start), day) between 31 and 90) as sessions_30d_90d,
146 countif(session_engaged = '1') as sessions_engaged,
147 safe_divide(sum(engagement_time), countif(session_engaged = '1')) as engagement_time_avg,
148 sum(session_login) as logins,
149 sum(transactions) as total_transactions,
150 ifnull(sum(transaction_value),0) as total_transaction_value,
151 count(distinct device) as devices,
152 count(distinct first_channel_in_session) as channels,
153 date_diff(current_date(),cast((max(session_start)) as date),day) as days_since_last_session,
154 date_diff(current_date(),cast((max(if(transactions > 0, session_start, null))) as date),day) as days_since_last_transaction,
155 string_agg(device order by session_start asc limit 1) as device_first,
156 string_agg(device order by session_start desc limit 1) as device_last,
157 most_freq(array_agg(device)) as device_most_freq,
158 string_agg(os order by session_start desc limit 1) as os_last,
159 most_freq(array_agg(os)) as os_most_freq,
160 string_agg(browser order by session_start desc limit 1) as browser_last,
161 most_freq(array_agg(browser)) as browser_most_freq,
162 channel_grouping(min(traffic_source_first), min(traffic_medium_first), null) as channel_acquisition,
163 string_agg(first_channel_in_session order by session_start asc limit 1) as channel_first,
164 string_agg(first_channel_in_session order by session_start desc limit 1) as channel_last,
165 most_freq(array_agg(first_channel_in_session ignore nulls)) as channel_most_freq,
166 array_concat_agg(item_interactions limit 200) as item_interactions
167 from
168 sessions
169 where -- remove when grouping on device-id
170 customer_id is not null -- remove when grouping on device-id
171 group by
172 customer_id -- remove when grouping on device-id
173 --ga_client_id --uncomment when grouping on device-id instead of customer-id
174)
That's it!
Closing thoughts & next steps
- Building a customer view enables a lot of marketing and customer service use cases. But, make sure you have the appropriate consent and be transparent to your users about how you process and use the data collected.
- You can run the query daily or multiple times per day. Based on the amount of data you have in GA4, keep in mind that this is probably a rather resource intensive query. You could split up the query and create an intermediate table for the session, an append data to this table instead of generating for the complete date range every time.
- If your looking to activate audiences (based on the customer view) in your (marketing) channels, you could check out this article where I describe how to send an audience from BigQuery to Google Ads.
- Lastly, if you want to personalize your website or app using (customer) data, you will need some sort of mechanism that can provide this data to your website fast. Using BigQuery directly is not a solid solution. In this article I describe how to set this up, using Redis as a fast-read database and Google Tag Manager Server to serve the data.