Data Engineering

Create a User Mapping table based on the Google Analytics 4 BigQuery dataset

When joining data sources on user level, in many occasions the same user has different types of identifiers (or multiple identifiers of the same type). Think of different device (cookie)-IDs, customer-IDs or lead-IDs. Grouping all these identifiers into a matching or mapping table (identity graph) can be useful when you want to stitch online behavior to customers and build a 360° customer view.

This tutorial will show you how to create an identity graph containing all default and custom user identifiers that can be collected in Google Analytics 4, using the BigQuery exports.

  • Group all available user identifiers in Google Analytics 4 into one BigQuery table.
  • Create sub tables based on this master table for joining other data easily.
  • Go directly to the SQL code

Identity Resolution: What is an Identity Graph?

An identity graph is a a unified collection of user-IDs (like customer- or device-IDs) that can be associated to a user, based on their interactions with a website, app or platform.

Platforms like Facebook or Google have their own identity graphs, but the context of this article is to build an identity graph based on your first-party sources, so the identity graph will be specific for your organization.


Identity Graph of first-party identifiers found in the GA4 dataset

Different types of identifiers

The mapping table we're creating in this article, will relate all the (configured) default and custom user-IDs to a single user. In most cases, these identifiers will be of the following types (also see the figure in the previous chapter):

ID typeGA4 parameterDescription
Device-ID / Cookie-IDsuser_pseudo_idIdentifies the same user over different visits (sessions). A cookie-ID (GA client-ID) for web of instance-ID for apps.
Custom User-IDsuser_id / custom_paramFirst-party user-IDs such as customer-IDs or lead-IDs. Should be configured manually (customer-ID has to be made available in your web/app data layer and send to GA4).
Click-IDsgclidGoogle Click-ID (GCLID/DCLID) or Facebook Click-ID (FBCLID). Can be used to send conversion data to these platforms for example. In most cases this value can be retrieved on the landingpage (URL parameters).
Transaction-IDsecommerce.transaction_idIdentifying transactions belonging to a user / customer.

Since we're basically grouping all online behavior based on cookie-IDs (device/browser) to (deterministic) user identifiers such as a customer-ID, online behavior can be joined with other data sets as well, like CRM data.

I can't emphasize enough, but make sure you have the appropriate user consent based on what you're doing with the data (e.g. statistics vs marketing purposes).

There are some things you should keep in mind when working with online behavioral data in GA4:

  • Devices and browsers can be shared between multiple users. You can't be 100% sure that you're dealing with the same user for every session in Google Analytics 4. Devices / browsers can be shared for example.
  • Practical example: when customer #1 logs in on browser #1, a customer-ID is tied to the cookie-ID of browser #1. If that same browser is used by another user, this behavior is tied to customer #1 as well.
  • IDs can be shared (unintentionally). For example sharing links with with click-IDs (?gclid=) or some other form of user specific identifiers (commonly used in email links)
  • Cookies (device-IDs) can be deleted. Both by tracking prevention mechanisms such as Apple's ITP or manually.

Use cases & business value

The identity graph can be used to;

  • Join historical (browsing) behavior, thus giving insights in the path leading to a conversion
  • Provide insights in cross-device user behavior (and attribution)
  • Input for modelling like purchase or churn predictions
  • Joining CRM or other data sources with online behavior (thus creating the base for a 360° customer view)

Google Analytics 4 - Prerequisites & definitions

This tutorial assumes you have an Google Analytics 4 account and enabled the data exports to BigQuery.

Regarding the user identifiers that are collected in GA4:

  • The user_pseudo_id is the device/browser-ID identifier and collected automatically.
  • You need to 'manually' populate the user_id variable with a business specific user identifier like a customer-ID. This identifier can tie together cross-device and cross-browser behavior. Not collected automatically.
  • Additionally you could add other identifiers in custom parameters such as secondary customer-IDs, lead-IDs or click-IDs (and make sure you have the appropriate user consent).
  • The ecommerce.transaction_id parameter is populated when you use the ecommerce parameters of GA4 within the purchase event.

The mapping table explained

The mapping table consists of two parts:

  1. Collecting all identifiers > Collecting all the configured user identifiers hidden away within the events of the GA4 dataset. The query has a variable called lookback_window to set the window in which to look for identifiers. You could query the complete dataset every time (resource intensive) or daily update a table with this data.
  2. Grouping the identifiers > Make one identifier the unique key (in this tutorial, the customer-ID). Other identifiers will be aggregated into an array using the ARRAY_AGG function, including the timestamp of last recognition.


Table: Example result

We've set a limit of 100 unique alternative identifiers per customer-ID / row, since "snowballing" of user-identifiers can occur (one identifiers is mapped to too many device identifiers for example) or due to some tracking issues (device-ID is reset on every page for example).

As mentioned, for each identifier an array of structs is created, containing all the unique identifiers (using the custom DEDUP() function), including the timestamp of the last event in which the identifier was recognized. This can be used to performs checks on "freshness" or expiration. GCLIDs can expire, for example.

Now, let's create the table!

Query: The mapping table

This query will collect all the (configured) user-IDs within the GA4 dataset for all the configured user-ID and groups the results on the customer_id.

Keep in mind, this is a rather resource intensive query, so when used on large datasets make sure you select a small amount of data tot test. In production you could split up the query, one process to update the recognized identifiers (daily) and a second query that groups the data on the customer-ID (or other identifiers).

An example result (when parsed into JSON format):

2  "customer_id": "123", 
3  "ga_client_id": [
4     { "id": "789", "timestamp": "123123" },
5     { "id": "189", "timestamp": "123123" }
6  ],
7  "customer_id_secondary": [
8      { "id": "897", "timestamp": "123123" }
9  ],
10  "gclid": [
11      { "id": "89134", "timestamp": "123123" },
12      { "id": "72347", "timestamp": "123123" }
13  ],
14  "...": []

The mapping table SQL:

Also available on GitHub

1declare lookback_window int64 default 90; -- how many days to lookback into the dataset to search for ids (compared to today)
3-- udf: deduplicate array of struct 
4create temp function dedup(arr any type) as ((
5    select 
6        array_agg(t)
7    from (
8        select max(a.timestamp) as timestamp, from unnest(arr) a group by order by timestamp desc limit 100
9    ) t
12with ga_user_ids as (
13  -- select all user-ids from the ga4 dataset in the specified time period
14    select
15        *
16    from (
17        select
18            user_pseudo_id as ga_client_id, -- device-id
19            user_id as customer_id, -- custom defined user-id (e.g. customer-id)
20            (select value.string_value from unnest(user_properties) where key = 'relay_id') as customer_id_secondary, -- secondary custom user-id
21            (select value.string_value from unnest(event_params) where event_name = 'page_view' and key = 'gclid') as gclid, -- click-id
22            ecommerce.transaction_id as transaction_id, -- order-ids
23            max(event_timestamp) as event_timestamp -- timestamps
24        from `<your-project>.analytics_<your-dataset>.events_*`
25        where
26            _table_suffix between format_date('%Y%m%d',date_sub(current_date(), interval lookback_window day))
27            and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))
28        group by
29           1,2,3,4,5
30    )
31    -- filter out users without any additional user-ids besides ga_client_id
32    where
33        customer_id is not null
34        or customer_id_secondary is not null
35        or gclid is not null
36        or transaction_id is not null
39-- select all identifiers and group on customer-id.
41    customer_id, 
42    max(timestamp_micros(event_timestamp)) as timestamp_last_visit,
43    -- aggegrate customer-ids to an array of structs
44    dedup(array_agg(
45        if(ga_client_id is not null,    
46            struct(timestamp_micros(event_timestamp) as timestamp, ga_client_id as id),
47            null
48        ) ignore nulls order by event_timestamp desc
49    )) as ga_client_id,
50    -- aggegrate secondary customer-ids to an array of structs
51    dedup(array_agg(
52        if(customer_id_secondary is not null,
53            struct(timestamp_micros(event_timestamp) as timestamp, customer_id_secondary as id),
54            null
55        )ignore nulls order by event_timestamp desc
56    )) as customer_id_secondary,
57    -- aggegrate gclids to an array of structs
58    dedup(array_agg(
59        if(gclid is not null, 
60            struct(timestamp_micros(event_timestamp) as timestamp, gclid as id), 
61            null
62        ) ignore nulls order by event_timestamp desc
63    )) as gclid,
64    -- aggegrate transaction-ids to an array of structs
65    dedup(array_agg(
66        if(transaction_id is not null, 
67            struct(timestamp_micros(event_timestamp) as timestamp, transaction_id as id), 
68            null
69        ) ignore nulls order by event_timestamp desc
70    )) as transaction_id,
71from ga_user_ids
73    customer_id is not null
74group by 1

Query: Create an ID specific match table

Based on the result of the previous table, you could create a match table that can quickly provide a customer-ID for a device-ID (user_pseudo_id).


2 AS ga_client_id,
3    customer_id
4from `<your-project>.<your-dataset>.<your-mapping-table>`, 
5    unnest(ga_client_id) as ga_id
6group by 1, 2

Summary & next steps

The examples provided in this tutorial can serve as input for a lot of interesting use cases as mentioned earlier. Again, make sure you have the appropriate consent. Some next steps:

  • Group mapping table on other user identifiers (for example on device-ID)
  • Automate the generation of the mapping table. As mentioned, my advise would be to add new identification events to a master table (daily) and from there build your mapping table
  • Use the mapping table to create a single customer view or single user view, aggregating data into a user profile, which can be used for targeting or personalization.
Did you like this article? Stay hydrated and get notified when new articles, tutorials or other interesting updates are published. No spam and you can unsubscribe at any time. You can also check my Twitter account for updates and other interesting material.