Automation & Activation

Create a User Profile and Recommender service using BigQuery, Redis and GTM Server

Learn how to serve user level data from BigQuery to your website or app, using Redis (low-latency), Cloud Functions and Google Tag Manager Server. The user-level profile data can be used for on-site personalization or data enrichments within GTM Server.

  • Provide user level data to the client (e.g. browser or app).
  • Data can be used for personalization, enrichment or act as a persistent user or session store.
  • Low-latency response by using a Redis cache server (Google Memorystore) acting as the storage mechanism.
  • Google Cloud Functions are used to load and fetch data from the Redis server.
  • Google Tag Manager Server (GTM) is used to serve the data using your own domain (+ optionally enrich or distribute the data).

Caution! Make sure you have the appropriate user consent to collect user level data in the first place and do not expose sensitive user data to the client.

Architectural overview

Google BigQuery is a great tool to combine and analyze large amounts of data. However, it's not suited to provide data directly to (high volume) production environments.

The solution proposed in this tutorial is trying to solve that problem by using Redis as a low-latency storage mechanism and can be used in high-volume environments. An overview:

google-tag-manager-bigquery-redis-user-store.png

  1. Google BigQuery holds the master data table. For example (product) recommendations or audience membership on user level, calculated once or multiple times per day. Should have a unique key (e.g. some sort of user-ID).
  2. Exporting this BigQuery master table to a newline delimited JSON within a Google Cloud Bucket ( JSON exports are available within the default export options).
  3. Google Cloud Function #1 is triggered by the creation of new files in this bucket and will put the contents into the Redis cache (Google Memorystore).
  4. Google Cloud Function #2 is triggered by HTTP and will request a specified key (query parameter) in the Redis cache and returns the JSON response
  5. To be able to connect the serverless Cloud Functions to the Redis instance, a serverless VPC connector needs to be set up.
  6. A custom client in Google Tag Manager Server is configured to serve the response from Cloud Function #2 under your own domain.
  7. Using Google Tag Manager will enables you to make this service available under your own domain + you could enrich or distribute data directly in GTM server. It will add some latency however.

Disclaimer: I did not test this setup in a large-volume production environment.

Why a user (profile) store?

If you want to personalize your website or app based on (historical) user data, you will need a service that provides this data. Most likely you will need data on user level, think of:

  • Product recommendations that are pre-calculated every day (or hour).
  • Customer behavior or audience membership (for example loyalty member yes/no). You could build this on top of a 360° customer view.
  • A/B testing groups

But also:

  • Build a persistent session storage (instead of writing data to cookies)
  • User-ID management and profile "stitching".

In many cases, this data is available somewhere in your data warehouse (we assume Google BigQuery). However, if we want to personalize on-site or in-app, or enrich data in real-time, the data service should have the following requirements:

  • Low latency (input for personalization should be there quick so content can be generated as fast as possible)
  • Scalable: should be able to handle a lot of reads / requests

As mentioned, Google BigQuery does not fulfill these requirements. So we need another storage mechanism.

I've researched the option using GTM Server as a cache directly. GTM has a cache mechanism in the form of the templateStorage API. You can persist data with this function, however it's not shared between the different instances of your GTM Server cluster. Besides, it's running in memory of your GTM server (no dedicated in-memory cache), so you don't want to add too much data using templateStorage, because it can hurt the stability of your setup. Since it's a sandboxed environment, we also can't connect to Redis directly. So at this point in time, unsuited for the use cases we have in mind. However, if functionality is released to GTM Server in the future, the same mechanism to feed Redis can be used to provide data to GTM.

The following illustration is giving a nice overview of the storage possibilities within GCP:

which-database-google-cloud.jpeg Google Cloud database options. Image by thecloudgirl.dev

Since Cloud Bigtable has somewhat higher entry barriers (not possible to set up a low-cost or pay-per-use instance), we're opting for Memorystore, in our case Redis.

While I was writing this article, I stumbled upon an great article by Lukas Oldenburg using Google Firestore. However, in terms of latency, Firestore doesn't seem to be the best choice when using the data for personalization / persistent session storage in low-latency / high-read environments.

That's that. let's set up the Redis server.

1. Create and export a BigQuery table to JSON

This article is assuming you want to create some sort of fast accessible user storage system for personalization or data enrichment. The mechanism in this article however, is not limited to user data. You can use al sorts of data (think of product data).

The first step is to create a BigQuery master table that is the source of your user store. It should have a unique key for every user (device-ID or user-ID for example), that is used as the key in the Redis store.

You need to export this table to a newline delimited JSON file (you could automate this, once or multiple times per day) in your cloud bucket which we will create later on.

bigquery-export-to-json.PNG

Also see the example / test file.

1{"key": "123", "audience_1": "1", "audience_2": "1", "ab_test_1": "3" }
2{"key": "234", "audience_1": "0", "audience_2": "1", "ab_test_2": "2" }
3{"key": "456", "audience_1": "1", "audience_2": "0", "ab_test_3": "1" }
4{"key": "456", "audience_1": "1", "audience_2": "0", "ab_test_4": "3" }

2. Create the Memorystore Redis service

Unfortunately, Google Memorystore is not available on a pay-per-use model, so we have to create a minimal setup. Google estimates the monthly costs for this setup around $40 (when using 1GB of memory).

  • In GCP, navigate to: Memorystore > Redis > Create Instance
  • Name: Whatever you like (e.g. gmc-user-store)
  • Tier: Basic
  • Region: Should be the same region as your Cloud Function! I've used europe-west3 (EU, Frankfurt)
  • Capacity: 1 GB (lowest possible).

To connect the Redis server to your Google Cloud Function, both have to be in the same region.

3. Set up a Serverless VPC connector

Now, for the Cloud Functions to be able to connect to the Memorystore Redis server, they need to be added to the VPC network of Memorystore. However, since Cloud Functions are serverless, we need to set up a serverless VPC connector.

  • Navigate to: Networking > VPC network > Serverless VPC access
  • Enable the Serverless VPC Access API (if not already)
  • Click "Create connector"
  • Name: vpc-redis-to-gcf (or use your own name)
  • Region: Same region as Memorystore / Redis and your Cloud Functions (I've used europe-west3)
  • Network: default (or select another network if you have configured a different VPC network)
  • Subnet: 10.8.0.0/28
  • Min. instances: 2
  • Max instances: 3 (adjust to your situation). For a first setup, you can use 3 (minimal setup).
  • Instance-type: e2-micro

When creating your Cloud Function later on, you can select the VPC connector you just created in the settings.

The VPC connector also generates some monthly costs. Estimated monthly costs for this setup: between $15 (using min. number of instances which is 2) and $80 (when using 10 instances).

Also see this fine article about setting this up the serverless VPC connector and other services automatically.

4. Create a Storage bucket

Create a Storage bucket:

  • The BigQuery master table is exported to this bucket in JSON format.
  • A Cloud Function trigger is attached when new files are created within this bucket (you will configure this in the next chapter)
  • Create an outbound folder (defaults to /outbound). Cloud Function #1 will check this folder for new files.
  • Cloud Function #1 will automatically create a /processed folder, in which processed files are moved.

5. Cloud Function #1 - Cloud Storage to Redis

The first function will trigger when a new file is placed into a Google Cloud Storage bucket (in the /outbound folder). This JSON file is processed and uploaded to the Redis server. I've used the Cloud Function described in this article and did some modifications (listening to a specific folder within the bucket and the option to move the file after processing).

  • Name: Whatever you like (e.g. gcf-bigquery-to-redis)
  • Region: Should be the same region as your Memcache Redis service! I've used europe-west3 (EU, Frankfurt)
  • Trigger: Cloud Storage
  • Event type: Finalize/Create (so the function is triggered when a new file, in this case the profile updates, are uploaded to the Google Cloud bucket).
  • Bucket: Select your Google Cloud Bucket created in the previous chapter.
  • Retry on failure: Uncheck
  • Runtime: Node.js 14
  • Entrypoint: loadCloudStorageToRedis

Runtime, build, connections and security settings

  • (Runtime) Memory allocated: 1GB (also depends on the type and size of files you want to process)
  • (Runtime) Runtime: Timeout: 300 seconds (since the process could take some time for large files). Adjust to your situation.
  • (Connections) Set VPC connector to vpc-redis-to-gcf (or the VPC connector name of your own choosing)

Runtime environment variables

  • REDISHOST: Your Redis host (can be found in the settings screen of your created Memcache instance).
  • REDISPORT: Can be found in the Memcache settings as well. Defaults to 6379
  • EXPIRATION: Expiration time of records (in seconds). When uploading user profiles daily for example, you could use this to clean up old records
  • FILE_PATH_OUTBOUND: Folder in bucket to listen to. Defaults to /outbound
  • FILE_PATH_PROCESSED: Folder to place processed files. Defaults to /processed
  • FILE_PREFIX: Filename prefix to filter.

gcf-cloud-storage-to-redis/index.js (Cloud Function #1) - View code on Github

Make sure you upload both the index.js and package.json to your cloud function

1'use strict';
2
3const Redis = require('ioredis');
4const {Storage} = require('@google-cloud/storage');
5const split = require('split');
6
7const REDISHOST = process.env.REDISHOST || 'localhost';
8const REDISPORT = process.env.REDISPORT || 6379;
9const EXPIRATION = process.env.EXPIRATION || 259200;  // Expiration of records in seconds
10const FILE_PATH_OUTBOUND = process.env.FILE_PATH_OUTBOUND || 'outbound'; 
11const FILE_PATH_PROCESSED = process.env.FILE_PATH_PROCESSED || 'processed'; 
12const FILE_PREFIX = process.env.FILE_PREFIX || ''; 
13
14const redisClient = new Redis({
15  host: REDISHOST,
16  port: REDISPORT,
17});
18
19/**
20 * Triggered from a change to a Cloud Storage bucket.
21 *
22 * @param {!Object} event Event payload.
23 * @param {!Object} context Metadata for the event.
24 */
25exports.loadCloudStorageToRedis = async(info, context) => {
26  
27  const path = info.name.split('/')
28  const fileName = path[path.length-1]
29
30  if (info.metageneration === '1' && info.name.startsWith(FILE_PATH_OUTBOUND) && fileName.startsWith(FILE_PREFIX)) {
31
32    console.log(`New file upload: gs://${info.bucket}/${info.name}`)
33
34    const storage = new Storage()
35    const bucket = storage.bucket(info.bucket);
36    const file = bucket.file(info.name);
37
38    let keysWritten = 0;
39
40    try {
41      
42      // Read file and send to Redis
43      file.createReadStream()
44        .on('error', error => reject(error))
45        .on('response', (response) => {
46          // connection to GCS opened
47        }).pipe(split())
48        .on('data', function (record) {
49          if (!record || record === "") return;
50          keysWritten++;
51          const data = JSON.parse(record);
52          redisClient.set(data.key, record, 'EX', EXPIRATION);
53        })
54        .on('end', () => {
55          console.log(`Successfully written ${keysWritten} keys to Memcache Redis.`);
56
57          // Move file to processed folder
58          bucket.file(info.name).move(FILE_PATH_PROCESSED + '/' + fileName);
59          console.log(`File moved to: ${info.bucket}/${FILE_PATH_PROCESSED}/${fileName}`);
60        })
61        .on('error', error => reject(error));
62    
63    } catch(e) {
64      console.log(`Error importing ${fileName} to Redis: ${e}`);
65    }
66
67  }
68};

6. Cloud Function #2 - Serving data from Redis

The second Cloud Function is triggered by a HTTP request and will fetch the data (user store) from the Redis server and returns the data in a JSON response. Later on, GTM server is passing the response of this function to the client.

  • Name: Whatever you like (e.g. gcf-server-user-store)
  • Region: Should be the same region as your Memcache Redis service! I've used europe-west3 (EU, Frankfurt)
  • Trigger: HTTP
  • Require HTTPS: Checked
  • Runtime: Node.js 14
  • Entrypoint: httpServeFromRedis

Runtime, build, connections and security settings

  • (Runtime) Memory allocated: 512MB (depends on your specific situation)
  • (Runtime) Runtime: Timeout: 5 seconds (you need this data fast, so if the function can't fetch data within 5 seconds, fail).
  • (Connections) Set VPC connector to vpc-redis-to-gcf

Runtime environment variables

  • REDISHOST: Your Redis host (can be found in the settings screen of your created Memcache instance).
  • REDISPORT: Can be found in the Memcache settings as well. Defaults to 6379
  • QUERY_REDIS_KEY: What query parameter in the GET requests holds the Redis key (defaults to id)
  • QUERY_SECRET_KEY: What query parameter in the GET requests holds the secret key. If not specified, no secret key filter will be used
  • SECRET: If QUERY_SECRET_KEY is not blank, the key to check the GET request against.

To make the HTTP function publicly accessible (caution!), you have to select the "Permissions" tab in the function settings, and add allUsers with the role Cloud Function Invoker. Make sure you understand the consequences when enabling this. In the Cloud Function, we've added the option to provide a SECRET_KEY before returning the response. I know this is not the best mechanism, and you should look into alternative authentication mechanisms.

Make some choices about the contents and naming of the (JSON) response. If the response is openly available for the client, you might want to consider masking or renaming some properties for example.

gcf-http-serve-from-redis/index.js (Cloud Function #1) - View code on Github

Make sure you upload both the index.js and package.json to your cloud function

1'use strict'
2
3const Redis = require('ioredis');
4
5const REDISHOST = process.env.REDISHOST || 'localhost';
6const REDISPORT = process.env.REDISPORT || 6379;
7const QUERY_REDIS_KEY = process.env.QUERY_REDIS_KEY || 'id';
8const QUERY_SECRET_KEY = process.env.QUERY_SECRET_KEY || 'secret';
9const SECRET = process.env.SECRET || '';
10
11const redisClient = new Redis({
12  host: REDISHOST,
13  port: REDISPORT,
14});
15
16/**
17 * HTTP function 
18 * 
19 * @param {Object} req Cloud Function request context.
20 * @param {Object} res Cloud Function response context.
21 */
22exports.httpServeFromRedis = (req, res) => {
23
24    res.set('Access-Control-Allow-Origin', "*");
25    res.set('Access-Control-Allow-Methods', 'GET');
26
27    if (SECRET == '' || (req.query[QUERY_SECRET_KEY] == SECRET)) {
28        if (typeof req.query[QUERY_REDIS_KEY] != 'undefined') {
29            const key = req.query[QUERY_REDIS_KEY];
30            
31            redisClient.get(key, function (err, result) {
32                if (err) {
33                    console.error(err);
34                    res.status(400).json({ error: 'Error' });
35                } else {
36                    if (result !== null) {
37                        const data = JSON.parse(result);
38                        res.status(200).json(data);
39                    } else {
40                        res.status(400).json({ error: 'Key not found' });
41                    }
42                }
43            });
44        } else {
45            res.status(400).json( { error: 'No key specified' } );
46        }
47    } else {
48        res.status(401).json( { error: 'Unauthorized' } );
49    }
50};

7. Test your functions

  1. Upload a test file (link to GitHub) containing newline delimited JSON records (also see chapter 1).
  2. Check the logs of Cloud Function #1 and make sure the file is uploaded (you can also check the '/processed' folder in the Cloud Bucket)
  3. Make a request to Cloud Function, using one of the keys in the JSON test file. E.g. https://?id=123&secret=
  4. The function should return the contents of the test file for the key you specified.

8. Proxy the Cloud Function in GTM Server

The last step will use Google Tag Manager server as a proxy to serve the user data (provided by the Cloud Function) from your own domain. Additionally, you can fire an event and run the GTM container with this data, that can be used for data enrichment (e.g. in Google Analytics).

Hopefully, GTM Server will receive some form of persistent storage functionality in the future, so you could also enrich your event data with (historical) pre-calculated or even session storage data.

I've created a custom Client template for GTM Server to fetch the data from the Google Cloud Function and return the JSON response through GTM. You can find the template here.

Import the template.tpl file to your GTM container as a custom template and configure accordingly:

gtm-server-client-http-proxy.PNG

Most important settings:

  • Request path: To what path the client needs to listen (defaults to /profile).
  • External HTTP endpoint: Your Cloud Function trigger URL. When using a secret key, also include this as a query parameter (e.g. ?secret=123)
  • Pass query parameters: Enable. You need this so you can select a key from the Redis cache based on a query parameter (e.g. ?id=).
  • Pass response data to event: Pass the response as an event and run the GTM (so you can use the data in your tag templates)

Also see this article of Simo Ahava about creating GTM Server Client templates.

And the last step, publish your container!

Now, when you request the correct path, it should return the data from Redis:

  • <your-gtm-server-url>/profile&id=<your-redis-key>

Example response (when data is passed to container):

gtm-user-store-event-data.PNG

That's it! You now can export your master table from BigQuery to the Storage bucket and the framework will handle the rest.

If you have any questions or comments, please let me know (Twitter / LinkedIn).

Closing thoughts and next steps

  • You need to create an automated process that automatically exports a BigQuery table to a newline delimited JSON file in the storage bucket. Exporting to JSON is a default option within BigQuery. See this post for example.
  • Double check the data you're making available in the client. Do you have the appropriate consent or are you exposing sensitive data? You could also adjust the GTM client template by creating a whitelist mechanism in the GTM client template (only return properties from the JSON response specified in whitelist).
  • Keep in mind that the Cloud Functions and Redis server are configured in a specific region. This could be troublesome when you want to server globally and and scale.

Another solution we would like to explore in the future, is AWS DynamoDB (instead of Redis). DynamoDB has a caching mechanism (called DAX) to serve low latency requests) and could be interesting for these use cases as well.

Lastly, instead of using a custom GTM client to pass the data to the client, you could also use this setup to enrich the data (for example by modifying the Google Analytics Client).

Did you like this article? Stay hydrated and subscribe to a monthly roundup of our newest articles and tutorials. No spam and you can unsubscribe at any time. You can also checkout my Twitter account for updates.