Data Engineering

Backup your valuable BigQuery Views and Scheduled Queries using Python

Making data backups is important. But what about your valuable SQL code, hidden away in views and scheduled queries? This tutorial will show you how to backup all the SQL code into a Git repository using Python.

When a lot users are actively participating in the development of your Google Cloud Platform (GCP) environment, it's hard to track changes to views and scheduled queries. This solution will;

  • Backup your BigQuery views and scheduled queries to .sql files on your local filesystem.
  • Ability to commit changes to a Git repository. When running the script regularly, you can track change history.
  • Supports multiple GCP projects.
  • Go directly to the code

Preparation

We will use a Service Account to authenticate with your Google Cloud project(s).

You can use the same service account for all the projects by adding this service account with the correct role(s) to all the GCP projects you would like to backup.

If you're not familiar with setting up a Service Account, read the Google documentation first.

  • Create a Service Account and enable the BigQuery User role for the account.
  • Generate and download the JSON keyfile for the Service Account.
  • Enable the Data Transfer API within your GCP project

How the script works

  • The Python script will fetch the SQL code of all the views and the scheduled queries within the GCP projects configured in the script.
  • Files are downloaded to your local filesystem.
  • When you enable the REPO_COMMIT setting, the script will clone a (remote) repository first, fetches the views and scheduled query SQL and commits the changes to the repo. Before it will clone the remote repository, it will delete an old version of the repo first

When using a Git repository to backup your SQL and track change history, you can use a SSH key for authentication. See instructions for GitHub or Bitbucket

File structure

EntityFolder
Views<backup-folder>/<project-id>/bq_views/<dataset-name>/<table-name>.sql
Scheduled queries<backup-folder>/<project-id>/bq_scheduled_queries/<scheduled-query-name>.sql

bigquery-backup_file_structure.PNG

Installation and running the code

Installing packages

The script requires some dependencies. Use pip to install these packages:

  • pip install google-auth
  • pip install google-cloud-bigquery
  • pip install GitPython

Last step is to adjust the settings in the script:

SettingDescription
LOCAL_PATH_BACKUPLocal path to your backup (or local repository) folder. Use a trailing slash
REPO_COMMITWhen set to True, the script will clone the Git repository specified in the REPO_LINK setting. Then it will delete all the folders in the repo, download the latest version of the views and scheduled queries to the repo and commits the changes.
REPO_KEY_PATHPath to your SSH key. Leave blank when not using an SSH key.
GCP_JSON_KEYPATHPath to the Service Account JSON keyfile
GCP_PROJECTS_BACKUPSpecify the GCP project-IDs (and locations) that the script has to check for views and queries. For the location value see the region column in the GCP Scheduled Query interface

Now you're all set to run the code:

  • python backup_views_scheduled_queries.py

The Python script

Save the code to a file manually or download the code directly from our GitHub repository.

1############################################################
2# Author    Krisjan Oldekamp / Stacktonic.com              
3# Email     krisjan@stacktonic.com                         
4# Article   https://stacktonic.com/article/backup-your-valuable-big-query-views-and-scheduled-queries-using-python
5############################################################
6
7import os
8import git
9import google.oauth2.service_account
10from google.cloud import bigquery
11from google.cloud import bigquery_datatransfer # Also enable the Data Transfer API in the GCP console
12
13############################################################
14# Settings
15
16LOCAL_PATH_BACKUP = "./gcp-bigquery-sql-backup/" # Backup directory (end with a trailing /). When REPO_COMMIT is set to True, this folder will also be used to clone and commit to the Git repository
17
18REPO_COMMIT = True # When set to True, the script will clone the Git repository specified in the REPO_LINK setting. Then it will delete all the folders in the repo, download the latest version of the views and scheduled queries to the repo and commits the changes.
19REPO_LINK = "git@bitbucket.org..." # Repository link
20REPO_KEY_PATH = "../.ssh/..." # Path to SSH private key used for authentication with the repository. Leave empty when not using an SSH key.
21
22GCP_JSON_KEYPATH = "your-keyfile.json" # Path to JSON keyfile for Service Account
23
24# You can specify multiple projects. The location / region is required for downloading scheduled query SQL.
25GCP_PROJECTS_BACKUP = [
26    {
27        "project_id": "your-gcp-project-id",
28        "location": "europe" # See the "region" column in the GCP scheduled query interface
29    }
30]
31
32# End of settings
33############################################################
34
35# Authenticate use a Service account / JSON keyfile
36def get_credentials(keyfile_json):
37    scopes=['https://www.googleapis.com/auth/bigquery']
38
39    return (
40        google.oauth2.service_account.Credentials.from_service_account_file(keyfile_json, scopes=scopes)
41    )
42
43# Format scheduled query name so it can be used in a filename
44def format_name(name):
45    return name.lower().replace(" ", "_")
46
47# Push commit to repository
48def git_push_commit(repo, repo_key_path, repo_local_path):
49    try:
50        if repo_key_path == "":
51            repo.git.add(all=True)
52            repo.index.commit("Automated commit containing changes to BigQuery views and scheduled queries.")
53            repo.remotes.origin.push()
54        else:
55            with git.Git().custom_environment(GIT_SSH_COMMAND="ssh -i " + repo_key_path):
56                repo.git.add(all=True)
57                repo.index.commit("Automated commit containing changes to BigQuery views and scheduled queries.")
58                repo.remotes.origin.push()
59    except:
60        print("Some error occured while pushing the commit")    
61
62# Save Google BigQuery views to local filesystem
63def save_bigquery_views(credentials, project_id, path):
64    
65    client_bq = bigquery.Client(credentials=credentials, project=project_id)
66    datasets = list(client_bq.list_datasets())
67    cnt_views = 0
68
69    # Loop datasets
70    if datasets: 
71        for dataset in datasets: 
72            dataset_name = dataset.dataset_id
73            dataset_ref = dataset.reference
74            tables = list(client_bq.list_tables(dataset_ref))
75
76            # Loop tables in dataset
77            for table in tables:
78                # Only select VIEW
79                if table.table_type == "VIEW":
80                    table_name = table.table_id
81                    table_ref = dataset_ref.table(table_name)
82                    table = client_bq.get_table(table_ref)
83                    
84                    backup_directory = path + project_id + "/bq_views/" + dataset_name
85                    if not os.path.exists(backup_directory):
86                        os.makedirs(backup_directory) # Create directory when it doesn't exist
87                    
88                    # Save view SQL to file
89                    f = open(backup_directory + "/" + table_name + ".sql", "w+") # Save view SQL to file
90                    f.write(table.view_query)
91                    cnt_views+=1
92
93    return cnt_views
94
95# Save Google BigQuery scheduled queries to local filesystem   
96def save_bigquery_scheduled_queries(credentials, project_id, location, path):
97    
98    client_bq_df = bigquery_datatransfer.DataTransferServiceClient(credentials=credentials)
99    parent = "projects/{}/locations/{}".format(project_id, location)
100    cnt_scheduled_queries = 0
101
102    # Loop all Data Transfer elements in project
103    for element in client_bq_df.list_transfer_configs(parent):
104        
105        scheduled_query_name = format_name(element.display_name)
106        scheduled_query_sql = ""
107        
108        # Loop over Data Transfer parameters, select scheduled queries and get query SQL
109        params = element.params.items()
110        for key, value in params: 
111            if key == "query":
112                scheduled_query_sql = value
113
114        backup_directory = path + project_id + "/bq_scheduled_queries/"
115        if not os.path.exists(backup_directory): # Create directory when it doesn't exist
116            os.makedirs(backup_directory)
117
118        # Write to file
119        f = open(backup_directory + "/" + scheduled_query_name + ".sql", "w+")
120        f.write(scheduled_query_sql)
121        cnt_scheduled_queries+=1
122
123    return cnt_scheduled_queries
124
125def execute():
126    
127    credentials = get_credentials(GCP_JSON_KEYPATH)
128    
129    print ("Found {} GCP projects in settings".format(len(GCP_PROJECTS_BACKUP)))
130
131    if REPO_COMMIT:
132        # Clone repository if Git commits are enabled
133        print("Git commits enabled. Cloning repository {} to {}".format(REPO_LINK, LOCAL_PATH_BACKUP))
134        
135        # Delete folder / repository when existing
136        if os.path.exists(LOCAL_PATH_BACKUP):
137            git.rmtree(LOCAL_PATH_BACKUP)
138        
139        if REPO_KEY_PATH == "":
140            repo = git.Repo.clone_from(REPO_LINK, LOCAL_PATH_BACKUP)     
141        else:
142            repo = git.Repo.clone_from(REPO_LINK, LOCAL_PATH_BACKUP, env={"GIT_SSH_COMMAND": "ssh -i " + REPO_KEY_PATH}) # Use SSH key
143    
144        # Remove old backups in repository
145        repo_dirs = next(os.walk(LOCAL_PATH_BACKUP))[1]
146        for dir in repo_dirs:
147            if not dir.startswith("."):
148                git.rmtree(os.path.join(LOCAL_PATH_BACKUP, dir))
149    else:
150        # Only download to local filesystem. Create directory when it doesn't exist
151        if not os.path.exists(LOCAL_PATH_BACKUP):
152            os.makedirs(LOCAL_PATH_BACKUP)
153
154    # Loop through GCP project and save views and scheduled queries
155    for i, project in enumerate(GCP_PROJECTS_BACKUP):
156
157        print ("-- Starting backup for project: {}".format(project["project_id"]))
158        
159        views = save_bigquery_views(
160            credentials, 
161            project["project_id"], 
162            LOCAL_PATH_BACKUP
163        )
164        print("# {} views saved..".format(views))
165
166        scheduled_queries = save_bigquery_scheduled_queries(
167            credentials, project["project_id"], 
168            project["location"], 
169            LOCAL_PATH_BACKUP
170        )
171        print("# {} scheduled queries saved..".format(scheduled_queries))
172
173    # Push code to remote repository
174    if REPO_COMMIT:
175        git_push_commit(repo, REPO_KEY_PATH, LOCAL_PATH_BACKUP)
176        print("Pushed code to repository..")
177
178    print("Done.")
179       
180execute()

Some last tips

  • You probably want to schedule this script daily. You can use your favorite scheduling tool like Apache Airflow or good ol' cronjobs.
  • On windows, configuring SSH keys can give you some headaches. See the GitHub and the Bitbucket documentation for troubleshooting.
Did you like this article? Stay hydrated and subscribe to a monthly roundup of our newest articles and tutorials. No spam and you can unsubscribe at any time. You can also checkout my Twitter account for updates.