Backup your valuable BigQuery Views and Scheduled Queries using Python
Making data backups is important. But what about your valuable SQL code, hidden away in views and scheduled queries? This tutorial will show you how to backup all the SQL code into a Git repository using Python.
When a lot users are actively participating in the development of your Google Cloud Platform (GCP) environment, it's hard to track changes to views and scheduled queries. This solution will;
- Backup your BigQuery views and scheduled queries to
.sql
files on your local filesystem. - Ability to commit changes to a Git repository. When running the script regularly, you can track change history.
- Supports multiple GCP projects.
- Go directly to the code
Preparation
We will use a Service Account to authenticate with your Google Cloud project(s).
You can use the same service account for all the projects by adding this service account with the correct role(s) to all the GCP projects you would like to backup.
If you're not familiar with setting up a Service Account, read the Google documentation first.
- Create a Service Account and enable the
BigQuery User
role for the account. - Generate and download the JSON keyfile for the Service Account.
- Enable the
Data Transfer API
within your GCP project
How the script works
- The Python script will fetch the SQL code of all the views and the scheduled queries within the GCP projects configured in the script.
- Files are downloaded to your local filesystem.
- When you enable the
REPO_COMMIT
setting, the script will clone a (remote) repository first, fetches the views and scheduled query SQL and commits the changes to the repo. Before it will clone the remote repository, it will delete an old version of the repo first
When using a Git repository to backup your SQL and track change history, you can use a SSH key for authentication. See instructions for GitHub or Bitbucket
File structure
Entity | Folder |
---|---|
Views | <backup-folder>/<project-id>/bq_views/<dataset-name>/<table-name>.sql |
Scheduled queries | <backup-folder>/<project-id>/bq_scheduled_queries/<scheduled-query-name>.sql |
Installation and running the code
Installing packages
The script requires some dependencies. Use pip
to install these packages:
pip install google-auth
pip install google-cloud-bigquery
pip install GitPython
Last step is to adjust the settings in the script:
Setting | Description |
---|---|
LOCAL_PATH_BACKUP | Local path to your backup (or local repository) folder. Use a trailing slash |
REPO_COMMIT | When set to True, the script will clone the Git repository specified in the REPO_LINK setting. Then it will delete all the folders in the repo, download the latest version of the views and scheduled queries to the repo and commits the changes. |
REPO_KEY_PATH | Path to your SSH key. Leave blank when not using an SSH key. |
GCP_JSON_KEYPATH | Path to the Service Account JSON keyfile |
GCP_PROJECTS_BACKUP | Specify the GCP project-IDs (and locations) that the script has to check for views and queries. For the location value see the region column in the GCP Scheduled Query interface |
Now you're all set to run the code:
python backup_views_scheduled_queries.py
The Python script
Save the code to a file manually or download the code directly from our GitHub repository.
1############################################################
2# Author Krisjan Oldekamp / Stacktonic.com
3# Email krisjan@stacktonic.com
4# Article https://stacktonic.com/article/backup-your-valuable-big-query-views-and-scheduled-queries-using-python
5############################################################
6
7import os
8import git
9import google.oauth2.service_account
10from google.cloud import bigquery
11from google.cloud import bigquery_datatransfer # Also enable the Data Transfer API in the GCP console
12
13############################################################
14# Settings
15
16LOCAL_PATH_BACKUP = "./gcp-bigquery-sql-backup/" # Backup directory (end with a trailing /). When REPO_COMMIT is set to True, this folder will also be used to clone and commit to the Git repository
17
18REPO_COMMIT = True # When set to True, the script will clone the Git repository specified in the REPO_LINK setting. Then it will delete all the folders in the repo, download the latest version of the views and scheduled queries to the repo and commits the changes.
19REPO_LINK = "git@bitbucket.org..." # Repository link
20REPO_KEY_PATH = "../.ssh/..." # Path to SSH private key used for authentication with the repository. Leave empty when not using an SSH key.
21
22GCP_JSON_KEYPATH = "your-keyfile.json" # Path to JSON keyfile for Service Account
23
24# You can specify multiple projects. The location / region is required for downloading scheduled query SQL.
25GCP_PROJECTS_BACKUP = [
26 {
27 "project_id": "your-gcp-project-id",
28 "location": "europe" # See the "region" column in the GCP scheduled query interface
29 }
30]
31
32# End of settings
33############################################################
34
35# Authenticate use a Service account / JSON keyfile
36def get_credentials(keyfile_json):
37 scopes=['https://www.googleapis.com/auth/bigquery']
38
39 return (
40 google.oauth2.service_account.Credentials.from_service_account_file(keyfile_json, scopes=scopes)
41 )
42
43# Format scheduled query name so it can be used in a filename
44def format_name(name):
45 return name.lower().replace(" ", "_")
46
47# Push commit to repository
48def git_push_commit(repo, repo_key_path, repo_local_path):
49 try:
50 if repo_key_path == "":
51 repo.git.add(all=True)
52 repo.index.commit("Automated commit containing changes to BigQuery views and scheduled queries.")
53 repo.remotes.origin.push()
54 else:
55 with git.Git().custom_environment(GIT_SSH_COMMAND="ssh -i " + repo_key_path):
56 repo.git.add(all=True)
57 repo.index.commit("Automated commit containing changes to BigQuery views and scheduled queries.")
58 repo.remotes.origin.push()
59 except:
60 print("Some error occured while pushing the commit")
61
62# Save Google BigQuery views to local filesystem
63def save_bigquery_views(credentials, project_id, path):
64
65 client_bq = bigquery.Client(credentials=credentials, project=project_id)
66 datasets = list(client_bq.list_datasets())
67 cnt_views = 0
68
69 # Loop datasets
70 if datasets:
71 for dataset in datasets:
72 dataset_name = dataset.dataset_id
73 dataset_ref = dataset.reference
74 tables = list(client_bq.list_tables(dataset_ref))
75
76 # Loop tables in dataset
77 for table in tables:
78 # Only select VIEW
79 if table.table_type == "VIEW":
80 table_name = table.table_id
81 table_ref = dataset_ref.table(table_name)
82 table = client_bq.get_table(table_ref)
83
84 backup_directory = path + project_id + "/bq_views/" + dataset_name
85 if not os.path.exists(backup_directory):
86 os.makedirs(backup_directory) # Create directory when it doesn't exist
87
88 # Save view SQL to file
89 f = open(backup_directory + "/" + table_name + ".sql", "w+") # Save view SQL to file
90 f.write(table.view_query)
91 cnt_views+=1
92
93 return cnt_views
94
95# Save Google BigQuery scheduled queries to local filesystem
96def save_bigquery_scheduled_queries(credentials, project_id, location, path):
97
98 client_bq_df = bigquery_datatransfer.DataTransferServiceClient(credentials=credentials)
99 parent = "projects/{}/locations/{}".format(project_id, location)
100 cnt_scheduled_queries = 0
101
102 # Loop all Data Transfer elements in project
103 for element in client_bq_df.list_transfer_configs(parent):
104
105 scheduled_query_name = format_name(element.display_name)
106 scheduled_query_sql = ""
107
108 # Loop over Data Transfer parameters, select scheduled queries and get query SQL
109 params = element.params.items()
110 for key, value in params:
111 if key == "query":
112 scheduled_query_sql = value
113
114 backup_directory = path + project_id + "/bq_scheduled_queries/"
115 if not os.path.exists(backup_directory): # Create directory when it doesn't exist
116 os.makedirs(backup_directory)
117
118 # Write to file
119 f = open(backup_directory + "/" + scheduled_query_name + ".sql", "w+")
120 f.write(scheduled_query_sql)
121 cnt_scheduled_queries+=1
122
123 return cnt_scheduled_queries
124
125def execute():
126
127 credentials = get_credentials(GCP_JSON_KEYPATH)
128
129 print ("Found {} GCP projects in settings".format(len(GCP_PROJECTS_BACKUP)))
130
131 if REPO_COMMIT:
132 # Clone repository if Git commits are enabled
133 print("Git commits enabled. Cloning repository {} to {}".format(REPO_LINK, LOCAL_PATH_BACKUP))
134
135 # Delete folder / repository when existing
136 if os.path.exists(LOCAL_PATH_BACKUP):
137 git.rmtree(LOCAL_PATH_BACKUP)
138
139 if REPO_KEY_PATH == "":
140 repo = git.Repo.clone_from(REPO_LINK, LOCAL_PATH_BACKUP)
141 else:
142 repo = git.Repo.clone_from(REPO_LINK, LOCAL_PATH_BACKUP, env={"GIT_SSH_COMMAND": "ssh -i " + REPO_KEY_PATH}) # Use SSH key
143
144 # Remove old backups in repository
145 repo_dirs = next(os.walk(LOCAL_PATH_BACKUP))[1]
146 for dir in repo_dirs:
147 if not dir.startswith("."):
148 git.rmtree(os.path.join(LOCAL_PATH_BACKUP, dir))
149 else:
150 # Only download to local filesystem. Create directory when it doesn't exist
151 if not os.path.exists(LOCAL_PATH_BACKUP):
152 os.makedirs(LOCAL_PATH_BACKUP)
153
154 # Loop through GCP project and save views and scheduled queries
155 for i, project in enumerate(GCP_PROJECTS_BACKUP):
156
157 print ("-- Starting backup for project: {}".format(project["project_id"]))
158
159 views = save_bigquery_views(
160 credentials,
161 project["project_id"],
162 LOCAL_PATH_BACKUP
163 )
164 print("# {} views saved..".format(views))
165
166 scheduled_queries = save_bigquery_scheduled_queries(
167 credentials, project["project_id"],
168 project["location"],
169 LOCAL_PATH_BACKUP
170 )
171 print("# {} scheduled queries saved..".format(scheduled_queries))
172
173 # Push code to remote repository
174 if REPO_COMMIT:
175 git_push_commit(repo, REPO_KEY_PATH, LOCAL_PATH_BACKUP)
176 print("Pushed code to repository..")
177
178 print("Done.")
179
180execute()
Some last tips
- You probably want to schedule this script daily. You can use your favorite scheduling tool like Apache Airflow or good ol' cronjobs.
- On windows, configuring SSH keys can give you some headaches. See the GitHub and the Bitbucket documentation for troubleshooting.