Google Cloud Workflows: Export data from BigQuery to Cloud Storage
For a lot of data projects Apache Airflow is the orchestration tool of choice. However, Airflow can be a bit overkill for smaller projects and tasks. One alternative is Google Workflows where you can create orchestration flows using simple .yaml files. In this tutorial I will show you how to export data from BigQuery to a Google Cloud Storage bucket using Google Workflows.
Contents:
- Introduction Google Workflows
- Workflow example: Exporting Google BigQuery data to Google Cloud Storage as a CSV (two methods).
Why Google Workflows?
Google Workflows is a lightweight and serverless workflow orchestration service where you can connect Google Cloud and external services (using APIs). As mentioned, Apache Airflow is used in a lot of projects, but it can be a bit overkill since you have to maintain the Airflow cluster (or pay to maintain it). It can also have some overlap with other tools like dbt.
Triggered by a post from Robert Sahlin related to modern data workflow orchestration and The Unbundling of Airflow by Gorkem Yurtseven I've dicided to give Google Workflows a try for some workflows mainy related to Google Cloud. Currently, I'm using Workflows successfully in some smaller projects.
Google Workflows basics
Basically, Google workflows uses a .yaml file where you can build your workflow. The file contains tasks that can be executed in order. You can upload this .yaml file and Google will create a visual representation of your flow. Flows can be scheduled using Cloud Scheduler, executed manually or using an API. Logs are also collected.
There are a bunch of connectors / APIs available already and the list is growing. For an overview checkout the official connector list. Make sure you also check out the introduction by Google.
As of writing, Google Workflows is fairly new, so example code and documentation is somewhat limited.
Exporting from Google BigQuery to a Cloud Storage bucket is an operation that is present in a lot of pipelines I developed. So I'm taking this operation to build an example Workflow. Two methods:
Method 1: BigQuery to Cloud Storage (intermediate table)
The first method will create an intermediate table from your query, and is exporting that table as a CSV to Google Cloud Storage. Adding this step enables you to export large amounts of data into a single file (max 1GB to a single file). If you have a limited amount of data or it doesn't matter that data is always exported to multiple CSV files / file parts, you could also use option 2.
Workflow steps
- Run a query a save the result in an intermediate table
- Export the intermediate table to a CSV
Workflow .yaml code
1- init:
2 assign:
3 - project_id: "<your-project-id>"
4 - dataset_export: "<your-bigquery-dataset>"
5 - table_export: "<your-bigquery-table>"
6 - query: >
7 select * from <your-table>
8 - bucket: "<your-cloud-bucket>"
9 - folder: "<your-export-folder-within-cloud-bucket>"
10 - filename: ${string(sys.now()) + "-<your-filename>.csv"}
11
12- bigquery-create-export-table:
13 call: googleapis.bigquery.v2.jobs.insert
14 args:
15 projectId: ${project_id}
16 body:
17 configuration:
18 query:
19 query: ${query}
20 destinationTable:
21 projectId: ${project_id}
22 datasetId: ${dataset_export}
23 tableId: ${table_export}
24 create_disposition: "CREATE_IF_NEEDED"
25 write_disposition: "WRITE_TRUNCATE"
26 allowLargeResults: true
27 useLegacySql: false
28
29- bigquery-table-to-gcs:
30 call: googleapis.bigquery.v2.jobs.insert
31 args:
32 projectId: ${project_id}
33 body:
34 configuration:
35 extract:
36 compression: NONE
37 destinationFormat: "CSV"
38 destinationUris: ['${"gs://" + bucket + "/" + folder + "/" + filename}']
39 fieldDelimiter: ","
40 printHeader: true
41 sourceTable:
42 projectId: ${project_id}
43 datasetId: ${dataset_export}
44 tableId: ${table_export}
Method 2: BigQuery to Cloud Storage (direct)
This method will export a query result directly to a CSV. However, it will create multiple file parts in most situations.
Workflow .yaml code
1- init:
2 assign:
3 - project_id: "<your-project-id>"
4 - query: >
5 select * from <your-table>
6 - - bucket: "<your-cloud-bucket>"
7 - folder: "<your-export-folder-within-cloud-bucket>"
8 - filename: "<your-filename)"
9- bigquery-table-to-gcs:
10 call: googleapis.bigquery.v2.jobs.query
11 args:
12 projectId: ${project_id}
13 body:
14 query: ${"EXPORT DATA OPTIONS( uri='gs://" + bucket + "/" + folder + "/'||current_date()||'-" + filename + "-part-*.csv', format='CSV', overwrite=true, header=true) AS " + query}
15 useLegacySql: false
Closing thoughts
Google Workflows is an excellent service for small projects and tasks. But it can also be used in larger projects. You can create Workflows using the interface, the CLI / API or Terraform, which makes it more manageable in larger projects.
There are also some remarks. Think of pipelines that need to be able to perform backfills. By default, this is harder to achieve within Workflows. Or more complex flows that can't be easily writtin in a .yaml configuration file. In those cases, Airflow of Prefect are also good alternatives.