Optimizing BigQuery Table Partitions

Thomas Spicer
Openbridge
Published in
6 min readJun 30, 2021

--

Different partitioning strategies within BigQuery have various pros and cons. A typical strategy for partitioning is by DAY and partitioned on the field _PARTITIONTIME. For context, _PARTITIONTIME is a pseudo column that contains the UTC day that the partitioned data was loaded. In many cases, this is a good default. However, there may be cases where you want to optimize a table beyond Google defaults.

What Tables Should You Optimize?

Like most things, it depends. For example, if you have a high volume of loads for a common data source, using a different strategy can lower query costs. Be aware that some partitioning strategies, while improving costs, may increase the complexity of your queries.

If you want to explore which tables are candidates for optimization based on the volume of data and number of rows, who can run this query in BigQuery:

DECLARE dataset_names ARRAY<STRING>;DECLARE batch ARRAY<STRING>;DECLARE batch_size INT64 DEFAULT 25;CREATE TEMP TABLE results (project_id STRING,dataset_id STRING,table_id STRING,row_count INT64,size_bytes INT64);SET dataset_names = (SELECT ARRAY_AGG(SCHEMA_NAME)FROM `region-us.INFORMATION_SCHEMA.SCHEMATA`);LOOPIF ARRAY_LENGTH(dataset_names) < 1 THENLEAVE;END IF;SET batch = (SELECT ARRAY_AGG(d)FROM UNNEST(dataset_names) AS d WITH OFFSET iWHERE i < batch_size);EXECUTE IMMEDIATE (SELECT """INSERT INTO results """|| STRING_AGG((SELECT """SELECT project_id, dataset_id, table_id, row_count, size_bytesFROM `""" || s || """.__TABLES__`""")," UNION ALL ")FROM UNNEST(batch) AS s);SET dataset_names = (SELECT ARRAY_AGG(d)FROM UNNEST(dataset_names) AS dWHERE d NOT IN (SELECT * FROM UNNEST(batch)));END LOOP;SELECT * FROM results;

The results will list a collection of tables for a given dataset:

There is no universal approach for all use cases. In this post, we detail how to migrate from one type of partitioned table to another.

Before proceeding, please review the documentation from Google on the topic: https://cloud.google.com/bigquery/docs/partitioned-tables

Convert Partitioning On An Existing BigQuery Table

In our example, we will show you how to convert partitioning from _PARTITIONTIMEto a different fieldob_date. We will use an Amazon Advertising table called amzadvertising_sp_productads_v5 for reference.

Step 1. Back-Up Your SOURCE Table

The first step is to make sure you back up your table. If something goes wrong, having a backup will be critical. We suggest you place the file(s) in an organized, easy-to-understand structure within Google Cloud Storage. For example, here is a structure:

  • gs://<bucket-name>/<parent-directory>/<child-directory>/<file-name>

Using the data source name usually makes sense for these. For example, you create a bucket called “openbridge-exports.” Next, you define a parent directory called amazonadvertising. The child directory would be the specific table name for the source so useamzadvertising_sp_productads. Lastly, the file name should be a combination of the child and the export date.

  • gs://openbridge-exports/amazonadvertsing/amzadvertising_sp_productads_v5/amzadvertising_sp_productads_v5_06012021

Once you have your structure planned out, you are ready to back up your table:

  • Go to the BigQuery WebUI.
  • Select the amzadvertising_sp_productads_v5 table for export.
  • Click on Export Table in the top-right.
  • Select the Export format (CSV) and Compression (GZIP).
  • Adjust the Google Cloud Storage path to match the bucket, directories, and file name you want to use.
  • Click OK and wait for the job to complete.

Once you verify the export is completed, proceed to Step 2.

Step 2: Create Temp Table Based On SOURCE

After we have data safely backed up to Cloud Storage, we need to create an empty TEMP table reflecting the table you want to convert.

The TEMP table should follow the same schema definition as your SOURCE table. For amzadvertising_sp_productads_v5, the schema and create table definition would be as follows.

The BigQuery UI allows you to inspect the schema, so you have access to all the columns and data types needed. Just follow the formatting for each in our example CREATE TABLE statement:

CREATE TABLEmy-google-project.mws.amzadvertising_sp_productads_TEMP01 (ad_group_id INT64,ad_group_name STRING,ad_id INT64,asin STRING,attributed_conversions14d INT64,attributed_conversions14d_same_sku INT64,attributed_conversions1d INT64,attributed_conversions1d_same_sku INT64,attributed_conversions30d INT64,attributed_conversions30d_same_sku INT64,attributed_conversions7d INT64,attributed_conversions7d_same_sku INT64,attributed_sales14d FLOAT64,attributed_sales14d_same_sku FLOAT64,attributed_sales1d FLOAT64,attributed_sales1d_same_sku FLOAT64,attributed_sales30d FLOAT64,attributed_sales30d_same_sku FLOAT64,attributed_sales7d FLOAT64,attributed_sales7d_same_sku FLOAT64,attributed_units_ordered14d INT64,attributed_units_ordered14d_same_sku INT64,attributed_units_ordered1d INT64,attributed_units_ordered1d_same_sku INT64,attributed_units_ordered30d INT64,attributed_units_ordered30d_same_sku INT64,attributed_units_ordered7d INT64,attributed_units_ordered7d_same_sku INT64,campaign_id INT64,campaign_name STRING,clicks INT64,cost FLOAT64,currency STRING,impressions INT64,profile_id INT64,sku STRING,ob_date DATE,ob_transaction_id STRING,ob_file_name STRING,ob_processed_at STRING,ob_modified_date DATETIME)PARTITION BYob_date

Notice we are using ob_date as the PARTITION_BY value. Google should confirm this was created, and you should see it in your tables list.

Step 3: Load TEMP Table From SOURCE

Next, we want to load all the data from SOURCE into the TEMP. To do this, we need to SELECT everything and set the TEMP as the target.

SELECT * FROM `my-google-project.mws.amzadvertising_sp_productads_v5`

Don’t run it yet! Go to Query Settings as you need to set a target destination for the results (“Set a destination table for query results”).

Follow the process they define: https://cloud.google.com/bigquery/docs/writing-results#writing_large_results_using_legacy_sql

Once complete, the process will load everything from the SOURCE table to TEMP. Spot-check the table details to make sure the size of SOURCE and TEMP are the same. Also, it would be best if you ran some test queries to confirm parity between TEMP and SOURCE.

Step 4: Delete the Source Table

Did you confirm TEMP and SOURCE are they the same? Good. BigQuery does not allow you to rename a table, so we need to delete SOURCEso we can recreate it based on TEMP. See this doc for more info: https://cloud.google.com/bigquery/docs/managing-tables

Step 5: Copy TEMP into New SOURCE

In BigQuery UI, expand your project and dataset, then select the newly createdamzadvertising_sp_productads_TEMP01table which holds all your data

In the details panel, click Copy table. In the Copy table dialog, under Destination, set the Project name and Dataset name. You want to make sure that bothTEMP and SOURCE are in the same project and dataset. It will not work otherwise.

For the Table name, enter a name for the new table. In this case, we will want to use the original table name we deleted:amzadvertising_sp_productads_v5.

All set? Click Copy to start the copy job.

Confirm that your newamzadvertising_sp_productads_v5 matches your TEMP table amzadvertising_sp_productads_TEMP01. Run the same checks as Step 3 to verify everything looks good. Feel free to delete TEMP when you are confident everything is aligned.

That is it! Over the next few days, you can check billing to see if there are cost improvements using the new PARTITION strategy. If you are not seeing the expected impact, try a different strategy and observe the results.

References

Here is a query to check the query compute usage by the user:

SELECT
COUNT(*) TOTAL_QUERIES,
SUM(total_slot_ms/TIMESTAMP_DIFF(end_time,creation_time,MILLISECOND)) AVG_SLOT_USAGE,
SUM(TIMESTAMP_DIFF(end_time,creation_time,SECOND)) TOTAL_DURATION_IN_SECONDS,
AVG(TIMESTAMP_DIFF(end_time,creation_time,SECOND)) AVG_DURATION_IN_SECONDS,
SUM(total_bytes_processed*10e-12) TOTAL_PROCESSED_TB,
EXTRACT (DATE FROM creation_time) AS EXECUTION_DATE,
user_email as USER
FROM `XXXXXX.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE state=’DONE’
AND statement_type=’SELECT’
AND creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY) AND CURRENT_TIMESTAMP()
GROUP BY EXECUTION_DATE, USER
ORDER BY EXECUTION_DATE,AVG_SLOT_USAGE

Here is a query to determine table size:

SELECT TABLE_ID, SUM(SIZE_BYTES)/(1024*1024*1024) AS SIZE_GB
FROM XXXXXXX.OPENBRIDGE_DATASET.__TABLES__
GROUP BY TABLE_ID ORDER BY SIZE_GB DESC;

About Openbridge: Make data work for you

Unify data in a trusted, private industry-leading data lake or cloud warehouses like Amazon Redshift, Amazon Redshift Spectrum, Google BigQuery, Snowflake, Azure Data Lake, Ahana, and Amazon Athena. Data is always fully owned by you.

Take control, and put your data to work with your favorite analytics tools. Explore, analyze, and visualize data to deliver faster innovation while avoiding vendor lock-in using tools like Google Data Studio, Tableau, Microsoft Power BI, Looker, Amazon QuickSight, SAP, Alteryx, dbt, Azure Data Factory, Qlik Sense, and many others.

Get a 30-day free trial so you can try Openbridge for yourself.

--

--