BigQuery Omni: Distributed Query Engine Comes To Google Cloud

BigQuery’s distributed query engine extends support for multi-cloud data lakes

Thomas Spicer
Published in
5 min readJul 30, 2020

--

Google Cloud released BigQuery Omni, a service that provides a federated query engine that executes standard SQL queries the contents within AWS and Microsoft Azure data lakes.

Unlike query engine offerings from AWS or Azure, the BigQuery Omni application platform fully embraces multi-cloud with a serverless query engine that can execute SQL across different cloud data lakes.

According to the announcement on Google Cloud Next ’20, the first release started as a private alpha for Amazon Web Services, while a subsequent version will support Microsoft Azure.

BigQuery Omni, Anthos & Dremel

What makes Omni different than other solutions? Rather than run the full-stack of BigQuery cloud services in GCP, Google “localized” the Dremel engine compute resources to the AWS or Azure cloud platform that holds the data. Taking to approach solves a few technical hurdles, but also minimizes some significant cost implications for BigQuery Omni customers.

For example, let’s say that Google said that data resident in AWS S3 needed to be moved temporarily to Google Cloud Storage for Omni to work. Transfer of data would be necessary so that BigQuery can execute queries within GCP. As a result, your data is leaving AWS, which charges for all outbound network traffic. The first 10 TB per month costs .15 per gigabyte.

Let’s assume you have a 1TB of CSV files on AWS S3. An Omni SQL query would need to copy 1TB of data from AWS to Google. The transfer costs for this query would be close to $150. Ouch! If moving data like this were required, it would not be a viable multi-cloud analytics solution. However, Google localizes a query to AWS or Azure, which eliminates the need for this type of data transfer.

By running in a containerized environment in a Google-managed AWS or Azure environment, Omni solves data transfer cost implications by localizing compute resources in the same cloud and region as the data resides. As a result, efforts to analyze data do not incur significant cost penalties.

The idea of multi-cloud business insights using the Omni query engine becomes viable.

BigQuery Omni vs

Industry-wide the move to distributed query engines is gaining steam. Amazon Athena, Amazon Redshift, PrestoDB, and others support this model. Google BigQuery supported this pattern as well before Omni, except only it could only query data resident within GCP.

However, what is novel about the approach Google has taken goes beyond the distribution of a query but where the actual compute resources that execute those queries reside. Truly distributed queries leveraging native compute resources is a unique offering. Assuming the aforementioned Omni service works as advertised, this will extend the reach for current BigQuery customers that operate in AWS and Azure.

One caveat; while Google supports a federated compute model, it does not change the need to have your data lake contents optimized for Omni. If you are running Omni against unoptimized data lakes, the performance and cost implications are significant. This is true for any query service, not just Omni.

BigQuery Omni Opportunites & Challenges

Based on Google Anthos and Dremel, Omni supports Avro, CSV, JSON, ORC, and Parquet. Google says there is no need to format or transform your data for an AWS or Azure hybrid and multi-cloud queries from Omni; this is marketing, not technical advice.

It is not uncommon for Google, AWS, and Azure to promote the conceptual ease of use model while downplaying the realities associated with it.

If you are attempting to use Omni to query data objects in an unoptimized AWS or Azure environment, performance and cost will become a significant concern.

Vendor’s posts, guides, documentation, or tweets describe setting up a data lake for query engines as creating an AWS S3 bucket, some paths, and then dropping files in. Voilà, you have a data lake ready for the query service to get to work! Not really.

For most Omni users, they will learn the same hard lessons Athena, Spectrum, and Presto users learned: distributed query engines are only as good as the data lakes they query.

BigQuery Omni Best Practices

Optimizing and automating the configuration, processing, and loading of data to your private Azure or Amazon data lake is critical for Omni to operate efficiently.

Here key considerations for Omni optimization when using your AWS data lake:

  • Automatic partitioning of data — With data partitioning, you maximize the amount of data scanned by each Omni query, thus improving performance and reducing the cost of data stored in Azure or AWS S3 as you run queries.
  • Automatic conversion to Apache Parquet — Convert data into an efficient and optimized open-source columnar format, Apache Parquet. This lowers costs when you execute queries as the Parquet files columnar format is highly optimized for interactive query services like Omni.
  • Automatic data compression — With data in Apache Parquet, compression is performed column by column using Google Snappy, which means it not only supports query optimizations it also reduces the size of the data stored in your Azure Data Lake Storage or Amazon S3 bucket, which reduces costs
  • Automated data catalogs, database, table, and view creation — As upstream data changes, the use of a data catalog can ensure that changes in your data lake automatically version tables and views within Omni. Data is analyzed, and the system “trained” to infer schemas to automate the creation of database, views, and tables in Omni.

Getting Started With Bigquery Omni & Data Lakes

Does a data lake have to be complex to set up for Omni? No! With a data lake formation process, you can get up and running more quickly.

It has never been easier to take advantage of an “analytics-ready” data lake with a serverless query service like Omni.

The Openbridge data lake service automates the configuration, processing, and loading of data to Google BigQuery, unlocking how users can return query results quickly and cost-effectively.

With our zero administration, data lake service, you push data from supported data sources, and our service automatically loads it into BigQuery Omni.

Want to get started with Omni and data lakes? Sign up for a 14-day no cost trial!

References

--

--