Amazon Redshift Federated Queries: Rise Of Query Engines

Thomas Spicer
Openbridge
Published in
6 min readAug 3, 2020

--

Here come the SQL query engines!

A few years ago AWS added query services to Redshift under the “Spectrum” name. Spectrum enabled users to query an S3 data lake from within Redshift.

However, with the latest federated query updates, AWS is bringing Amazon Redshift in line with competitive query service offerings from not only Google and Microsoft, but other AWS services too. For example, the new capabilities will allow users the ability to analyze data in an external system like a Postgres database from within their Amazon Redshift cluster.

What are federated queries?

Facebook PrestoDB popularized the concept of distributed SQL query engines when it open-sourced the project back in 2013.

Over the past couple of years, AWS, Google, Microsoft, and many others in the industry have accelerated the adoption of a distributed query engine model within their products. For example, AWS developed Amazon Athena on top of the Presto code base.

Here is how PrestoDB describes what it allows users to do:

Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.

Like PrestoDB and other query engine services, Amazon Redshift now supports federated queries that enable its customers the ability to query data across different databases, data warehouses, or data lakes. This follows previous support for federated queries in AWS Athena:

AWS Redshift Federated Query Use Cases

The use cases that applied to Redshift Spectrum apply today, the primary difference is the expansion of sources you can query. For example, you can run a query on data in Amazon RDS for PostgreSQL, Amazon Redshift, and AWS S3 data lake or data lakehouse. This allows Redshift customers the ability to incorporate live data from remote systems as part of their existing Redshift data stack from other services like PostgreSQL and Amazon Aurora.

Federated querying also allows you the ability to apply lightweight transformations on the fly, and load data into the target tables.

This is good news for current Redshift users as this adds new features that keep the service competitive with other AWS offerings, PrestoDB, Google BigQuery Omni, and other SQL query engine services.

How do the Amazon Redshift Federated Queries work?

First, you will need to do some setup to configure the service. AWS offers a tutorial that shows you how to get started using the Redshift federated query using AWS CloudFormation.

From a technical perspective, Amazon includes a query optimizer to determine the most efficient way to execute a federated query. Redshift will distribute a portion of the query directly into the target database to speed up query performance. This approach reduces the risk of moving large volumes of data over the network.

Reducing network overhead is an important strategy given the performance constraints associated with large data sets. This is why Google BigQuery Omni actually runs part of the query engine directly within AWS or Azure.

If you are planning to query the contents of an AWS data lake, we suggest sure you are following the best practices we detailed for Athena which apply to Redshift as well:

Amazon Redshift Federated Queries Vs

Amazon Redshift Spectrum had allowed you the ability to query your AWS data lake. In a sense, Redshift has had a form of federated queries for some time. However, the scope was limited to an AWS data lake.

The new capabilities follow an industry trend toward query engines supporting diverse data stores for data ingestion. For example, Amazon Athena, which is based on PrestoDB, has supported the concept of a federated query engine for some time. PrestoDB was conceived by Facebook as a federated SQL query engine.

The fact that Redshift supports a federated query engine model is a must-have, not a nice-to-have, feature for Redshift to remain relevant as a service.

Who should use the Redshift Federated Query Service?

The value proposition is targeted at existing Redshift users. If you are using a different federated query engine service, there is no compelling reason to switch. For example, if you are currently an Amazon Athena user, there is no reason to switch.

In a previous post, we discussed the Redshift Spectrum vs Athena use case.

On the plus side, AWS Redshift and AWS Athena can access the same AWS data lake. This means you can pilot Redshift by running queries against the same data lake used by Athena. Of course, this type of flexibility and efficiency assumes a proper architecture data lake.

If you are a Redshift user, Amazon Redshift Federated Queries offer flexibility, especially when deciding if you need to scale or add capacity to the system. For example, you can save big dollars by adding a lifecycle process to move data out of Redshift to a data lake or by leaving data in place within RDS.

Why pay to store that data in Redshift when storing data in a lake or querying data in place is possible? As a result, these new Redshift query capabilities can give users more technical options and cost optimization opportunities. For example, you can minimize the need to scale Redshift with a new node, which can be an expensive proposition.

Getting Started With Amazon Redshift Federated Queries

A well-architected data lake will ensure your Redshift federated queries run quickly and incur minimal costs. The Openbridge zero administration data lake service is a perfect pairing for Redshift Federated Queries. Push data from supported data sources, and our service automatically handles the data ingestion to a Redshift-supported AWS data lake.

Want to discuss Redshift federated querying or data lakes for your organization? Need a platform and team of experts to kickstart your data and analytics efforts? We can help! Getting traction adopting new technologies, especially if it means your team is working in different and unfamiliar ways, can be a roadblock to success. This is especially true in a self-service-only world. If you want to discuss a proof-of-concept, pilot, project, or any other effort, the Openbridge platform and team of data experts are ready to help.

Reach out to us at hello@openbridge.com. Prefer to talk to someone? Set up a call with our team of data experts.

References

--

--