Amazon Redshift Federated Queries: Rise Of Query Engines

Published in

Openbridge

6 min readAug 3, 2020

Here come the SQL query engines!

A few years ago AWS added query services to Redshift under the “Spectrum” name. Spectrum enabled users to query an S3 data lake from within Redshift.

However, with the latest federated query updates, AWS is bringing Amazon Redshift in line with competitive query service offerings from not only Google and Microsoft, but other AWS services too. For example, the new capabilities will allow users the ability to analyze data in an external system like a Postgres database from within their Amazon Redshift cluster.

What are federated queries?

Facebook PrestoDB popularized the concept of distributed SQL query engines when it open-sourced the project back in 2013.

Over the past couple of years, AWS, Google, Microsoft, and many others in the industry have accelerated the adoption of a distributed query engine model within their products. For example, AWS developed Amazon Athena on top of the Presto code base.

Here is how PrestoDB describes what it allows users to do:

Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.

Like PrestoDB and other query engine services, Amazon Redshift now supports federated queries that enable its customers the ability to query data across different databases, data warehouses, or data lakes. This follows previous support for federated queries in AWS Athena:

AWS Data Lake And Amazon Athena Federated Queries

Supercharge your AWS data lake architecture with new query service extensions

blog.openbridge.com

AWS Redshift Federated Query Use Cases

The use cases that applied to Redshift Spectrum apply today, the primary difference is the expansion of sources you can query. For example, you can run a query on data in Amazon RDS for PostgreSQL, Amazon Redshift, and AWS S3 data lake or data lakehouse. This allows Redshift customers the ability to incorporate live data from remote systems as part of their existing Redshift data stack from other services like PostgreSQL and Amazon Aurora.

Federated querying also allows you the ability to apply lightweight transformations on the fly, and load data into the target tables.

This is good news for current Redshift users as this adds new features that keep the service competitive with other AWS offerings, PrestoDB, Google BigQuery Omni, and other SQL query engine services.

How do the Amazon Redshift Federated Queries work?

First, you will need to do some setup to configure the service. AWS offers a tutorial that shows you how to get started using the Redshift federated query using AWS CloudFormation.

From a technical perspective, Amazon includes a query optimizer to determine the most efficient way to execute a federated query. Redshift will distribute a portion of the query directly into the target database to speed up query performance. This approach reduces the risk of moving large volumes of data over the network.

Reducing network overhead is an important strategy given the performance constraints associated with large data sets. This is why Google BigQuery Omni actually runs part of the query engine directly within AWS or Azure.

If you are planning to query the contents of an AWS data lake, we suggest sure you are following the best practices we detailed for Athena which apply to Redshift as well:

How To Create A Serverless, Zero Infrastructure, Zero Administration Data Lake With Amazon S3…

We are going to describe a serverless solution stack of Amazon S3, Apache Parquet and Amazon Athena for your data lake

blog.openbridge.com

Amazon Redshift Federated Queries Vs

Amazon Redshift Spectrum had allowed you the ability to query your AWS data lake. In a sense, Redshift has had a form of federated queries for some time. However, the scope was limited to an AWS data lake.

The new capabilities follow an industry trend toward query engines supporting diverse data stores for data ingestion. For example, Amazon Athena, which is based on PrestoDB, has supported the concept of a federated query engine for some time. PrestoDB was conceived by Facebook as a federated SQL query engine.

The fact that Redshift supports a federated query engine model is a must-have, not a nice-to-have, feature for Redshift to remain relevant as a service.

Who should use the Redshift Federated Query Service?

The value proposition is targeted at existing Redshift users. If you are using a different federated query engine service, there is no compelling reason to switch. For example, if you are currently an Amazon Athena user, there is no reason to switch.

In a previous post, we discussed the Redshift Spectrum vs Athena use case.

On the plus side, AWS Redshift and AWS Athena can access the same AWS data lake. This means you can pilot Redshift by running queries against the same data lake used by Athena. Of course, this type of flexibility and efficiency assumes a proper architecture data lake.

If you are a Redshift user, Amazon Redshift Federated Queries offer flexibility, especially when deciding if you need to scale or add capacity to the system. For example, you can save big dollars by adding a lifecycle process to move data out of Redshift to a data lake or by leaving data in place within RDS.

Why pay to store that data in Redshift when storing data in a lake or querying data in place is possible? As a result, these new Redshift query capabilities can give users more technical options and cost optimization opportunities. For example, you can minimize the need to scale Redshift with a new node, which can be an expensive proposition.

Getting Started With Amazon Redshift Federated Queries

A well-architected data lake will ensure your Redshift federated queries run quickly and incur minimal costs. The Openbridge zero administration data lake service is a perfect pairing for Redshift Federated Queries. Push data from supported data sources, and our service automatically handles the data ingestion to a Redshift-supported AWS data lake.

Want to discuss Redshift federated querying or data lakes for your organization? Need a platform and team of experts to kickstart your data and analytics efforts? We can help! Getting traction adopting new technologies, especially if it means your team is working in different and unfamiliar ways, can be a roadblock to success. This is especially true in a self-service-only world. If you want to discuss a proof-of-concept, pilot, project, or any other effort, the Openbridge platform and team of data experts are ready to help.

Reach out to us at hello@openbridge.com. Prefer to talk to someone? Set up a call with our team of data experts.

References

What is a data lake? Strategy & success depend on practical data lake solutions

Data lake vs. data warehouse, which is better? It is not uncommon to see a data lake framed as just “storage” or claims…

www.openbridge.com

Data Lakes? Big Myths About Architecture, Strategy, and Analytics

What is a data lake? Get a leg up to becoming a data-driven enterprise

blog.openbridge.com

AWS Lake Formation: Accelerating Data Lake Adoption

If you read about data lakes, you will often come across a post, guide, documentation, or tweet that will describe…

blog.openbridge.com

Adobe Data Feeds: How to use a data lake and Amazon Athena for analytic insights

4 Steps to configure your Adobe Data Feeds for a data lake using Amazon S3 and Amazon Athena

blog.openbridge.com

Data lake vs. data warehouse? Modern data management strategies

We have referenced some of them on this page as well as on our blog. Yes, but it depends on what you mean by “all” …

www.openbridge.com

Best practices for Amazon Redshift Federated Query | Amazon Web Services

This post discusses ten best practices to help you maximize the benefits of Federated Query when you have large…

aws.amazon.com

Amazon Redshift Federated Queries: Rise Of Query Engines

What are federated queries?

AWS Data Lake And Amazon Athena Federated Queries

Supercharge your AWS data lake architecture with new query service extensions

AWS Redshift Federated Query Use Cases

How do the Amazon Redshift Federated Queries work?

How To Create A Serverless, Zero Infrastructure, Zero Administration Data Lake With Amazon S3…

We are going to describe a serverless solution stack of Amazon S3, Apache Parquet and Amazon Athena for your data lake

Amazon Redshift Federated Queries Vs

Who should use the Redshift Federated Query Service?

Getting Started With Amazon Redshift Federated Queries

References

What is a data lake? Strategy & success depend on practical data lake solutions

Data lake vs. data warehouse, which is better? It is not uncommon to see a data lake framed as just “storage” or claims…

Data Lakes? Big Myths About Architecture, Strategy, and Analytics

What is a data lake? Get a leg up to becoming a data-driven enterprise

AWS Lake Formation: Accelerating Data Lake Adoption

If you read about data lakes, you will often come across a post, guide, documentation, or tweet that will describe…

Adobe Data Feeds: How to use a data lake and Amazon Athena for analytic insights

4 Steps to configure your Adobe Data Feeds for a data lake using Amazon S3 and Amazon Athena

Data lake vs. data warehouse? Modern data management strategies

We have referenced some of them on this page as well as on our blog. Yes, but it depends on what you mean by “all” …

Best practices for Amazon Redshift Federated Query | Amazon Web Services

This post discusses ten best practices to help you maximize the benefits of Federated Query when you have large…

Written by Thomas Spicer