How is AWS Redshift Spectrum different than AWS Athena?

Which data lake SQL query engine? Redshift Spectrum vs. Athena

Thomas Spicer
Published in
6 min readAug 31, 2017

--

This question about interactive query services AWS Athena and Redshift Spectrum database has come up a few times in various posts and forums. However, most of the discussion focuses on the technical difference between these Amazon Web Services resources.

Athena & Redshift Spectrum are excellent choices for their respective use cases. Rather than try to decipher technical differences, the post frames the choice as a buying, or value, question.

Here are four questions you can ask yourself to help frame which may work best for your situation:

1. You Are An Existing Redshift Customer

If you are already a Redshift customer, Amazon Redshift Spectrum can help you balance the need for adding capacity to the system. Spectrum allows you to extend beyond typical data warehousing and dense storage by directly querying a data lake. This can save you big dollars since you can get lifecycle data out of Redshift to S3.

For example, let’s say you have a 100 GB transactional table of infrequently accessed data within one of your Redshift Operational Databases. Why pay to store that data in Redshift, adjusting cluster size to hold more data when moving it to external tables on AWS S3 and query data with Spectrum is an option? This approach can minimize the need to scale Redshift requires a new node for improving consistent performance for both a simple or complex query, which can be expensive!

Note: You are still paying “per query” for the amount of data scanned via Spectrum the same as Athena. In both cases, you pay for each terabyte of data scanned. The benefit of this approach is offloading data so you can be more efficient with local storage in Redshift. Also, complex analytic queries will work equally well in either system and will have the same cost profile for compute resources. One caveat is that you still have constraints for concurrent users in a Redshift for various analytical workloads compared to Athena.

Quick Pick? Stick With Redshift vs Athena

As an existing Redshift user, I would be less inclined to use Athena because of existing investments in Redshift. This includes any ancillary data operations that process data into a Redshift warehouse or unique and complex data types supported by Redshift that may break existing workflows.

Keep in mind that Spectrum data resides in an external schema. Not a big deal, but make sure any ETL or ELT data processing for use within Spectrum should account for external tables. Any connector you have loading data to Redshift will need to make sure it supports the file format and workflows associated with Spectrum

Redshift federated queries were released in 2020. These new capabilities may tip the scales in favor of sticking with Redshift.

Lastly, remember that a key difference from Athena is that access to Spectrum requires an active, running Redshift instance. Redshift Spectrum is not an option without Redshift. Access to the “Redshift+Redshift Spectrum” tandem has costs that might not be worthwhile (right now) if you are NOT an AWS Redshift customer.

2. You Are NOT An Existing Redshift Customer

If you are not a Redshift customer, then it becomes more interesting. AWS Athena is based on Facebook Presto and includes some Apache Hive goodness. Assuming you have objects on S3 that Athena can consume, then you might start with Athena vs. spinning up Redshift clusters.

A query in Athena and Spectrum generally has the same compute cost of $5 per terabyte scanned. One significant difference is that Spectrum requires Redshift, which must be factored into your total cost.

Have data in locations other than your data lake? You can extend Athena via federated query services.

Using Athena opens up an opportunity to create a full serverless data analytics stack.

Quick Pick? Explore Athena

Check out the post Building A Serverless Business Intelligence Stack With Apache Parquet, Tableau, and Amazon Athena for inspiration. Athena can be an exceptional value when implemented correctly, especially when paired with analytics services that support data caches like Tableau and storage formats like Parquet.

Lastly, Athena is an on-demand, serverless query engine that also has the capability to operate as federated query engines. This may be attractive given the fact many teams may not want to run, maintain, or pay for a set of Amazon Redshift clusters. If Athena is a fit for your workload and you want to follow a separation of storage from compute, the obvious choice is well within reach.

3. Do Your Analytic Tools Support Amazon Athena?

Check your tools, can then access Athena via ODBC or JDBC? It might be the case that your analytic tool of choice does not support Athena, but does support Redshift.

For example, Tableau 10.3 officially released support for Athena. Looker also released support for Athena. Did you know that Power BI supports Athena via ODBC connections?

However, there may be tools that don’t support Athena that you rely on. The flip side is they might not support Spectrum either!

Quick Pick? It depends!

So this gets back to the first point (1) around what your current stack includes. If you went down the Athena path, your tool choices might be more limited than Redshift. This could be a deal-breaker for some. The Redshift path may give you more data and analytics tooling options.

We suggest that you test a tool that works with Athena, Redshift, and Redshift Spectrum.

Lastly, regardless of a Spectrum or Athena choice, do not overlook data formats optimizations external tables need to drive efficiency and downward costs.

4. Can I use Amazon Athena AND Redshift Spectrum?

This does not have to be an AWS Athena vs. Redshift choice. Why? Nothing stops you from using both Athena or Spectrum. As query service compute engines, both AWS Redshift Spectrum and AWS Athena can both access the same data lake!

I can query a 1 TB Parquet file on S3 in Athena the same as Spectrum. Data optimized on S3 in the Apache Parquet format is well-positioned for Athena AND Spectrum.

The transition between the two becomes somewhat trivial. This is also true in moving for Apache Parquet data from S3 Data Lake to a Microsoft Azure Data Lake!

However, if you are using both together, you should look closely at your architecture if this occurs. If you are not careful, you could have increased the costs of maintaining this kind of stack.

If you are asking yourself, “What is a Data Lake?” we cover the basics, which may help you pick the right path. Athena and Redshift Spectrum provide compelling, cost-effective solutions to query the contents of your lake.

Getting Started With Athena or Spectrum

Building data platforms and data infrastructure is hard work. Whether you are a team of one or a group of 100, the last thing you need is to fly blind and get stuck with self-service (aka, no service) solutions.

It has never been easier to take advantage of an analytics-ready data lake with Amazon Athena and Redshift Spectrum interactive query services. Our AWS lake formation service optimizes and automates the configuration, processing, organizing, and loading of data for use in Athena and Spectrum.

With our zero administration, data lake service, you push data from supported data sources, and our service automatically handles the data ingestion.

Our approach unlocks how quickly users can undertake data ingestion to a data lake so they can return query results rapidly.

References:

--

--