Introduction To Amazon Redshift Spectrum — Redefining Performance

Openbridge
Openbridge
Published in
5 min readApr 26, 2017

--

Amazon Web Services (AWS) released a companion to Redshift called Amazon Redshift Spectrum, a feature that enables running SQL queries against the data residing in a data lake using Amazon Simple Storage Service (Amazon S3). Amazon says that with Redshift Spectrum, users can query unstructured data without having to load or transform it.

We have launched a code-free, zero-admin, fully automated data pipeline that automates database, table creation, Parquet file conversion, Snappy compression, partitioning and more.

Get started with Amazon Redshift Spectrum for free!

Key Features

With Redshift Spectrum, you will have the freedom to store your data in a multitude of formats, so that it is available for processing whenever you need it. Here are a few of Spectrum’s key features:

  • instant queries within your favorite BI tools without needing to load and transform your data stored in Amazon S3
  • scaling processing across thousands of nodes with separated cluster storage and computing
  • fast results with Amazon Redshift query optimizer that minimizes data scanned in AWS S3 to improve query speed
  • a pay-per-query price which allows you to pay only for queries you run

How Does Amazon Redshift Spectrum Work?

Amazon Redshift Spectrum operates on data stored on AWS S3 which means that you can process the data using other AWS services. Actually, Amazon Athena data catalogs are used by Spectrum by default. So if you use Athena, all you need to do to get started with Spectrum is provide an authorization to access your data files in S3 and data catalog in Athena. The data files are the same ones that you would use for other applications and AWS services.

There are a few things you need to take into consideration when starting to work with data files for queries in Amazon Redshift Spectrum.

1. External Tables

To query data on Amazon S3, Spectrum uses external tables, so you’ll need to define those. Syntax to query external tables is the same SELECT syntax that is used to query other Amazon Redshift tables. One thing to mention is that you can join created an external table with other non-external tables residing on Redshift using JOIN command.

For more information about external tables for Redshift Spectrum visit: http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html.

2. File formats supported by Spectrum

Amazon Redshift Spectrum supports structured and semi-structured data formats which include Parquet, Textfile, Sequencefile and Rcfile. Amazon recommends using a columnar format because it will allow you to choose only the columns you need to transfer data from S3.

3. Data compression

Compressing your data files allows you to reduce storage space, improve performance, and minimize costs and recommended by Amazon. Spectrum supports gzip, snappy and bz2 compression files.

4. Optimization for massively parallel processing (MPP)

In order to effectively work with complex queries running on large amounts of data, you need to optimize your data for parallel processing. For best performance, Amazon suggests to break large files into many smaller chunks (from 100 MB to 1 GB) and store them in the same folder. It is also recommended to keep them all about the same size because a big difference in file sizes may cause uneven distribution of the workload.

For more information visit: http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html.

To see Amazon Spectrum in action visit a Jeff Barr’s blog post: https://aws.amazon.com/blogs/aws/amazon-redshift-spectrum-exabyte-scale-in-place-queries-of-s3-data/.

Spectrum: Reducing time-to-insight

A number of enterprises are already leveraging Redshift Spectrum for their data efforts. For example, Lyft, Nasdaq, TripAdvisor, Yahoo! and Yelp are now able to analyze all of their data stored in Amazon S3 “data lakes” just by running standard Redshift SQL queries. Multiple teams can query the same data sets even if data collected through different services, for example, Amazon Redshift and EMR.

However, there may be some limitations to be mindful of and others you will likely discover along the way. For example, the external tables used to query data on S3 are read-only.

Lastly, take a look at our post about Amazon Redshift Spectrum and Amazon Athena. Query services are gaining steam.

With Redshift Spectrum capabilities you will no longer need to think about storage of data as a constraint. You can now run SQL queries directly on vast amounts of data you collected within Amazon S3.

We have launched a code-free, zero-admin, fully automated data pipeline that automates database, table creation, Parquet file conversion, Snappy compression, partitioning and more.

Get started with Amazon Redshift Spectrum for free!

If you’d like to learn more about how Openbridge can help you to streamline your data to Redshift so you can take advantage of Spectrum’s capabilities and win query battle, contact us.

DDWant to discuss how to leverage Amazon Redshift Spectrum for your organization? Need a platform and team of experts to kickstart your data and analytic efforts? We can help! Getting traction adopting new technologies, especially if it means your team is working in different and unfamiliar ways, can be a roadblock for success. This is especially true in a self-service only world. If you want to discuss a proof-of-concept, pilot, project or any other effort, the Openbridge platform and team of data experts are ready to help.

Reach out to us at hello@openbridge.com. Prefer to talk to someone? Set up a call with our team of data experts.

Visit us at www.openbridge.com to learn how we are helping other companies with their data efforts.

--

--

Openbridge is a Data Logistics Platform (DLP) designed to collect, discover and act upon data simply, quickly and smartly