Apache Parquet + Amazon S3 + Amazon Athena + Tableau

Building A Serverless Business Intelligence Stack With Apache Parquet, Tableau, and Amazon Athena

Thomas Spicer

Published in

Openbridge

5 min readNov 12, 2018

What do you get when you use Apache Parquet, an Amazon S3 data lake, Amazon Athena, and Tableau’s new Hyper Engine?

You have yourself a powerful, on-demand, and serverless analytics stack.

The Architecture

The basic premise of this model is that you store data in Parquet files within a data lake on S3. Then, you wrap AWS Athena (or AWS Redshift Spectrum) as a query service on top of that data. Lastly, you leverage Tableau to run scheduled queries that will store a “cache” of your data within the Tableau Hyper Engine.

This approach allows you to optimize performance while reducing costs because you let Tableau Server perform select queries to Athena to pull data locally to the Hyper Engine. All subsequent queries in a Tableau visualization happen against the data resident in the Hyper Engine rather than Athena.

Since you only pay for the queries you run in Athena, this approach not only leverages the native Hyper engine performance optimizations, it ensures any query costs are significantly reduced. Leveraging Tableau schedules, you could be making as few as 1 or 2 queries a day.

Why Apache Parquet?

Apache Parquet is a self-describing data format which embeds the schema, or structure, within the data itself. This results in a file that is optimized for query performance and minimizing I/O. Parquet also supports very efficient compression and encoding schemes. The great thing is that it is licensed under the Apache software foundation and available to any project.

You could use CSV files, but why would you?

If time is money your analysts can be spending close to 5 minutes waiting for a query to complete simply because you are using raw CSV. If you are paying someone $150 an hour and they are doing this once a day for a year then they spent about 30 hours simply waiting for a query to complete. That is roughly about $4500 in unproductive “wait” time or “time to go get a coffee while my query completes”. Total wait time for the Apache Parquet user? About 42 mins or $100.

Why Tableau?

Tableau provides desktop, server, and hosted software that allows users to connect, explore, and visualize their data. Tableau has native connectors that enable it to query relational databases, cloud databases, flat files, and spreadsheets. One of the more recent connectors Tableau added was for Amazon Athena. This opens the door to leveraging a serverless stack.

Why Amazon Athena?

Not long ago, Amazon Web Services (AWS) introduced Amazon Athena, a service that uses ANSI-standard SQL to query data directly from a data lake within Amazons Simple Storage Service, or Amazon S3. This makes it is easy to analyze big data (or any data) directly in S3 using standard SQL. One of the key elements of the Athena model is that you only pay for the queries you run. This is an attractive feature since there is no hardware to set up, manage, or maintain.

How Do You Query Your Parquet Files With Tableau?

Is your data is loaded into Athena with Apache Parquet? If so, you are ready to connect Tableau. The process is straightforward and resembles other database connections. Select the Amazon Athena data source and then connect using your credentials and host info:

If you get stuck, take a look at the Tableau docs for more detail on making a connection to Athena.

If you want to dig deeper into how you can leverage Tableau and Apache Parquet, take a look at this post:

4 Steps To Create a Serverless Analytics Stack with Tableau and Amazon AthenaTableau provides desktop, server and hosted software that allows users to connect, explore and visualize their data…
blog.openbridge.com

How Do You Query Your Parquet Files With Amazon Athena?

If you don’t have Tableau, you can also query your Parquet data within the Amazon Athena interface:

Query Apache Parquet files via Amazon Athena

Getting Started

It has never been easier to get your data into Apache Parquet and Amazon Athena. Our Apache Parquet service optimizes and automates the configuration, processing, and loading of data to AWS Athena unlocking how users can return query results in Tableau. With our new zero administration, AWS Athena service you simply push data from supported data sources and our service will automatically load it into your AWS Athena database for use in Tableau Desktop and Tableau Server.

If you wanted to get started Apache Parquet but did not have the time or expertise, this is the solution for you!

We have launched a code-free, zero-admin, fully automated data pipeline that automates database, table creation, Parquet file conversion, Snappy compression, partitioning, and more.

Get started with AWS Redshift Spectrum or AWS Athena for free!

DWant to discuss how to leverage Apache Parquet for your organization? Need a platform and team of experts to kickstart your data and analytic efforts? We can help! Getting traction adopting new technologies, especially if it means your team is working in different and unfamiliar ways, can be a roadblock for success. This is especially true in a self-service only world. If you want to discuss a proof-of-concept, pilot, project, or any other effort, the Openbridge platform and team of data experts are ready to help.

Reach out to us at hello@openbridge.com. Prefer to talk to someone? Set up a call with our team of data experts.

Visit us at www.openbridge.com to learn how we are helping other companies with their data efforts.