Get started with pre-built Amazon data lakes

How To Create A Serverless, Zero Infrastructure, Zero Administration Data Lake With Amazon S3, Amazon Athena and Apache Parquet

Thomas Spicer
Openbridge
Published in
5 min readJan 2, 2019

--

If you have read about data lakes they conceptually seem like a viable approach to fusing together all of that organizational data locked away in silos of web, mobile, marketing, social, CRM….systems. In fact, teams are using innovative serverless data lakes to fuel their data analytics efforts using tools like Tableau.

Note: If you are wondering “ what is a data lake?”, you can start with our post on the topic.

While promising, we know building data lake can seem like a daunting proposition. It is difficult to understand exactly what is needed to get started given all the different, and sometimes costly, options that exist in the market.

Getting Started With Data Lakes

We are going to describe a serverless solution stack of Amazon S3, Apache Parquet, and Amazon Athena for your data lake. This approach means no infrastructure to build or manage. You also do not need to worry about configuration, software updates, failures, or scaling your infrastructure as your datasets and number of users grow. This will free your team to focus on use cases of data consumption via tools like Tableau, not the underlying system development.

Why Amazon S3 and AWS Athena For Your Data Lake?

When teams can realize “quick wins”, they can build the confidence needed to establish organizational velocity. This allows teams the ability to explore creating, low cost, low-risk data lakes that allow them to focus on having data at the ready for analysis, reporting, or other business activity.

Using AWS offers a number of benefits when creating a serverless stack:

  • Athena is easy to query: Amazon Athena uses Presto, an open-source, distributed SQL query engine created by Facebook. It was optimized for low latency, ad hoc analysis of data. This means you can run queries against large datasets in Amazon S3 using SQL and a wide variety of BI tools like Tableau and Microsoft Power BI.
  • Athena and S3 are cost-efficient by design: With Amazon Athena, you pay only for the queries that you run. You are charged based on the amount of data scanned by each query on S3. You can realize significant cost savings and performance gains by compressing, partitioning, and converting your data to Apache Parquet because it reduces the amount of data that Athena needs to scan to execute a query.
  • Convenience with performance: With Amazon Athena and S3, you don’t have to worry about managing or tuning clusters to get fast performance. Athena is optimized for fast performance with Amazon S3. Combined with data in Apache Parquet format, Athena will efficiently and automatically execute queries in parallel across your data in S3. This will deliver query results in seconds, even on large datasets.
  • Built-in durability and availability: Using Amazon S3 as the underlying data store ensures your data is highly available and durable. Amazon S3 provides a durable infrastructure for the durability of 99.999999999%. Your data is redundantly stored across multiple facilities and multiple devices in each facility. Peace of mind and a huge time saver.
  • Secure: Access to data is controlled using AWS Identity and Access Management (IAM) policies, access control lists (ACLs), and Amazon S3 bucket policies. With IAM policies, you can grant fine-grained control to your S3 buckets. By controlling access to data in S3, you can restrict users from querying it in Athena. Athena also allows you to easily query encrypted data stored in Amazon S3 and write encrypted results back to your S3 bucket. Both, server-side encryption and client-side encryption are supported.

Get A Code-free, Ready-to-go Serverless Data Lake Stack

If you wanted to get started with a data lake but did not have the time or expertise, we have a solution for you.

We offer a pre-built data lake automatically handles kickstarting your efforts. Creating a private, powerful data lake is perfect for those wanting to get exposure to this type of approach without dealing with technical complexities of architecture, design, systems operations, and infrastructure.

Our service optimizes and automates the configuration, processing, and loading of data to your private Amazon data lake. Not only do you get the benefits of Amazon S3 and Athena, but a fully automated, efficient, and optimized data pipelines:

  • Automatic partitioning of data — With data partitioning, we optimize the amount of data scanned by each query, thus improving performance and reducing the cost of data stored in S3 as you run queries
  • Automatic conversion to Apache Parquet — We convert data into an efficient and optimized open-source columnar format, Apache Parquet. This lowers costs when you execute queries as the Parquet files columnar format are highly optimized for interactive query services like Athena
  • Automatic data compression — With data in Apache Parquet compression is performed column by column using Snappy, which means not only supports query optimizations, it reduces the size of the data stored in your Amazon S3 bucket which reduces costs
  • Automated database and table creation — As upstream data changes, we automatically version tables and views. Data is analyzed and the system “trained” to infer schemas to automate the creation of database, views, and tables in the Athena
  • No coding required — Using the Openbridge web interface, users can create and configure an Amazon data lake using S3, IAM, and AWS Athena

This means you can focus on using the data, not managing, building, and deploying infrastructure.

Start Small, Build On Success

We know it is not always easy to kickstart a data lake effort in a self-service world. This is why we advocate the use of pre-built data lakes for pilots or proofs-of-concept. What better way to demonstrate how you can use a data lake than to demonstrate how it is helping your team avoid unnecessarily wasting valuable human resources like analysts, engineers or data scientists undertaking manual data wrangling efforts.

We have launched a code-free, zero-admin, fully automated data pipeline that automates database, table creation, Parquet file conversion, Snappy compression, partitioning, and more.

Get started with Amazon Redshift Spectrum or Amazon Athena for free!

DDWant to discuss a solution like this for your organization? Need a platform and team of experts to kickstart your data and analytic efforts? We can help! Getting traction adopting new technologies, especially if it means your team is working in different and unfamiliar ways, can be a roadblock for success. This is especially true in a self-service only world. If you want to discuss a proof-of-concept, pilot, project, or any other effort, the Openbridge platform and team of data experts are ready to help.

Reach out to us at hello@openbridge.com. Prefer to talk to someone? Set up a call with our team of data experts.

Visit us at www.openbridge.com to learn how we are helping other companies with their data efforts.

--

--