Cloud Data Lake Best Practices: Data Lake vs. Data Warehouse

Why an open, flexible, and agile data lake architecture makes a difference between success or swamp

Published in

Openbridge

6 min readJun 29, 2020

Before we jump into best practices around lake formation, architecture, analytics, and other aspects of data lakes, we need to baseline precisely “what is a data lake?”

As we have detailed in a prior post, there are numerous misconceptions and myths about data lakes. To set a baseline, this is how Pentaho co-founder and CTO, James Dixon who coined the term, frames it;

This situation is similar to the way that old school business intelligence and analytic applications were built. End users listed out the questions they want to ask of the data, the attributes necessary to answer those questions were skimmed from the data stream, and bulk loaded into a data mart.
This method works fine until you have a new question to ask. The Data Lake approach solves this problem. You store all of the data in a lake, populate data marts and your data warehouse to satisfy traditional needs, and enable ad-hoc query and reporting on the raw data in the lake for new questions.

The statement does not frame solutions in a data lake vs. data warehouse vs. data mart context, but one of a lake fueling and coexisting with a mart or warehouse. These systems are not mutually exclusive of each other, as James Dixon stated. This approach is actually very much the opposite of “vs”.

The beauty and elegance of a lake should be in its simplicity, agility, and flexibility. The innovative, strategic discussion, is how a lake and warehouse are designed to work in tandem.

Specific jobs and tasks, traditionally done in a warehouse, can be offloaded to a data lake for cost efficiencies. For example, unloading a seldom-used table can reduce the amounts of data resident in the warehouse which is a cost-effective strategy for a system like AWS Redshift, Oracle, and others with local storage constraints

With data stored in a lake, pairing the two can ensure flexibility across data engineers, data scientists, and business users. Done well, this can improve both data governance and data quality for machine learning, data visualizations, BI, or reporting.

Unfortunately, there are several data lake vendors, or data lake solution providers, that are promoting complex, heavy-handed, and closed architectures. These closed, heavy-handed architectures represent one of the greatest threats to an efficient, lean, productive data lake.

Overdesigned, Heavy-handed “Closed” Data Lakes

What is an example of being close or heavy-handed? A typical pattern is to wrap, or create, an embedded data lake. In our case below, “System X” reflects a vendor solution that creates a closed “wrapper” ecosystem around a lake.

System X forces any data ingress or egress access to the lake to be tightly coupled to the requirements set by vendor System X, not data lake best practices;

This type of embedded, closed data lake model reflects a data lake in name only. By design, this vendor-driven architecture promotes lock-in. This approach will limit access by analytic, SQL, ETL, and other tools to only those blessed by System X. As a result, data access is governed by proprietary or esoteric drivers that all too often create compatibility nightmares for any tools not blessed by the vendor.

Technical Debt, Big Price Tags

In addition to considerable technical debt, solutions like System X come with even bigger price tags. How so? Specialized training, expensive support contracts, avoidance of open-standards, …all contribute to a challenging data lake environment for people to work. This is not the hallmark of an efficient, lean, productive data lake.

Data Swamps

If you were wondering how a data lake becomes a swamp, it is precisely a function of the crushing weight of unnecessary and ancillary people, processes, and technologies that are placed in and around it.

As an architecture and operational model, this vendor lock-in approach is the antithesis of a data lake. You ensure your data lake will not be responsive to change and overly rigid in who and how data is consumed.

As your team gets fed up with this model, they demand you to start looking at alternatives. Sound familiar?

Let’s look at some guiding principles to help you avoid vendors layering an oppressive ecosystem of tools and systems on data lakes.

Agile, Open Data Lake Access

Going back to Systems X. It should not be the manifestation of the lake itself. Rather than encapsulate the lake, System X should be a consumer or contributor to the lake:

A healthy lake ecosystem is one where you follow a well-abstracted architecture. Doing ensures you are minimizing the risk of artificially limiting the opportunity lakes represent as System X initially did. An open lake provides sophisticated modeling, data transformation, or wrangling jobs that are best left to those that need to consume data, with whatever tools they deem appropriate.

For example, you can pair a data lake with an open-source, standards-based distributed SQL query engine. This can be Facebook Presto, Amazon Athena, Redshift Spectrum, or all three at the same time. This gives you access and consumption flexibility to undertake analysis with your preferred ELT/ETL, data science, BI, or query tools.

An open, on-demand data lake strategy means you can run queries directly against your raw structured and unstructured data from a wide variety of tools like; Tableau, Microsoft Power BI, Looker, Amazon Quicksight, and many others.

Following a well-abstracted architecture and open-standards ensures you have the data you need at the ready to fuel the tools your team loves, today or tomorrow.

Don’t go it alone solving the toughest data strategy, engineering, and infrastructure challenges

Building data platforms and data infrastructure is hard work. Whether you are a team of one or a group of 100, the last thing you need is to fly blind, and get stuck with self-service (aka, no service) solutions.

It has never been easier to take advantage of an “analytics-ready” data lake with a serverless query service like Azure Data Lake Storage, Amazon Athena, and Redshift Spectrum.

Our service optimizes and automates the configuration, processing, and loading of data to AWS Athena and Redshift Spectrum, unlocking how users can return query results. With our zero administration, data lake service, you push data from supported data sources, and our service automatically loads it into your data lake for AWS Athena and Redshift Spectrum to query.

Need strategy or engineering support? If you have a project, we have the expertise. Let’s put it to work for you!

References:

What is a data lake? Strategy & success depend on practical data lake solutionsData lake vs. data warehouse, which is better? It is not uncommon to see a data lake framed as just “storage” or claims…
www.openbridge.com

Data Lakes? Big Myths About Architecture, Strategy, and AnalyticsWhat is a data lake? Get a leg up to becoming a data-driven enterprise
blog.openbridge.com

AWS Lake Formation: Accelerating Data Lake AdoptionIf you read about data lakes, you will often come across a post, guide, documentation, or tweet that will describe…
blog.openbridge.com

Adobe Data Feeds: How to use a data lake and Amazon Athena for analytic insights4 Steps to configure your Adobe Data Feeds for a data lake using Amazon S3 and Amazon Athena
blog.openbridge.com

Data lake vs. data warehouse? Modern data management strategiesWe have referenced some of them on this page as well as on our blog. Yes, but it depends on what you mean by “all” …
www.openbridge.com

Cloud Data Lake Best Practices: Data Lake vs. Data Warehouse

Why an open, flexible, and agile data lake architecture makes a difference between success or swamp

Overdesigned, Heavy-handed “Closed” Data Lakes

Technical Debt, Big Price Tags

Data Swamps

Agile, Open Data Lake Access

Don’t go it alone solving the toughest data strategy, engineering, and infrastructure challenges

References:

What is a data lake? Strategy & success depend on practical data lake solutions

Data lake vs. data warehouse, which is better? It is not uncommon to see a data lake framed as just “storage” or claims…

Data Lakes? Big Myths About Architecture, Strategy, and Analytics

What is a data lake? Get a leg up to becoming a data-driven enterprise

AWS Lake Formation: Accelerating Data Lake Adoption

If you read about data lakes, you will often come across a post, guide, documentation, or tweet that will describe…

Adobe Data Feeds: How to use a data lake and Amazon Athena for analytic insights

4 Steps to configure your Adobe Data Feeds for a data lake using Amazon S3 and Amazon Athena

Data lake vs. data warehouse? Modern data management strategies

We have referenced some of them on this page as well as on our blog. Yes, but it depends on what you mean by “all” …

Written by Thomas Spicer