Cloud Data Lake or Warehouse

Cloud Data Lake Best Practices: Data Lake vs. Data Warehouse

Why an open, flexible, and agile data lake architecture makes a difference between success or swamp

Thomas Spicer
Openbridge
Published in
6 min readJun 29, 2020

--

Before we jump into best practices around lake formation, architecture, analytics, and other aspects of data lakes, we need to baseline precisely “what is a data lake?”

As we have detailed in a prior post, there are numerous misconceptions and myths about data lakes. To set a baseline, this is how Pentaho co-founder and CTO, James Dixon who coined the term, frames it;

This situation is similar to the way that old school business intelligence and analytic applications were built. End users listed out the questions they want to ask of the data, the attributes necessary to answer those questions were skimmed from the data stream, and bulk loaded into a data mart.

This method works fine until you have a new question to ask. The Data Lake approach solves this problem. You store all of the data in a lake, populate data marts and your data warehouse to satisfy traditional needs, and enable ad-hoc query and reporting on the raw data in the lake for new questions.

The statement does not frame solutions in a data lake vs. data warehouse vs. data mart context, but one of a lake fueling and coexisting with a mart or warehouse. These systems are not mutually exclusive of each other, as James Dixon stated. This approach is actually very much the opposite of “vs”.

The beauty and elegance of a lake should be in its simplicity, agility, and flexibility. The innovative, strategic discussion, is how a lake and warehouse are designed to work in tandem.

Specific jobs and tasks, traditionally done in a warehouse, can be offloaded to a data lake for cost efficiencies. For example, unloading a seldom-used table can reduce the amounts of data resident in the warehouse which is a cost-effective strategy for a system like AWS Redshift, Oracle, and others with local storage constraints

With data stored in a lake, pairing the two can ensure flexibility across data engineers, data scientists, and business users. Done well, this can improve both data governance and data quality for machine learning, data visualizations, BI, or reporting.

Unfortunately, there are several data lake vendors, or data lake solution providers, that are promoting complex, heavy-handed, and closed architectures. These closed, heavy-handed architectures represent one of the greatest threats to an efficient, lean, productive data lake.

Overdesigned, Heavy-handed “Closed” Data Lakes

What is an example of being close or heavy-handed? A typical pattern is to wrap, or create, an embedded data lake. In our case below, “System X” reflects a vendor solution that creates a closed “wrapper” ecosystem around a lake.

System X forces any data ingress or egress access to the lake to be tightly coupled to the requirements set by vendor System X, not data lake best practices;

This type of embedded, closed data lake model reflects a data lake in name only. By design, this vendor-driven architecture promotes lock-in. This approach will limit access by analytic, SQL, ETL, and other tools to only those blessed by System X. As a result, data access is governed by proprietary or esoteric drivers that all too often create compatibility nightmares for any tools not blessed by the vendor.

Technical Debt, Big Price Tags

In addition to considerable technical debt, solutions like System X come with even bigger price tags. How so? Specialized training, expensive support contracts, avoidance of open-standards, …all contribute to a challenging data lake environment for people to work. This is not the hallmark of an efficient, lean, productive data lake.

Data Swamps

If you were wondering how a data lake becomes a swamp, it is precisely a function of the crushing weight of unnecessary and ancillary people, processes, and technologies that are placed in and around it.

As an architecture and operational model, this vendor lock-in approach is the antithesis of a data lake. You ensure your data lake will not be responsive to change and overly rigid in who and how data is consumed.

As your team gets fed up with this model, they demand you to start looking at alternatives. Sound familiar?

Let’s look at some guiding principles to help you avoid vendors layering an oppressive ecosystem of tools and systems on data lakes.

Agile, Open Data Lake Access

Going back to Systems X. It should not be the manifestation of the lake itself. Rather than encapsulate the lake, System X should be a consumer or contributor to the lake:

A healthy lake ecosystem is one where you follow a well-abstracted architecture. Doing ensures you are minimizing the risk of artificially limiting the opportunity lakes represent as System X initially did. An open lake provides sophisticated modeling, data transformation, or wrangling jobs that are best left to those that need to consume data, with whatever tools they deem appropriate.

For example, you can pair a data lake with an open-source, standards-based distributed SQL query engine. This can be Facebook Presto, Amazon Athena, Redshift Spectrum, or all three at the same time. This gives you access and consumption flexibility to undertake analysis with your preferred ELT/ETL, data science, BI, or query tools.

An open, on-demand data lake strategy means you can run queries directly against your raw structured and unstructured data from a wide variety of tools like; Tableau, Microsoft Power BI, Looker, Amazon Quicksight, and many others.

Following a well-abstracted architecture and open-standards ensures you have the data you need at the ready to fuel the tools your team loves, today or tomorrow.

Don’t go it alone solving the toughest data strategy, engineering, and infrastructure challenges

Building data platforms and data infrastructure is hard work. Whether you are a team of one or a group of 100, the last thing you need is to fly blind, and get stuck with self-service (aka, no service) solutions.

It has never been easier to take advantage of an “analytics-ready” data lake with a serverless query service like Azure Data Lake Storage, Amazon Athena, and Redshift Spectrum.

Our service optimizes and automates the configuration, processing, and loading of data to AWS Athena and Redshift Spectrum, unlocking how users can return query results. With our zero administration, data lake service, you push data from supported data sources, and our service automatically loads it into your data lake for AWS Athena and Redshift Spectrum to query.

Need strategy or engineering support? If you have a project, we have the expertise. Let’s put it to work for you!

References:

--

--