Data Lakes? 8 Architecture, Strategy, & Analytic Myths

What is a data lake? How does a data lake fit within enterprise data strategies?

Thomas Spicer
Published in
18 min readJun 5, 2019

--

This post explains what a data lake is and how it fits into enterprise data strategies, including traditional data warehouses.

Technical and business questions or advice have historically been confusing and opaque, given conflicting advice from consultants and vendors.

Debunking some common myths associated with various strategies, interplay with source systems, cloud architecture, data analytics, and implementation advice to have data lakes explained more straightforwardly is crucial.

Breaking down these myths will help you understand why data lakes fail and the multiple challenges teams often face getting started. It will also shed light on why vendors and consultants are providing a path that may be contrary to a data lake or data lakehouse best practices.

So Much Conflicting Advice. It Gets Confusing, Fast

Unfortunately, confusing and misleading advice leads people to ask questions in the context of technology platforms rather than strategy or business outcomes. Focusing only on a technology-driven decision-making process attempts to make a subjective conversation more objective.

For example, vendors may push narrow technical questions like “What is an Amazon data lake?” or “What is the best data lake software?”. Maybe a pushy vendor is promoting buzzword-laden HIPPA-compliant data lakes in a healthcare context.

All too often, the advice presents a binary choice between traditional data warehouses, cloud warehouses, data lakehouses, or lakes as a means to access data. They make it a definitive choice between a warehouse or a lake for self-service analytics, advanced data queries, batch data processing, and business agility. The “lakes and data warehouse” battle a false choice.

As a result, the conversation around lakes can get bewildering for those trying to figure out how to capture value from data source systems for their advanced analytics, data science, and insights efforts.

1. Data Lake vs. Data Warehouse

The data lake vs. data warehouse frames conversations about a centralized repository, analytics architecture, business analytics, or efforts centered around data-driven Insights. In these use cases, a lake will often be framed as a losing proposition.

Conversely, the same is true when people start the conversation by stating;

  • Data warehouses obsolete
  • Any data pipeline or integration from multiple sources to a lake creates a swamp, resulting in almost no data quality
  • No need to worry about any data type or organization data, the lake does it all
  • Toss out your current physical architecture and warehouse architectures
  • A lake must be the single source of an all-powerful unified data model

Anyone framing these questions as absolutes leads you down the wrong path to realizing an optimal solution.

Typically, the “vs” argument occurs when a company, or person, has some form of technical investment in a particular design pattern. For example, a vendor or consultant will claim that certain analytics application operations can or must happen in a cloud data warehouse. The “warehouse” is the true single platform for all types of analytics.

They will then frame these operations as a limitation and risk of a lake. What is an example of a limitation vendors will promote?

A vendor will say a lake is limited because they can’t scale compute resources efficiently on-demand like a warehouse or only reflect data storage. This is true but misleading.

Complaining about a lack of compute resources is like complaining Tom Brady has never hit a home run as a Football player. Since Tom Brady is a football player, would you expect him to be dropping dingers over the green monster at Fenway (well, maybe the Pesky pole )? No. Lakes don’t have compute resources because they are well abstracted from query services, by design!

Why are vendors and consultants applying data warehouse compute concepts here? The fact that a lake does not have to compute resources is FUD. Someone is likely trying to promote a warehouse as the panacea for your data.

Lakes don’t scale compute resources because there are no compute resources to scale. We wrote about this topic here:

Separate compute resources are a core abstraction a modern data architecture embraces. This is why analytic engines like Google Cloud BigQuery Omni, Redshift Spectrum, Presto, and Athena exist.

Amazon Web Services Athena is an example. Athena is not a warehouse but an on-demand query engine based on open-source Facebook Presto.

As a service, Athena (and Presto) provides on-demand” “compute” resources to query data. Like Athena, Amazon Redshift Spectrum can query data in a lake separate from resources in a Redshift cluster.

Google Cloud just released BigQuery Omni, further illustrating the point of abstracting compute resources from an AWS or Microsoft Azure data lake storage.

By design, this model is supposed to be well abstracted from the services that consume information that resides within them. If you have an Amazon data lake, Oracle data lake, or Azure Data Lake, the model is similar. Lake data sets containing marketing, social media, and real-time sales feeds… will be accessed with a query engine like Athena or a “warehouse” like Azure Synapse Analytics, Redshift, BigQuery, or Snowflake. For example, see Databricks Lakehouse Best Practices which reinforces this pattern using the “lakehouse” model.

These services provide the compute resources for querying external resources, not the lake.

Rather than have the dialogue be one or the other, the proper discussion is for most enterprises to coexist for the benefit of business analysts, different business units, and business processes.

When someone argues that you need to choose one or the other, they likely have an agenda that aligns with their product offering or commercial partnership. This thinking suggests you forgo a lake and dump everything into a warehouse.

2. The “Productive” Advanced Data Platform Architecture

Vendors and consultants advocate that the warehouse is your new data lake model. They actively promote a modern data architecture pattern that says a data warehouse offers faster insights and is far more of a flexible solution with unlimited cloud-scale object storage (just like a lake).

Various vendors and consultants will suggest that schemas (or other physical and logical constructs) be used to denote the lifecycle of data from “raw data” to usable unstructured and structured data in a cloud warehouse. Any data maturity or data quality needed by the business will be done directly within the confines of the warehouse.

Redefining the Role Of A “Warehouse”

Traditionally, the role of a data warehouse reflects the source of truth for a business, not transitory operational systems or transactional systems. A settled truth demonstrates a collection of agreed-upon facts about the enterprise.

For example, a settled truth may provide authoritative facts about revenue, orders, “best customers,” and other domains.

However, in the dump everything in the warehouse model, the warehouse holds everything, including transient and volatile raw data. The suggested repackaging of all raw data into a warehouse looks more like an operational data store (ODS) or a data mart than a warehouse.

Can you dump everything into a warehouse? Yes. Just because you can do something technically does not make it the right architecture.

The suggestion for putting everything in a warehouse says a “truth” is simply a function of the data’s logical organization. Who defines this logical definition and disseminates this within an enterprise is glossed over, misunderstood, or, worse, ignored.

The warehouse rules-all approach is almost a textbook definition of a data swamp, often attributed to lakes. Treating a warehouse as a lake is a data-wrangling nightmare, not reflective of a best-of-breed architecture for insights efforts.

This model locks you into warehouse technology and an operational model. It embraces a mindset that now requires you to dump everything into the warehouse. If you like vendor lock-in, artificial constraints, reduced data literacy, and technical debt, this approach is undoubtedly for you.

The Role of a Lake

Done right, the basic architecture of a lake is that it can minimize technical debt while accelerating an enterprise team’s data consumption. The lake can be an authoritative source of “landed,” enterprise-ready data.

Given an accelerating rate of change in the data warehouse, query engine, and data analytics market, minimizing risk and technical debt should be a core part of your strategy.

3. Hadoop

You will often find discussions and examples where lakes are synonymous with Apache Hadoop or Hadoop-related vendor technology stacks.

To be clear, the lake approach does not imply the use of Hadoop Data Lakes.

Promoting a Hadoop-centric model gives the impression that a lake is tightly bound to specific technologies. While Hadoop technologies are used in some lakes, they do not reflect a strategy and architecture. It is essential to recognize that, first and foremost, a lake should reflect an approach, strategy, and architecture, not a technology.

Pentaho co-founder and CTO, James Dixon, who coined the term, said;

This situation is similar to the way that old school business intelligence and analytic applications were built. End users listed out the questions they want to ask of the data, the attributes necessary to answer those questions were skimmed from the data stream, and bulk loaded into a data mart. This method works fine until you have a new question to ask. The approach solves this problem. You store all of the data in a lake, populate data marts and your data warehouse to satisfy traditional needs, and enable ad-hoc query and reporting on the raw data in the lake for new questions.

Hadoop, like any other technology, supports the enablement of a strategy and architecture. If you had a lake today, you have many non-Hadoop choices, even if those choices leverage Hadoop-related technologies under the covers.

For example, your lake may act as an accessible data repository supporting warehouse solutions such as Snowflake or query in place with AWS Athena, Redshift Spectrum, or BigQuery, all at the same time. Don’t think an on-premise or cloud data lake tightly binds you to Hadoop. If you follow a well-abstracted lake architecture, you minimize the risk of artificially limiting the opportunity they represent.

4. Lakes as Storage

In this scenario, a lake is just a place to store all your stuff. They describe a lake as a super big hard drive on your laptop where you drop all your files into the “Untitled folder.” The impression this gives is all you need to do is dump data in it and declare victory! Welcome to your new auto-magic data management model.

So is a lake simply a massive “cloud storage service” for all types of data and file formats? Not precisely, but storage is part of it.

Vendors frame lakes to be synonymous with storage. For example, Microsoft packages its product as Azure Data Lake Storage Gen2. Amazon says to create an AWS account, configure an Amazon S3 storage bucket, launch Amazon EMR, and you can use your favorite analytic service from there.

More Than Storage

Lakes provide storage, but a characterization that they are “just” storage is off the mark. Vendors are oversimplifying for marketing purposes. They are not just cloud storage services and are not mutually exclusive of a warehouse or other data and analytics stack aspects. Quite the opposite.

Framing Your Stack

Your lake should be viewed as a strategic element of a broader enterprise data stack, including the data acquisition process that fuels aggregation pipelines from source systems. A lake may be contributing to a settled truth in a downstream system like a warehouse or supporting data consumption in tools like Tableau or Oracle ETL.

Most lakes in nature are dynamic ecosystems, not static, closed systems. One job a lake can have is an active source of data that fuels a warehouse. However, the opposite is also true where specific warehouse workloads can be offloaded to reduce costs and improve efficiencies. A lake may play a role as a landing zone for real-time data or third-party data.

Packaged and structured correctly, a lake can deliver downstream value to those consuming data from it, including a warehouse.

We have a customer who uses a lake to undertake quality control analysis to tag across dozens of websites and third-party properties. This allows them to identify possible gaps or implementation errors by the different teams responsible for that work.

We also have another customer who uses a lake to reconcile potentially inaccurate or duplicate multi-channel orders across various internal, third-party, and partner systems before delivering to an EDW.

Both of these examples highlight that they play a dynamic role in ensuring that the downstream settled truth meets enterprise expectations and norms.

As the folks at McKinsey said,

“…lakes ensure flexibility not just within technology stacks but also within business capabilities”

The data lake as a service model is about delivering business value, not just storage. We agree.

5. Effective data lake strategy and architecture

Linked to the “dump everything into the warehouse” approach, you will often hear that lakes don’t add value because only raw data is resident there. Their argument goes something like this; “If lakes only deal with raw data, then don’t bother with a lake, just dump all your data, raw or processed, into a warehouse”.

As we stated previously, this contradicts the fundamental premise that a warehouse is meant to reflect the business’s settled truth. A better historical comparison is not between a warehouse and a lake but between an ODS and a lake.

Historically, it was an ODS, not a warehouse, ingesting rough and volatile raw data from upstream data sources. An ODS typically held a narrow time horizon of data, maybe 90 days worth. The ODS also may have had a more limited focus, say for a particular domain of data.

On the other hand, a lake will often have no time constraints for data retention and deliver a broader scope. So are lakes just for raw data from a random assortment and variety of sources? No.

Data Ingestion and Curation

By design, Lakes should have some curation level for data ingress (i.e., data coming in). If you have no sense of data ingress patterns, you likely have problems elsewhere in your technology stack. This is also true for a data warehouse or any data system: garbage in, garbage out.

Landing Zones

Best practices should embrace a model where you have a landing zone to optimize (or curate), however minimally, for downstream consumption. Consumption might be within an analytics tool like Tableau or Power BI. Still, it also might be an application that handles loading to a warehouse, such as Azure SQL database, Snowflake, Redshift, or BigQuery.

We worked with a customer that would send Adobe event data to an AWS to support an enterprise Oracle Cloud environment. Why AWS to Oracle? It was the most efficient and cost-effective data consumption pattern for the Oracle BI environment, especially considering the agility and economics of using an AWS lake and Athena as the on-demand query service.

By maximizing data’s effectiveness and efficiency, you minimize the downstream processing costs experienced by data consumers.

6. Go Big Or Go Home

If you spend any time reading materials on lakes, you would think there is only one type, and it would look like the Capsian Sea. Yes, the Caspian is a lake despite the “sea” in the name!

People describe lakes as massive, all-encompassing entities designed to hold all knowledge. There is only an enterprise big data lake or something synonymous with big data architecture. Unfortunately, the “big data” angle gives the impression that lakes are only for Caspian scale data endeavors. This certainly makes the lake concept intimidating.

As a result, describing things in such massive terms makes it inaccessible to those who can benefit from them. The data lake vs. big data overlay can make the use of a lake seem overwhelming.\

Different Sizes, Different Use Cases

Lakes in nature come in all different shapes and sizes. So do data lakes. Each has a natural state, often reflecting ecosystems of data, just like those in nature reflect ecosystems of fish, birds, or other organisms.

Here are a few examples;

  • The Great Caspian: Just like the Caspian is a large body of water, this type of lake is a large, broad repository of semi-structured and unstructured data. This extensive collection of diverse data reflects information from across the enterprise.
  • Temporary “Ephemeral”: Like deserts can have small, temporary lakes, an Ephemeral exists for a short period. They may be used for a project, pilot, PoC, or a point solution; they are quickly turned off as soon as they are turned on.
  • Domain “Project”: These types of lakes, like Ephemeral, are often focused on specific knowledge domains. However, unlike the Ephemeral lake, this lake will persist over time. These may also be shallow, meaning they may be focused on a narrow domain of data such as media, social, web analytics, email, or similar data sources.

We have a customer that described their project as the “Tableau data lake”, which started off as a small pilot effort that scaled enterprise-wide.

By design, all types should embrace an abstraction that minimizes risk and affords you greater flexibility. Also, they should be structured for easy consumption, independent of size. This ensures an environment structured for easy data consumption versus an artificial size delineation if used by a data scientist or business user, or python code.

Whether your use case is artificial intelligence, machine learning, visualization, real-time analytics, feeding a warehouse, or a data mart, thinking differently about size may unlock new opportunities to employ these solutions.

7. No Security

They are an insecure collection of data objects available to anyone in an organization that wants to take a dip and leave with what they want.

There is some truth that people rely on implicit technology solutions (i.e., automatic AWS S3 AES object encryption) rather than explicitly having an architecture and downstream use cases that govern security.

This can lead to security gaps. However, this can be said of many systems and is not unique to a lake per se. The notion that lakes are inherently insecure is not accurate.

Security can and should be a first-class citizen. Here are a few areas of consideration;

  • Access: It is not uncommon to have well-defined access policies to the underlying data. Within AWS, this would be defined in your IAM policies for S3 and related services. In addition to AWS, Microsoft has an Azure data lake architecture that describes similar methods of security policies.
  • Tools: The tools and systems that consume data will also offer a level of security. For example, there can be a table and column level access control, depending on the query engine. Furthermore, data consumption tools such as Tableau or Power BI will set access controls on the lake’s data.
  • Encryption: Often expect (or enforce) encryption in transit and at rest.
  • Partitioning: Have a level of logical and physical partitioning that further facilitates a security strategy. For example, teams may ETL data from a “raw” landing zone to another location so they can anonymize sensitive data for downstream consumption.

One could argue the merits of these different strategies, but to say lakes are intrinsically insecure would be incorrect.

8. Nothing But Swamps

A critique of data lakes devolves into accusations they always end up as data swamps. Why? Because they are just storage, lack curation, have no data governance, no lifecycle/retention policies, and no metadata. There is no data quality or data discovery; it is just a fire swamp from hell.

In the extreme, there is a level of truth to this. If you treat a lake like a generic “Untitled folder” on your laptop where you dump files, yes, you will likely have a swamp. So this is a risk. However, one of the key differentiators between success and failure is that anyone going down the path of dumping files this way is somewhat disinterested in being successful.

Understanding Data Swamps

So what are the actual data swamps? These “so-called data lakes” are reasons customers do not achieve positive outcomes, drive the cost of ownership up, and are antithetical to successful data lakes.

A lake becomes a swamp not from a lack of curation, management, life cycle, and metadata but from an ecosystem of tools, roles, responsibilities, and systems meant to prevent this from happening.

“Dumping” files is an issue, but the crushing weight of ancillary people, processes, and technologies that are placed in and around a lake is the true villain in your swamp story.

If you thought your enterprise data warehouse was a slog, your lake would start to look very familiar. This is the data swamp you need to be very wary of. It is expensive, time-consuming, and will fail to meet anyone’s expectations. Sound familiar?

Avoiding Swamps

To those planning or who have deployed a data lake, be cautious of role and feature creep. It is not uncommon to see vendors (pushed by customers?) pull forward features and functionality found in a traditional warehouse or other ETL products to be an “in-lake” capability.

Yes, it is technically possible for you to perform complex in-lake data processing. However, you may already have workflows, tools, people, and technology that performs these functions outside of the lake.

Not all data activities may be appropriate, given your context. Think long and hard about the risks of cascading complexity these choices represent.

Clean, Consistent, Lake Formation

Part of the beauty of lake formation architecture is a level of simplicity, agility, and flexibility. When significant business logic and process occur in-lake, you run the risk of creating a solution that lacks clarity, is not responsive to change, and is overly rigid by design.

Be cognizant if a current or planned data lake starts to look more like an amalgamation of traditional ETL tools and data warehouses. If you have suffered through an overly complicated EDW effort, this will be easy to spot.

Everyone Is A Critic

Unfortunately, critiques devolve into broad statements of lakes “not being successful” or “data lakes equal swamp,” or they are too tightly linked to specific technologies like Hadoop. They may also complain that the semantic definition of a lake is overly opaque and changing.

Criticism is a necessary part of growth with any technology. However, a key to growth is taking a step back to develop perspective. In doing so, these criticisms are not unique.

These critiques can apply to just about any technology endeavor generally and data projects correctly. For example, the term “data warehouse” suffers from the exact opaque and changing definition of lakes. Search Google for “failed data warehouse,” and you will find stories about projects that were not successful.

Does this mean that we should forgo the phrase data warehouse or stop the pursuit of those projects? No.

Too often, consultants or vendors that deride lakes provide products and services which offer the magic elixir for implementation. It is odd that a consultant or vendor who does not believe in the model would turn around and engage in a solution that they don’t think has merit.

Entrusting any work to these very same consultants or vendors may be why lake initiatives are not successful.

Build “Manageable” Momentum

Start small and be agile in pilot projects. Here are a few tips as you think about how to get your data lake implementation rolling;

  • Focus: Seek opportunities where you can deploy an “Ephemeral” or “Project” solution. Reduce risk, and overcome technical and organizational challenges so your team can build confidence with lakes. Hone in on the types of insights, critical business outcomes, and business applications that deliver business value from a delta lake, data lakehouse architecture or similar.
  • Passion: Make sure you have an” “evangelist” or” “advocate” internally, someone passionate about the company’s solution and adoption to promote to business leaders. It would be best if you had a person or team who is excited about the business impact. Get someone experienced using the lake with a tool like Tableau to demonstrate value. If not, you will find your lake is just as productive as a gym membership four weeks after the New Year.
  • Simple: Embrace simplicity and agility, and put people, process, and technology choices through this lens. The lack of complexity should not be seen as a deficiency but a byproduct of thoughtful design. For example, getting a baseline Azure data lake analytics environment ready for testing can be done quickly to unnecessary infrastructure costs and adverse impacts on any legacy corporate applications.
  • Narrow: Keep the scope tight and well-defined by limiting your lake to understand data, say exports from ERP, CRM, Point-of-Sales, Marketing, Or Advertising data. Data literacy at this stage will help you know workflow around data structure, lower operational costs, business-critical outcomes, critical analytics workloads, and testing.
  • Experiment: Pair your solution with modern BI and analytics tools like Tableau, Power BI, Amazon Quicksight, or Looker. Allow non-technical users an opportunity to experiment and explore data access via a lake. Engage a different user base that can assess performance bottlenecks, and discover opportunities for improvements, possible linkages to existing EDW systems (or other data systems), and additional candidate data sources. Allow discovering data lake tools that make sense for your team and where best to invest resources into data lake automation.

Being a successful early adopter means taking a business value approach rather than a technology outcome. Focusing on the business value a lake affords, rather than industry analysts talking points, provides an opportunity to frame your efforts in the context of holistic data and analytics strategies. Increasing velocity helps you can achieve your data lake goals and measure progress in business performance.

About Openbridge: Make data work for you

Unify data in trusted, private industry-leading data lake or cloud warehouses like Amazon Redshift, Amazon Redshift Spectrum, Google BigQuery, Snowflake, Azure Data Lake, Ahana, and Amazon Athena. Data is always wholly owned by you.

Take control, and put your data to work with your favorite analytics tools. Explore, analyze, and visualize data to deliver faster innovation while avoiding vendor lock-in using tools like Google Data Studio, Tableau, Microsoft Power BI, Looker, Amazon QuickSight, SAP, Alteryx, dbt, Azure Data Factory, Qlik Sense, and many others.

--

--