Data Lake Definition: Velocity, Agility, and Openness By Design

Defining data lakes in terms of velocity, agility, and openness delivers successful business outcomes

Thomas Spicer
Published in
3 min readAug 18, 2020

--

Data lake definitions can take many shapes, mainly because different vendors promote definitions that align with product offerings.

Given that there can be many different definitions, there can be confusion when people attempt to ask, “what is a data lake?” Building a background definition helps create a shared vocabulary around overly technical, abstract, and vendor-driven conversations.

Define “Data Lake”

Rather than rely on an AWS, Google, or Azure data lake definition, here are a few essentials to set some baselines. Pentaho co-founder and CTO James Dixon framed it this way;

This situation is similar to the way that old school business intelligence and analytic applications were built. End users listed out the questions they want to ask of the data, the attributes necessary to answer those questions were skimmed from the data stream, and bulk loaded into a data mart.

This method works fine until you have a new question to ask. The Data Lake approach solves this problem. You store all of the data in a lake, populate data marts and your data warehouse to satisfy traditional needs, and enable ad-hoc query and reporting on the raw data in the lake for new questions.

The statement lays out a broad philosophy of the role of a data lake within a stack.

A fundamental tenet of this data lake definition is an architecture that promotes open, agile accessibility to the data in a lake. Open availability reflects this definition and is the foundation of an agile data lake pattern.

For data science or business users, accessibility means they can run different types of processes from visualizations, real-time analytics, transformations, machine learning, and many other functions as needed on the content of a lake. For example, connect a wide variety of tools like Tableau, Microsoft Power BI, Looker, Amazon QuickSight, and many others to explore trends or develop insights.

A data lake definition must transcend AWS, Google, or Azure marketing definitions. For example, if you focused your architecture on a data lake in Google Cloud without an overall architectural pattern or philosophy, you may unknowingly create an outcome tightly coupled to vendor-specific technologies and tools. A vendor-driven definition may preclude you from leveraging certain types of tools or technologies, or worse case, prevent you from moving your lake to other clouds.

This is not a dig at Google, Azure, AWS, or anyone else; it merely reflects a focus on establishing a vendor-agnostic philosophy that favors open, agile data lakes.

This not to say you should not invest time to learn semantics that AWS, Google, or Azure data lake approaches entail. However, it is essential to be grounded with a foundational understanding of what an open, agile, integrated data lake looks like and its role within a data stack.

Zero admin, code-free data ingestion pipelines to Azure or Amazon data lakes

Do you need pipeline Amazon Seller Central data? Marketing data from Instagram, Facebook, Amazon Advertising, or Google Ads to your data lake?

Are you looking to extend your data lake with batch exports from internal systems? Collect event data from webhooks?

The Openbridge lake formation service offers business or enterprise data teams the ability to harness more data, from more sources, in less time and at a lower cost.

References

--

--