Data Lake Definition: Velocity, Agility, and Openness By Design

Defining data lakes in terms of velocity, agility, and openness delivers successful business outcomes

Published in

Openbridge

3 min readAug 18, 2020

Data lake definitions can take many shapes, mainly because different vendors promote definitions that align with product offerings.

Given that there can be many different definitions, there can be confusion when people attempt to ask, “what is a data lake?” Building a background definition helps create a shared vocabulary around overly technical, abstract, and vendor-driven conversations.

Define “Data Lake”

Rather than rely on an AWS, Google, or Azure data lake definition, here are a few essentials to set some baselines. Pentaho co-founder and CTO James Dixon framed it this way;

This situation is similar to the way that old school business intelligence and analytic applications were built. End users listed out the questions they want to ask of the data, the attributes necessary to answer those questions were skimmed from the data stream, and bulk loaded into a data mart.
This method works fine until you have a new question to ask. The Data Lake approach solves this problem. You store all of the data in a lake, populate data marts and your data warehouse to satisfy traditional needs, and enable ad-hoc query and reporting on the raw data in the lake for new questions.

The statement lays out a broad philosophy of the role of a data lake within a stack.

A fundamental tenet of this data lake definition is an architecture that promotes open, agile accessibility to the data in a lake. Open availability reflects this definition and is the foundation of an agile data lake pattern.

For data science or business users, accessibility means they can run different types of processes from visualizations, real-time analytics, transformations, machine learning, and many other functions as needed on the content of a lake. For example, connect a wide variety of tools like Tableau, Microsoft Power BI, Looker, Amazon QuickSight, and many others to explore trends or develop insights.

A data lake definition must transcend AWS, Google, or Azure marketing definitions. For example, if you focused your architecture on a data lake in Google Cloud without an overall architectural pattern or philosophy, you may unknowingly create an outcome tightly coupled to vendor-specific technologies and tools. A vendor-driven definition may preclude you from leveraging certain types of tools or technologies, or worse case, prevent you from moving your lake to other clouds.

This is not a dig at Google, Azure, AWS, or anyone else; it merely reflects a focus on establishing a vendor-agnostic philosophy that favors open, agile data lakes.

This not to say you should not invest time to learn semantics that AWS, Google, or Azure data lake approaches entail. However, it is essential to be grounded with a foundational understanding of what an open, agile, integrated data lake looks like and its role within a data stack.

Zero admin, code-free data ingestion pipelines to Azure or Amazon data lakes

Do you need pipeline Amazon Seller Central data? Marketing data from Instagram, Facebook, Amazon Advertising, or Google Ads to your data lake?

Are you looking to extend your data lake with batch exports from internal systems? Collect event data from webhooks?

The Openbridge lake formation service offers business or enterprise data teams the ability to harness more data, from more sources, in less time and at a lower cost.

References

4 Steps To Create a Serverless Analytics Stack with Tableau and Amazon Athena

Combining the simplicity of Tableau, a data lake, and the power of Athena can deliver a cost-efficient, high-performance…

blog.openbridge.com

How To Create A Serverless, Zero Infrastructure, Zero Administration Data Lake With Amazon S3…

We are going to describe a serverless solution stack of Amazon S3, Apache Parquet, and Amazon Athena for your data lake

blog.openbridge.com

Adobe Data Feeds: How to use a data lake and Amazon Athena for analytic insights

4 Steps to configure your Adobe Data Feeds for a data lake using Amazon S3 and Amazon Athena

blog.openbridge.com

Cloud Data Lake Best Practices: Data Lake vs. Data Warehouse

Why an open, flexible, and agile data lake architecture makes a difference between success or swamp

blog.openbridge.com

Data Lake Definition: Velocity, Agility, and Openness By Design

Defining data lakes in terms of velocity, agility, and openness delivers successful business outcomes

Define “Data Lake”

Zero admin, code-free data ingestion pipelines to Azure or Amazon data lakes

References

4 Steps To Create a Serverless Analytics Stack with Tableau and Amazon Athena

Combining the simplicity of Tableau, a data lake, and the power of Athena can deliver a cost-efficient, high-performance…

How To Create A Serverless, Zero Infrastructure, Zero Administration Data Lake With Amazon S3…

We are going to describe a serverless solution stack of Amazon S3, Apache Parquet, and Amazon Athena for your data lake

Adobe Data Feeds: How to use a data lake and Amazon Athena for analytic insights

4 Steps to configure your Adobe Data Feeds for a data lake using Amazon S3 and Amazon Athena

Cloud Data Lake Best Practices: Data Lake vs. Data Warehouse

Why an open, flexible, and agile data lake architecture makes a difference between success or swamp

Written by Thomas Spicer