This article provides an introductory overview of data lakes. We’ll begin with a brief history, highlight some benefits of a data lake, and define some terms you may hear used in various information technology (IT) circles. By the end of this article you should have a solid understanding of data lakes and why people are embracing this exciting new approach to creating a more data-driven culture within their organizations.
History of Data Lakes
Since the topic of this article is a data lake, let’s define that term first and then provide a brief history. (We’ll define more data lake-related terms later in this article after we understand the topic at hand.) Gartner defines “data lake” as follows:
“A data lake is a collection of storage instances of various data assets additional to the originating data sources. These assets are stored in a near-exact, or even exact, copy of the source format. The purpose of a data lake is to present an unrefined view of data to only the most highly skilled analysts, to help them explore their data refinement and analysis techniques independent of any of the system-of-record compromises that may exist in a traditional analytic data store (such as a data mart or data warehouse).”
As a follow-up to that textbook definition, here’s an analogy from Pentaho’s Chief Technology Officer (CTO) James Dixon (who is credited with coining the term “data lake” as far back as 2010) that provides a more colloquial explanation (Marr, 2018):
“If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
Data lakes originated from the need to build a more comprehensive repository capable of meeting many data requirements across a much broader user community instead of just focusing on the end-state of cleansed data feeding enterprise reports and dashboards. Traditionally, an operational data store (ODS) may have been used to stage data for an enterprise data warehouse (EDW) and if not an ODS, at least a staging schema in the EDW. A handful of advanced users might query the ODS (or staging schema) for raw data, but most users access the EDW through an end-user layer (e.g., Business Objects, MicroStrategy, Cognos, OBIEE) to run ad hoc queries and reports/dashboards against transformed data that’s considered clean and ready for consumption.
In contrast to data lakes, an ODS and EDW are purpose-built solutions focused on extracting, transforming, and loading (ETL) data from source systems into data marts, with little to no concern about anything outside of that objective. Although that approach works and is a valuable information delivery mechanism for well-defined reporting requirements, a lot of data needs go unmet leaving user communities to their own devices in trying to meet those needs outside of the formal data ecosystem. And those needs are usually addressed with low-level, localized (desktop) solutions using primitive tools that tend to be time-consuming, error prone, and unscalable.
Enter data lakes! A data lake, unlike a purpose-built ODS/EDW, provides an opportunity to make more data available to more user personas empowering an organization to extract more value throughout the entire data life-cycle. Novices can access fully-cleansed data in ready-to-use artifacts (using canned reports), power users can access slightly-transformed data to examine information (using SQL), and data scientists can experiment with raw data to discover hidden value (using statistics).
Benefits of a Data Lake
There are many benefits to building a data lake but the biggest benefit, by far, is the ability to serve data-hungry user communities (and systems) with a single, integrated, trusted, reliable, and well-managed data repository comprised of all types of data. And this data repository can service enterprise systems (including an EDW), data scientists, subject matter experts, power users, casual users, analysts, managers, executives, external consumers (via secure APIs), etc.
Other notable benefits leading to the pursuit of a data lake include the following:
- Integrating new data sources without waiting for lengthy release cycles that need to curate requirements, design systems, develop scripts/routines, test code, and deploy fully-baked solutions that may or may not prove to be valuable in the end.
- Reducing redundant data stores by providing a comprehensive data platform that can accommodate a much wider range of requirements, not just structured/cleansed data.
- Accommodating all types of data by allowing unstructured and semi-structured data to co-exist with structured data for more complete and comprehensive analyses.
- Implementing big data processes to handle data volume, velocity, and variety in a much more effective manner. (And yes, those are the original V’s of big data that have plagued data communities for years.)
- Executing faster test-and-learn cycles by accessing raw data, performing analyses, creating statistical models, and iterating to separate signal from noise much earlier in the analytical process.
- Improving agility towards becoming a nimbler organization operating at the speed of business delivering what’s needed when it’s needed through constant adaptation.
- Expanding user adoption through a service-oriented architecture designed to meet the needs of an organization with information requirements spanning everything from raw unstructured data to fully-cleansed structured data.
Data Lake Terms
Similar to other technology trends over the years, metaphors are used to help define a solution and data lakes are no exception. There are a variety of terms and definitions used (some more than others), but we’ll focus on the more common terms and widely-accepted definitions in the industry.
Data Ponds (or Puddles) are small subsets of the data lake focused on a specific domain or system’s dataset (Quay, n.d.). As the name implies, ponds (puddles) are much smaller than a lake and provide more meaningful boundaries that can better facilitate data management and security.
Data Swamps are failed attempts to create a data lake resulting in data that is unorganized, mismanaged, and simply a mess (Feldman, 2015). Data swamps generally occur from an imbalance of functional, technical, and political priorities. For example, if all time and energy is spent on pumping data into the data lake from anywhere with reckless abandon lacking the proper controls and documentation, then the lake will inevitably become a swamp.
Data Reservoirs are successful implementations where clean data is well-organized, properly managed, harmonized, and documented leading to a valuable data trove (Kaptain, 2015). One could argue that a data lake and data reservoir can peacefully co-exist where both serve a valuable purpose; the data lake, as a whole, provides all of the data whereas the data reservoir is a special area reserved for the cleanest of data.
Data Graveyards (or Junkyards) are giant repositories of unused data (Branch, 2017). In a world where storage is inexpensive, some organizations tend to store everything just because they can. But that approach incurs more overhead and maintenance, causing unused data to remain in the data lake polluting it with dead (or junk) data. Additionally, unused data creates confusion and uncertainty about what to use when, if at all.
Data Streams consist of real-time data in motion that is collected in transit, analyzed, and reduced to important data that is committed to storage (Shacklett, 2014). Streams focus on extracting value from data to store what’s useful and discard what’s not as it comes in. This approach not only shifts focus from storage resources to computing resources, it also ensures that we only permit valuable data to flow into the lake.
Zones are logical or physical separations of data that keep the lake organized and secure. Although zones may vary to meet the needs of an organization, here are some common zones used in a data lake as described by Patel, Wood, and Diaz (2017):
- Transient Zone contains temporary copies or short-lived data prior to data ingestion/storage.
- Raw Zone contains data as-is, untouched in any way.
- Trusted Zone contains high-quality data that is considered the “source of truth” and can be used for downstream systems.
- Refined Zone contains enriched data through manipulation.
Some people consider data lakes as data warehouses 2.0, but in reality, data lakes are a next-generation data platform extending well beyond, and separate from, data warehousing. Data warehouses will continue to exist and have a place in the enterprise, mainly for structured data requirements; however, it behooves an organization to strongly consider building a data lake to increase their analytical maturity. Users need access to raw data, purified data, and everything in-between. Data lakes meet those needs in a more controlled and managed fashion across the entire data spectrum.
Branch, M. (June 7, 2017). “Do You Have a Big Data Graveyard?” Geotab. Retrieved from https://www.geotab.com/blog/big-data-graveyard
Feldman, N. (July 28, 2015). “Data Lake or Data Swamp?” NVISIA. Retrieved from https://www.nvisia.com/insights/data-swamp
Gartner. (n.d.) “Data Lake.” In Gartner IT Glossary online. Retrieved from https://www.gartner.com/it-glossary/data-lake
Kaptain. (December 12, 2015). “Data Reservoir and Data Lakes.” WisdomSchema. Retrieved from https://wisdomschema.com/2015/12/data-reservoir-and-data-lakes
Marr, B. (August 27, 2018). “What Is A Data Lake? A Super-Simple Explanation for Anyone.” Forbes. Retrieved from https://www.forbes.com/sites/bernardmarr/2018/08/27/what-is-a-data-lake-a-super-simple-explanation-for-anyone/#5097e45f76e0
Patel, P., Wood, G., & Diaz, A. (April 25, 2017). “Data Lake Governance Best Practices.” DZone. Retrieved from https://dzone.com/articles/data-lake-governance-best-practices
Quay Consulting News. (n.d.) “Big Data, Data Lakes and Data Ponds: A quick reference guide.” Quay Consulting. Retrieved from https://www.quayconsulting.com.au/news/big-data-data-ponds-and-data-lakes-a-quick-reference-guide
Shacklett, M. (July 22, 2014). “Data lakes vs. data streams: Know the difference to save on storage costs.” TechRepublic. Retrieved from https://www.techrepublic.com/article/data-lakes-vs-data-streams-know-the-difference-to-save-on-storage-costs
Written By: Mark DeRosa, Director of Data Analytics and Modernization