If you’ve been in the Data world for a while, you’ve probably come across some mysterious terms: Data Lake, Data Warehouse, Data Lakehouse, and Data Mesh. You’d love to untangle this jargon and get a clearer understanding of what these concepts actually mean? Or maybe you already know them inside out but still wonder how to explain them to someone who’s not familiar with the field?
Whatever your situation, this article should pique your interest. I’ll walk you through clear and concise definitions of Data Lake, Data Warehouse (and Data Mart), Data Lakehouse and Data Mesh. To make these concepts easier to understand and explain later, I’ll also share a simple and relatable analogy for each one.
Get ready to dive into the fascinating depths of the data world. Let’s go!
Data Lake
It’s a space where a company stores all of its structured and unstructured data. The goal is to centralize data from the information system quickly and efficiently, making it easier to explore, combine, and reduce the time between a business need and its implementation.
The Data Lake is closely linked to the Big Data world, as it typically handles large volumes of data (we’re talking terabytes and petabytes), wide variety (structured, semi-structured, unstructured), and sometimes high velocity (streaming, IoT). Data is stored in its raw form as well as in a prepared version (with normalization, enrichment, etc.) to make it easier to use. Often, enriched data is pre-processed in the Data Lake and then sent to the Data Warehouse for easier access and usage.
💡 Analogy with Lake Annecy
Imagine a large lake surrounded by mountains, like Lake Annecy. The mountains represent your various applications such as IoT sensors, operational databases, CRM systems, or web apps. From these mountains flow several rivers like the Eau-Morte, the Ire, and the Laudon. Each river represents a stream of data from one of these applications, feeding into your Data Lake.
Some rivers carry low volumes of data, others carry large volumes. Some flow calmly in batches, others rush in real time as streaming data. These rivers bring fish, representing structured data, but they also carry debris and leftover materials, representing unstructured data.
Now imagine you are diving into this lake, exploring its depths to find beautiful fish or hidden treasures. Or think of yourself as a fisherman, examining each catch, evaluating its size and species. Depending on its quality, you may release it or keep it for consumption.
This is how a Data Lake works. It gathers raw data of all types, allowing you to explore, clean, and process what’s valuable for your business.
Data Warehouse
It is a centralized space where a company stores all of its structured data for analysis and reporting purposes. It provides a view of the current state (more or less in real time), as well as a historical perspective on certain datasets. The data is organized based on specific business use cases, which helps optimize the queries run against it.
The Data Warehouse can be fed from a Data Lake or directly from various systems within the information system, with preparation and transformation work required beforehand.
💡 Analogy with a supermarket
Imagine the Data Warehouse as your local supermarket where you do your shopping. All the products are neatly arranged in aisles and grouped by type : pasta, various kinds of rice, sauces and more. Everything is well organized to make the customer’s life easier and to encourage consumption.
Fresh products are regularly restocked. Each product has a different expiration date and once it is past, the product is removed from the shelves.
But supermarkets are large and it’s not always easy to find your way around. That’s why they are divided into specific sections, just like Data Marts in a Data Warehouse.
Data Mart
It is a sub-area of the Data Warehouse. It is often structured around organizational divisions to meet the specific needs of different departments within a company.
💡 Analogy with a supermarket (part two 😉)
Back in your favorite supermarket, take a moment to observe how the space is divided into zones and aisles. There’s the fruit and vegetable section, the dairy area, and the frozen food zone. In our analogy, each zone represents a Data Mart. Depending on the area, the way the aisles (or data) are organized varies to match the nature of the products and how they are consumed. Fruits and vegetables are displayed in open crates, dairy products are kept in large refrigerators, and dry goods like rice and pasta are stacked on multi-level shelves.
To help you find your way around the store, you’ll see signs hanging from the ceiling marking each section, with additional signs in each aisle indicating the product category, and even smaller ones on the shelves to separate each product type. Lastly, detailed information is available directly on the product packaging. A small nod here to Data Governance, which we won’t dive into in this article, but which remains essential in any information system to ensure clarity and proper use.
Data Lakehouse
Back in your favorite supermarket, take a moment to observe how the space is divided into zones and aisles. There’s the fruit and vegetable section, the dairy area, and the frozen food zone. In our analogy, each zone represents a Data Mart. Depending on the area, the way the aisles (or data) are organized varies to match the nature of the products and how they are consumed. Fruits and vegetables are displayed in open crates, dairy products are kept in large refrigerators, and dry goods like rice and pasta are stacked on multi-level shelves.
To help you find your way around the store, you’ll see signs hanging from the ceiling marking each section, with additional signs in each aisle indicating the product category, and even smaller ones on the shelves to separate each product type. Lastly, detailed information is available directly on the product packaging. A small nod here to Data Governance, which we won’t dive into in this article, but which remains essential in any information system to ensure clarity and proper use.
Each cloud provider offers its own Data Lakehouse solution, but other options that avoid hyperscaler lock-in are also available:
• GCP with Google BigQuery
• Azure with Azure Synapse Analytics
• AWS with Amazon Redshift
• Snowflake
• And Databricks
💡 Analogy with a DIY store
Let’s take the example of your favorite DIY store. You can walk into the store and browse the aisles looking for your items. Some products are readily available on the shelves, some are showcased in demo spaces, while others are not directly accessible. To get them, you need to place an order and pick them up from the depot a few minutes later. This is real progress compared to the past when you had to order the item and wait several days for delivery or collection.
In the materials yard, you’ll also find ready-to-use bags of concrete, or loose sand and gravel in piles for you to scoop. In both cases, you end up with concrete. The first option involves preparation done in advance, saving you time and ensuring a consistent result. In the second, you’re using raw materials and choosing how to mix and measure them to get the desired outcome.
With a Data Lakehouse, it’s the same idea. You can easily query the structured data that has been planned and optimized for that purpose (without duplication), but also access other structured data, even if it wasn’t specifically prepared for querying. This makes analysis and exploration much easier.
Data Mesh
Data Mesh goes against the idea of centralization and instead promotes a federated approach to data. In this model, a company’s data is split into several independent but interconnected domains, all governed by a shared framework. It forms a mesh of different systems that can include Data Warehouses, Data Lakes and Warehouses, or Data Lakehouses.
Instead of consolidating everything into one large system, the company operates multiple smaller systems that need to be managed and can exchange data with one another. Internal data exchange interfaces become crucial, which also facilitates external sharing. Data is therefore intentionally distributed but remains open, shareable, and governed to ensure it can be easily used by other systems.
In a Data Mesh, organization, governance, and processes are critical. Data is treated as a true product, complete with its own lifecycle, and must be managed accordingly within the enterprise.
💡 Analogy with our personal cars
I like to compare the Data Mesh to our personal cars. Each car has its own driver (the product owner) and carries passengers (the data). For all the cars to safely share the road, a highway code, traffic signs and signals (governance) have been put in place. And the cars are equipped with lights and a horn to communicate (interface contract) with other vehicles, even if they’re different.
💡 Analogy with an orchestra
For this final analogy, think of an orchestra. It is made up of individually talented musicians playing different instruments. Each musician follows their own score, but they all work together to produce a harmonious piece of music. They are synchronized by the conductor, who ensures that everything stays in tune.
In a data-driven company adopting a Data Mesh approach, you should picture the organization as an orchestra, where each independent system listens to the others and stays in sync to form a high-performing, harmonious whole, guided by a unifying governance.
In summary
Because a picture is worth a thousand words, I created this illustration as a summary of the article to show how each system relates to the others.
I hope this article helped you better understand these different concepts. What about you—what system is in place in your company? We can support you in evolving from a Data Lake + Data Warehouse to a Data Lakehouse, or help you transform your IT system toward a Data Mesh architecture. Contact us or leave a comment below the article 😉
Comments (0)
Your email address is only used by Business & Decision, the controller, to process your request and to send any Business & Decision communication related to your request only. Learn more about managing your data and your rights.