Amazon AWS defines a data lake as “a centralized repository that allows you to store all your structured and unstructured data at any scale.” This differs from a data warehouse, where data is stored in files or folders, as a data lake uses a flat architecture instead.

Each data lake element is assigned a unique identifier and tagged with a set of extended metadata tags to enable the data lake to be queried for relevant data when a business question arises. It also means that a smaller data set can be analysed to get the most relevant answer to the original question.

The term data lake is often associated with Hadoop object storage, as it allows business data to reside on Hadoop’s cluster nodes of commodity computers. However, like big data, a data lake should not be mistakenly used as a term to describe any product supporting Hadoop. Instead, it is the term that should be used for any large data pool in which data requirements and schema are not defined until the data is queried.

A data lake is a term used to describe a data storage strategy and not a specific piece of technology, despite being frequently used in connection with Hadoop (a specific piece of technology!) A data warehouse is a similar term in this respect as it is often used to refer to a specific technology or relational databases when it refers to a broader data management strategy.

How does a data lake differ from a data warehouse?

Although both data lakes and data warehouses deal with big data storage, there is a vital difference between the two. In a data warehouse, there tends to be a plan for the data before it is entered into the database – so the schema is pre-set, and the data is primarily structured. This isn’t the case for a data lake as it can house both structured and unstructured data and so does not tend to use a predetermined schema.

The clue is really in the difference between a data lake and a data warehouse. A warehouse is a man-made structure with designated places for things to be put inside it, as it stores curated goods from specific sources. A lake is largely unstructured and shifting and is fed from various sources.

This core difference is shown in many different ways, such as:

  • Data warehouses are less agile than data lakes as data lakes can be configured and reconfigured as needs are.
  • Data quality. Data in a data warehouse tends to be more reliable than that in a data lake because it is already processed.
  • Performance/cost. Data lakes are designed to be lower cost than data warehouses but, in the past, have been less reliable. A data warehouse is more expensive, especially for high volumes of data, but they offer faster query results, higher performance and more reliability.
  • The schema for data warehouses is pre-set for when the data is entered into the warehouse, whereas the schema for a data lake does not exist until the data has been accessed when someone wants to use it for something.
  • Source of the data. Data stored in a data warehouse is extracted from various online transaction processing applications to support business analytics queries and data marts. On the other hand, data lakes extract relational and non-relational data from corporate applications, IoT devices, mobile apps and social media.
  • The technology used to host the data. A data warehouse is a relational database housed in the cloud or on an enterprise mainframe server. A data lake is usually housed in a Hadoop environment or similar big data repository.
  • Data warehouses tend to be used by businesses when they have a massive amount of data that need to be readily available for analysis. Data lakes are used when businesses need somewhere to store their data, but they don’t have a purpose for this data. Also, because data lake data often originates from sources outside of a business’s operational systems and is often uncurated, they are better used by data scientists than the average business analytics user.

These differences, and the fact that data lakes are newer, mean that many businesses are currently using both data lakes and data warehouses simultaneously to either create an archive to deal with data roll-off from the main data warehouse or to accommodate the addition of new data sources.

The architecture of a data lake

As we have discussed above, the actual architecture of a data lake can vary massively as it is a strategy that can be applied to multiple technologies. So the architecture of a data lake using Amazon S3 will be different to that of a data lake using Hadoop.

Having said that, three things make up the basic architecture of a data lake and distinguish it from other big data storage methods, and these are:

  • All data is loaded from various sources and retained
  • Data is stored as it was received from the source – untransformed
  • Data is then transformed and fitted into the schema as it is needed

What benefits do data lakes offer businesses?

A data lake can offer a business many benefits, including:

  • It’s capacity to offer data scientists and developers the ability to easily configure a given data application, model or query due to its agility
  • No inherent structure means that any user can access the data in the data lake
  • Ability to support various user levels, from those who want a daily report to those who want to answer entirely new questions with the data
  • They are cheap to implement, as the technologies used to manage them are usually open source and can therefore be installed on low-cost hardware
  • Scalable due to their lack of structure

Talk to Agile Recruit today if you want to further your career as a data scientist and help businesses with their data lakes, or you are looking for some data talent to strengthen your existing team.

Share this blog