Amazon AWS defines a data lake as being “a centralized repository that allows you to store all your structured and unstructured data at any scale.” This differs from a data warehouse, where data is stored in files or folders, as a data lake uses a flat architecture instead.
Each data lake element is assigned a unique identifier and tagged with a set of extended metadata tags to enable the data lake to be queried for relevant data when a business question arises. It also means that a smaller set of data can then be analysed to get the most relevant answer to the original question.
The term data lake is most often associated with Hadoop object storage, as it allows business data to reside on Hadoop’s cluster nodes of commodity computers. However, in a similar way to big data, a data lake should not be mistakenly used as a term to just describe any product that supports Hadoop. Instead, it is the term that should be used for any large data pool in which data requirements and schema are not defined until the data is actually queried.
A data lake is a term used to describe a data storage strategy and not a specific piece of technology, despite it being frequently used in connection with Hadoop (which is a specific piece of technology!) Data warehouse is a similar term in this respect as it is often used to refer to a specific technology, relational databases, when in fact it actually refers to a broader data management strategy.
How does a data lake differ from a data warehouse?
Although both data lakes and data warehouses deal with the storage of big data, there is a vital difference between the two. In a data warehouse, there tends to be a plan for the data before it is entered into the database – so the schema is pre-set and the data is primarily structured. This isn’t the case for a data lake as it can house both structured and unstructured data and so does not tend to use a predetermined schema.
The clue is really in the name when it comes to the difference between a data lake and a data warehouse. A warehouse is a man-made structure that has designated places for things to be put inside it, as it stores curated goods from specific sources. A lake is largely unstructured, and shifting, and is fed from a variety of sources.
This core difference is shown in many different ways, such as:
- Data warehouses are less agile than data lakes as data lakes can be configured and reconfigured as needs be.
- Data quality. Data in a data warehouse tends to be more reliable than that in a data lake, due to the fact that it is already processed.
- Performance/cost. Data lakes are designed to be lower cost than data warehouses but, in the past have been less reliable. A data warehouse is more expensive, especially for high volumes of data, but they do offer faster query results, higher performance and more reliability.
- The schema for data warehouses is pre-set for when the data is entered into the warehouse, whereas the schema for a data lake does not exist until the data has been accessed when someone wants to use it for something.
- Source of the data. Data stored in a data warehouse is extracted from various online transaction processing applications to support business analytics queries and data marts. On the other hand, data lakes extract relational and non-relational data from corporate applications, IoT devices, mobile apps and social media.
- The technology used to host the data. A data warehouse is a relational database housed in the cloud or on an enterprise mainframe server. A data lake is usually housed in a Hadoop environment or similar big data repository.
- Data warehouses tend to be used by businesses when they have a massive amount of data they need to be readily available for analysis. Data lakes are used when businesses need somewhere to store all of their data but they don’t have a purpose for this data yet. Also, due to the fact that the data in a data lake often originates from sources outside of a business’s operational systems, and is often uncurated, they are better used by data scientists than the average business analytics user.
These differences, and the fact that data lakes are newer, means that many businesses are currently using both data lakes and data warehouses at the same time, to either create an archive to deal with data roll-off from the main data warehouse or to accommodate the addition of new data sources.
The architecture of a data lake
As we have talked about above, the actual architecture of a data lake can vary massively as it is a strategy that can be applied to multiple technologies – so the architecture of a data lake using Amazon S3 will be different to that of a data lake using Hadoop.
Having said that, there are three things that make up the basic architecture of a data lake and distinguish it from other big data storage methods, and these are:
- All data is loaded in from various sources and retained
- Data is stored as it was received from the source – untransformed
- Data is then transformed and fitted into the schema as it is needed
What benefits do data lakes offer businesses?
A data lake can offer a business many benefits, including:
- It’s capacity to offer data scientists and developers the ability to easily configure a given data application, model or query, due to its agility
- No inherent structure means that any user can access the data in the data lake
- Ability to support various user levels from those who just want a daily report, to those who want to answer entirely new questions with the data
- Cheap to implement, as the technologies used to manage them are usually open source and can therefore be installed on low-cost hardware
- Scalable due to their lack of structure
If you want to further your career as a data scientist and help businesses with their data lakes, or you are looking for some data talent to strengthen your existing team, talk to Agile Recruit today.