Managing big data is a full-on career choice. From one day to the next, you can run analytics reports, move data from one repository to another, and even create data for your company’s important new web application.

Therefore, data managers need tools to help them manage their various tasks, and fortunately, plenty of cloud computing services can help with data reporting.  One of the main providers of these services is Amazon Web Services (or AWS as it is more commonly known.)

One of the tools that AWS offers is AWS Glue, attracting significant attention. In the data industry, it is known as a managed ETL (Extract, Transform and Load) – because it extracts and transforms data for it to be prepared for analysis.

AWS Glue is a data catalogue that stores metadata in a central repository by automating ETL so that AWS Glue is pointed to the data stored in AWS. This makes the data searchable and queryable for any cloud analytics and reporting you may need.

Before diving into AWS Glue and its benefits, you must understand ETL and data lakes.  We will summarise it here.

What is a data lake?

A data lake is a centralised repository of information containing either structured or unstructured data.  The data is usually stored in S3 (Amazon Simple Storage Service) and can be used for analytics, data exploration, reporting, artificial intelligence and machine learning.

Using a data lake means the data is then available for many users throughout the business to analyse for insights. However, because the data is heterogeneous (has a high variability of formats and types), it needs to be transformed before it can be analysed.

This is where ETL comes into play. ETL preps all data stored in the data lake, making it ready for analytics and reporting. This saves time compared to moving all data, isolating the data you need, and then running queries in preparation for analytics and reporting.

How does AWS Glue work?

AWS Glue is the tool you need to generate the ETL code for Python and Scala programming languages. It does this by:

  • Pointing the Glue crawler at the data source
  • Creating a data catalogue containing enough information to recreate the dataset. It does this by using classifiers such as CSV, JSON and Parquet.
  • Performing the ETL through Glue jobs (which are run on-demand or using triggers)

Once AWS Glue has finished cataloguing the data, it is ready for analytics. You can then use tools such as AWS Athena to analyse and process the data or view analytical results quickly using quicksight.

What are the benefits of AWS Glue?

  • No need for an on-premise server, your own data centre, local data management stores or even a dedicated data management employee. AWS is serverless and runs as a managed ETL
  • AWS Glue can crawl various data sources, identify their format, and suggest how they should be used. It can also generate the code you need for any data processes, queries or transformations.
  • Frees up employees’ time. AWS Glue does all of its ETL processing in the cloud, so employees do not have to do any of the data management and prep that is usually required – tasks such as managing endpoint security, configuring the data beforehand, moving the data to the right repository and so on.

AWS Glue is being heralded as a tool that can change the data exploration process by making it much easier for people to manage, thanks to the fact queries can be automated and repeated easily.

If the world of data analytics intrigues you and you would like to learn more about your next career move – or you want to strengthen your existing data analytics team, don’t hesitate to get in touch with one of our expert consultants.

Share this blog