AWS GLUE

Everything you need to know about AWS Glue

Written on October 26, 2020 by Jason Horwood

Managing big data is a full-on career choice. From one day to the next you can be running analytics reports, moving data from one repository to another, and even creating data for your company’s important new web application.

Data managers, therefore, need tools to help them manage their various tasks, and fortunately, there are plenty of cloud computing services out there that can provide help with data reporting.  One of the main providers of these sorts of services is Amazon Web Services (or AWS as it is more commonly known.)

One of the tools that AWS offers is AWS Glue, which is attracting a significant amount of attention at the moment. In the data industry, it is known as a managed ETL (Extract, Transform and Load) – due to the fact that it extracts and transforms data in order for it to be prepared for analysis.

AWS Glue is basically a data catalogue that stores metadata in a central repository by automating ETL so that AWS Glue is pointed to the data that is stored in AWS. This makes the data searchable, and queryable, for any of the cloud analytics and reporting you may need.

It is essential that you understand ETL and data lakes before you dive into AWS Glue and its benefits, so we will summarise it here.

What is a data lake?

A data lake is a centralised repository of information containing either structured or unstructured data.  The data is usually stored in S3 (Amazon Simple Storage Service) and so can be used for analytics, data exploration, reporting, artificial intelligence and machine learning.

Using a data lake means the data is then available for many users throughout the business to be able to analyse it for insights. However, because the data is heterogeneous (has a high variability of formats and types) then it needs to be transformed before it can be analysed.

This is where ETL comes into play. ETL preps all data stored in the data lake, making it ready for analytics and reporting. This saves time compared to having to move all data, isolate the data you need, and then run queries in preparation for analytics and reporting.

How does AWS Glue work?

AWS Glue is the tool that you need to generate the ETL code for programming languages Python and Scala. It does this by:

  • Pointing the Glue crawler at the data source
  • Creating a data catalogue which contains enough information for it to recreate the dataset. It does this by using classifiers such as CSV, JSON and Parquet.
  • Performing the ETL through Glue jobs (which are run on-demand or using triggers)

Once AWS Glue has finished cataloguing the data, it is then ready to be used for analytics. You can then use tools such as AWS Athena to analyse and process the data, or just view analytical results quickly using quicksight.

What are the benefits of AWS Glue?

  • No need for an on-premise server, your own data centre, local data management stores or even a dedicated data management employee. AWS is serverless and runs as a managed ETL
  • AWS Glue can crawl a variety of different data sources, identify their format, and suggest how the data should be used. It can also generate the code you need for any data processes, queries or transformations
  • Frees up employees time. AWS Glue does all of its ETL processing in the cloud, so employees do not have to do any of the data management and prep that is usually required – tasks such as managing endpoint security, configuring the data beforehand, moving the data to the right repository and so on.

AWS Glue is being heralded as a tool that can change the data exploration process by making it much easier for people to manage, thanks to the fact queries can be automated and repeated easily.

If the world of data analytics intrigues you and you would like to find out more about your next career move – or you want to strengthen your existing data analytics team, please contact one of our expert consultants.