When we think about data warehousing, we think about Extract, Transformation and Load (ETL). From as far back as we can remember, data warehouse analysts and engineers have striven for increased maintainability by using no-code ETL solutions.
However, this doesn’t mean that Structured Query Language (SQL) for ETL workflows is not used anymore. Far from it. There are many cases where SQL is the go-to choice for performing extractions and transformations – but will this always be the case?
Those who work in the world of ETL know that building an ETL workflow will never be fully automated, as every problem is unique. Data from the source is dynamic and diverse, and these multiple data sources require different connectors to bring the data to its final destination.
Skills needed to build ETL pipelines
The role of an ETL developer asks them not just to be able to create ETL workflows but also to have a complete understanding of the available data pipelines, data sources and warehouses. This means that not only do they need a good knowledge of database engines but also a good grasp of SQL skills.
Knowledge of other programming languages, such as Python or Scala, is also advantageous, as is at least a basic understanding of cloud components and how they work.
The benefit of SQL is that it is not just limited to traditional relational database systems (RDBMS) and data warehousing solutions. Still, it can run on top of distributed file systems to help build data lakes and warehouses and process big data.
Therefore, it is easy to see why many think that knowledge of SQL should be an essential part of any ETL developer’s tool kit.
The downside to ETL solutions
Many people find ETL solutions great to work with, as they mostly consist of drag-and-drop elements that can be moved to transform the datasets. However, there are some downsides to the use of ETL solutions that one should keep in mind:
- The choice of an ETL tool is something to think carefully about. If you end up using databases that are not cloud-native on a cloud-native ETL solution, performance can deteriorate in certain cases. So, make sure you choose a tool compatible with your source.
- Most available ETL tools do not consider the source tables’ definition or examine the query plan for the transformation you have assigned to the workflow beforehand – and so won’t suggest any improvements.
- Staging tables dynamically created by ETL tools may not be optimised for transformations to be run on them, leading to slow pipeline runtimes.
- There is a high probability that the ETL action results could be inaccurate compared to the results you were expecting due to how NULLs are handled in different ETL solutions, for example.
- Unless staging tables are clearly defined, there is no way to recover data missed during pipeline failures or in transit.
How can SQL help?
SQL can help by providing an edge over other ETL actions and solving complex transformations. This does not mean that ETL solutions are useless; they are just better at data orchestration and providing maintainable and reusable data pipeline structures that can be easily shared with other team members.
So, a hybrid approach would seem to be the best solution – combining the best of both worlds regarding ETL and SQL.
The current ETL solutions are equipped with SQL actions, which can run any SQL queries – from DDLS and Index Creations to Stored Procedures. This feature of ETL tools is critical for building a hybrid and robust data pipeline.
Many data cleaning and transformation tasks for high-volume data require more distributed computing power for the transformation to be processed. Those ETL tools capable of using Hadoop or Spark-based executors can run processing jobs containing SQL-based data transformations on top of the object store (using external executors for a nominal price).
By moving the transformation logic, you can clearly separate the ETL pipeline and the SQL scripts, helping ease running tests on the SQL scripts and allowing for easier maintainability.
However, the work of SQL doesn’t end when the ETL process is completed. SQL is the tool that drives business intelligence as it obtains the Key Performance Indicators (KPIs) by effectively grouping the required columns.
So, ETL tools are never complete without SQL, and so those looking to work in data warehousing and ETL should consider upgrading their SQL skills to help them to upgrade their companies’ data pipelines.