When we think about data warehousing, we think about Extract, Transformation and Load (ETL). From as far back as we can remember, data warehouse analysts and engineers have striven for increased maintainability through the use of no-code ETL solutions.
However, this doesn’t mean that Structured Query Language (SQL) for ETL workflows is not used anymore. Far from it. There are lots of cases where SQL is the go-to choice for performing both extractions and transformations – but will this always be the case?
Those who work in the world of ETL know that building an ETL workflow will never be a fully automated process, as every problem is unique. Data from the source is dynamic and diverse and all of these multiple data sources require different connectors to bring the data to its final destination.
Skills needed to build ETL pipelines
The role of an ETL developer asks them to not just be able to create ETL workflows, but also to have a complete understanding of the available data pipelines, data sources and warehouses. This means that not only do they need a good knowledge of database engines, but also a good grasp of SQL skills.
Knowledge of other programming languages such as Python or Scala is also advantageous, as is at least a basic understanding of cloud components and how they work.
The benefit of SQL is that it is not just limited to traditional relational database systems (RDBMS) and data warehousing solutions, but it can run on top of distributed file systems to help build data lakes and data warehouses and also process big data.
Therefore, it is easy to see why many think that knowledge of SQL should be an essential part of any ETL developer’s tool kit.
The downside to ETL solutions
Many people find ETL solutions great to work with, as they mostly consist of drag and drop elements that can be moved to transform the datasets. However, there are some downsides to the use of ETL solutions that one should keep in mind:
- Choice of an ETL tool is something to think carefully about as if you end up using databases that are not cloud-native on a cloud-native ETL solution, then performance can deteriorate in certain cases. So, make sure you choose a tool that is compatible with your source.
- Most available ETL tools do not take into consideration the definition of the source tables or examine the query plan for the transformation you have assigned to the workflow beforehand – and so won’t suggest any improvements.
- Staging tables that are dynamically created by ETL tools may not be optimised for transformations to be run on them, leading to slow pipeline runtimes.
- There is a high probability that the ETL actions results could be inaccurate compared to the results you were expecting, due to the way that NULLs are handled in different ETL solutions, for example.
- Unless staging tables are clearly defined, there is no way to recover data that is missed during pipeline failures or in transit.
How can SQL help?
SQL can help by providing an edge over other ETL actions and solving complex transformations. This does not mean that ETL solutions are useless, they are just better in data orchestration and at providing maintainable and reusable data pipeline structures that can be easily shared with other members of the team.
So, a hybrid approach would seem to be the best solution – a way of combining the best of both worlds when it comes to ETL and SQL.
The current ETL solutions available all come equipped with SQL actions, which can be used to run any SQL queries – from DDLS and Index Creations to Stored Procedures. This feature of ETL tools is critical for the building of a hybrid, and robust, data pipeline.
Many data cleaning and transformation tasks for high volume data require more distributed computing power in order for the transformation to be processed. Those ETL tools that are capable of using Hadoop or Spark-based executors can therefore run processing jobs containing SQL-based data transformations on top of the object-store (using external executors for a nominal price).
By moving the transformation logic, you can have a clear separation between the ETL pipeline and the SQL scripts, helping ease in running tests on the SQL scripts and allowing for easier maintainability.
However, the work of SQL doesn’t end when the ETL process is completed. SQL is the tool that drives business intelligence as it obtains the Key Performance Indicators (KPIs) by effectively grouping the required columns.
So, ETL tools are never complete without SQL and so those looking to work in data warehousing and ETL should consider upgrading their SQL skills to help them to upgrade their companies’ data pipelines.