I’ll open this category with an overview of the “ETL” versus “ELT” methods of refreshing the enterprise data warehouse. Today there is some confusion about these terms, and when to apply one versus the other. This post examines the respective roles of ETL versus ELT, and introduces the ETLT Framework.
Some ETL History
In the early days of data warehousing, DW technology was great for queries, but dog slow for transforming and loading data. As a result, a class of technology commonly referred to as ETL (extract, transform and load) evolved. ETL tools relieved the performance problem by preparing the data with traditional, high-performance, batch processing on UNIX or Windows servers. These tools extract data from source systems, process data files “row-by-row” applying transforms to the rows and columns, sort and merge the data, perform look-ups against the data warehouse or in-memory files, and aggregate data. Drag-and-drop graphical interfaces insulate developers from the underlying procedural programming code. At the end of the ETL processing, the data is loaded into the target data warehouse, ready reports and user queries. This basic approach is still used today for most data warehouses. Leading ETL tools include Informatica, DataStage, and ABInitio.
As ETL tools have matured, so also has data warehouse technology. Today high performance MPP data warehouse appliances can transform data very efficiently using set-level SQL. Data warehouse loader utilities have become so fast that it is practical to move data into the data warehouse at an early point in the data transformation process. Consequently, the performance bottleneck is shifting from the data warehouse to the ETL servers. With DW appliances, the equivalent administrative overhead compared to ETL servers is also reduced. We need to step back and holistically rebalance the end-to-end architecture.
Enter the ETLT Framework
Onto this scene enters a new approach to refreshing the enterprise data warehouse–the ETLT Framework (extract, format transformation, load, referential transformation). The idea is combine ETL and ELT to exploit the strengths of each approach to achieve optimum performance and scalability. The goal is to move certain transformation processing onto the data warehouse platform to take advantage of the inherent parallelism and the relational integration of data. This architectural shift is critical to the scalability of the EDW.
The Dynamic Duo
ETL and on-board ELT processing can work together in an ETLT framework, exploiting the strengths of each approach. In this arrangement, the ETL tool is responsible for extracting and consolidating data from the source systems, performing scalar data cleansing, and formatting the change data to look much like the target tables. At this point the “ET” phase data is loaded into the staging area of the data warehouse appliance. The loading operation is performed by the appliance high speed loader. Some ETL tools can leverage the appliance high speed loader through ODBC/JDBC or platform-specific connectors. If not, the ETL tool creates flat files for loading. Once that data is loaded into the appliance, the remaining “LT” transformation is completed using SQL-based set level operations. The ETLT Framework Manager handles end-to-end control including task dependencies, data archiving, restart/recovery, error-handling and event logging. Sitting above the ETLT Framework is the enterprise scheduler. The enterprise scheduler coordinates dependencies between the source systems, the ETL run-time server, and the EtLT Framework Manager. Balancing the roles of ELT and ETL in this manner will optimize throughput ETLT flow flow.
ETL and ELT collaborate as effectively to get data OUT of the EDW as they do to get data IN. It is common practice to export data from the data warehouse, and then reload the data into downstream data cube platforms and and data marts. In these scenarios, the EtLT Framework performs the “heavy lifting” preparation of data for export, and the ETL tool handles the interfacing and data movement to disparate downstream platforms. That said, a better strategy is to consolidate data marts and data cubes directly into the EDW. After all, the reason for cubes and data marts in the first place was to improve reporting performance and flexibility. If the EDW can satisfy those requirements, why incur the costs and latency of extra data layers? An EDW-centric strategy will streamline the information infrastructure to meet the demands of the 21st century.
Watch for future posts concerning how to implement CDC and other transforms within the ETLT Framework. If you have opinions on the ETL-ELT topic, or if your organization faces challenges that the framework might solve, please submit your comments back to this post. I very much want to incorporate reader ideas into edwblog.com content.