ETL - Extract, Transform, Load - is the dominant pattern for moving and integrating data across systems. In the extract phase, data is read from source systems: operational databases, SaaS APIs, flat files, or event streams. In the transform phase, raw data is cleaned, reshaped, deduplicated, enriched, and validated against business rules. In the load phase, the transformed data is written to the destination - a data warehouse, data lake, or another operational database.
ETL has been the backbone of enterprise data warehousing since the 1990s. Tools like Informatica, IBM DataStage, and Microsoft SSIS built entire product categories around it. The pattern works well for stable, well-understood data models where transformation logic is complex and the destination schema is rigid. Its weakness is latency: traditional ETL pipelines run on schedules - hourly or daily - meaning downstream analytics are always working with stale data.
Modern data platforms have pushed ETL in two directions. ELT (Extract, Load, Transform) emerged with cloud data warehouses: raw data is loaded first, then transformed inside the warehouse using SQL, typically with dbt. This approach leverages the warehouse's compute and eliminates the need for a separate transformation server. CDC-based streaming ETL replaces the scheduled extract with continuous log-reading, capturing every database change in near real time and enabling sub-minute latency without the full table scans of traditional extraction.