Data Engineering

ETL Pipelines

Extract, Transform, Load - the foundational data integration pattern, how it compares to ELT and CDC, and the tools used to build modern ETL pipelines.

What is ETL?

ETL - Extract, Transform, Load - is the dominant pattern for moving and integrating data across systems. In the extract phase, data is read from source systems: operational databases, SaaS APIs, flat files, or event streams. In the transform phase, raw data is cleaned, reshaped, deduplicated, enriched, and validated against business rules. In the load phase, the transformed data is written to the destination - a data warehouse, data lake, or another operational database.

ETL has been the backbone of enterprise data warehousing since the 1990s. Tools like Informatica, IBM DataStage, and Microsoft SSIS built entire product categories around it. The pattern works well for stable, well-understood data models where transformation logic is complex and the destination schema is rigid. Its weakness is latency: traditional ETL pipelines run on schedules - hourly or daily - meaning downstream analytics are always working with stale data.

Modern data platforms have pushed ETL in two directions. ELT (Extract, Load, Transform) emerged with cloud data warehouses: raw data is loaded first, then transformed inside the warehouse using SQL, typically with dbt. This approach leverages the warehouse's compute and eliminates the need for a separate transformation server. CDC-based streaming ETL replaces the scheduled extract with continuous log-reading, capturing every database change in near real time and enabling sub-minute latency without the full table scans of traditional extraction.

ETL vs ELT vs CDC

Dimension ETL ELT CDC Streaming
Transform step Before load After load (in DWH) In-stream or after load
Extraction method Scheduled query / polling Scheduled query / API Transaction log (continuous)
Latency Hours Minutes to hours Seconds
Deletes captured Only with soft deletes Only with soft deletes Yes - hard deletes captured
Source DB load High (full table scan) High (full table scan) Low (log read only)
Typical tools Informatica, SSIS, NiFi Fivetran, Airbyte, dbt Debezium, GoldenGate, Kafka

ETL, ELT, and CDC are complementary rather than competing. Most mature data platforms use all three: ETL/ELT for SaaS sources and file-based ingestion, CDC for operational databases where real-time sync and delete capture are required.

CDC as a Modern ETL Source

CDC eliminates the most expensive part of traditional ETL - the scheduled full-table extract - replacing it with continuous log-based streaming. This site focuses on Oracle database sources.

Debezium - CDC Source for ETL Pipelines

Replace scheduled Oracle queries with continuous log streaming. Debezium reads Oracle redo logs and streams change events to Kafka, HTTP, or cloud targets - the real-time extract layer for modern ELT pipelines.

Debezium Guides

Oracle CDC Methods

LogMiner, XStream, GoldenGate, and trigger-based approaches for extracting change data from Oracle - the comparison every data engineer needs before building an Oracle ETL pipeline.

Oracle CDC Guide

Guides & Tools - Coming Soon

Practical ETL pipeline guides focused on Oracle database sources, streaming extraction, and modern ELT patterns are in progress.

CDC Solution Evaluation Criteria - coming soon

Decision framework for choosing between ETL polling, CDC log-reading, and hybrid approaches for Oracle data integration projects.

CDC Architecture Cost Estimation - coming soon

Cost modelling for CDC pipeline infrastructure: Debezium vs GoldenGate licensing, Kafka cluster sizing, and total cost of ownership comparison with traditional ETL tools.

Debezium to Iceberg Configuration Guide - coming soon

End-to-end configuration for streaming Oracle CDC events from Debezium into Apache Iceberg tables - the streaming ELT pipeline for Oracle data lake ingestion.

CDC Data Engineering Best Practices - coming soon

Schema evolution handling, idempotent consumers, exactly-once semantics, and observability patterns for production CDC-based ETL pipelines.

Frequently Asked Questions