Data Engineering

Data Pipelines

Architecture, tools, and best practices for moving data from source systems to destinations - covering batch, streaming, and CDC-based pipeline patterns.

What is a Data Pipeline?

A data pipeline is a series of automated steps that collect data from one or more source systems, transform it into a usable form, and deliver it to a destination for storage, analysis, or further processing. Pipelines are the backbone of modern data infrastructure - they feed analytics dashboards, train machine learning models, synchronise microservices, and populate data lakes and warehouses.

At its simplest, a pipeline has three stages: extract (read from the source), transform (clean, enrich, reshape), and load (write to the destination). Real pipelines add more - multiple destinations, schema evolution handling, error queues, exactly-once delivery - but the same core shape holds.

The two fundamental pipeline archetypes are batch and streaming. Batch pipelines run on a schedule and move large volumes of data at once - daily warehouse loads, nightly report refreshes. Streaming pipelines process events continuously as they arrive, with end-to-end latencies from milliseconds to seconds. Change Data Capture is a streaming pipeline pattern that reads database transaction logs as the event source, capturing every committed row change in the order it occurred.

Batch vs Streaming Pipelines

Dimension Batch Streaming (CDC)
Latency Minutes to hours Seconds or less
Deletes captured Only with soft-delete columns Yes - full delete events
Source DB load High (full table scan) Low (log read only)
Before images Not available Available per event
Complexity Lower Higher
Best for Reporting, warehousing, bulk loads Event-driven apps, real-time sync, data lakes

Most production data platforms use both patterns. Batch pipelines handle historical loads, large aggregations, and analytical workloads where a few hours of latency is acceptable. Streaming pipelines power operational dashboards, microservice event feeds, and fraud detection where stale data has a real cost.

CDC pipelines occupy a specific niche in the streaming space: they are driven by committed database transactions rather than application events. This makes them the right choice when the source of truth is a relational database and you need to replicate its state continuously without modifying the application that writes to it.

Pipeline Architecture Patterns

Lambda Architecture

Runs a batch layer (complete, accurate historical results) and a speed layer (low-latency real-time results) in parallel, merging outputs in a serving layer. Handles both real-time and historical queries but is operationally expensive - two codepaths to maintain, two consistency models to reason about.

Kappa Architecture

Simplifies Lambda by using a single streaming pipeline for both real-time and historical processing. Historical reprocessing is done by replaying the event log from the beginning. Apache Kafka's log retention and Apache Flink's stateful processing make Kappa practical at scale. CDC fits naturally - the database log is the durable event source for replay.

CDC-Driven Microservice Sync

A database-centric pattern where a CDC tool (Debezium, GoldenGate) reads the authoritative database log and publishes change events to a message broker (Kafka, Kinesis). Downstream microservices consume those events to maintain their own read models. No point-to-point API calls, no polling, no dual-write - the database log is the single source of truth.

CDC-Based Data Pipelines

Change Data Capture is the most robust way to build real-time pipelines from relational databases. This site focuses on Oracle and open-source CDC tools.

Debezium - Open-Source CDC for Oracle & more

Debezium reads Oracle redo logs via LogMiner or XStream and streams change events to Kafka, HTTP endpoints, or cloud message services. The most widely deployed open-source CDC tool for building real-time data pipelines from Oracle.

Debezium Guides & Tools

Oracle CDC Methods

LogMiner, XStream, GoldenGate, and trigger-based approaches for extracting change data from Oracle Database - compared by license cost, latency, and pipeline target.

Oracle CDC Guide

Kafka CDC Pipelines

Apache Kafka as the backbone for CDC pipelines - using Kafka Connect and Debezium to stream database changes into event-driven architectures and data lakes.

Kafka CDC Guide

Guides & Tools - Coming Soon

In-depth pipeline guides are in progress, covering Kafka streaming infrastructure, Apache Iceberg data lake ingestion, and cloud-native pipeline patterns.

Kafka vs Redpanda for Data Pipelines - coming soon

Architecture trade-offs, operational complexity, and migration considerations for choosing a Kafka-compatible event broker for CDC pipelines.

Iceberg CDC Pipeline Tutorial - coming soon

End-to-end guide for streaming Oracle CDC events into an Apache Iceberg table using Debezium, Kafka, and an Iceberg sink connector.

Multi-Tenant Pipeline Strategy - coming soon

Topic naming, ACL design, and schema isolation for running CDC pipelines from multiple Oracle schemas on shared Kafka and Iceberg infrastructure.

Oracle CDC Across Cloud Boundaries - coming soon

Networking patterns, latency considerations, and security architecture for CDC pipelines that cross from on-premises Oracle to cloud-native targets.

Frequently Asked Questions