What is the difference between batch and streaming data pipelines?

Batch pipelines process data in scheduled intervals - hourly, daily, or on demand - moving large volumes at once. Streaming pipelines process events continuously as they arrive, with latencies measured in milliseconds to seconds. CDC-based pipelines are a form of streaming pipeline that reads database transaction logs to capture changes in near real time.

What tools are used to build data pipelines?

Common tools include Apache Kafka and Kafka Connect for streaming pipelines, Apache Spark and Flink for large-scale processing, dbt for SQL-based transformations, Apache Airflow for orchestration, and CDC tools like Debezium and Oracle GoldenGate for database change streaming. Cloud providers offer managed equivalents: AWS Glue, Azure Data Factory, Google Cloud Dataflow.

How does Change Data Capture fit into a data pipeline?

CDC replaces the 'extract' step of a traditional pipeline. Instead of querying the source database on a schedule, a CDC tool reads the database's transaction log and streams changes as events. This gives the pipeline real-time data with lower database load, full delete visibility, and before/after row images for each change.

What is a Lambda architecture?

Lambda architecture is a data pipeline pattern that runs a batch layer (for complete, accurate historical results) and a speed layer (for low-latency real-time results) in parallel, merging their outputs in a serving layer. It handles both real-time and historical queries but is operationally complex. Kappa architecture simplifies this by using a single streaming pipeline for both real-time and reprocessed historical data.

Data Pipelines - Architecture, Tools & Best Practices

Q: What is a data pipeline?

A data pipeline is a series of automated steps that move data from one or more source systems to a destination, transforming it along the way. Pipelines can be batch-oriented (scheduled, bulk transfers) or streaming (continuous, event-by-event). They underpin analytics, machine learning, data lakes, and real-time applications.

What is a Data Pipeline?

A data pipeline is a series of automated steps that collect data from one or more source systems, transform it into a usable form, and deliver it to a destination for storage, analysis, or further processing. Pipelines are the backbone of modern data infrastructure - they feed analytics dashboards, train machine learning models, synchronise microservices, and populate data lakes and warehouses.

At its simplest, a pipeline has three stages: extract (read from the source), transform (clean, enrich, reshape), and load (write to the destination). Real pipelines add more - multiple destinations, schema evolution handling, error queues, exactly-once delivery - but the same core shape holds.

The two fundamental pipeline archetypes are batch and streaming. Batch pipelines run on a schedule and move large volumes of data at once - daily warehouse loads, nightly report refreshes. Streaming pipelines process events continuously as they arrive, with end-to-end latencies from milliseconds to seconds. Change Data Capture is a streaming pipeline pattern that reads database transaction logs as the event source, capturing every committed row change in the order it occurred.

Batch vs Streaming Pipelines

Dimension	Batch	Streaming (CDC)
Latency	Minutes to hours	Seconds or less
Deletes captured	Only with soft-delete columns	Yes - full delete events
Source DB load	High (full table scan)	Low (log read only)
Before images	Not available	Available per event
Complexity	Lower	Higher
Best for	Reporting, warehousing, bulk loads	Event-driven apps, real-time sync, data lakes

Most production data platforms use both patterns. Batch pipelines handle historical loads, large aggregations, and analytical workloads where a few hours of latency is acceptable. Streaming pipelines power operational dashboards, microservice event feeds, and fraud detection where stale data has a real cost.

CDC pipelines occupy a specific niche in the streaming space: they are driven by committed database transactions rather than application events. This makes them the right choice when the source of truth is a relational database and you need to replicate its state continuously without modifying the application that writes to it.

Pipeline Architecture Patterns

Lambda Architecture

Runs a batch layer (complete, accurate historical results) and a speed layer (low-latency real-time results) in parallel, merging outputs in a serving layer. Handles both real-time and historical queries but is operationally expensive - two codepaths to maintain, two consistency models to reason about.

Kappa Architecture

Simplifies Lambda by using a single streaming pipeline for both real-time and historical processing. Historical reprocessing is done by replaying the event log from the beginning. Apache Kafka's log retention and Apache Flink's stateful processing make Kappa practical at scale. CDC fits naturally - the database log is the durable event source for replay.

CDC-Driven Microservice Sync

A database-centric pattern where a CDC tool (Debezium, GoldenGate) reads the authoritative database log and publishes change events to a message broker (Kafka, Kinesis). Downstream microservices consume those events to maintain their own read models. No point-to-point API calls, no polling, no dual-write - the database log is the single source of truth.

CDC-Based Data Pipelines

Change Data Capture is the most robust way to build real-time pipelines from relational databases. This site focuses on Oracle and open-source CDC tools.

Debezium - Open-Source CDC for Oracle & more

Debezium reads Oracle redo logs via LogMiner or XStream and streams change events to Kafka, HTTP endpoints, or cloud message services. The most widely deployed open-source CDC tool for building real-time data pipelines from Oracle.

Debezium Guides & Tools

Oracle CDC Methods

LogMiner, XStream, GoldenGate, and trigger-based approaches for extracting change data from Oracle Database - compared by license cost, latency, and pipeline target.

Oracle CDC Guide

Kafka CDC Pipelines

Apache Kafka as the backbone for CDC pipelines - using Kafka Connect and Debezium to stream database changes into event-driven architectures and data lakes.

Kafka CDC Guide

CDC vs Full Refresh ETL: Cost, Latency, Deletes

Compare CDC and full refresh ETL on compute cost, data freshness, and delete capture. Includes a savings calculator and migration readiness quiz.

Interactive App How-To Guide

Guides & Tools - Coming Soon

In-depth pipeline guides are in progress, covering Kafka streaming infrastructure, Apache Iceberg data lake ingestion, and cloud-native pipeline patterns.

Kafka vs Redpanda for Data Pipelines - coming soon

Architecture trade-offs, operational complexity, and migration considerations for choosing a Kafka-compatible event broker for CDC pipelines.

Iceberg CDC Pipeline Tutorial - coming soon

End-to-end guide for streaming Oracle CDC events into an Apache Iceberg table using Debezium, Kafka, and an Iceberg sink connector.

Multi-Tenant Pipeline Strategy - coming soon

Topic naming, ACL design, and schema isolation for running CDC pipelines from multiple Oracle schemas on shared Kafka and Iceberg infrastructure.

Oracle CDC Across Cloud Boundaries - coming soon

Networking patterns, latency considerations, and security architecture for CDC pipelines that cross from on-premises Oracle to cloud-native targets.