changedatacapture.net

CDC Migration Toolkit Full Refresh vs CDC

Calculate savings, assess migration readiness, and explore the dlt + DuckDB implementation stack.

CDC vs Full Refresh ETL: Cost, Latency, Deletes

Full refresh ETL reprocesses every row on every run, whether it changed or not. At scale this wastes compute, introduces overnight data latency, and silently misses hard deletes. Log-based Change Data Capture (CDC) reads only what changed from the database transaction log, delivering sub-ten-second freshness, explicit delete events, and 30-60% lower compute costs on most production workloads.

ETL compute savings estimator

Full refresh reprocesses data that did not change. Estimate your monthly compute savings from migrating to an incremental CDC pipeline.

10% change rate

Estimated CDC monthly cost

$2,500

You save roughly $2,500/mo


New data freshness

< 10 seconds

Plus automatic hard-delete capture (op = 'd').

Should you migrate?

CDC is not free. Answer these four structural questions to find out if full refresh still wins for your workload, or if you are a clear candidate for log-based CDC.

1. How big is the source table?

2. What is the nature of the data changes?

3. Do downstream systems need to act on hard deletes?

4. Does the source database expose a transaction log?

Architecture and code explorer

Select your tier to view the recommended stack and implementation code.

dlt replication + DuckDB

For PostgreSQL sources with fewer than 10 downstream consumers. Pure Python, no Java, no Kafka. Uses pg_replication via the built-in pgoutput plugin.

Python: extract and load

import dlt
from pg_replication import replication_resource
from pg_replication.helpers import init_replication

CREDENTIALS = "postgresql://postgres:postgres@localhost:5432/appdb"

pipeline = dlt.pipeline(
    pipeline_name="cdc_pipeline", destination="duckdb", dataset_name="staging"
)

# 1. Capture initial snapshot
snapshot = init_replication(
    slot_name="cdc_slot", pub_name="cdc_pub", schema_name="public",
    table_names=["orders"], credentials=CREDENTIALS,
    persist_snapshots=True, reset=True
)
pipeline.run(snapshot)

# 2. Stream ongoing changes (inserts, updates, deletes)
changes = replication_resource("cdc_slot", "cdc_pub", credentials=CREDENTIALS)
pipeline.run(changes)

DuckDB: apply changes (both tiers)

-- Apply CDC events: upsert updated/inserted rows, remove deleted rows
MERGE INTO orders AS target
USING (
  SELECT * FROM cdc.staging_staging.orders
  WHERE _dlt_load_id = (SELECT MAX(_dlt_load_id) FROM cdc.staging_staging.orders)
) AS source
ON target.order_id = source.order_id
WHEN MATCHED AND source.deleted_ts IS NOT NULL THEN
  DELETE
WHEN MATCHED AND source.deleted_ts IS NULL THEN
  UPDATE SET status = source.status, amount = source.amount, updated_at = source.updated_at
WHEN NOT MATCHED AND source.deleted_ts IS NULL THEN
  INSERT (order_id, status, amount, updated_at)
  VALUES (source.order_id, source.status, source.amount, source.updated_at);
Part of: ETL Pipelines Data Pipelines