ETL compute savings estimator

Full refresh reprocesses data that did not change. Estimate your monthly compute savings from migrating to an incremental CDC pipeline.

Current monthly compute cost ($)

Average % of table changing daily

10% change rate

Current data latency

Estimated CDC monthly cost

$2,500

You save roughly $2,500/mo

New data freshness

< 10 seconds

Plus automatic hard-delete capture (op = 'd').

Should you migrate?

CDC is not free. Answer these four structural questions to find out if full refresh still wins for your workload, or if you are a clear candidate for log-based CDC.

1. How big is the source table?

Above a few million rows Small reference table (under 1M rows)

2. What is the nature of the data changes?

Data is frequently updated or deleted Append-only (e.g., event logs)

3. Do downstream systems need to act on hard deletes?

Yes, we have silent delete blindspots No, not required for our use case

4. Does the source database expose a transaction log?

Yes (e.g., Postgres WAL, MySQL Binlog) No (e.g., SaaS API, restricted managed DB)

Architecture and code explorer

Select your tier to view the recommended stack and implementation code.

dlt replication + DuckDB

For PostgreSQL sources with fewer than 10 downstream consumers. Pure Python, no Java, no Kafka. Uses pg_replication via the built-in pgoutput plugin.

Python: extract and load

import dlt
from pg_replication import replication_resource
from pg_replication.helpers import init_replication

CREDENTIALS = "postgresql://postgres:postgres@localhost:5432/appdb"

pipeline = dlt.pipeline(
    pipeline_name="cdc_pipeline", destination="duckdb", dataset_name="staging"
)

# 1. Capture initial snapshot
snapshot = init_replication(
    slot_name="cdc_slot", pub_name="cdc_pub", schema_name="public",
    table_names=["orders"], credentials=CREDENTIALS,
    persist_snapshots=True, reset=True
)
pipeline.run(snapshot)

# 2. Stream ongoing changes (inserts, updates, deletes)
changes = replication_resource("cdc_slot", "cdc_pub", credentials=CREDENTIALS)
pipeline.run(changes)

Debezium + Redpanda

The standard production stack for durable replay, non-Postgres sources, or more than 10 downstream consumers. Downstream DuckDB MERGE logic stays identical to Tier 1.

Docker Compose: infrastructure

services:
  redpanda:
    image: docker.redpanda.com/redpandadata/redpanda:latest
    command:
      - redpanda start
      - --kafka-addr internal://0.0.0.0:9092,external://0.0.0.0:19092

  debezium:
    image: quay.io/debezium/connect:2.5
    environment:
      - BOOTSTRAP_SERVERS=redpanda:9092
      - GROUP_ID=1
      - CONFIG_STORAGE_TOPIC=my_connect_configs
    depends_on:
      - redpanda

DuckDB: apply changes (both tiers)

-- Apply CDC events: upsert updated/inserted rows, remove deleted rows
MERGE INTO orders AS target
USING (
  SELECT * FROM cdc.staging_staging.orders
  WHERE _dlt_load_id = (SELECT MAX(_dlt_load_id) FROM cdc.staging_staging.orders)
) AS source
ON target.order_id = source.order_id
WHEN MATCHED AND source.deleted_ts IS NOT NULL THEN
  DELETE
WHEN MATCHED AND source.deleted_ts IS NULL THEN
  UPDATE SET status = source.status, amount = source.amount, updated_at = source.updated_at
WHEN NOT MATCHED AND source.deleted_ts IS NULL THEN
  INSERT (order_id, status, amount, updated_at)
  VALUES (source.order_id, source.status, source.amount, source.updated_at);

CDC vs Full Refresh ETL: Cost, Latency, Deletes

ETL compute savings estimator

Should you migrate?

Architecture and code explorer

dlt replication + DuckDB

Debezium + Redpanda