Streaming Infrastructure

Kafka CDC

Change Data Capture with Apache Kafka - using Kafka Connect and Debezium to stream database changes into event-driven pipelines, data lakes, and microservices.

What is Kafka CDC?

Kafka CDC is the practice of using Apache Kafka as the central nervous system for Change Data Capture pipelines. A CDC source connector reads committed changes from a source database's transaction log and publishes them as immutable events to Kafka topics. Downstream consumers - stream processors, data lake writers, microservice event handlers - subscribe to those topics and react to changes in near real time.

The key advantage of routing CDC events through Kafka is decoupling. The source database doesn't need to know about downstream consumers. New consumers can be added without touching the database or the CDC connector. Kafka's retention provides a durable replay buffer, so a consumer that falls behind or restarts can catch up from any point in the change history - something that polling-based integration cannot offer.

Kafka's ecosystem - Kafka Streams, ksqlDB, and sink connectors - lets you enrich, join, filter, and route CDC events wherever they need to go: relational databases, search indexes, data warehouses, object storage, or HTTP APIs. That's what makes it a natural backbone when multiple systems all need to stay current with the same database.

Kafka Connect & Debezium

Kafka Connect is the integration framework built into Apache Kafka for moving data between Kafka and external systems. Source connectors read from external systems and publish to Kafka; sink connectors read from Kafka and write to external systems. Debezium connectors run as Kafka Connect source plugins.

When a Debezium connector runs in Kafka Connect, it reads from the database transaction log and publishes a change event to a Kafka topic for each committed row change. The event payload follows Debezium's envelope schema: a top-level op field (c/u/d/r for create/update/delete/read), before and after structs with column values, and a source block with database metadata including transaction ID and commit timestamp.

Kafka Connect's distributed mode provides fault tolerance: if a connector worker fails, another worker takes over and resumes from the last committed offset. Connectors can also be scaled across multiple workers for parallel processing of large table sets.

Kafka Connect vs Debezium Server

Dimension Kafka Connect Debezium Server
Requires Kafka Yes No
Targets Kafka topics HTTP, Kinesis, Pub/Sub, Redis…
Fault tolerance Distributed, multi-worker Single process (restart via K8s)
Best for Kafka-native architectures Kafka-free or cloud-native pipelines

Guides & Research

Kafka CDC guides covering broker selection, topic strategy, disaster recovery, and multi-tenant patterns are in progress.

Kafka vs Redpanda for CDC - coming soon

Architecture comparison, operational trade-offs, and migration considerations for using Redpanda as a Kafka-compatible CDC broker.

Kafka Topic Strategy for CDC Pipelines - coming soon

Table-per-topic vs schema-per-topic patterns, partitioning by primary key, compaction policy, and retention sizing for CDC workloads.

Kafka Disaster Recovery for CDC - coming soon

MirrorMaker 2, cross-cluster replication, and recovery strategies for CDC pipelines that cannot tolerate data loss or reordering.

Multi-Tenant Kafka CDC Patterns - coming soon

Topic naming conventions, ACL design, and schema registry configuration for running CDC pipelines from multiple source databases on a shared Kafka cluster.

Frequently Asked Questions