The CDC data flow
Select a component to understand its role and failure modes in the Debezium Oracle pipeline.
Oracle Database (LGWR)
The primary operational database. High-frequency log switches force the LGWR (Log Writer) and Checkpoint processes to work overtime, temporarily freezing database operations and increasing I/O latency. This directly affects the stability of the redo log stream that Debezium reads.
Key configuration parameters
| Parameter | Target | Rationale |
|---|---|---|
| Switch frequency | 4-6 per hour at peak | Avoids checkpoint stalls; manageable crash recovery time |
| Log file size | 300MB - 1GB | Derived from switch frequency target; tune per workload |
| Log groups | 3-5 minimum | Prevents archiver stalls during log switches |
| Members per group | 2 minimum | Eliminates single-disk failure as outage cause |
| log.mining.strategy | online_catalog | No dictionary flush redo; simpler operations |
| Archive retention | 24-48 hours peak redo | Covers typical connector downtime without SCN not found |
System performance simulation
Database I/O latency and CDC extraction lag over 60 minutes. Target: log switch every 15-30 minutes.
Checkpoint not complete: what it looks like
When redo logs fill too fast, Oracle's LGWR must wait for DBWR to flush dirty buffers before switching. You'll see this in the alert log:
Thread 1 cannot allocate new log, sequence 892 Checkpoint not complete Current log# 3 seq# 892 mem# 0: /oracle/redo/redo3.log Current log# 3 seq# 892 mem# 1: /oracle/redo/redo3b.log
Increasing log file size from 100MB to 500MB and adding more groups eliminates these waits on most high-volume databases. Monitor V$ARCHIVED_LOG to verify your switch frequency after resizing.
The most critical failure in an Oracle CDC pipeline is the SCN not found error. Debezium's last-known System Change Number (SCN) is no longer present in the available redo or archive logs. Two resolution paths exist.
Root causes
- Short retention: Archive log policy purges logs before Debezium can read them after downtime.
- Long downtime: Connector was offline longer than the retention window.
- Stale SCN: Monitored tables have low activity; the database SCN advances past the connector's last offset.
- Log relocation: DBAs moved archive logs to a destination not configured in log.mining.archive.destination.name.
Resolution path 1: restore the log
Zero data loss. Preferred approach.
- Identify the missing SCN from Kafka Connect logs.
- DBA identifies the specific archive log sequence containing that SCN.
- DBA restores the archive log from backup to the original configured destination.
- Restart the Debezium connector.
Resolution path 2: re-snapshot
Use only when logs cannot be restored. This causes downtime and may produce duplicate events downstream.
Prevention checklist
✓ Size redo logs for 4-6 switches per hour (not 15-30 min per switch — same thing, different phrasing).
✓ Set archive log retention to at least 24-48 hours of peak redo generation.
✓ Configure log.mining.strategy=online_catalog to avoid dictionary flush overhead.
✓ Set heartbeat.interval.ms=300000 to prevent stale SCN on low-traffic databases.
✓ Enable supplemental logging: ALL COLUMNS at the database level.
✓ Monitor V$ARCHIVED_LOG switch frequency after any log resizing.