What is the difference between batch and streaming ingestion?

Batch ingestion moves data in scheduled bulk loads - hourly, daily, or on demand. Streaming ingestion processes events continuously as they are generated, with end-to-end latencies from milliseconds to seconds. CDC-based ingestion is a streaming pattern that reads database transaction logs as the event source, making it the most efficient way to ingest relational database changes into a data lake.

How does CDC relate to data ingestion?

Change Data Capture is the most efficient method for ingesting data from relational databases. Instead of querying tables on a schedule, CDC reads the database's transaction log and streams each committed change as an event. This enables real-time data lake updates, captures hard deletes, and eliminates the full table scans that make batch ingestion expensive on large operational databases.

What tools are used for data ingestion?

Common tools include Apache Kafka and Kafka Connect for streaming ingestion pipelines, Debezium for CDC-based database ingestion, Apache Spark for large-scale batch ingestion, Fivetran and Airbyte for managed SaaS-to-warehouse ingestion, and AWS Kinesis or Google Pub/Sub for cloud-native streaming. For Oracle specifically, Debezium (open-source) and Oracle GoldenGate (enterprise) are the leading CDC ingestion tools.

What is the best way to ingest Oracle data into a data lake?

The recommended approach is CDC-based streaming ingestion: Debezium reads Oracle redo logs via LogMiner, streams change events to Apache Kafka, and a sink connector writes them to Apache Iceberg tables in the data lake. This captures all changes including deletes in near real time with minimal Oracle load. For batch ingestion, scheduled Spark jobs reading from Oracle via JDBC are a simpler alternative with higher latency and source load.

Data Ingestion - Methods, Platforms & Architecture

What is Data Ingestion?

Data ingestion is the process of importing data from source systems into a target platform - a data lake, data warehouse, analytics engine, or operational store - where it can be stored, queried, and processed. It is the first and most foundational step of any data pipeline: if ingestion fails, everything downstream breaks. If ingestion is slow, all analytics are stale. If ingestion misses records, analysis produces incorrect results.

Ingestion connects the operational world (where data is created) to the analytical world (where data is used). Sources include relational databases, SaaS applications, event streams, log files, message queues, and external APIs. Each source type has different characteristics - schema stability, change frequency, volume, access patterns - that determine the right ingestion strategy. A REST API polled every minute is a very different ingestion problem from an Oracle database with thousands of committed transactions per second.

The choice of ingestion method has cascading effects: it determines data freshness (minutes vs seconds), what changes are captured (inserts only, or also updates and deletes), how much load is placed on the source system, and what recovery looks like when something goes wrong. Getting ingestion architecture right is often more impactful than optimising the transformation or serving layers that follow it.

Ingestion Methods

Batch Ingestion

Data is extracted from sources in scheduled bulk loads - hourly, nightly, or on demand. SQL queries read from source tables, flat files are imported, or API endpoints are polled on a schedule. Simple to implement and well-supported by traditional ETL tools. Drawbacks: misses hard deletes, creates peak load on source systems, and delivers data that is always stale relative to the last run window.

Streaming Ingestion

Events are ingested continuously as they are produced - from application event buses, message queues (Kafka, Kinesis, Pub/Sub), or IoT sensors. Streaming ingestion into a data lake writes small files frequently, which Apache Iceberg and Delta Lake manage efficiently with background compaction. Latency is measured in seconds. Requires stream processing infrastructure and more complex exactly-once delivery guarantees.

CDC-Based Ingestion

Change Data Capture reads the database transaction log directly and streams each committed row change as an event. This is the optimal ingestion method for relational databases: it captures inserts, updates, and hard deletes with full before/after row images, places minimal load on the source (log read only, no table scans), and delivers changes in seconds. Tools like Debezium (open-source) and Oracle GoldenGate (enterprise) implement CDC for Oracle and other databases.

Ingesting Oracle Database Data

Oracle is one of the most common data sources in enterprise environments. CDC via LogMiner or XStream is the recommended ingestion approach for real-time data lake feeds.

Oracle CDC Methods

LogMiner (built-in, no extra license), XStream (lower latency, requires GoldenGate license), GoldenGate (enterprise), and trigger-based approaches - compared for ingestion use cases.

Oracle CDC Guide

Debezium Oracle Connector

Open-source CDC ingestion from Oracle to Kafka, HTTP, or cloud targets. Interactive guides covering SCN recovery, performance tuning, XStream configuration, and concurrent reader scaling.

Debezium Guides & Tools

Kafka as Ingestion Backbone

Using Apache Kafka and Kafka Connect as the durable buffer between Oracle CDC events and downstream ingestion targets - data lakes, warehouses, and analytical engines.

Kafka CDC Guide

Guides & Tools - Coming Soon

In-depth ingestion guides are in progress, focusing on Oracle to Iceberg pipelines, multi-tenant ingestion patterns, and cloud boundary architectures.

Debezium to Iceberg Ingestion Guide - coming soon

End-to-end configuration for streaming Oracle CDC events from Debezium through Kafka into Apache Iceberg tables - the complete real-time Oracle data lake ingestion pipeline.

Iceberg Sink Schema & Key Design - coming soon

Schema mapping from Debezium Oracle events to Iceberg table layouts, primary key strategy, partition design, and handling schema evolution in the ingestion layer.

Multi-Tenant CDC Ingestion - coming soon

Ingesting data from multiple Oracle schemas or databases into a shared Iceberg data lake with tenant isolation, topic strategy, and ACL design.

Oracle CDC Across Cloud Boundaries - coming soon

Networking, security, and latency patterns for ingesting Oracle on-premises changes into cloud data lakes - covering private connectivity, encryption, and data sovereignty constraints.

Oracle LOBs to Iceberg via CDC - coming soon

Strategies for ingesting Oracle CLOB, BLOB, and XMLTYPE columns through Debezium into Iceberg tables - handling large object serialisation and storage trade-offs.