A data pipeline is a series of automated steps that collect data from one or more source systems, transform it into a usable form, and deliver it to a destination for storage, analysis, or further processing. Pipelines are the backbone of modern data infrastructure - they feed analytics dashboards, train machine learning models, synchronise microservices, and populate data lakes and warehouses.
At its simplest, a pipeline has three stages: extract (read from the source), transform (clean, enrich, reshape), and load (write to the destination). Real pipelines add more - multiple destinations, schema evolution handling, error queues, exactly-once delivery - but the same core shape holds.
The two fundamental pipeline archetypes are batch and streaming. Batch pipelines run on a schedule and move large volumes of data at once - daily warehouse loads, nightly report refreshes. Streaming pipelines process events continuously as they arrive, with end-to-end latencies from milliseconds to seconds. Change Data Capture is a streaming pipeline pattern that reads database transaction logs as the event source, capturing every committed row change in the order it occurred.