The search for streaming etl vs cdc usually starts after a data integration project outgrows its ticket. A database replication job now feeds customer analytics. A stream-processing pipeline has become input to a machine learning feature store. A few connectors have turned into a platform, and the team must decide whether to capture raw database changes, transform events in motion, or do both.
That choice is not a vocabulary exercise. Streaming ETL and Change Data Capture solve different parts of the same production problem: how to move changing data through a Kafka-compatible platform without losing semantics or overloading source systems. CDC is strongest when the source of truth is a database and consumers need committed row-level changes. Streaming ETL is strongest when the platform needs to validate, enrich, filter, aggregate, or reshape events before downstream use.
The hard cases sit between those definitions. A CDC stream may need transformation before it is usable. A streaming ETL pipeline may depend on database commit order. Choose the path by the contract the platform can defend under replay, not by which connector is easiest to deploy this week.
Why streaming etl vs cdc matters in production
CDC became popular because it avoids a familiar batch failure mode: repeatedly querying operational databases and hoping the polling interval and source load are acceptable. A CDC connector reads database logs or equivalent change streams and publishes committed changes into Kafka topics. Downstream systems receive inserts, updates, deletes, and transaction-aware changes without querying the source directly.
Streaming ETL starts from a different pressure. The issue is not only how to extract changes, but how to make events usable while they are fresh. A streaming ETL job may join reference data, mask sensitive fields, normalize schemas, route records, calculate aggregates, or produce a domain-level event. The value is turning a stream into a governed product before downstream systems depend on it.
That difference becomes visible when the platform team asks what must remain true during a replay:
- CDC replay should preserve source-system semantics. Deletes, primary keys, transaction boundaries, ordering, snapshots, and connector offsets matter because consumers may reconstruct state from the change log.
- Streaming ETL replay should preserve transformation semantics. The same input should produce the same output unless code, schema, or enrichment data has intentionally changed.
- Hybrid replay must preserve both. If a CDC topic feeds streaming ETL, the platform needs evidence for source correctness and transformation correctness, not one or the other.
This is why the question often belongs to platform engineering rather than only data engineering. Once CDC or streaming ETL becomes shared infrastructure, the operating model includes schemas, lag, retention, access control, incidents, cost, and migration planning. The connector is only the visible edge.
The production constraints behind the choice
The first constraint is event semantics. CDC events usually describe record changes in a source database. They may include before-and-after values, operation codes, source metadata, transaction positions, and tombstones depending on the connector and serialization format. That is valuable when consumers need to rebuild database state or process row-level changes. It is less pleasant when consumers want CustomerRiskScoreUpdated and instead receive a raw table update that exposes database design decisions.
Streaming ETL can repair that gap by converting low-level changes into curated events, but the transformation layer becomes part of the correctness boundary. If a job joins reference data, filters records, or masks columns, the result is no longer a neutral copy of the database log. It is a derived data product, and its owner must also own versioning, test data, schema evolution, and replay behavior.
The second constraint is source-system pressure. CDC can reduce query load compared with repeated extraction, but it does not make the source irrelevant. Initial snapshots, log retention, connector restarts, and long outages still depend on source database configuration. Streaming ETL may move work away from the source, but it can create pressure in state stores, enrichment lookups, downstream writes, and Kafka retention windows.
The third constraint is platform cost. Both paths create traffic and retained bytes in Kafka. CDC topics are often long-lived because teams want replayable history. Streaming ETL topics may multiply volume because one raw stream produces several curated streams. In traditional Kafka, retention, replication, and broker replacement all affect infrastructure cost and operations.
| Decision dimension | CDC is usually better when... | Streaming ETL is usually better when... |
|---|---|---|
| Source of truth | A database log is the canonical history. | The output is a curated stream, not a database mirror. |
| Event meaning | Consumers can work with row-level changes. | Consumers need domain events or normalized records. |
| Replay goal | Rebuild state from committed source changes. | Recompute derived outputs from stable input contracts. |
| Governance owner | Database and connector owners define the contract. | Platform or domain owners define the transformation contract. |
| Failure mode to test | Snapshot, log retention, ordering, deletes, offsets. | Determinism, enrichment drift, state recovery, output schemas. |
The table is not a ranking. Mature platforms often use CDC to capture source changes and streaming ETL to turn those changes into consumable products. The risk comes from hiding that two-step contract behind one label.
Architecture patterns teams usually compare
The simplest pattern is direct CDC into Kafka, followed by independent consumers. This fits when downstream systems understand change events and need broad access to source updates. Data warehouse ingestion, cache invalidation, search indexing, audit trails, and near-real-time replication often start here. The platform team should still decide whether raw CDC topics are public products or internal inputs, because that determines how much schema drift downstream teams absorb.
A second pattern is CDC followed by streaming ETL. The CDC layer captures committed changes, while a stream-processing job converts them into curated topics, materialized views, data lake tables, or service-specific outputs. This separates extraction from transformation, but teams must monitor the connector, processing job, intermediate topics, output topics, schemas, and replay behavior across the chain.
A third pattern is event-first streaming ETL, where applications publish domain events directly and the platform transforms or routes them as needed. This avoids exposing database internals and can produce cleaner contracts. It works best when application teams model events intentionally, and poorly when events are only thin wrappers around tables.
The fourth pattern is table-oriented streaming, where Kafka topics are converted into analytical table formats such as Apache Iceberg. This can be fed by CDC or curated streams. The key question is whether the table receives raw row changes, transformed records, or both through separate layers. A table sink that accepts deletes, upserts, schema evolution, and late data needs a clearer contract than a simple append-only stream.
These patterns are often discussed at the connector layer, but the broker layer quietly shapes what is practical. With traditional Kafka, broker-local storage makes retained data part of broker state. Long CDC retention, repeated backfills, and large fanout can increase the cost of scaling and replacing brokers because data placement matters. Shared storage changes that model by separating durable log storage from broker compute.
Evaluation checklist for platform teams
Start with semantics because every other decision depends on it. For CDC, list source tables, primary keys, delete behavior, snapshot mode, transaction expectations, schema format, connector offsets, and consumer assumptions. For streaming ETL, list input topics, transformation code, enrichment sources, state stores, output schemas, error policy, and replay procedure. If the team cannot explain what a replay should produce, it is too early to choose the path.
Then test the awkward cases rather than the happy path. A CDC proof of concept that only inserts five rows proves almost nothing about production. Use updates, deletes, schema changes, connector restarts, source failover, snapshot interruption, lag, and backfill. A streaming ETL proof of concept should test bad records, late data, enrichment changes, state restoration, output schema evolution, and duplicates. The point is to fail while the blast radius is small.
Governance deserves a separate pass. CDC can leak database structure if raw table topics become public APIs. Streaming ETL can hide complexity, but it can also hide ownership if no team is accountable for transformation logic. Make topic ownership, schema compatibility, access control, and lineage visible before pipelines multiply.
Cost and scalability should be evaluated by workload shape:
- CDC-heavy platforms often need durable retention, compacted topics, large snapshots, and backfills after downstream outages. Storage growth can dominate the cost model.
- Streaming ETL-heavy platforms often need more compute, state management, and output topics. Write amplification and fanout can dominate the cost model.
- Hybrid platforms need both sets of controls. A CDC stream feeding multiple transformations may require raw retention for replay and curated retention for consumers.
Migration planning should include the consumer contract, not only the producer path. If a team moves from batch ETL to CDC, downstream consumers may need to handle deletes, tombstones, reordering expectations, and schema evolution differently. If a team moves from raw CDC to streaming ETL, consumers may need to trust new derived-event contracts. If a team changes the Kafka-compatible platform underneath, offsets, consumer group behavior, topic configuration, security settings, and rollback evidence must be validated together.
Where AutoMQ changes the operating model
Once the evaluation reaches retention, replay, broker replacement, and cloud cost, Kafka storage architecture becomes part of the decision. Traditional Kafka's Shared Nothing design binds partition data to brokers. That model is proven, but it can turn storage-heavy CDC and replay-heavy streaming ETL into broker sizing problems. Adding compute may add storage, replacing brokers may move partition data, and retention growth may require capacity planning even when CPU demand is stable.
AutoMQ fits into a different category: a Kafka-compatible, cloud-native streaming platform built on Shared Storage architecture. It keeps Kafka protocol compatibility while replacing broker-local log storage with S3Stream, where data is written through WAL storage and uploaded to S3-compatible object storage. AutoMQ is relevant when a team wants Kafka semantics and ecosystem compatibility, but the operating pain is caused by broker-local storage, partition movement, or cloud elasticity rather than the connector framework itself.
That does not make AutoMQ a universal answer to streaming etl vs cdc. The pattern decision still starts with semantics: CDC for committed database changes, streaming ETL for curated event products. AutoMQ changes the substrate those patterns run on. Stateless brokers, object-storage-backed durability, faster partition reassignment, and customer-controlled deployment options can make integration workloads easier to scale and migrate when the bottleneck is the Kafka data plane.
AutoMQ's Table Topic capability is also worth understanding here. Table Topic can convert Kafka topic data into Apache Iceberg tables, with configuration for schema parsing, partitioning, and upsert behavior. For lakehouse ingestion, CDC or curated streams can feed Kafka topics while the platform handles table-oriented output. The same production rules still apply: deletes, primary keys, schema compatibility, and replay behavior must be tested.
Decision table and FAQ
The decision should be explicit enough that data engineers, application teams, SREs, and platform owners can disagree using the same language. A CDC advocate may focus on source correctness. A streaming ETL advocate may focus on consumer usability. An SRE may focus on replay, lag, and incidents. A CTO may focus on whether the platform can support more teams without bespoke pipelines.
| If your main pressure is... | Prefer this path | Watch for |
|---|---|---|
| Capturing committed database changes with minimal source queries | CDC into Kafka | Snapshot load, log retention, deletes, ordering, connector offsets |
| Publishing consumer-friendly events from messy inputs | Streaming ETL | Transformation ownership, test data, schema evolution, deterministic replay |
| Feeding analytics tables from operational changes | CDC plus table-oriented sink | Upserts, deletes, primary keys, schema registry, compaction |
| Reducing raw database coupling for downstream teams | CDC followed by streaming ETL | Intermediate retention, dual contracts, replay across both layers |
| Scaling replay-heavy integration workloads in the cloud | Kafka-compatible shared storage evaluation | Latency profile, object storage behavior, migration, rollback evidence |
Back to the original search: streaming etl vs cdc is the wrong question if it is treated as a tool comparison. It is the right question if it forces the team to name the event contract, replay contract, governance owner, and operating model. Choose CDC for faithful committed source changes. Choose streaming ETL for curated, validated, transformed events. Use both when the organization needs source truth and consumer-ready products, but make the boundary visible.
If the integration path exposes broker storage, retention, and migration limits, evaluate the Kafka-compatible platform underneath the pipelines with the same rigor. AutoMQ's shared-storage architecture is one option to review when the team wants Kafka semantics with a cloud-native data plane. A practical next step is to map one representative workload against the AutoMQ migration documentation and validate semantics, offsets, replay, and rollback before production cutover.
References
- Apache Kafka documentation: Kafka Connect
- Apache Kafka documentation: consumer groups
- Apache Kafka documentation: transactions
- Debezium documentation
- AutoMQ documentation: compatibility with Apache Kafka
- AutoMQ documentation: architecture overview
- AutoMQ documentation: Table Topic overview
FAQ
Is CDC the same as streaming ETL?
No. CDC captures committed changes from a source system, usually a database, and publishes them as change events. Streaming ETL transforms, validates, enriches, filters, or routes events while they are in motion. CDC can feed streaming ETL, but the two layers have different correctness responsibilities.
When should CDC be preferred?
Prefer CDC when downstream systems need a faithful stream of database changes, when polling would overload the source, or when consumers need to reconstruct state. Test snapshots, log retention, connector offsets, delete handling, schema evolution, and source failover before relying on it in production.
When should streaming ETL be preferred?
Prefer streaming ETL when consumers need curated domain events, normalized schemas, enrichment, masking, routing, aggregation, or table-ready output. The key requirement is deterministic replay: the team should know whether the same input produces the same output after code or schema changes.
Can a platform use both CDC and streaming ETL?
Yes. Many production platforms capture source changes with CDC, then use streaming ETL to turn raw changes into governed event products or analytical outputs. Keep the raw CDC contract and the transformed-output contract separate, with clear ownership and replay procedures for both.
Does a shared-storage Kafka architecture change the CDC vs streaming ETL decision?
It does not change the semantic decision. It changes the operating model underneath it. A Kafka-compatible shared-storage platform such as AutoMQ can be relevant when CDC or streaming ETL workloads are constrained by broker-local storage, long retention, partition movement, replay cost, or cloud elasticity.