Blog

A Platform Engineer's Guide to CDC-to-table Pipelines

Teams do not search for cdc to table pipeline kafka because they need another diagram with a database on the left and a table format on the right. They search for it when a change data capture pipeline has become part of the production contract. Product analytics, risk systems, lakehouse tables, operational search, and machine learning features are reading the same stream, and the platform team has to keep offsets, schema changes, retention, replay, and cloud cost under control at the same time.

That pressure changes the question. A CDC-to-table pipeline moves row changes from operational databases into Apache Iceberg, Delta Lake, or another table format, but the harder decision is where the durable streaming log lives. The team also has to decide how long it can retain change history, how failures are recovered, and who owns the boundary between database logs, Kafka topics, stream processing jobs, table commits, and governance.

The strongest design starts with a plain thesis: Kafka remains a good coordination layer for CDC-to-table workloads, but the storage architecture behind Kafka determines whether the pipeline stays elastic as data volumes, retention windows, and replay traffic grow.

Why teams search for cdc to table pipeline kafka

The search intent is practical. A platform engineer wants to know whether Kafka should sit between a CDC source connector and a table sink, whether Flink should own the table writes, whether Kafka Connect is enough, and how to keep downstream tables correct when the upstream database emits updates, deletes, transaction boundaries, and schema changes. The answer depends less on one tool choice and more on the guarantees the team needs to preserve.

A production CDC-to-table architecture has to manage several contracts at once:

  • Ordering and offsets. Kafka partitions provide ordered logs, and consumers track offsets as they process records. Table writers must translate those offsets into checkpointed commits so a retry does not create gaps or duplicate table files.
  • Consumer group behavior. Kafka consumer groups distribute partitions across members. That is useful for scaling, but the table sink still needs deterministic commit logic when group membership changes.
  • Schema and governance. CDC records carry database schema evolution into the streaming layer. The table layer needs compatible evolution rules, catalog ownership, access control, and auditability.
  • Replay and retention. A table backfill or repair job may need to reread hours, days, or months of change history. That makes Kafka retention a first-class design input, not a cleanup setting.
  • Failure boundaries. Connectors, brokers, processors, catalogs, buckets, and network paths can fail independently. The runbook must say where recovery restarts and what state is authoritative.

These contracts are why Kafka is still attractive. Its protocol, topic model, offsets, transactions, and Kafka Connect ecosystem give CDC pipelines a common language for capture, fan-out, processing, and replay. But using Kafka as the center of the design also means inheriting Kafka's operational assumptions.

The production constraint behind the problem

Traditional Apache Kafka uses a Shared Nothing architecture: each broker owns local storage, and partitions are replicated between brokers for durability and availability. This design is coherent and battle-tested. It keeps recent data close to the broker that serves it, and Kafka's replication model gives operators a clear mental model for leader, follower, in-sync replicas, and failure recovery.

CDC-to-table workloads stress the parts of that model that bind durable bytes to broker-local storage. A table repair may trigger catch-up reads from old offsets. A schema bug may require replaying several topics. A retention policy may expand for audit needs. Each event looks different to the application team, but the Kafka cluster sees a familiar shape: more retained data, more replay traffic, more partitions to balance, and more careful capacity planning.

In a broker-local design, capacity changes are not pure compute changes. Adding brokers can require partition reassignment and data movement. Replacing unhealthy nodes requires knowing which replicas were local to that node. Increasing retention changes disk sizing and recovery exposure. These are manageable problems, but they become part of the CDC-to-table product surface.

Shared Nothing vs Shared Storage Operating Model

Tiered Storage helps some Kafka deployments by moving older log segments to object storage while recent data remains on broker-local disks. The trade-off is architectural: tiering still leaves the broker-local log in the primary write and recovery path. For a CDC-to-table platform with frequent replay and table repair workflows, the question becomes whether tiering is enough or whether the durable log should be designed around shared storage.

Architecture options and trade-offs

Platform teams usually land on one of four patterns. The right answer depends on ownership boundaries as much as technology. A central data platform usually needs a stronger contract for retention, governance, rollback, and multi-tenant operation.

OptionWhen it fitsMain trade-off
CDC connector directly to table storageNarrow use case, limited fan-out, one owning teamLess reusable event stream; harder to support multiple consumers and replay paths
CDC connector to Kafka, then Kafka Connect sinkFamiliar Kafka operations and connector ecosystemSink correctness, schema evolution, and table commits still need careful validation
CDC connector to Kafka, stream processor to tableComplex transformations, enrichment, deduplication, and exactly-once-style workflowsMore moving parts: checkpoints, state, job upgrades, and table commit coordination
Kafka-compatible shared-storage streaming layerLong retention, elastic capacity, replay-heavy operations, and cloud-native ownershipRequires validating the storage implementation, WAL profile, observability, and migration plan

Production systems rarely stay in one box. A team may start with Kafka Connect and later add Flink for deduplication. A platform may keep raw CDC topics for audit and write curated Iceberg tables for analytics. The key is to separate the pipeline contract from the first implementation.

Evaluation should begin with failure and growth scenarios rather than tool names. What happens when the sink job commits files but fails before committing Kafka offsets? Can the team replay a topic without starving low-latency consumers? Can a broker replacement happen during a table repair window? If the answers depend on heroic manual coordination, the architecture is still a prototype.

CDC to Table Pipeline Kafka Decision Map

Evaluation checklist for platform teams

A CDC-to-table pipeline deserves the same design review as any production data platform. The Kafka cluster is one part of that review, but it often decides whether the rest of the system has room to recover.

AreaQuestions to answerWhy it matters
CompatibilityWhich Kafka clients, connectors, authentication modes, ACLs, and admin workflows must continue to work?CDC platforms usually inherit existing Kafka clients and operational tooling.
Commit semanticsWhere are Kafka offsets, processor checkpoints, table snapshots, and catalog commits coordinated?The pipeline must recover without corrupting table state or losing change records.
Retention and replayHow much raw CDC history is kept, and who is allowed to trigger replay?Table repair and audit workflows can turn cold history into active workload.
ScalingCan compute scale independently from retained data volume?CDC bursts and table maintenance windows often do not grow at the same rate.
Cost modelWhich costs come from compute, storage, object requests, PrivateLink or VPC endpoints, and inter-AZ traffic?A pipeline that is cost-effective at steady state can surprise teams during replay.
Security boundaryWhich account, VPC, bucket, catalog, key, and IAM role owns the data path?CDC records often contain sensitive operational data and require clear ownership.
ObservabilityAre connector lag, consumer lag, table commit latency, broker health, object storage errors, and replay throughput visible together?Split dashboards hide the moment when streaming and table systems drift apart.
RollbackCan producers, consumers, sinks, and tables return to the previous path without offset confusion?Migration risk is mostly state risk, not endpoint risk.

This checklist should produce a scorecard, not a yes-or-no answer. A local-storage Kafka cluster may be acceptable if retention is short, replay is rare, and the team already has mature capacity automation. A shared-storage design becomes more compelling when the pipeline needs long retention, elastic capacity, customer-controlled cloud boundaries, and predictable recovery from broker changes.

The important shift is to treat Kafka storage as an architectural decision, not a background implementation detail.

How shared storage changes the operating model

Once the evaluation reaches storage architecture, AutoMQ enters as a Kafka-compatible shared-storage option rather than as a different pipeline pattern. AutoMQ keeps Kafka protocol compatibility while replacing broker-local log storage with S3Stream, a shared streaming storage layer backed by WAL storage, Data caching, and S3-compatible object storage.

That distinction matters because the hard operational questions are often about retained bytes, not request routing. In AutoMQ's Shared Storage architecture, AutoMQ Brokers are stateless with respect to durable partition data. Object storage is the primary repository, and WAL storage absorbs the latency and I/O mismatch between streaming writes and object storage APIs. Broker replacement, scaling, and partition reassignment no longer require the same broker-local data movement that dominates many Shared Nothing operations.

For a platform engineer, this changes the checklist in three ways. First, long-retention CDC topics can be evaluated against object-storage-backed durability instead of broker disk envelopes. Second, scaling decisions can focus more directly on broker compute, cache, and throughput rather than retained data placement. Third, BYOC-style deployment boundaries become easier to reason about because the data path can remain inside the customer's cloud account, VPC, object storage buckets, and IAM model.

AutoMQ also has a table-oriented feature to evaluate. Table Topic is designed to stream data into Apache Iceberg tables without operating a separate ETL pipeline for that path. It is not a replacement for every Flink or Kafka Connect use case; transformations and enrichment may still belong in a stream processor. But when the first goal is to land Kafka-compatible streams into governed table storage, Table Topic is worth testing.

A CDC-to-table system cannot be moved by changing a bootstrap address and hoping state follows. The source cluster, target cluster, producer path, consumer groups, offsets, and rollback plan all have to be explicit. AutoMQ Kafka Linking is relevant because it is designed for migration from Apache Kafka-compatible sources to AutoMQ while preserving topic data and consumer progress.

Readiness Checklist

A practical rollout path

The safest rollout starts with raw CDC topics, not table sinks. Pick one representative database source, one high-value topic, and one downstream table workload. Measure connector lag, broker throughput, consumer lag, table commit latency, object storage behavior, and replay speed under normal load and under a repair scenario. That scenario reveals whether the design can recover from bad schema mapping, missed deletes, incorrect primary keys, or a failed sink deployment.

Then test migration and rollback before treating the new platform as a production dependency. Create the target topics, synchronize the needed data, validate offsets and consumer group behavior, run the table writer against the target path, and rehearse promotion. A good runbook names the authoritative state at every point: source connector position, Kafka offset, processor checkpoint, table snapshot, and catalog commit.

Finally, make the cost model observable. Track steady-state write traffic, retention growth, replay volume, object storage requests, endpoint traffic, and inter-AZ paths as separate signals. CDC-to-table pipelines are long-lived infrastructure; their cost profile is shaped by rare but necessary operations, not only by average ingestion.

If your team is evaluating a Kafka-compatible foundation for CDC-to-table pipelines, review AutoMQ's architecture and run a workload-specific test in your own cloud boundary: start from AutoMQ BYOC. The useful result is a scorecard showing how your connectors, offsets, table commits, replay jobs, and runbooks behave together.

FAQ

Is Kafka required for every CDC-to-table pipeline?

No. A direct CDC-to-table path can work when one team owns the whole workflow and there is little need for fan-out, replay, or reuse. Kafka becomes valuable when multiple consumers need the same change stream, when raw history must be retained, or when processing and table commits need a durable buffer between systems.

Kafka Connect is a good fit for connector-centered pipelines with limited transformation logic. Flink is usually a better fit when the pipeline needs stateful processing, enrichment, deduplication, windowing, or stronger checkpoint coordination with table commits. Many production platforms use both for different classes of workload.

How does shared storage affect CDC replay?

Shared storage does not remove the need to test replay behavior, but it changes the operating model. Retained data is no longer primarily tied to broker-local disks, so replay-heavy workloads can be evaluated around object storage, cache behavior, consumer throughput, and table writer capacity instead of broker disk placement alone.

Where does AutoMQ Table Topic fit?

Table Topic fits when the goal is to land Kafka-compatible stream data into Apache Iceberg tables with less ETL infrastructure. It should still be tested against schema evolution, catalog integration, table maintenance, and workload-specific latency requirements.

References

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.