MSK Connect and Replicator Cost Checks for Data Platform Teams

Teams usually search for “MSK Cost” after the first Kafka bill stops matching the mental model. The cluster may still look right on paper: three brokers, sensible retention, enough partitions, and a managed control plane. Then a new line item appears because data is being moved out of the cluster, copied into another region, processed by a connector, or routed through a private connectivity path. At that point, the question is no longer “How much does a broker cost?” It is “Which data paths are we paying for every time a record moves?”

That distinction matters because Amazon MSK is often only one part of the production streaming footprint. MSK Connect runs managed Kafka Connect workers for source and sink integration. MSK Replicator moves data between MSK clusters for same-region or cross-region replication. Private connectivity, cross-region transfer, storage throughput, and consumer fan-out can all sit around the cluster. Each service is reasonable in isolation, but the combined bill follows workload shape more than cluster size.

The practical way to evaluate MSK cost is to separate three layers: the base Kafka cluster, the data movement services around it, and the cloud network boundary those services cross. This gives platform, SRE, and FinOps teams a shared worksheet. It also keeps the architecture discussion respectful. A managed Kafka service can be the right answer for many AWS-native teams, while a Kafka-compatible shared-storage engine can be worth evaluating when the expensive part is repeated data movement rather than Kafka protocol support.

The Cost Unit Is the Data Path

The base MSK cluster has visible cost drivers: broker instance hours, storage, optional provisioned storage throughput, and, for some modes, data in or data out. Those are the line items most teams model first because they map directly to capacity planning. If write throughput grows, broker and storage planning changes. If retention grows, storage planning changes. If partition count grows, the operational envelope changes.

MSK Connect and MSK Replicator add a different pattern. MSK Connect is priced around connector workers and MSK Connect Units, or MCUs. AWS documentation describes each MCU as 1 vCPU and 4 GiB of memory, with connector capacity determined by workers multiplied by MCUs per worker. MSK Replicator pricing is based on data processed, and for cross-region replication AWS also applies standard cross-region data transfer charges. In active-active configurations using identical topic names, AWS notes that each Replicator can process twice the usual data because it must filter records to prevent loops.

Those details turn a pricing review into a topology review. A connector that scales during business hours has a time-weighted compute profile. A replication flow that copies topics across regions has a throughput profile. A private connectivity path has a network-processing profile. All three may be attached to the same Kafka estate, but they do not scale in the same way.

Cost surface	Primary question	What to measure before estimating
MSK cluster	What capacity must stay online?	Broker type, broker count, storage, throughput, partition count, retention
MSK Connect	How many workers run for how long?	Worker count, MCUs per worker, autoscaling range, task parallelism, connector throughput
MSK Replicator	How much data is processed and where does it go?	Source throughput, replicated topics, filters, same-region or cross-region path
Network boundary	Which bytes cross billable boundaries?	Cross-AZ, cross-VPC, PrivateLink, cross-region transfer

The table is deliberately plain. Cost mistakes usually happen when teams skip one row, not when they fail to build a detailed calculator. If the connector is low cost but causes additional read traffic on the source cluster, the cluster row changes. If replication is affordable at steady state but doubles during backfill, the Replicator row changes. If the target VPC lives behind a private connectivity pattern, the network row changes.

Connect Cost Checks: Workers, MCUs, and Backpressure

MSK Connect is attractive because it removes a lot of undifferentiated Kafka Connect operations. You do not have to run a separate Connect cluster or patch worker hosts. The trade is that worker capacity becomes a managed service input, and that input has to be sized against connector behavior rather than average Kafka throughput.

Start with the connector’s limiting resource. Some connectors are CPU-bound because they serialize, transform, or encrypt records. Some are memory-sensitive because batching and buffering dominate. Others are limited by the external system: object layout, JDBC writes, SaaS API quotas, or search indexing. Increasing workers helps only when the connector can parallelize the work. If the source topic has too few partitions or the sink imposes strict ordering, more workers can raise cost without raising useful throughput.

Capacity mode is the next check. Provisioned capacity is easier to model because workers and MCUs are explicit. Autoscaled capacity is better for variable load, but the cost model must use time spent at each worker level, not the average throughput for the month. AWS’s MSK Connect pricing example follows that pattern: model worker-hours first, then apply current regional pricing.

There are also operational constraints hiding behind the price input:

Worker IP usage matters. AWS documents that MSK Connect workers consume IP addresses in customer-provided subnets. A connector fleet that autoscales across many teams can turn subnet capacity into a deployment constraint.
tasks.max is not a pure cost knob. In autoscaled mode, MSK Connect can override the connector’s task allocation based on worker capacity. The limit should follow topic partitioning and external-system parallelism, not an arbitrary budget target.
Backfills should be modeled separately from steady state. A connector that looks small during incremental processing may need a different profile when a new sink is bootstrapped from existing topics.

The cleanest FinOps artifact is a connector inventory with five columns: connector name, source and sink, steady-state worker profile, backfill worker profile, and downstream service limit. That inventory turns cost review into an engineering conversation. If a sink cannot accept more parallel writes, the answer is not more MCUs. If a connector is oversized because it once handled migration, the answer may be a smaller steady-state profile.

Replicator Cost Checks: Direction, Filters, and Recovery Goals

MSK Replicator is often introduced for a good reason: regional resilience, migration, data locality, or platform consolidation. The cost review should not challenge the need for replication. It should make the replication objective precise enough that the data movement bill matches the business requirement.

Three questions prevent most surprises. First, is the flow same-region or cross-region? AWS documentation states that cross-region replication also incurs cross-region data transfer charges, while same-region replication does not have that specific charge. Second, how much data is processed rather than only delivered to the target? Filtering and active-active loop prevention can change the processed byte count. Third, is replication continuous, temporary, or bursty?

Replication also changes the recovery model. A target cluster that receives every topic but has no tested consumer failover path is not a recovery system; it is a copy. A target cluster that can be promoted but lags under peak traffic may satisfy analytics locality but not operational continuity. Cost and readiness have to be checked together.

The right replication design is the smallest one that satisfies the recovery objective, data locality requirement, and rollback plan. Anything broader should be justified as platform simplification, not hidden inside a default “replicate everything” policy.

For active-active designs, be explicit about topic naming and loop prevention. AWS notes that identical topic name replication can cause each Replicator to process additional data for filtering. That may still be the correct design, but the extra processing belongs in the estimate. For migration designs, separate the one-time copy from the post-cutover steady state.

Network Boundaries Decide Whether the Estimate Holds

Many MSK cost reviews focus on the managed service page and miss the network path. That is understandable because the network path is usually spread across VPC, EC2 data transfer, PrivateLink, NAT, and service-specific pages. It is also where a technically correct Kafka design can become financially noisy.

AWS pricing pages show the pattern: data processing and data transfer charges can apply when traffic crosses boundaries such as NAT gateways, VPC peering across Availability Zones, multi-VPC private connectivity, or cross-region transfer. MSK Replicator’s pricing page explicitly includes data transfer for cross-region replication. None of this is unusual for cloud infrastructure, but Kafka amplifies it because the same record may be written, replicated, read, replayed, and copied again.

The review should trace bytes from the producer to the final consumer, not from the producer to the broker. A useful worksheet looks like this:

Path	Charge risk	Engineering check
Producer to cluster	Cross-AZ or cross-VPC writes	Are clients pinned or routed to same-AZ endpoints where possible?
Broker replication	Inter-broker data movement	How does the architecture store durable copies, and where do replicas live?
Connector reads	Extra source reads	Does each sink create a new full read path from Kafka?
Region replication	Cross-region transfer and processing	Is replication topic-scoped and recovery-tested?
Consumer replay	Read fan-out and retained data access	How often do consumers replay historical data?

This worksheet is also where architecture alternatives become concrete. Traditional Kafka and managed Kafka deployments usually rely on broker-local storage and replication across brokers for durability. Tiered storage can reduce the amount of older data kept on primary broker storage, but the broker still owns the hot log and replication path. Shared-storage Kafka-compatible systems change the equation by moving durable log storage into a shared storage layer and making brokers more stateless. That does not remove every cost, but it changes which bytes have to move between compute nodes.

Where AutoMQ Fits the Evaluation

After the data paths are visible, AutoMQ becomes relevant as an architectural option rather than a slogan. AutoMQ is a Kafka-compatible cloud-native streaming system that replaces Kafka’s broker-local log storage with the S3Stream shared-storage architecture. Its documentation describes a design where brokers become stateless, object storage is the primary durable repository, and a WAL layer handles write efficiency before data is stored in object storage.

That design matters most when the cost problem is tied to cloud replication and operations. If every broker stores local durable replicas, the platform pays for storage, movement, and rebalancing in the broker layer. If durable data lives in shared object storage and brokers are stateless compute, scaling and recovery can be handled with less broker-to-broker data movement. AutoMQ’s documentation also describes AZ-aware routing patterns intended to reduce cross-AZ traffic in multi-AZ deployments.

This does not mean every MSK estate should migrate. A small AWS-native deployment with stable throughput and a few managed connectors may be better served by staying close to the existing managed service. The evaluation changes when one or more of these conditions are true:

The largest bill items are data movement, replication, and retained-data access rather than base broker hours.
Platform teams need faster scaling or recovery than broker-local data movement allows.
Multiple teams run connectors and replication flows without a central inventory of byte paths.
Cross-AZ or cross-region traffic is becoming a recurring FinOps issue.
Kafka API compatibility is required, but the storage architecture is open for reconsideration.

In those cases, compare MSK and AutoMQ with the same workload inputs: write throughput, read fan-out, retention, partition count, connector inventory, replication topology, recovery target, and network boundary. The result is more useful than a generic vendor comparison because it shows which architecture pays for which movement.

A Practical Cost Review Sequence

The most reliable review sequence is intentionally boring. First, freeze the workload assumptions for one representative month: write throughput, read fan-out, retention, partitions, and expected backfills. Second, inventory all Connect and Replicator flows, including temporary migration links. Third, draw the network path across AZs, VPCs, accounts, and regions. Fourth, apply current regional pricing from the official pages. Fifth, rerun the model for failure and migration events, not only steady state.

That last step is where teams often find the gap. A steady-state connector profile may be fine, while a backfill profile needs a temporary budget. A cross-region Replicator may be acceptable for selected operational topics, while replicating every analytics topic may fail the cost test. A private connectivity pattern may be required for governance, but the consuming team should see the per-GB implication before it becomes a shared platform bill.

The output should be a decision record, not only a spreadsheet. It should state which flows are required, optional, temporary, or candidates for architectural change. That record gives procurement, FinOps, SRE, and application teams the same language. It also prevents a familiar failure mode: optimizing broker size while ignoring the data paths that move the bill.

If your review shows that broker-local replication, cross-AZ movement, or connector sprawl is the cost driver, evaluate whether a Kafka-compatible shared-storage architecture changes the curve. AutoMQ’s documentation is a good next step for that comparison: review the architecture and deployment options with your own throughput, retention, and network assumptions.

References

Amazon MSK pricing: https://aws.amazon.com/msk/pricing/
Amazon MSK Connect capacity documentation: https://docs.aws.amazon.com/msk/latest/developerguide/msk-connect-capacity.html
Amazon MSK Replicator pricing documentation: https://docs.aws.amazon.com/msk/latest/developerguide/msk-replicator-pricing.html
AWS VPC pricing: https://aws.amazon.com/vpc/pricing/
Apache Kafka Connect documentation: https://kafka.apache.org/documentation/#connect
Apache Kafka replication documentation: https://kafka.apache.org/documentation/#replication
AutoMQ shared storage architecture: https://docs.automq.com/automq/architecture/overview?utm_source=blog&utm_medium=reference&utm_campaign=gs100-0085
AutoMQ cross-AZ traffic cost guide: https://docs.automq.com/automq-cloud/best-practice/save-cross-az-traffic-costs-with-automq?utm_source=blog&utm_medium=reference&utm_campaign=gs100-0085

FAQ

Is MSK Connect billed by connector throughput?

MSK Connect pricing is centered on worker usage and MSK Connect Units rather than a single connector-throughput number. Throughput still matters because it determines how many workers, MCUs, and tasks you need, and how long autoscaled capacity stays at each level.

What makes MSK Replicator cost hard to estimate?

The hard part is processed data, not only replicated data. Same-region and cross-region designs have different network implications, filters can change the processed byte count, and active-active configurations may require extra processing to prevent loops.

Should teams include backfills in monthly estimates?

Yes. Backfills, migrations, and replay events can dominate a billing period even when steady-state streaming looks modest. Model them separately so temporary migration capacity does not become the assumed steady-state run rate.

Does tiered storage remove Kafka replication cost?

Tiered storage can reduce the amount of older data held in primary broker storage, but it does not automatically remove the broker-local hot log, inter-broker replication, or client read paths. Teams should evaluate the hot path and retained-data path separately.

When should AutoMQ be evaluated alongside MSK?

AutoMQ is worth evaluating when Kafka compatibility is required but the main pain is cloud data movement, broker-local storage operations, cross-AZ traffic, or slow scaling tied to local data. The comparison should use the same workload and network assumptions for both architectures.

MSK Connect and Replicator Cost Checks for Data Platform Teams

The Cost Unit Is the Data Path

Connect Cost Checks: Workers, MCUs, and Backpressure

Replicator Cost Checks: Direction, Filters, and Recovery Goals

Network Boundaries Decide Whether the Estimate Holds

Where AutoMQ Fits the Evaluation

A Practical Cost Review Sequence

References

FAQ

Is MSK Connect billed by connector throughput?

What makes MSK Replicator cost hard to estimate?

Should teams include backfills in monthly estimates?

Does tiered storage remove Kafka replication cost?

When should AutoMQ be evaluated alongside MSK?

Trusted by teams running Kafka at scale

Grab

Tencent

LG U+

MSK Connect and Replicator Cost Checks for Data Platform Teams

The Cost Unit Is the Data Path

Connect Cost Checks: Workers, MCUs, and Backpressure

Replicator Cost Checks: Direction, Filters, and Recovery Goals

Network Boundaries Decide Whether the Estimate Holds

Where AutoMQ Fits the Evaluation

A Practical Cost Review Sequence

References

FAQ

Is MSK Connect billed by connector throughput?

What makes MSK Replicator cost hard to estimate?

Should teams include backfills in monthly estimates?

Does tiered storage remove Kafka replication cost?

When should AutoMQ be evaluated alongside MSK?

Trusted by teams running Kafka at scale

Grab

Tencent

LG U+

Newsletter