How to Replace Amazon MSK When Scaling Costs Hit a Wall

Amazon MSK is often the sensible first move for Kafka on AWS. It removes a large part of the undifferentiated cluster-management burden, keeps teams inside the AWS operating model, and gives application developers the Kafka interface they already understand. If your workload fits the shape of MSK, replacing it is usually a distraction.

The search for an Amazon MSK replacement starts when that fit breaks. A cluster that looked clean at launch can become expensive once retention grows, cross-AZ traffic becomes visible, peak capacity sits idle most of the day, or scaling a busy Kafka estate turns into a rebalancing project. At that point, the question is no longer "Can AWS run Kafka for us?" It is "Is the Kafka storage and scaling model still aligned with the workload?"

This distinction keeps the replacement decision honest. MSK can still be the right platform. But when the pressure comes from storage, data movement, and elasticity rather than basic administration, another round of tuning may only delay the architecture conversation.

When MSK Is Still the Right Choice

MSK deserves a fair defense before anyone starts planning a migration. It is AWS-native, integrates with familiar security and networking patterns, and supports multiple deployment choices. As of May 20, 2026, AWS positions MSK Provisioned with Standard and Express brokers, MSK Serverless for capacity-abstracted workloads, and related managed services such as MSK Connect and MSK Replicator. Express brokers are especially important to acknowledge because AWS describes them as purpose-built for higher throughput per broker and faster scaling than Standard brokers.

That means a replacement is not the default answer to every MSK complaint. If the issue is a small configuration mistake, an outdated broker type, an undersized cluster, or a retention policy that nobody has revisited, staying on MSK may be the lower-risk path. The same is true when the team strongly prefers AWS-managed operations and the workload has predictable throughput, moderate retention, and stable consumer behavior.

MSK is usually still a strong fit when:

The team wants a managed Kafka service inside AWS and is comfortable with AWS's cluster model.
The workload is steady enough that provisioned capacity does not create a large idle-capacity gap.
Retention is modest, so local broker storage does not dominate the bill.
Scaling events are infrequent, planned, and tolerable within the team's maintenance process.
The organization values AWS-native service ownership more than changing Kafka's storage architecture.

Those are not small advantages. They are the reason many teams start with MSK in the first place. Replacement only becomes interesting when the recurring pain is structural.

The Symptoms That Point Beyond Tuning

Most teams do not wake up and decide to move off MSK because a comparison table looked persuasive. They arrive there through repeated symptoms that tuning did not remove. The same issues show up in cost reviews, incident retrospectives, and capacity-planning meetings: the cluster is healthy enough to run, but expensive or slow to change in ways the business can feel.

Symptom	What it usually means	Why tuning may not be enough
Storage keeps growing faster than traffic	Retention, replay, audit, or log workloads are driving retained bytes	Changing broker size does not change the fact that Kafka data is tied to broker storage
Cross-AZ data movement is hard to explain	Producers, consumers, replication, or dependent services cross AZ boundaries	Placement and client routing help, but multi-AZ Kafka still moves data through the network
Scaling creates operational drag	Partition movement, capacity changes, or broker replacement take too much coordination	Stateful brokers make scaling a data movement problem, not only a compute problem
Peak capacity sits idle	Traffic spikes require provisioning for the high-water mark	Provisioned clusters often pay for readiness even when demand drops
Rebalancing is avoided until it hurts	The team fears operational risk from partition reassignment	Avoiding rebalancing can leave hotspots and wasted capacity in place

These symptoms point to the same underlying pattern: traditional Kafka binds compute, storage, and partition ownership tightly to brokers. MSK manages that model for you, but it does not remove the model. If retained data lives on broker-attached storage and partition movement requires data movement, scaling remains coupled to state.

That is why an MSK replacement decision should start with diagnosis. If the pain comes from a wrong broker class or a missing lifecycle policy, optimize MSK first. If it comes from the cost of retained data, the blast radius of rebalancing, and the inability to scale compute separately from storage, the team should evaluate a Kafka-compatible platform with a different storage architecture.

Optimize MSK First: What to Try

Before replacing Amazon MSK, exhaust the changes that keep the current platform intact. This is not busywork. It gives the team a cleaner baseline and prevents a migration from becoming a cover story for unexamined workload design.

Start with the workload shape. Retention, partition count, ingress rate, egress fan-out, compression ratio, and peak-to-average traffic ratio matter more than the cluster label on an invoice. Then map those numbers to the current MSK deployment choice. Provisioned Standard, Provisioned Express, Serverless, tiered storage, Connect, and Replicator each move a different part of the tradeoff, and AWS pricing is split across broker usage, storage, writes, throughput, data transfer, and managed add-on services depending on the selected mode.

The useful optimization pass usually includes four checks:

Right-size the MSK mode. Compare Standard, Express, and Serverless against the actual workload rather than against a generic "managed Kafka" requirement. Express may improve throughput and scaling characteristics for some provisioned workloads, while Serverless may fit teams that want AWS to abstract more capacity management.
Revisit retention and topic design. Long retention on hot broker storage can be expensive when the real requirement is replay or audit access. Tiered storage may help some MSK Provisioned workloads, but it should be tested against read patterns and operational expectations.
Reduce unnecessary data movement. Review producer, consumer, connector, and downstream placement by AZ. Cross-AZ traffic pricing is region- and path-dependent, so the right answer requires the team's actual topology and the current AWS data transfer page, not a copied example.
Measure rebalance and recovery behavior. Do not evaluate scaling only by whether a cluster can add capacity. Measure how long the workload takes to become balanced, how much data moves, and what happens to client latency and lag during the event.

If these steps solve the problem, staying on MSK is a good outcome. A replacement project should earn its place by changing the cost or scaling curve, not by sounding like a fresher SKU.

Replacement Criteria for Kafka Workloads on AWS

When optimization is not enough, the replacement scorecard should be strict. The target platform has to keep the Kafka application contract while improving the specific constraints that made MSK painful. A platform that reduces one bill line but forces application rewrites, weakens operational visibility, or breaks offset continuity has not really solved the problem.

Criterion	What to verify	Why it matters
Kafka compatibility	Client versions, protocol behavior, ACLs, consumer groups, transactions if used	Application rewrites are the fastest way to turn a platform migration into a product delay
Storage model	Where durable data lives, how retention scales, how reads behave for older data	The storage model determines whether retained bytes remain tied to broker capacity
Scaling model	Whether adding or removing brokers requires data movement	Elasticity is only useful if it does not create a rebalancing project every time demand changes
Network economics	Cross-AZ paths for clients, replication, connectors, and storage	Kafka costs often hide in traffic paths rather than broker line items
Migration path	MirrorMaker 2, MSK Replicator, AutoMQ Linking, offset strategy, rollback	A replacement is safe only when data and consumer progress can be validated before cutover
Operations boundary	AWS account, VPC, IAM, monitoring, patching, incident ownership	The team must know which parts are managed and which parts remain their responsibility

This is also where migration tooling needs careful language. MSK Replicator is an AWS-managed feature for replication between MSK clusters and, under supported conditions, from self-managed Kafka into MSK Provisioned clusters with Express brokers. Apache Kafka teams also use MirrorMaker 2 for cluster-to-cluster replication. AutoMQ documents MirrorMaker 2 for open-source migration paths and Kafka Linking in commercial editions for byte-to-byte synchronization with offset consistency under supported conditions. These tools are not interchangeable; the right one depends on source cluster, authentication, network path, offset requirements, and rollback plan.

The migration plan should therefore be designed around the riskiest workload, not the easiest topic. Pick a workload with meaningful retention, real consumer groups, normal authentication, and enough traffic variation to expose scaling behavior. A proof of concept that only produces and consumes a test topic proves very little about replacing MSK in production.

Where AutoMQ Fits as an MSK Replacement

AutoMQ fits the MSK replacement conversation when the goal is to keep Kafka semantics while changing the storage and elasticity model. AutoMQ is designed as a Kafka-compatible streaming platform built on object storage. Its documentation describes compatibility with Kafka clients, connectors, proxies, and ecosystem components, while the storage layer is redesigned around shared storage rather than broker-local disks.

That architecture targets the exact failure mode that makes some MSK workloads feel stuck. In traditional Kafka, brokers own local log segments and replication is handled at the Kafka layer. Scaling or replacing brokers can involve partition reassignment and data movement. AutoMQ offloads durable stream data to S3-compatible object storage through its shared-storage architecture, making brokers stateless from the perspective of durable data. Stateless brokers can be added, removed, or replaced with far less data movement because the persistent log is not trapped on the broker.

For AWS teams, the evaluation usually comes down to four practical questions:

Can existing Kafka clients keep their contract? Test producers, consumers, Connect jobs, ACLs, observability, and any transaction or compaction assumptions against the target AutoMQ version.
Does object-storage-native retention change the cost curve? If the MSK bill is dominated by retained data and broker storage, S3-backed shared storage can change the economics more directly than another broker-size adjustment.
Does stateless scaling match the traffic pattern? Workloads with sharp peaks, frequent capacity changes, or painful rebalancing are better candidates than flat workloads that rarely change.
Can the migration preserve progress and rollback options? AutoMQ's documented migration paths include MirrorMaker 2 and Kafka Linking, but the team still needs to validate authentication mode, offset behavior, and cutover ownership for its specific MSK estate.

AutoMQ should not be positioned as a generic "MSK is bad" argument. MSK is useful, mature, and deeply integrated with AWS. AutoMQ is relevant when the team's bottleneck is the stateful Kafka storage model itself: retained bytes, broker-bound capacity, cross-AZ data movement, and slow operational changes.

PoC Checklist

A strong first PoC is not the biggest cluster. It is the workload that makes the architectural tradeoff visible. High retention, uneven traffic, cross-AZ consumers, frequent rebalancing, or a painful peak-to-average gap will teach the team more than a clean synthetic benchmark.

Use this checklist before production traffic moves:

Select one workload with real producers, real consumers, and representative retention.
Capture current MSK baseline metrics: ingress, egress, retained bytes, broker utilization, consumer lag, rebalance duration, and cross-AZ traffic paths.
Recreate topic configuration, authentication, ACLs, monitoring, and alerting in the target environment.
Replicate data with the chosen migration tool and validate offsets, ordering assumptions, and lag behavior.
Run a scaling event in the target architecture and compare operational steps, data movement, and client impact.
Define rollback before cutover: owner, trigger, source-cluster state, client reversal, and maximum acceptable dual-run window.
Build a cost model using the same workload inputs for both platforms, with AWS region and pricing pages dated in the worksheet.

Do not let the PoC end at "the client connected." The whole reason to replace Amazon MSK is to change the behavior under storage growth, scaling pressure, and operational load. The PoC should make those differences measurable.

Sources

FAQ

When should a team replace Amazon MSK instead of tuning it?

Replace MSK only when the recurring pain comes from structural workload constraints: storage growth, retained data economics, cross-AZ traffic, slow stateful scaling, or peak over-provisioning that tuning cannot reasonably remove. If the issue is broker selection, retention configuration, topic design, or workload placement, optimize MSK first.

Is MSK Express a replacement for evaluating another Kafka platform?

Not always. MSK Express can improve throughput and scaling characteristics for certain MSK Provisioned workloads, and it should be part of the optimization pass. A different platform becomes relevant when the team needs to change the underlying storage and elasticity model, not only the broker type.

Can you move off MSK without rewriting Kafka applications?

Often, yes, if the target is Kafka-compatible and the applications rely on standard Kafka client behavior. You still need to test client versions, authentication, ACLs, consumer offset handling, connector dependencies, transaction usage, and rollback behavior before calling the move application-transparent.

What workloads are good candidates for an AutoMQ proof of concept?

Start with workloads where MSK pressure is visible: long retention, high retained bytes, uneven peaks, frequent scaling or rebalancing, heavy cross-AZ access, or cost reviews where broker storage and idle capacity dominate the discussion. A flat, low-retention workload may not show much architectural difference.

Which migration tool should be used when moving from MSK?

It depends on the source and target platform, authentication mode, networking, offset requirements, and downtime tolerance. Options to evaluate include MirrorMaker 2, MSK Replicator for supported AWS scenarios, and AutoMQ Kafka Linking for supported AutoMQ commercial migrations. Validate the exact path in a PoC before committing to a production cutover.

How to Replace Amazon MSK When Scaling Costs Hit a Wall

When MSK Is Still the Right Choice

The Symptoms That Point Beyond Tuning

Optimize MSK First: What to Try

Replacement Criteria for Kafka Workloads on AWS

Where AutoMQ Fits as an MSK Replacement

PoC Checklist

Sources

FAQ

When should a team replace Amazon MSK instead of tuning it?

Is MSK Express a replacement for evaluating another Kafka platform?

Can you move off MSK without rewriting Kafka applications?

What workloads are good candidates for an AutoMQ proof of concept?

Which migration tool should be used when moving from MSK?

Trusted by teams running Kafka at scale

Grab

Tencent

LG U+

How to Replace Amazon MSK When Scaling Costs Hit a Wall

When MSK Is Still the Right Choice

The Symptoms That Point Beyond Tuning

Optimize MSK First: What to Try

Replacement Criteria for Kafka Workloads on AWS

Where AutoMQ Fits as an MSK Replacement

PoC Checklist

Sources

FAQ

When should a team replace Amazon MSK instead of tuning it?

Is MSK Express a replacement for evaluating another Kafka platform?

Can you move off MSK without rewriting Kafka applications?

What workloads are good candidates for an AutoMQ proof of concept?

Which migration tool should be used when moving from MSK?

Trusted by teams running Kafka at scale

Grab

Tencent

LG U+

Newsletter