Blog

Move Off MSK: Signs Your AWS Kafka Architecture Needs a Rethink

Teams rarely decide to move off MSK because Amazon MSK is "bad." The original choice is usually reasonable: AWS-native operations, managed broker lifecycle, VPC integration, IAM options, and compatibility with the Apache Kafka ecosystem. The rethink starts when workload shape, retention, cost control, or governance requirements move beyond what a broker-local Kafka architecture can hide.

That distinction matters. A frustrating Kafka month does not automatically justify replacement. AWS publishes concrete guidance for right-sizing Standard brokers, watching partition count per broker, maintaining CPU headroom, tuning larger instances, and using tiered storage where it fits. Those are real levers. A team should use them before treating migration as the first answer.

The useful question is narrower: when is MSK tuning no longer enough? The answer usually appears as a pattern across cost, scaling, reliability, and control. One signal may be an operations backlog. Three or four signals together are an architecture smell.

Move Off MSK Signal Scorecard

Why Teams Decide to Move Off MSK

Amazon MSK keeps the Kafka operating model familiar. In Provisioned mode, you still choose broker type, broker count, storage settings, networking, security configuration, and Kafka version. MSK Serverless changes capacity management for workloads that fit its quotas and IAM access-control model. Express brokers change the performance profile for some Provisioned clusters. Tiered storage can reduce pressure from long retention on Standard brokers by moving older data into a lower-cost remote tier.

Those options are valuable, but they do not erase Kafka's shape. Apache Kafka organizes data into topic partitions, stores records in logs, and uses replication for fault tolerance. Broker placement, partition leadership, replica movement, retained bytes, and client locality still affect performance and cost.

Signal categoryUsually optimize MSK firstConsider replacement when
CostA few topics have runaway retention or poor placementKafka cost grows faster than business volume after retention, partitioning, and traffic locality are corrected
ScalingBroker size or partition distribution is wrongScaling requires disruptive data movement or sustained over-provisioning
StorageRetention policy is inconsistentRetained data is strategic and keeps expanding beyond broker-local economics
NetworkingA few clients cross AZs unnecessarilyCross-AZ and private connectivity design becomes a permanent platform tax

Seven Signs MSK Tuning Is No Longer Enough

1. Kafka Cost Grows Faster Than the Business Metric It Supports

A healthy Kafka cost curve should roughly follow a workload driver: events per second, retained GiB, number of applications, or analytical read volume. The first warning sign is divergence. The business grows 30%, but the Kafka bill grows 2x because retention doubled, consumers replay more history, and the team keeps broker headroom high for maintenance events.

AWS MSK pricing separates broker instance usage, storage, provisioned throughput where applicable, Serverless usage dimensions, MSK Connect, MSK Replicator, and surrounding AWS networking charges. Normalize that into a workload model before replacing anything: logical bytes written, consumer fan-out, retained GiB-months, replication factor, DR volume, cross-AZ traffic, and idle headroom.

If the model exposes a fix, take it. If it shows the dominant cost comes from the architecture itself, the discussion changes. Broker-local Kafka couples compute, storage, and recovery. In a cloud environment, that coupling can make you pay for capacity in the most expensive shape: provisioned before you need it, replicated before you read it, and moved again when you rebalance it.

2. Cross-AZ Traffic Becomes a Design Constraint

Multi-AZ Kafka is the right default for production availability, but it makes traffic locality part of architecture. Producers may connect from one Availability Zone while partition leaders sit in another. Followers replicate data across zones. Consumers may read from a remote replica unless client and broker placement are deliberately tuned.

The warning sign is not the existence of cross-AZ traffic. The warning sign is that workload reviews start with AZ placement, client routing, and private connectivity exceptions. Kafka has mitigation options, including rack awareness, client placement, and follower fetching where supported. But if the dominant debate is how to avoid data movement caused by broker-local replicas, replacement candidates should be evaluated on their data path, not only on hourly broker rates.

3. Scaling Is Planned Like a Maintenance Project

AWS recommends maintaining broker CPU utilization under 60% so MSK clusters have headroom for broker failure, patching, and rolling upgrades. The same guidance explains that broker-size updates happen in a rolling fashion and typically take 10-15 minutes per broker. Adding brokers and reassigning partitions can be more disruptive because it creates additional replication load.

Those are reasonable operational facts. The issue is what they force the platform team to do. If every meaningful scale event becomes a spreadsheet, a traffic freeze, a partition reassignment plan, and a watch window, your Kafka architecture is pushing capacity management back onto humans. If the business needs frequent elasticity while the cluster scales like stateful storage, the gap is architectural.

4. Retention Is No Longer a Kafka Setting

Retention starts as an operational parameter. Later, it becomes part of the product: fraud analysis wants replay windows, ML pipelines want reproducible feature streams, compliance wants history, and downstream systems want rebuildable state.

MSK tiered storage is designed for part of this problem. AWS describes it as a low-cost storage tier for Standard brokers that scales to virtually unlimited storage, with older data moved from primary broker storage to the tiered layer until Kafka topic retention limits apply. AWS also documents constraints: it applies only to Provisioned mode clusters, has version and broker requirements, does not support compacted topics, and has topic-level limitations. Tiered storage can be the right answer when your hot working set is bounded and older reads are occasional. It is not the same as making object storage the primary durable repository.

5. Disaster Recovery and Replication Outgrow the Cluster

Kafka DR is not one feature. It is a collection of choices about RPO, RTO, topic selection, offset handling, identity, failover, DNS, application restart behavior, and post-failback reconciliation. MSK Replicator can reduce operational burden for supported scenarios, but AWS documents quotas around replicators, topic scope, throughput, and record size.

The replacement signal appears when DR stops being a cluster add-on and becomes a second platform. If your team needs multiple replication tools, custom offset procedures, manual producer cutovers, and separate governance around every critical topic, ask whether the current architecture gives you a low-risk recovery model.

6. MSK Modes Help Workloads but Fragment the Platform

It is tempting to frame the decision as Provisioned vs Serverless vs Express. That is too narrow. These are different operating modes within the AWS MSK family, and each can be useful. Serverless automatically provisions and scales capacity for eligible workloads. Express brokers publish different throughput and partition guidance. Provisioned Standard brokers remain familiar for teams that need direct sizing control.

The signal for replacement is not that one mode has a limitation. Every platform has limits. The signal is that your Kafka estate now contains several operating models because no one model fits the work.

7. The Control Boundary No Longer Matches the Business

MSK is AWS-native, which is often the point. The data plane runs in AWS, integrates with AWS networking, and fits AWS security and procurement workflows. Over time, control requirements can sharpen. A regulated team may need stricter data-plane ownership. A FinOps team may want infrastructure cost and vendor fee separated. A platform team may want Kafka compatibility without committing every scaling and storage decision to broker-local infrastructure.

This is where BYOC, or bring your own cloud, enters the evaluation. BYOC is not automatically better than a fully AWS-managed service; it changes the responsibility boundary. The data plane stays in the customer's cloud account, while the provider automates lifecycle, monitoring, upgrades, and support around it.

Tune vs Replace: A Decision Path

The safest migration is the one you do not need. Start by proving whether MSK can be made healthy with ordinary engineering work. Use AWS best practices to validate broker CPU, partition count per broker, connection patterns, storage settings, and client failover configuration. Review quotas before assuming that a service mode can absorb future growth.

Tune vs Replace Decision Path

If the cluster still looks unhealthy after that pass, separate symptoms from structural causes:

  • Tactical tuning problem: hot partitions, poor client placement, too little CPU headroom, incorrect retention, missing monitoring, or an avoidable VPC path.
  • Architecture problem: compute and storage are coupled in a way that forces over-provisioning, slow scaling, expensive replication, or complex DR across important workloads.
  • Governance problem: data-plane control, cost attribution, or security boundaries no longer match the organization's operating model.

If MSK optimization can meet the next 12-18 months of requirements with acceptable cost and risk, optimize. If not, evaluate replacement with a workload-specific proof of concept.

What to Evaluate Before Replacing MSK

A replacement plan should be stricter than the complaint that triggered it. Kafka is usually embedded in application bootstraps, connector configs, stream processors, ACLs, dashboards, incident runbooks, and procurement models.

Evaluation areaQuestions to answer
Kafka compatibilityWhich client versions, APIs, authentication modes, ACL patterns, Connect workloads, and stream jobs must keep working?
Migration pathHow are topics, data, offsets, producers, and consumers migrated? What is the rollback plan?
Storage architectureIs object storage a tier, a backup target, or the primary durable repository? What remains on broker-local disks?
ElasticityWhat happens when brokers are added, removed, upgraded, or replaced under load?
Network costHow does the architecture handle cross-AZ produce, replication, and consume paths?

Start with your hardest realistic workload: the topic family with awkward retention, the consumer group that replays history, the connector that produces backpressure, or the business domain with the hardest cutover requirement.

How AutoMQ Helps AWS Kafka Teams Rethink the Architecture

The architectural alternative to broker-local Kafka is not "Kafka, but managed by someone else." It is Kafka compatibility with a different storage model. AutoMQ fits this category: a Kafka-compatible, cloud-native streaming platform that replaces Kafka's broker-local storage layer with S3-based shared storage while preserving Kafka protocol compatibility for clients and ecosystem tools.

AutoMQ's public documentation describes a shared-storage architecture built around S3Stream, where object storage serves as the primary repository and a WAL layer absorbs low-latency writes before data is stored in object storage. That design makes brokers stateless because durable log data is no longer tied to broker-local disks. AutoMQ also documents migration workflows based on topic and consumer-group batches, synchronization, consumer cutover, producer cutover, and rollback stages.

MSK Replacement Architecture with AutoMQ

The practical consequence is a different set of operational levers:

  • Storage and compute can be scaled more independently because brokers are not the long-term owners of retained log data.
  • Broker replacement and partition movement are less tied to copying large volumes between broker disks.
  • Object storage becomes the durable foundation, which can change retention economics and reduce dependence on over-provisioned broker storage.
  • In BYOC deployments, the data plane can remain in the customer's cloud account while managed automation handles much of the platform lifecycle.
  • Cross-zone traffic patterns can change because replica movement between brokers is no longer the central durability mechanism.

Those benefits still need workload validation. If you rely on a Kafka feature, connector, ACL behavior, or operational integration, test it with producer acks, batch sizes, consumer fetch behavior, and failure scenarios.

Next-Step Migration Checklist

A low-risk move off MSK starts as an assessment, not a cutover. The first milestone is a shared fact base: cost model, topic inventory, consumer groups, client versions, replication requirements, and operational pain.

  1. Inventory topics by business domain, retention, partition count, write volume, read fan-out, compaction policy, and owner.
  2. Identify workloads that should not be first: critical paths, unknown owners, compacted topics with special semantics, or applications without restart windows.
  3. Create a target AutoMQ environment in the required AWS account or deployment model, then validate networking, authentication, monitoring, and access control.
  4. Run batch synchronization for a small topic group and monitor replication lag, consumer offset handling, and application behavior.
  5. Switch consumers and producers according to the migration plan, keeping rollback steps explicit at each stage.
  6. Measure cost, latency, throughput, failure behavior, and operator workload against the baseline MSK cluster.

If the evidence shows a fixable MSK problem, stay and tune. If it shows a structural mismatch between your AWS Kafka workload and broker-local storage economics, "move off MSK" becomes an architecture decision with a test plan.

For teams that want to validate the shared-storage path, start with AutoMQ's migration documentation and run a workload-specific assessment: AutoMQ migration guide and AutoMQ architecture overview.

FAQ

Is moving off MSK always the right answer for high AWS Kafka cost?

No. Many MSK cost problems come from fixable issues: excessive retention, poor partition design, unnecessary cross-AZ client paths, or missing CPU headroom. Replacement is worth evaluating when the dominant driver is the coupling of compute, storage, replication, and recovery.

Should we try MSK Serverless before replacing MSK?

Often, yes. MSK Serverless can fit workloads that need on-demand capacity and match its quotas, security model, and feature constraints. It is less useful as a universal answer for strict compatibility, high partition scale, specialized networking, or long-retention economics.

Does MSK tiered storage solve the retention problem?

It can solve an important part of the retention problem for eligible Provisioned Standard broker clusters and topic patterns. AWS documents benefits such as virtually unlimited low-cost storage, but also documents constraints around versions, broker types, topic policies, compacted topics, and operational visibility. Treat it as a strong optimization path, not as primary shared storage.

What makes AutoMQ different from ordinary Kafka tiered storage?

AutoMQ is designed around shared storage: object storage is the durable foundation, with a WAL layer for write efficiency, and brokers become stateless. Tiered storage keeps broker-local primary storage in the architecture and moves older data to a remote tier.

How risky is an MSK-to-AutoMQ migration?

The risk depends on topic ownership, client versions, consumer offset behavior, producer cutover, connector dependencies, and rollback design. AutoMQ documents a staged process with synchronization, lag monitoring, consumer and producer switching, and rollback options. The safer approach is batch migration by business domain.

References

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.