Amazon MSK is a credible answer to a real problem: most teams do not want to run broker replacement workflows and patch windows for every Kafka cluster. A managed service removes a lot of undifferentiated work. The risk starts when an organization turns that relief into a platform standard without asking where the service boundary ends.
The phrase "MSK limitations" is easy to misuse. Some limits are Amazon MSK product quotas, some are Apache Kafka architectural trade-offs, and some are cloud economics that appear only after multi-AZ traffic, retention, and private connectivity are added. A useful review separates those layers before a CTO, platform team, or FinOps lead commits to MSK as the default Kafka substrate.
MSK is managed Kafka, not magic Kafka
Amazon MSK manages broker provisioning, replacement, patching, monitoring integration, and several security and connectivity concerns. It does not remove the Kafka data model. Topics still have partitions. Partitions still have leaders and followers. Data still sits behind broker ownership rules. Clients still need resilient configuration. AWS explicitly tells MSK users to configure clients for high availability, include brokers from multiple Availability Zones in connection strings, and performance-test client settings against application objectives.
That distinction matters because many standardization programs treat "managed" as a synonym for "abstracted." MSK abstracts infrastructure operations, but it does not abstract every capacity and topology decision. AWS service quotas still apply. Kafka partition count still affects broker load and metadata operations. Storage and throughput settings still need sizing. For a company-wide Kafka standard, the hard work moves from "who patches brokers?" to "who owns the guardrails that keep clusters inside safe envelopes?"
A practical pre-standardization review should group risks into six buckets:
- Cost: broker hours, provisioned EBS storage, optional provisioned storage throughput, tiered storage, inter-VPC or internet data transfer, PrivateLink processing, monitoring, and replication tools.
- Scaling: broker size changes, broker count changes, partition redistribution, partition ceilings, connection limits, and quota increase lead time.
- Storage: local primary storage, tiered storage constraints, compaction requirements, cooldown windows, retention expectations, and backfill latency.
- Networking: multi-AZ topology, client placement, PrivateLink, cross-account access, inter-region replication, and data movement outside MSK.
- Operations: upgrade windows, client failover behavior, CPU headroom, CloudWatch metric coverage, incident runbooks, and ownership of topic lifecycle.
- Migration and lock-in: Kafka API compatibility is not the same as zero migration cost. Authentication, network paths, topic settings, tiered storage choices, and managed replication patterns can shape how reversible the decision is.
None of these categories means MSK is a poor service. They mean MSK is still a distributed log running inside AWS, and standardization should be a risk decision, not a procurement shortcut.
Cost limitations: the bill is not only broker hours
The visible MSK cost model starts with broker instance hours. AWS pricing then adds storage and, depending on the broker mode, storage throughput, data written, and connectivity charges. For Standard brokers, AWS lists broker instance usage, provisioned storage, and optional provisioned storage throughput. For Express brokers, AWS describes broker instance usage, storage used, and a per-GB rate for data written. AWS also states that broker replication traffic is not charged, while standard AWS data transfer charges apply for data transferred in and out of MSK clusters.
That last sentence is where many architecture reviews get too shallow. Kafka workloads rarely consist of one producer, one cluster, and one consumer in the same subnet. A real estate of clients, stream processors, connectors, observability exporters, migration tools, and downstream systems can create data paths that are outside the broker-replication exemption. Private connectivity adds another layer: AWS pricing states that MSK private connectivity, powered by AWS PrivateLink, includes hourly charges and per-GB data processing charges, plus standard AWS PrivateLink charges for managed VPC connections.
The cost review should not ask whether MSK is "expensive" in the abstract. It should ask which cost terms scale with the workload metric the business actually cares about.
| Cost area | What to verify before standardizing | Why it matters |
|---|---|---|
| Broker compute | Broker count and size by cluster tier | Idle capacity becomes a platform tax when many teams receive dedicated clusters. |
| Primary storage | Provisioned EBS storage per Standard broker | Standard broker storage is provisioned, so retention buffers can create committed spend. |
| Storage throughput | Optional provisioned throughput on larger Standard brokers | Throughput planning becomes a capacity product, not a background setting. |
| Tiered storage | Low-cost tier usage, first-byte backfill latency, and topic eligibility | It can reduce local storage pressure, but it has constraints. |
| Network and connectivity | PrivateLink, inter-VPC, internet, cross-region, and client egress paths | Network line items can sit outside the MSK console mental model. |
| Replication and migration | MSK Replicator, MirrorMaker2, dual writes, and validation environments | Migration paths often need temporary duplicate capacity. |
The useful output is a pricing worksheet tied to workload classes: internal events, clickstream, compacted state topics, long-retention audit logs, and cross-account streams. Without that segmentation, a central MSK standard can look fine in a reference cluster and drift under production diversity.
Storage limitations: EBS, cooldowns, and tiered storage constraints
Standard MSK brokers persist data on storage volumes. AWS documents that storage I/O is consumed when producers write, when data is replicated between brokers, and when consumers read data that is not in memory. Provisioned storage throughput is available only for clusters whose brokers are kafka.m5.4xlarge or larger and whose storage volume is at least 10 GiB. That is not a flaw; it is a sign that storage remains a first-class capacity dimension.
The most direct storage limitation is one-way scaling. AWS states that you can increase EBS storage per broker but cannot decrease it. After a storage scaling event, the cluster enters a cooldown period before storage can be scaled again. AWS documents that this period ranges from a minimum of 6 hours to more than 24 hours, depending on storage size, utilization, and traffic. For steady workloads, this is manageable. For teams that standardize MSK across unpredictable tenants, it changes how aggressively they can correct bad capacity assumptions.
Tiered storage helps with long retention, but it is not a full replacement for a Kafka platform built on Shared Storage architecture. AWS describes MSK tiered storage as moving data from primary storage into a lower-cost tier after Kafka topic retention limits are reached. The same documentation notes several constraints: tiered storage applies only to provisioned clusters, does not support t3.small, has a 3-day minimum retention period in low-cost storage, does not support compacted topics, and cannot be re-enabled for a topic after being disabled.
This is the subtle point: tiered storage reduces the amount of data that must remain on primary storage, but Kafka still retains a local primary tier and broker ownership model. Apache Kafka's KIP-405 explains the same design pattern at the Kafka level: local storage remains the active tier, remote storage holds completed log segments, and followers still need local and remote log metadata to maintain lineage. That is a strong improvement for retention-heavy workloads, but it is not the same as making brokers stateless.
Scaling limitations: partitions move slower than decisions
MSK gives teams several scaling paths: change broker size, add brokers, adjust partitions, tune thread counts, use tiered storage, or choose Serverless or Express brokers for specific workloads. The limitation is not that scaling is impossible. The limitation is that each path has a different operational consequence.
AWS recommends keeping Standard broker CPU utilization under 60% so the cluster has headroom for broker failures, patching, and rolling upgrades. When CPU is high, AWS lists several options, including broker size updates and broker expansion. It also notes that adding brokers and reassigning existing partitions with kafka-reassign-partitions.sh requires the cluster to replicate data from broker to broker, which can increase load at first. That is the Kafka architecture showing through the managed surface.
Partition count is another standardization trap. AWS publishes recommended partition counts per Standard broker and maximum counts that support update operations. For example, several kafka.m5 and kafka.m7g sizes are listed with recommended values from 1,000 to 4,000 partitions per broker and update-operation maximums from 1,500 to 6,000, depending on broker size. AWS also warns that if partitions per broker exceed the maximum allowed value and the cluster becomes overloaded, some operations can be blocked, including configuration updates and downsizing.
For a single application, these constraints can be handled with design review. For a platform standard, they become product policy. You need rules for topic creation, partition growth, compaction, retention, and tenant isolation. A developer can create 1,000 topics faster than a platform team can unwind a bad partition strategy.
Networking limitations: multi-AZ availability has a data-path shape
MSK clusters are commonly deployed across multiple Availability Zones for availability. AWS recommends three-AZ MSK Provisioned clusters, a replication factor of at least 3, and min.insync.replicas no higher than RF - 1 so rolling updates do not block producers. This is the right posture for availability, but it also means the platform has to reason about where clients run, how they connect, and which traffic leaves the protected MSK replication path.
Private connectivity is a good example. It solves a real enterprise need: clients in one or more VPCs can connect privately to an MSK cluster in a different VPC. It also introduces design questions that belong in the standard:
- Which teams are allowed to consume across VPCs, accounts, or regions?
- Who pays the hourly and per-GB PrivateLink charges?
- Are producers and stream processors placed near the cluster, or does every client path cross a network boundary?
- How are DNS, authentication schemes, security groups, and certificate rotation handled?
- What is the exit strategy if a future platform standard moves away from AWS-only networking?
These are governance questions, not only implementation details. A central Kafka platform often becomes a network product. If the standard does not include placement rules and chargeback tags, MSK cost and troubleshooting can land in different teams' queues.
Operations limitations: managed service does not mean ownerless service
MSK reduces the operational surface. It does not eliminate Kafka ownership. AWS documentation still asks users to test client configurations, maintain client failover behavior, right-size brokers, monitor CPU, and keep clusters within partition guidelines. It also warns that high partition counts can result in missing Kafka metrics in CloudWatch and Prometheus scraping.
The operational question is therefore not "Can AWS run the brokers?" It can. The question is "Who runs Kafka as a product inside the company?" That includes topic lifecycle, client libraries, ACL conventions, schema ownership, incidents, cost allocation, upgrade testing, and migration rehearsal. If those responsibilities stay implicit, AWS owns the infrastructure, but no team fully owns the workload contract.
This is also where MSK Serverless and Express brokers should be evaluated carefully. MSK Serverless removes broker sizing and storage provisioning from the user-facing model, but AWS publishes per-cluster quotas such as 200 MBps maximum ingress, 400 MBps maximum egress, 3,000 client connections, 500 consumer groups, 2,400 leader partitions for non-compacted topics, and 120 leader partitions for compacted topics. Express brokers have their own throughput, storage, and partition quotas. These products can be excellent fits for the right workload class. They are not drop-in replacements for every Kafka estate.
Migration lock-in: Kafka API compatibility is only the first layer
Kafka compatibility keeps producers and consumers from being rewritten. It does not make every platform decision reversible. Once an organization standardizes on MSK, it tends to standardize adjacent decisions too: IAM authentication, VPC connectivity, monitoring dashboards, topic naming, backup and replication patterns, tiered storage settings, quota processes, and Terraform modules. Those choices create useful consistency, but they also become migration surface area.
Before standardizing, create a migration scorecard for each workload tier:
- Protocol and client: Are clients using standard Kafka APIs, or do they depend on AWS-specific authentication and network assumptions?
- Data movement: Can the team run dual writes, MirrorMaker2, MSK Replicator, or another replication path without violating latency or cost targets?
- Offset continuity: Is consumer group state part of the migration plan, and how will cutover be validated?
- Topic features: Are compacted topics, transactions, tiered storage, large messages, and retention policies used in ways that narrow destination choices?
- Rollback: Can the platform move traffic back, or is rollback only a rebuild plan?
The goal is not to avoid MSK-specific features. The goal is to know when you are choosing them. Deliberate dependency is an architecture decision. Accidental dependency is future incident material.
When Shared Storage BYOC Kafka deserves a look
The pattern emerging across these limitations is not that MSK lacks features. It is that Kafka's traditional Shared Nothing architecture still ties broker compute to local primary storage and partition movement. Tuning helps when the problem is inside an MSK envelope: right-size brokers, reduce partitions, adjust clients, enable tiered storage for eligible topics, use Serverless for bounded workloads, or use Express brokers where their quotas and pricing model fit.
Architecture change becomes relevant when the main pain is structural:
- Capacity decisions are dominated by retention and storage throughput rather than compute.
- Rebalancing and broker replacement are too slow because data movement follows broker ownership.
- Multi-tenant clusters need elastic compute without long partition reassignment cycles.
- Cross-AZ or cross-VPC traffic is hard to govern under a central platform standard.
- Teams want Kafka compatibility while keeping the data plane in their own cloud account.
This is where AutoMQ enters as an architecture category, not as a slogan. AutoMQ is a Kafka-compatible, cloud-native streaming platform that replaces Kafka's local log storage with a Shared Storage architecture on object storage. Its documentation describes stateless brokers, S3Stream, WAL storage, object storage as the primary repository, and BYOC deployment options where the data plane runs in the customer's cloud environment.
That design changes which risks need tuning. If brokers are stateless, scaling compute is less coupled to moving partition data. If object storage is the primary repository, retention planning is less tied to broker-local disks. If the platform is deployed as AutoMQ BYOC, the organization can keep the operational control and data placement model inside its own cloud account while still using Kafka APIs. Those are meaningful differences for teams whose MSK pain comes from storage, elasticity, or governance. They are less important for a small, stable workload that fits cleanly into MSK Serverless or a well-sized provisioned cluster.
The most defensible standard may not be "MSK everywhere" or "replace MSK everywhere." A stronger platform policy is workload-based: MSK for AWS-native teams that fit its quotas and managed-service model; MSK Serverless or Express for bounded profiles where the published limits fit; Shared Storage BYOC Kafka for workloads where retention, elasticity, and data-plane control dominate the risk model.
References
- AWS, Amazon MSK quota
- AWS, Amazon MSK pricing
- AWS, Best practices for Standard brokers
- AWS, Scale up Amazon MSK Standard broker storage
- AWS, Provision storage throughput for Standard brokers
- AWS, Tiered storage for Standard brokers
- Apache Kafka, KIP-405: Kafka Tiered Storage
- Apache Kafka, Documentation
- AutoMQ, Architecture overview
FAQ
What are the main MSK limitations?
The main MSK limitations to review are cost structure, partition and broker quotas, storage scaling behavior, tiered storage constraints, networking charges, client failover requirements, and migration surface area. Some are Amazon MSK service limits, while others come from Apache Kafka's broker-local storage and partition ownership model.
Is Amazon MSK worth it?
Amazon MSK is often worth it when teams want managed Kafka inside AWS and their workloads fit the published quotas, pricing model, and operational envelope. It is less compelling when the core requirement is fast elasticity, long retention without broker-local storage planning, or a data-plane model that must remain portable across cloud environments.
What is the difference between MSK Standard, MSK Serverless, and MSK Express?
MSK Standard gives the most control over broker sizing, storage, and Kafka configuration. MSK Serverless removes broker and storage provisioning from users but has per-cluster quotas such as throughput, connection, consumer group, and partition limits. MSK Express brokers use a different pricing and quota model and are designed for higher per-broker throughput and lower operational work than Standard brokers, but they still need workload-fit validation.
Does MSK tiered storage remove Kafka storage limits?
MSK tiered storage can reduce primary storage pressure and support longer retention by moving older data to a low-cost tier. It does not make brokers fully stateless, and AWS documents constraints including provisioned-cluster scope, topic eligibility, compacted topic restrictions, and low-cost tier retention behavior.
When should a team consider an Amazon MSK alternative?
Consider an alternative when the limiting factor is structural rather than operational: slow data rebalancing, retention-driven overprovisioning, frequent elasticity needs, cross-AZ or cross-VPC governance issues, or a requirement to keep Kafka-compatible data planes under BYOC control. In those cases, Kafka platforms built on Shared Storage architecture, such as AutoMQ, may deserve evaluation alongside MSK.
Is AutoMQ a drop-in replacement for MSK?
AutoMQ is Kafka-compatible, but no platform migration should be treated as a blind drop-in replacement. Teams should validate client compatibility, authentication, topic features, offset migration, observability, and rollback. The architectural difference is that AutoMQ uses Shared Storage and stateless brokers, which changes the scaling and storage trade-offs compared with broker-local Kafka deployments.
Ready to test whether your workload is a better fit for managed Kafka tuning or Shared Storage architecture? Start with the AutoMQ BYOC environment and compare it against your highest-risk MSK workload class, not a toy benchmark.