Blog

Kafka Scaling Issues: Why Adding Brokers Does Not Always Fix the Bottleneck

The frustrating version of Kafka scaling starts with a reasonable move: the team adds brokers, waits for the cluster to settle, and expects latency or consumer lag to drop. Instead, p99 produce latency stays high, one topic still has a growing backlog, and broker dashboards show that the added nodes are not carrying much of the pain. Nothing feels broken enough to call an incident, but the cluster clearly did not scale the way the team expected.

That gap between expectation and result is where many Kafka scaling issues begin. Broker count is only one dimension of Kafka scalability. A cluster can have extra CPU while one partition leader is overloaded. It can have idle brokers while disks on two old brokers are saturated. It can have enough network bandwidth in aggregate while replication, catch-up fetches, and consumer fanout all converge on the same leaders. Adding capacity helps only when the bottleneck can actually use that capacity.

The useful question is not "How many brokers should Kafka have?" It is "Which part of the system is the limiting unit right now?" For SREs and platform engineers, that means diagnosing partitions, leaders, disks, network, replication, consumers, and controller work before deciding whether the answer is tuning, rebalancing, repartitioning, or replacing the architecture.

Kafka Scaling Bottleneck Map

The Symptoms of Kafka Scaling Problems

Kafka scaling problems usually show up as symptoms that look similar but have different causes. Producer latency can rise because brokers are CPU-bound, because request queues are backing up, because followers are slow to acknowledge writes, or because disks cannot flush and read at the same time. Consumer lag can grow because brokers are slow, but it can also grow because the consumer group has fewer active consumers than partitions, because a few hot partitions dominate the workload, or because downstream processing is the actual limiter.

This is why "add brokers" is often too broad as a first response. A broker adds potential CPU, memory, disk, and network capacity, but Kafka traffic is organized by partition leadership and replica placement. If the hot partitions remain on the same leaders, the added broker may be mostly a spectator. If the topic has too few partitions, the workload cannot spread across more leaders or more consumers without repartitioning. If consumer processing is the bottleneck, the broker layer may scale perfectly while application lag keeps rising.

Look for the mismatch between aggregate health and local pressure. Averages hide Kafka scaling problems because Kafka is a set of partition-level logs running on broker-level resources. The cluster can look acceptable in aggregate while a small number of leaders, disks, or network paths dominate the user-visible failure mode.

Start with these signals before changing topology:

  • Produce and fetch latency by broker. Request latency points to where clients experience the problem, not only where resource graphs look busy.
  • Leader distribution and partition traffic. A balanced partition count does not guarantee balanced bytes per partition.
  • Disk utilization and disk I/O wait. Local storage saturation can make every other resource look confusing.
  • Network in/out per broker. Replication and consumer reads can overwhelm a broker even when producer writes look modest.
  • Under-replicated partitions, ISR shrink events, and follower lag. Slow replication can turn scaling into a durability and latency problem.
  • Consumer lag per partition, not only per group. One hot partition can hold back an otherwise healthy consumer group.

The goal is to find the narrowest bottleneck. Once you know that, the next move becomes much less emotional.

The Bottlenecks Hidden Behind Broker Count

Kafka broker scaling works when more brokers can take ownership of meaningful work. That work is not evenly distributed by magic. It is shaped by the number of partitions, the traffic per partition, which brokers lead those partitions, where follower replicas sit, and where clients run.

Partition count is the first constraint. Each partition has one leader at a time, and producer writes for that partition go through that leader. If a high-throughput topic has only a few partitions, the maximum write parallelism is capped no matter how many brokers exist. Adding brokers increases available slots, but it does not split existing partitions into smaller streams. Repartitioning can help, but it changes ordering boundaries and may require application or topic migration work.

Leader skew is the second constraint. A cluster can have a reasonable replica count but still place too many busy leaders on a few brokers. That often happens after organic growth: topics were created at different times, brokers were added later, and traffic patterns changed after the original partition plan. Rebalancing leaders may reduce pressure without adding hardware, but only if the problem is leadership placement rather than partition count or disk throughput.

Disk and network are the third constraint, and they are easy to underestimate because they interact. A broker may be reading historical data for consumers, writing new segments, serving follower fetches, and copying data during reassignment at the same time. When local disks are saturated, adding a broker helps only after enough partition data moves to it. When network is saturated, adding a broker can even create temporary pressure if the cluster starts copying retained data to rebalance.

BottleneckCommon SymptomWhy More Brokers May Not HelpBetter First Move
Too few partitionsLag remains even with idle brokersExisting partitions cannot use more leaders or consumersRepartition or split the workload
Hot leadersOne or two brokers show high latencyNew brokers do not own the busy leadersLeader rebalance and placement review
Disk I/O saturationProduce and fetch latency rise togetherCapacity helps only after data movesReduce local disk pressure or move partitions
Replication lagISR churn or under-replicated partitionsMore brokers add targets but not faster followersInspect follower fetch, disk, and network paths
Consumer parallelismLag isolated to a few partitionsBroker capacity is not the limiterScale or redesign consumers within partition limits
Controller operationsSlow metadata or reassignment progressMore brokers can increase metadata surface areaReduce churn and sequence changes carefully

This table also explains why Kafka cluster scaling becomes a troubleshooting exercise rather than a capacity spreadsheet. The same visible symptom, lag or latency, can come from different layers. Treating every symptom as a broker-count problem creates expensive clusters that still feel fragile.

Why Partition and Storage Coupling Makes Scaling Slow

The deeper reason traditional Kafka scaling can feel stubborn is that brokers are not only compute workers. They also own persistent log data on local storage. That design is powerful and proven, but it means cluster scaling is often coupled to data placement. An added broker is not useful for an existing hot partition until leadership, replica placement, or partition assignment changes. For retained data, that change can involve copying log segments across brokers.

This is where Kafka broker scaling differs from scaling a stateless service. In a stateless service, adding instances can immediately absorb new requests after routing changes. In traditional Kafka, adding a broker creates a place where work could run, but the durable partition data and leadership responsibilities still have to move or be reassigned. If the cluster has long retention, large topics, or cross-AZ placement, that movement can take time and create its own disk and network load.

Add Broker but Still Bottlenecked

Kafka's own operational documentation reflects this reality: expanding a cluster is not only a matter of starting another broker; partition replicas need to be reassigned if you want the new broker to carry existing topic load. That is a normal part of operating Kafka, not a bug. The pain appears when the reassignment traffic competes with the production workload that scaling was supposed to rescue.

Replication can amplify the same effect. With replicated partitions, the leader must serve followers, and followers must keep up to remain in sync. If a follower falls behind because of disk or network pressure, producer latency may rise depending on acknowledgment and in-sync replica settings. Adding another broker does not automatically make those followers faster. It may create a better destination for future placement, but the current slow path still needs to be diagnosed.

The architecture is doing exactly what it was designed to do: keep durable logs close to brokers and replicate them for fault tolerance. The scaling tradeoff is that compute, storage, and data movement are tied together. When growth is gradual, careful reassignment and balancing can be enough. During spikes, migrations, or repeated broker replacement, that coupling becomes much harder to hide.

A Decision Flow for Tuning, Repartitioning, or Replacing

A good diagnosis flow separates immediate relief from structural change. If the production system is under pressure, first reduce risk: avoid launching a large reassignment into an already saturated network, avoid changing partition counts without understanding ordering and consumer behavior, and avoid lowering replication safeguards as a cost or throughput shortcut.

Use the bottleneck to choose the smallest effective move:

  1. If the issue is producer or fetch latency concentrated on a few brokers, inspect leader distribution, request queues, disk I/O, and network before adding brokers.
  2. If the issue is isolated to one topic with too few partitions, plan repartitioning as an application-level scaling change, not a broker-level fix.
  3. If the issue is consumer lag, compare lag per partition with active consumer count and downstream processing time.
  4. If the issue appears during scaling or maintenance, inspect reassignment, recovery, and follower catch-up traffic.
  5. If the issue repeats whenever brokers are added, replaced, or rebalanced, treat storage ownership as part of the problem.

The last case is the architectural turning point. Tuning can improve request handling. Rebalancing can improve placement. Repartitioning can unlock more parallelism. But when the recurring pain comes from moving retained data every time compute capacity changes, the team is no longer dealing with a single bad configuration. It is dealing with a scaling model where local persistent data makes elasticity expensive to exercise.

That does not mean every Kafka cluster should be replaced. A stable workload with predictable throughput, good partition design, and infrequent topology changes may be well served by conventional operations. The replacement discussion becomes rational when the cluster's growth pattern keeps producing the same failure mode: more brokers are added, data movement starts, hot partitions remain hot, and the platform team spends the next maintenance window nursing the rebalance.

How AutoMQ Approaches Kafka Scaling Differently

If the root issue is the coupling between broker compute and broker-local durable data, the alternative architecture needs to separate those responsibilities. AutoMQ is a Kafka-compatible cloud-native streaming platform that keeps Kafka protocol semantics while moving durable log storage to object storage and making brokers more stateless. The important scaling difference is not a slogan about "more capacity"; it is that adding compute does not have to mean moving persistent partition data in the same way a shared-nothing Kafka cluster does.

In this model, brokers can be added for serving traffic and ownership can shift without treating retained log data as broker-local cargo. Self-balancing and reassignment become more about compute placement, metadata, and traffic distribution. Persistent data remains in shared storage, so the expensive part of many traditional scaling events is reduced or avoided.

Stateless Broker Scale-Out Flow

This distinction should be scoped carefully. Stateless broker scale-out does not eliminate every bottleneck. Poor partition design still matters. Consumer parallelism still matters. Client placement, object storage paths, and workload-specific latency requirements still need to be evaluated. What changes is the failure mode of broker scaling: an added broker can become useful without first becoming the owner of a large local copy of historical data.

For teams troubleshooting Kafka scalability during growth, that changes the decision framework. If the cluster needs occasional tuning, keep tuning. If the cluster needs a leader rebalance, do the rebalance. If the topic needs more partitions, plan repartitioning with application owners. But if Kafka scaling problems keep returning because every capacity change turns into a data movement project, it is time to evaluate a Kafka-compatible architecture where compute and storage can scale on different timelines.

The original frustration was simple: more brokers were added, but latency and lag did not improve. The fix is not always to add even more brokers. The fix is to identify the real limiting unit, then decide whether the current architecture lets that unit scale cleanly.

References

FAQ

Why did adding Kafka brokers not reduce consumer lag?

Consumer lag may be limited by partition count, hot partitions, consumer processing speed, or downstream systems rather than broker capacity. Check lag per partition, active consumers per group, and broker fetch latency before assuming the cluster needs more brokers.

Does Kafka scale horizontally?

Yes, Kafka can scale horizontally, but its scaling unit is not only the broker. Throughput depends on partition count, leader placement, replication, disk and network capacity, and consumer parallelism. Horizontal scaling works best when those units can use the added brokers.

When should I add partitions instead of brokers?

Add partitions when the workload needs more parallelism than the current partition count allows. This is especially relevant when a topic has too few partitions for producer throughput or consumer group parallelism. Review ordering requirements and application behavior before changing partition strategy.

Can broker rebalancing fix Kafka scaling problems?

Broker and leader rebalancing can help when load is unevenly placed across the cluster. It will not fix too few partitions, slow consumers, saturated disks, or an architecture where every scaling event creates too much data movement.

How is AutoMQ different from traditional Kafka scaling?

Traditional Kafka ties broker compute to broker-local persistent log data. AutoMQ uses a Kafka-compatible model with shared object storage and more stateless brokers, so adding compute capacity can be separated from moving persistent partition data.

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.