Blog

Diagnosing Kafka Saturation Before Adding More Brokers

A Kafka cluster under pressure often creates the same first reaction: add brokers, expand disks, and rebalance partitions before the next incident review. When producer latency rises, consumer lag refuses to drain, or disk usage climbs toward a policy limit, the fastest visible lever is more capacity. The problem is that Kafka saturation is rarely a single capacity number. It is a compound signal from client behavior, partition layout, broker CPU, disk I/O, network replication, retention policy, and the cloud bill underneath it.

That is why a serious kafka performance bottleneck review should start before a purchase order or infrastructure change. More brokers can help when compute is the real constraint. They can also make the operating model heavier when the real constraint is broker-local storage, inter-broker replication, unbalanced partitions, cross-AZ traffic, or a migration boundary that nobody has rehearsed. The review has to answer a harder question: is the cluster short of resources, or is the architecture turning every scaling event into data movement?

Kafka saturation review decision map

Why teams search for kafka performance bottleneck review

The search usually starts after a familiar sequence. The platform team sees broker-level alerts first, often disk, CPU, request latency, or under-replicated partitions. Application teams see the same event through different symptoms: produce requests take longer, consumers fall behind, batch jobs miss windows, and retry traffic increases the noise. Finance sees the infrastructure side later, when storage, network, and reserved headroom move in the wrong direction at the same time.

Those symptoms point to Kafka, but they do not identify the bottleneck by themselves. A producer latency spike can come from broker CPU, overloaded request handlers, slow disks, aggressive acks=all settings under replication stress, metadata churn, or network paths that cross availability zones. Consumer lag can be a consumer-side processing issue, a partition imbalance issue, a fetch sizing issue, or a broker read path issue. Disk growth can reflect legitimate retention, slow compaction, or a design that keeps three replicated copies of data on expensive block storage.

A useful review separates the signals into five lanes:

  • Client pressure. Check producer batching, compression, linger, acknowledgement settings, consumer processing time, fetch size, and retry behavior before blaming the broker tier.
  • Partition distribution. Look for hot partitions, uneven leader placement, skewed topic traffic, and consumer groups with fewer active consumers than partitions.
  • Broker resources. Compare request latency, network throughput, CPU, page cache behavior, disk I/O, and JVM pressure across brokers instead of averaging the whole cluster.
  • Storage and retention. Treat retention hours, segment size, compaction, tiering strategy, and disk expansion policy as performance variables, not capacity settings alone.
  • Cloud topology. In multi-AZ deployments, replication and client traffic can become a cost and latency variable. AWS describes data transfer charges as architecture-dependent, which means the bill is also an architecture signal.

The review has to keep those lanes separate long enough to avoid a false fix. If consumers are slow, adding brokers only hides lag until the next peak. If one topic has a hot key, adding brokers may create more idle partitions without moving the hot partition. If broker disks are the bottleneck, adding brokers can reduce pressure, but it also expands the local disks, replica placements, and rebalance operations the team must govern.

The production constraint behind the problem

Traditional Kafka was designed around a Shared Nothing architecture. Each broker owns local storage for the partitions it leads or follows, and durability comes from replicating partition data across brokers. This model is proven, well understood, and deeply integrated with Kafka operations. It also makes storage placement part of broker identity. When the cluster scales, fails over, or rebalances, data has to move because the durable copy is tied to broker-local storage.

That coupling is the reason saturation reviews become operationally expensive. A broker is not only a compute unit that handles requests. It is also a storage node, a replication participant, a network source, and a recovery boundary. Adding brokers can trigger partition and replica movement. Expanding disks buys more local storage before the team knows whether retention, write throughput, or read fanout caused the pressure. Changing instance types changes compute, memory, network, and storage behavior together.

Shared Nothing versus Shared Storage operating model

This is not an argument against Kafka. It is an argument against treating every Kafka symptom as a broker-count problem. Apache Kafka's documentation exposes the operational surfaces that matter: producer and consumer configuration, consumer groups, offset management, transactions, replication, and KRaft metadata. A bottleneck review has to follow those surfaces instead of reducing the system to a single utilization graph.

The cloud makes the constraint more visible. In a private data center, inter-node replication consumes network capacity, but it may not create a separate line item for every byte copied between fault domains. In public cloud deployments, storage class, block volume size, inter-AZ transfer, private networking, and managed service boundaries all affect Kafka cost. AWS documents data transfer cost patterns separately from compute and storage, so a stable cluster can still scale the wrong cost driver.

Architecture options and trade-offs

Once the team has evidence, the next step is not to choose a vendor. It is to choose the operating model that fits the workload. There are several valid paths, and each one has a different failure mode.

OptionWhen it helpsWhat to verify before choosing it
Add brokersBroker CPU, network, or request handling is the dominant constraintPartition skew, rebalance duration, rack awareness, and whether added brokers reduce the hot path
Resize brokersThe bottleneck maps cleanly to CPU, memory, network, or disk bandwidthInstance family limits, storage throughput, maintenance window, and rollback plan
Tune clientsProducer batching, compression, fetch behavior, or consumer processing is the dominant issueEnd-to-end latency budget, retry behavior, ordering expectations, and consumer group stability
Change retention or storage strategyDisk growth or historical reads are driving the incidentRecovery objectives, compliance retention, tiered storage semantics, and read latency expectations
Revisit architectureScaling events keep causing data movement, cost spikes, or operational riskKafka compatibility, migration path, ownership boundary, observability, and rollback design

Tiered storage deserves careful handling. Apache Kafka's tiered storage work, including KIP-405, addresses a real problem: keeping older log segments in remote storage so local disks do not carry the full retention burden. That can reduce local storage pressure and improve retention economics for some workloads. It does not automatically make brokers stateless, remove every rebalance concern, or separate compute from storage on the write path. Teams should evaluate it as a storage lifecycle feature, not as a complete operating model replacement.

The same discipline applies to diskless or shared-storage designs. Moving durable data away from broker-local disks changes the scaling equation, but teams still need to evaluate write latency, metadata behavior, compatibility, recovery, cloud permissions, and observability. The right question is whether it moves the bottleneck out of the part of Kafka operations that has been limiting the team.

Evaluation checklist for platform teams

A practical review should produce an artifact that SRE, platform engineering, security, and finance can all read. It should start with the incident symptom, connect it to measured evidence, then decide whether the fix changes capacity, configuration, or architecture.

Production readiness checklist

Use this checklist before approving more brokers:

  • Compatibility. Confirm the Kafka protocol surface your applications rely on, including producers, consumers, Kafka Connect, MirrorMaker-style replication, ACLs, SASL/TLS, transactions, offset commits, and observability integrations.
  • Cost decomposition. Split the bill into broker compute, persistent storage, object storage if used, inter-AZ or inter-region transfer, public egress, managed service fees, and idle headroom. Averages hide the part that is growing fastest.
  • Elasticity boundary. Identify whether compute can scale without moving historical data. If every scale event triggers large partition movement, elasticity is constrained by storage placement.
  • Failure recovery. Review broker failure, zone failure, metadata failure, client retry storms, and recovery time. Do not count a design as resilient until the team has tested how traffic resumes.
  • Migration risk. List the topics, offsets, ACLs, schemas, connectors, and consumers that must move together. The hardest part of migration is often not copying bytes. It is preserving application progress and rollback options.
  • Governance. Decide who owns cluster policy, cloud account access, encryption, network boundaries, audit trails, and upgrade windows.
  • Observability. Define the metrics that prove the bottleneck moved or disappeared. Without that proof, the next incident will reopen the same argument.

This checklist also prevents a common procurement mistake. A platform can be Kafka-compatible at the API level while still changing the operational model underneath. The review should distinguish protocol compatibility, storage architecture, deployment boundary, and day-2 operations instead of treating them as one yes-or-no question.

How AutoMQ changes the operating model

If the review shows that the limiting factor is broker-local storage and data movement, a Shared Storage architecture becomes worth evaluating. AutoMQ fits in that category: it is a Kafka-compatible streaming system that keeps the Kafka protocol and ecosystem surface while changing the storage layer underneath. Instead of making each broker the long-term owner of local log data, AutoMQ uses stateless brokers with a write-ahead log layer and object storage for durable stream data.

That distinction matters during saturation events. In a Shared Nothing cluster, adding brokers changes where partition replicas live, so scaling often creates data movement. In AutoMQ's model, durable data is not tied to a failed or newly added broker's local disk. Brokers handle compute and network I/O, while the storage layer provides the durable backing. The result is an operating model where compute and storage can be reasoned about separately.

AutoMQ's public documentation describes several capabilities that map directly to the checklist: Kafka compatibility for existing clients and tools, a Shared Storage architecture, WAL options for the write path, object-storage-backed retention, BYOC deployment boundaries, Self-Balancing, AutoMQ Linking for migration, and approaches for reducing cross-AZ traffic in supported cloud layouts. These are not magic switches. They are design choices that change which operational tasks remain with the platform team.

The most important evaluation points are concrete:

  • Will existing clients work without code changes? Kafka compatibility should be tested with the team's actual client versions, security settings, admin tools, and observability stack.
  • Does the write path meet the workload's latency budget? A WAL-backed design should be evaluated with the right WAL option, produce settings, and durability requirements.
  • Can migration preserve progress? Topic data is only part of the move. Consumer offsets, ACLs, cutover order, and rollback behavior determine whether the migration is operationally acceptable.
  • Does the deployment boundary match governance requirements? BYOC and software deployment models are valuable when data plane ownership, cloud account control, and network policy matter.
  • Can the team prove cost movement? Object storage, reduced local disk dependence, and reduced cross-AZ replication paths should be modeled against the actual workload, not a generic benchmark.

For teams that have confirmed the problem is architectural, AutoMQ's pricing calculator is a practical next step because it forces the discussion into workload inputs: write throughput, read fanout, retention, and cloud assumptions. If your Kafka saturation review shows that storage growth, replica movement, and cloud network cost are recurring constraints, model the workload with the AutoMQ pricing calculator before approving another round of broker expansion.

A review workflow that works under pressure

The best time to build the review process is before the cluster is paging the team every hour, but the workflow also works during an incident if the scope is tight. Start with a single workload or topic family, not the whole Kafka estate. Capture the symptom window, then compare producer latency, broker request latency, disk I/O, network throughput, consumer lag, and partition leadership during the same time range.

After that, run two decision loops. The first is tactical: tune clients, rebalance hot partitions, adjust quotas, resize brokers, or change retention where the evidence supports it. The second is architectural: if the same bottleneck keeps returning because storage and compute are coupled, evaluate a model that separates them. Keeping these loops separate avoids endless tuning when the architecture is the issue, and unnecessary platform migration when a client setting would have solved the incident.

Use a short scorecard:

Review areaEvidence to collectDecision threshold
Client behaviorProduce latency, batch size, compression, retries, consumer processing timeTune before scaling if clients create the bottleneck
Partition balanceLeader skew, hot keys, topic throughput by partitionRebalance or repartition if a minority of partitions dominate load
Broker healthCPU, request queues, disk I/O, network, JVM pressureAdd or resize brokers if pressure is broad and resource-bound
Storage modelRetention growth, rebalance data volume, recovery timeRevisit architecture if storage placement drives repeated incidents
Cloud economicsStorage cost, cross-AZ transfer, idle headroomRe-model platform cost if spend grows faster than useful throughput

That scorecard gives buyers a better basis for action than a utilization dashboard. It also helps SREs explain why "more brokers" is sometimes right and sometimes a more expensive version of the same problem.

References

FAQ

Should we always add brokers when Kafka is saturated?

No. Add brokers when evidence shows broad broker-side pressure that additional compute or network capacity will relieve. If the issue is client behavior, hot partitions, retention growth, or broker-local data movement, adding brokers can delay the next incident without fixing the root cause.

Is tiered storage the same as a Shared Storage architecture?

No. Tiered storage can move older log segments to remote storage and reduce local disk pressure, but brokers may still own local hot data and operational placement. A Shared Storage architecture changes the broker's relationship to durable data more directly by making brokers stateless or near-stateless, depending on implementation.

What metrics matter most in a Kafka performance bottleneck review?

Start with producer request latency, broker request latency, network throughput, disk I/O, CPU, partition leadership, consumer lag, and topic-level throughput. Then compare those metrics over the same incident window. The relationship between them matters more than any single graph.

Where does AutoMQ fit in the decision?

AutoMQ fits after the review shows that broker-local storage, data movement, cloud network cost, or slow elasticity is a recurring constraint. Its Kafka-compatible Shared Storage architecture is designed to separate compute from durable storage, so teams should evaluate it as an operating model change rather than as a simple broker replacement.

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.