Hot Partitions in Kafka: Detection, Mitigation, and Architecture Choices

Searches for kafka hot partitions rarely come from curiosity. They usually start after a dashboard shows one broker busier than the rest, one consumer group falling behind on a few partitions, or one tenant pushing enough traffic to distort the whole cluster. The averages look calm, but the tail tells a different story: producer latency rises, consumer lag clusters around a small key range, and adding brokers does not cool the overloaded path.

That is the uncomfortable part of partition skew. Kafka is designed around partitions, so the partition is both the unit of ordering and the unit of parallelism. A hot partition is not only "too much traffic." It is a sign that the workload's key distribution, broker placement, consumer parallelism, and storage path are no longer aligned. The fix can be as small as changing a producer key or as large as rethinking how much state a broker should own.

Hot Partitions Are a Shape Problem, Not a Cluster Average Problem

Kafka topics are split into partitions. Producers write records to partition leaders, consumers read by partition assignment, and consumer groups coordinate ownership of those partitions. This gives Kafka its clean ordering model: order is guaranteed within a partition, not across an entire topic. It also means a single partition can become the narrowest point in a much larger system.

The first trap is looking only at broker-level or topic-level averages. A topic can have healthy aggregate throughput while one partition carries a disproportionate share of bytes, requests, or expensive keys. A broker can show spare CPU while one leader handles the busiest traffic path. A consumer group can have enough instances on paper while the lag is concentrated on fewer partitions than active consumers.

Useful detection starts at the partition level:

Produce and fetch rate by partition. Compare bytes in, records in, bytes out, and request latency across partitions instead of relying on topic totals.
Leader concentration. Check whether the busiest partitions share the same broker, availability zone, rack, or storage volume.
Consumer lag distribution. Look for lag that grows on specific partitions while other partitions drain normally.
Key cardinality and key skew. Inspect whether a small set of tenants, accounts, devices, or customer IDs dominates the write path.
Recovery and backfill overlap. Separate normal live traffic from replay jobs, connector snapshots, compaction pressure, and partition reassignment traffic.

The last point matters because hot partitions often appear during operational events. A workload may be balanced during normal ingest, then become skewed when a historical replay fetches from a few retained partitions or a connector snapshot writes bursts for one source table. If the investigation starts after the incident, the team may mistake a temporary replay shape for a permanent keying problem.

Why Key Choice Becomes an Operations Contract

Partition keys are often chosen for application correctness first. That is reasonable. If all events for one order, account, or device must be processed in order, the key has to preserve that invariant. The trouble begins when the same key also carries an uneven workload. A large marketplace seller, a high-volume game title, or a popular IoT device family can turn a clean ordering rule into a cluster-level bottleneck.

Changing the key after production traffic exists is harder than choosing it up front. Consumers may depend on partition-local order. Stream processors may hold keyed state. Compacted topics may rely on the key as the identity of the latest value. Downstream systems may have assumptions about idempotency, ordering, and replay. A hot partition is therefore not only a Kafka issue; it is an application contract issue.

Three mitigation patterns are common, and each has a different blast radius:

Mitigation path	What it changes	When it helps	Hidden risk
Rebalance leaders or replicas	Broker placement, not the key	A hot partition is made worse by leader concentration or broker imbalance	Traditional reassignment can move data and compete with live traffic
Increase partition count	Parallelism ceiling	The topic has too few partitions for expected throughput and consumer concurrency	Existing keys may still map unevenly, and key-to-partition mapping can change
Change the keying strategy	Workload distribution	One key or key range dominates traffic	Ordering, compaction, and stateful processing semantics may change

This is where quick fixes can create a second incident. Increasing partitions may give future records more possible destinations, but it does not magically spread a single hot key. Salting the key can spread load, but it also breaks strict per-key order unless the application adds a reconciliation layer. Moving leaders can relieve a broker, but it does not change the shape of the workload.

The Detection Workflow Should Separate Four Bottlenecks

Treat the first investigation as a classification exercise. The team is not yet trying to prove which platform is better. It is trying to identify what kind of constraint the hot partition is exposing.

Key-space bottleneck. One business key or a small key range dominates traffic. Fixes usually live in producer design, aggregation strategy, sharding rules, or application semantics.
Partition-count bottleneck. The topic has fewer partitions than the required write, read, or processing parallelism. Fixes involve partition expansion, topic redesign, or workload splitting.
Placement bottleneck. Hot leaders or replicas are concentrated on the same broker, disk, zone, or network path. Fixes involve leader balancing, partition reassignment, rack awareness, and capacity headroom.
Storage-coupling bottleneck. The cluster can identify what should move, but moving it is slow or risky because partition data is tied to broker-local storage. Fixes require a deeper architecture decision.

These categories often overlap. A skewed key can create a hot leader, and a hot leader can expose local disk latency, and a recovery event can make the same broker look overloaded for reasons unrelated to the key. The point of the workflow is to avoid treating all of those as "add brokers." More brokers help only when the workload can actually spread to them and when the data movement needed to use them is acceptable.

For SREs, the strongest signal is the difference between where the work should run and where the bytes already live. If moving leadership is enough, the fix is operational. If moving replicas or retained data dominates the maintenance window, the problem is architectural.

Stateful Brokers Make Rebalancing a Data Movement Event

Traditional Kafka follows a Shared Nothing architecture. Each broker owns local log replicas, followers replicate from leaders, and the cluster uses partition assignment to distribute leadership and durability. That model is proven and understandable, but it makes balancing more than a scheduler decision. When a partition's durable data is local to a broker, changing placement can require copying data, catching up replicas, and watching network, disk, and controller behavior while production traffic continues.

Tiered Storage changes part of this equation. Apache Kafka KIP-405 introduced remote log storage so older log segments can be moved to a remote tier while the broker keeps recent active data locally. For long retention, that can reduce local disk pressure and make historical storage less expensive to operate. It is a useful tool when the main problem is retained bytes.

Tiered Storage does not make every hot-partition operation stateless. Active leaders, recent segments, local cache behavior, replication, and consumer fetch patterns still matter. If the hot partition is dominated by live writes or tail reads, the broker that owns the active path still has to serve the pressure. If the team needs to rebalance hot partitions during a spike, the active placement path still shapes how fast added capacity becomes useful.

This distinction explains why some teams keep tuning Kafka and still see the same incident pattern. They can identify the overloaded partitions. They can add brokers. They can plan reassignment. But the act of making added capacity useful still touches local data, replication traffic, and operational risk.

A Vendor-Neutral Evaluation Framework

The right mitigation depends on what you are protecting: ordering, throughput, recovery time, cost control, or operational simplicity. A stable Kafka cluster with predictable traffic, mature balancing automation, and bounded retention may not need a different architecture. The stronger case for architectural change appears when the same failure mode repeats: a few partitions get hot, the cluster has spare aggregate capacity, and the useful fix requires slow or risky data movement.

Use the following questions before choosing a path:

Question	Stay with current Kafka tuning	Repartition or redesign workload	Evaluate a Shared Storage architecture
Is the hot path caused by a single business key?	Maybe, if the key is rare and bounded	Yes, if the keying contract can change	Only after application semantics are understood
Is broker load uneven because leaders are poorly placed?	Yes, if reassignment is safe and fast enough	Usually no	Consider if rebalancing keeps becoming a storage event
Is consumer lag limited by partition count?	Add consumers only if partitions allow it	Yes, partition expansion or topic split may help	Architecture helps less than fixing the parallelism ceiling
Does scaling require moving large retained data sets?	Acceptable for infrequent, planned changes	Partial help	Strong signal for Shared Storage architecture
Do hot partitions affect cost or availability during bursts?	Add headroom and guardrails	Reduce skew at source	Evaluate if elastic compute and storage independence reduce the incident window

This table intentionally avoids a single universal answer. Hot partitions are not one problem. They are the place where application semantics, Kafka placement, and storage architecture meet.

Where AutoMQ Changes the Operating Model

Once the investigation reaches the storage-coupling bottleneck, the architecture question becomes sharper: should partition data remain broker-local, or should durable stream storage live outside the broker fleet? AutoMQ sits in the second category. It is a Kafka-compatible streaming platform built around a Shared Storage architecture, where brokers are stateless and stream data is stored in S3-compatible object storage through S3Stream and WAL storage.

For hot partitions, the important change is not that key skew disappears. It does not. If one tenant sends most of the traffic and strict ordering pins that tenant to one partition, the application still has a key-design problem. The difference is in the operational response after the team knows what must move. In a Shared Storage architecture, changing broker ownership does not have to copy the partition's retained local log between brokers in the same way a broker-local model does.

That can change the practical rebalancing loop:

Detection remains Kafka-aware. You still measure partition-level traffic, leader placement, lag, and client impact.
Mitigation has a smaller storage side effect. Broker changes focus more on ownership, metadata, cache, WAL safety, and object-store access than on moving large local log replicas.
Capacity becomes easier to reason about. Compute scale and retained data growth are less tightly coupled, which matters when bursts and long retention coexist.
Migration can be incremental. Kafka compatibility lets teams test producer and consumer behavior before moving the most skew-sensitive workloads.

The honest boundary is that architecture does not replace workload design. A Shared Storage architecture makes certain operations less tied to retained bytes, but it cannot invent ordering parallelism that the keying model forbids. It is most relevant when the team has already done the Kafka hygiene work and still finds that storage movement is the slowest part of responding to hot partitions.

A Practical Runbook for Platform Teams

A useful hot-partition runbook should avoid vague actions such as "rebalance the cluster" until the cause is classified. Start with evidence, then choose the smallest fix that preserves application semantics.

First, capture the partition distribution for the incident window: top partitions by bytes in, bytes out, records, produce latency, fetch latency, and consumer lag. Pair that with leader placement and broker resource metrics. The goal is to identify whether the same few partitions explain the user-visible symptom.

Second, inspect the key space. If the hot partition maps to a high-volume key or a small group of keys, bring the application owner into the discussion before changing Kafka layout. The safest Kafka operation can still be wrong if it breaks per-key ordering or compaction semantics.

Third, test the operational fix outside the incident path. Leader movement, partition reassignment, partition expansion, and topic split strategies should be rehearsed with production-like traffic. A happy-path test with empty topics proves little about a real hot partition, where live traffic, retained data, and catch-up reads compete for the same resources.

Finally, decide whether the incident is tactical or structural. Tactical issues can be fixed with better keys, more partitions, better leader balance, quotas, and alerting. Structural issues keep returning because the cluster's storage model makes every useful balancing action too slow, too risky, or too expensive to perform when the workload changes.

If your current Kafka-compatible platform keeps turning hot-partition response into a data movement project, compare that operating model with a Shared Storage architecture and test it against your own skewed workload. AutoMQ's architecture overview and GitHub project are good next stops for a production-style evaluation.

References

FAQ

What is a hot partition in Kafka?
A hot partition is a partition that receives or serves a disproportionate share of traffic compared with other partitions in the same topic. It can overload one leader, increase producer latency, concentrate consumer lag, or make one broker look busy while the cluster average appears healthy.

Do more brokers fix Kafka hot partitions?
More brokers help only if the hot workload can move to them. If the issue is a single hot key, too few partitions, or slow partition data movement, adding brokers may leave the overloaded path mostly unchanged.

Should I increase the partition count to fix hot partitions?
Increasing partitions can raise the parallelism ceiling, but it does not automatically split a single hot key. Check ordering requirements, key distribution, consumer assignments, and downstream state before changing partition counts.

Does Tiered Storage solve hot partitions?
Tiered Storage can reduce local disk pressure for older retained data. It does not fully remove the active leader, recent log, cache, and live write path constraints that often define a hot partition incident.

When should Shared Storage architecture be evaluated?
Evaluate Shared Storage architecture when the team can identify hot partitions but the response keeps requiring slow broker-local data movement. That is the point where the operating model, not only the workload shape, deserves scrutiny.

Hot Partitions in Kafka: Detection, Mitigation, and Architecture Choices

Hot Partitions Are a Shape Problem, Not a Cluster Average Problem

Why Key Choice Becomes an Operations Contract

The Detection Workflow Should Separate Four Bottlenecks

Stateful Brokers Make Rebalancing a Data Movement Event

A Vendor-Neutral Evaluation Framework

Where AutoMQ Changes the Operating Model

A Practical Runbook for Platform Teams

References

FAQ

Trusted by teams running Kafka at scale

Grab

Tencent

LG U+

Hot Partitions in Kafka: Detection, Mitigation, and Architecture Choices

Hot Partitions Are a Shape Problem, Not a Cluster Average Problem

Why Key Choice Becomes an Operations Contract

The Detection Workflow Should Separate Four Bottlenecks

Stateful Brokers Make Rebalancing a Data Movement Event

A Vendor-Neutral Evaluation Framework

Where AutoMQ Changes the Operating Model

A Practical Runbook for Platform Teams

References

FAQ

Trusted by teams running Kafka at scale

Grab

Tencent

LG U+

Newsletter