Blog

Kafka Troubleshooting Runbooks for Platform Teams

A Kafka incident rarely starts with a clean root cause. It starts with consumer lag, producer retries, a stuck rebalance, a broker disk alert, a busy controller, or a bill that suddenly makes the platform team unpopular. A page of commands for each alert decays quickly because Kafka problems cross application, broker, storage, network, metadata, and cost boundaries. A useful Kafka troubleshooting runbook has to guide the operator through evidence, not only commands.

The better model is an operating contract: symptom -> hypothesis -> evidence -> mitigation -> rollback. That contract gives an on-call engineer structure to move quickly without improvising every incident. It also shows platform leads which incidents are local defects and which ones are architecture signals. If the same runbook keeps ending in broker replacement, partition reassignment, disk expansion, and cross-zone traffic review, the incident is feedback from the platform design.

Kafka troubleshooting runbook decision loop

Runbooks Should Encode Decisions, Not Folklore

Most Kafka teams already have troubleshooting material: Grafana links, shell snippets, JMX metrics, kafka-consumer-groups.sh commands, and old incident threads. That toolbox still does not tell the next responder which evidence is trustworthy, or when a mitigation creates a worse second-order problem.

A production runbook needs five fields for every common symptom:

  • Observable symptom. Name the user-visible or SLO-visible failure: lag, produce latency, collapsed fetch rate, delayed controller requests, storage pressure, or unexplained spend.
  • Likely hypotheses. List competing explanations instead of the favorite one: slow consumer, rebalance loop, hot partition, throttling, storage read amplification, network loss, or downstream backpressure.
  • Evidence to collect. Specify the metrics, logs, traces, state, and recent changes that separate one hypothesis from another.
  • Mitigation and rollback. Explain the immediate action, its blast radius, and the condition that tells the team to stop or reverse.

This structure matters because Kafka incidents often look simpler than they are. A broker disk alert may be caused by retention, uneven placement, failed cleanup, a traffic spike, or an architecture that requires too much local headroom. A runbook that jumps from "disk high" to "expand disk" hides those distinctions.

The Symptom Map Platform Teams Actually Need

The most useful runbooks are organized around failure modes responders can recognize quickly. Platform teams should start with high-value playbooks, then refine them after real incidents. The goal is to keep the first minutes from becoming a debate about where to look.

SymptomFirst hypothesesEvidence that separates causesFast mitigationArchitecture signal
Consumer lag growsSlow processing, hot partition, rebalance loop, storage read latencyLag by partition, processing time, rebalance count, fetch latency, downstream queue depthPause low-priority consumers, increase processing capacity, isolate hot keysPartitioning and storage read path may be limiting recovery
Producer errors riseBroker saturation, request throttling, ISR pressure, network lossProduce request latency, error type, broker CPU, network retransmits, ISR changesReduce batch pressure, route traffic, scale brokers if safeWrite path and replication model may be too tightly coupled
Broker storage pressureRetention growth, uneven placement, failed cleanup, replay stormLog dir usage by topic, partition skew, cleanup metrics, recent topic changesAdjust retention, move partitions, expand capacityBroker-local storage may be driving repeated emergency work
Metadata/controller pressureToo many partitions, topic churn, controller failover, slow brokersController queue time, metadata request rate, partition count, recent automationStop churn, reduce automation loops, stabilize controller quorumTopic lifecycle governance is missing
Rebalance instabilityClient churn, cooperative/eager mismatch, session settings, slow processingRebalance duration, group generation churn, client logs, heartbeat timingDrain unstable members, tune client settings, reduce deploy churnDeploy process and consumer design need ownership
Network or AZ failureCross-zone path, DNS, load balancer, inter-broker traffic, client placementZone-level latency, packet loss, broker rack, client rack, cloud network eventsShift clients, isolate zone, lower noncritical trafficData placement and network topology need redesign
Schema or serialization breakageBad producer rollout, incompatible schema, poison messageError logs, schema version, topic offset, producer deployment timelineStop bad producer, quarantine offsets, replay after fixContract testing and rollout gates are weak
Stream processing backpressureState store pressure, sink bottleneck, checkpoint delay, skewTask-level lag, checkpoint duration, sink latency, input skewScale tasks, pause input, shed noncritical outputKafka runbook must include downstream systems
Cost anomalyRetention expansion, cross-zone traffic, cold reads, replication, consumer fan-outBill line items, topic growth, read/write path, zone placementCap retention, fix routing, separate heavy readersCost should be part of operational SLOs

The architecture signal column prevents the runbook from becoming a loop of temporary fixes. If lag incidents consistently end with more brokers and reserved disk, the platform may be using capacity as a substitute for diagnosis. If every storage alert ends with partition movement, the team is paying a tax created by local broker storage.

Consumer Lag Is a Workflow, Not a Metric

Consumer lag is the canonical Kafka troubleshooting example because it is easy to measure and easy to misunderstand. A lag graph says that consumers are behind. It does not say whether the consumer is slow, the partition is hot, the broker is serving reads slowly, the group is rebalancing, the sink is blocked, or deployment churn keeps destabilizing the group.

A strong consumer-lag runbook separates partition-level lag from group-level lag. If one partition is behind while the rest are healthy, suspect key skew, a poison record, or single-partition processing pressure. If every partition is behind, suspect shared processing capacity, broker-side read latency, downstream backpressure, network path, or a coordinated deploy. Then compare processing time against fetch latency, check rebalance frequency, inspect recent changes, and validate sink throughput.

The mitigation should match the evidence. Adding consumers will not fix a topic where one hot partition owns the backlog. Increasing max.poll.records may help a CPU-light consumer but hurt a saturated sink. Restarting a group may clear a stuck state, but it can also trigger another rebalance.

Producer, Broker, and Controller Runbooks Need Different Evidence

Producer-side incidents require a different mental model. When a producer reports timeouts, the team needs to know whether records are failing before they reach the broker, waiting in the request queue, blocked by replication behavior, throttled by quota, or delayed by network path. Evidence should include client error type, request latency, request handler saturation, network metrics, and ISR or leader election changes. The rollback path might revert a batch configuration or pause a new deployment.

Broker incidents are more physical. Disk, page cache, CPU, network, and partition placement shape what is possible. In traditional Kafka, broker-local disks are where durable partitions live, so storage pressure is tied to placement, recovery, and data movement. Expanding disk may buy time, but it does not explain why the broker became a hotspot.

Controller and metadata incidents sit in another layer. Apache Kafka's KRaft mode removes the dependency on ZooKeeper, but metadata scale still matters: topic churn, partition counts, controller quorum health, broker registration, and metadata request rates. A controller runbook should favor stabilization: stop automation loops, freeze topic creation, and avoid broad tuning during the incident.

Storage Architecture Changes the Runbook

Many Kafka troubleshooting runbooks assume a shared-nothing cluster where each broker owns local partition data and brokers replicate data to each other. That model is familiar and proven, but it shapes incident response: broker failure recovery involves leadership movement and replica catch-up, capacity changes involve partition reassignment, and retention growth consumes local disk.

Stateful broker storage versus shared-storage Kafka-compatible architecture

Tiered storage changes part of this story by placing older log segments in remote storage, but the hot write path and broker ownership model remain important. Shared-storage Kafka-compatible architectures go further by separating broker compute from durable data placement. Brokers can behave more like stateless compute nodes, while a write-ahead log layer and object storage provide durability. Teams still need observability, quotas, client discipline, and rollback, but a broker incident is less tied to emergency partition movement or recovery from broker-attached disks.

This is where AutoMQ fits naturally into the decision framework. AutoMQ is a Kafka-compatible streaming platform that redesigns Kafka storage around shared storage and stateless brokers while preserving Kafka protocol and ecosystem compatibility. In BYOC-style deployments, it can run inside the customer's cloud boundary. For troubleshooting, the question is whether the architecture removes recurring incident classes: broker-local disk pressure, slow data rebalancing, capacity over-reservation, and cross-zone paths that are hard to reason about under stress.

Build Runbooks as Evidence Pipelines

Platform teams can make runbooks more reliable by treating them like evidence pipelines. Each step should reduce uncertainty: classify the symptom, define scope, list hypotheses, collect evidence, choose a reversible mitigation, and set the rollback trigger. Add one post-incident field: did the incident require data movement, reserved capacity, manual balancing, or cross-zone traffic that should be redesigned? That field turns operations into platform engineering.

Governance Belongs in the Runbook

Kafka runbooks fail when they stop at brokers. Schema changes, topic creation, retention overrides, ACL changes, consumer group ownership, and stream processing deployments all create production risk. A platform team that owns Kafka as a service needs governance checks inside the troubleshooting process, not after it.

For schema and serialization incidents, the runbook should identify the producing application, schema version, deployment event, affected offsets, and consumer behavior. Mitigation may mean stopping the producer, quarantining records, deploying a compatibility fix, or replaying from a known offset. For stream processing backpressure, include the processor and sink; Kafka may be healthy while the state store, database, object store, or external API is the bottleneck.

Kafka production readiness checklist for troubleshooting runbooks

This is also where platform ownership becomes visible. A runbook should name who can pause a topic, change retention, approve a schema rollback, shift traffic across zones, and declare that a cluster has crossed from "tune it" to "redesign it."

When to Tune, When to Redesign

Not every Kafka incident justifies a new architecture. Many problems are better solved by disciplined client configuration, better partition keys, clearer topic ownership, retention cleanup, deploy gates, or more precise alerts. A good runbook protects teams from overreacting and from normalizing pain.

Use the following decision table after repeated incidents:

Pattern after multiple incidentsTune the current platformRedesign the operating model
One application causes lag after a deployAdd deploy gates, canaries, and consumer ownershipRedesign only if many applications repeat the same pattern
One topic has hot partitionsFix partition key, producer routing, or workload shapeRedesign if the workload cannot be partitioned cleanly and recovery windows are unacceptable
Storage alerts recur across brokersImprove retention and balancing policyEvaluate shared-storage architecture if local disk and data movement dominate incidents
Broker replacement takes too longImprove automation and replacement procedureEvaluate stateless brokers if recovery remains tied to local partition data
Cross-zone traffic is hard to explainAudit rack awareness and client placementRevisit architecture if durability or fan-out creates persistent zone traffic
Cost anomalies follow retention or fan-out growthAdd cost alerts and ownership tagsRevisit storage model if long retention and heavy reads are strategic requirements

AutoMQ should be evaluated in the redesign column when the recurring pain is architectural rather than procedural. Its Kafka-compatible API keeps producers, consumers, and ecosystem tools on familiar interfaces. Its shared-storage design changes the broker failure and scaling model, while BYOC is relevant when the platform team wants customer-controlled cloud boundaries. None of that removes the need for runbooks. It changes what the runbooks spend their time doing.

References

FAQ

What should a Kafka troubleshooting runbook include?

A production Kafka troubleshooting runbook should include the symptom, scope, competing hypotheses, evidence to collect, mitigation options, rollback trigger, and post-incident architecture note. The architecture note is important because repeated incidents often point to storage, scaling, governance, or network design problems.

How do I troubleshoot Kafka consumer lag?

Start by separating group-wide lag from partition-level lag. Then compare consumer processing time, broker fetch latency, rebalance frequency, downstream sink health, and recent deployments. Avoid adding consumers until you know whether the bottleneck is parallelism, hot partitions, broker reads, or downstream backpressure.

Why does broker-local storage make Kafka incidents harder?

Broker-local storage ties durable data to specific broker machines. During failures or capacity changes, teams often need replica catch-up, partition movement, disk expansion, or rebalancing. Those actions can consume network and storage bandwidth during the incident, which makes mitigation more complex.

Does a shared-storage Kafka-compatible platform remove the need for runbooks?

No. Teams still need observability, client governance, schema controls, rollback procedures, and incident ownership. Shared storage changes the operating model by reducing the amount of incident response tied to broker-local data movement and disk capacity management.

When should a team evaluate AutoMQ?

Evaluate AutoMQ when repeated Kafka incidents are caused by architecture constraints rather than isolated misconfiguration: broker-local disk pressure, slow broker replacement, expensive cross-zone replication paths, capacity over-reservation, or data movement during scaling. Start with the AutoMQ GitHub repository, then compare the architecture against your own incident history.

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.