Kafka on Kubernetes 2026 | Stateless Brokers Guide

May 2, 2026
AutoMQ Team
9 min read

Should you run Apache Kafka on Kubernetes in 2026? Yes, if you want standard scheduling, declarative operations, and a common platform model for infrastructure. But traditional Kafka remains a stateful system. Putting it inside Kubernetes does not make broker-local disks, partition replicas, and data rebalancing behave like a stateless microservice.

That is the tension behind most Kafka Kubernetes projects. Kubernetes is good at replacing pods. Kafka brokers are not ordinary pods; they own partition logs, participate in ISR replication, carry broker identity, and often bind to PersistentVolumes. The result is a platform that looks cloud native from the outside, while still carrying a local-disk storage model inside.

Kafka on Kubernetes pain matrix

Why Traditional Kafka Feels Heavy on Kubernetes

Kafka was designed around a Shared Nothing architecture. Each broker owns local log segments for the partitions assigned to it. Replication happens between brokers, and each replica has to catch up before it can safely participate in the in-sync replica set. This model is mature, battle-tested, and still the foundation of many large Kafka deployments.

The mismatch appears when the unit Kubernetes wants to manage is a pod, while the unit Kafka cares about is a broker with durable partition state. A Deployment can replace a pod and let the application recover through external state. A traditional Kafka broker is the external state. If the broker disappears, the replacement process involves more than starting a container; it has to preserve or rebuild the broker's relationship with its data.

Kubernetes StatefulSet exists for exactly this class of workload. The Kubernetes documentation describes StatefulSet pods as having stable identity, stable network identity, and stable storage. That is useful, but it is also the clue. Kafka on Kubernetes works by asking Kubernetes to respect Kafka's statefulness, not by removing that statefulness.

The problem is not Kubernetes. The problem is asking Kubernetes to manage a storage architecture that was designed around local disks.

Kafka Kubernetes Pain Points: Where the Complexity Shows Up

The first pain point is broker identity. In a typical Kafka Kubernetes deployment, broker identity maps naturally to StatefulSet ordinals: kafka-0, kafka-1, kafka-2, and so on. That gives every broker a stable name and a stable storage attachment. It also means the pods are not interchangeable. If kafka-1 is unhealthy, replacing it has a different operational meaning from replacing a web server pod behind a Service.

PersistentVolumes make the next layer visible. Teams have to think about storage classes, volume expansion, zonal disks, reclaim policies, node affinity, backup policy, and what happens when a volume is slow or unhealthy. None of this is Kubernetes being difficult. It is Kubernetes exposing the fact that partition data still lives next to the broker.

Scaling is where the difference becomes operationally expensive. Adding brokers creates more compute capacity, but it does not automatically move existing partition replicas to the additional brokers. Strimzi's documentation states this directly: after adding a broker, partitions must be reassigned before that broker shares existing load. Confluent for Kubernetes describes the same high-level sequence: provision broker resources, start brokers, then rebalance partitions so the expanded broker set carries part of the workload.

Scale-in is harder because it has a safety precondition. A broker that still hosts partition replicas cannot be removed like a stateless replica. The replicas have to move first, leadership has to settle, throttles may be needed, and operators have to monitor whether the reassignment is complete. With large topics, hot partitions, or limited disk and network headroom, this becomes the part of the operation that determines the maintenance window.

Kubernetes operationWhy traditional Kafka adds extra work
Replace a podBroker identity and attached storage must remain consistent.
Expand storagePVC, storage class, and disk layout become part of Kafka operations.
Scale out brokersExisting partitions must be redistributed before load improves.
Scale in brokersReplicas must move away before a broker can be removed safely.
Recover from failureRecovery involves ISR, replica state, disk health, and controller metadata.

Operators automate much of this. That automation matters. It reduces hand-built scripts, standardizes rolling upgrades, manages certificates, coordinates Kafka resources, and integrates with tools such as Cruise Control or self-balancing features. But automation does not change what has to happen underneath. If the broker owns local partition data, scaling still means moving or rebalancing data ownership.

Operators Help, But They Do Not Change the Architecture

Kafka operators are often presented as the answer to Kafka on Kubernetes. That is true at the workflow layer. A good operator turns a fragile sequence of manual steps into a repeatable declarative process. Platform teams should value that, especially when they manage many clusters across environments.

The architectural limit is more important than the operator interface. Operators automate the playbook; they do not remove the stateful constraints. StatefulSet still gives each broker stable identity. PersistentVolumeClaims still represent durable broker-local storage. Partition reassignment still exists because data placement still matters.

This distinction is easy to miss because the YAML looks deceptively simple. Increasing replicas from 6 to 9 may be a one-line change. The system work behind that line includes broker startup, metadata updates, replica movement, catch-up, leader balance, throttling, and verification. Kubernetes accepted the request quickly. Kafka still has to make the storage layout safe.

That is why many Kafka Kubernetes operational guides eventually converge on the same advice: use StatefulSets carefully, choose storage classes deliberately, configure rack awareness, protect brokers with PodDisruptionBudgets, and use a rebalancing tool. These are sound practices. They make stateful Kafka safer on Kubernetes, but they do not make it stateless.

Stateless Brokers: The Architecture Kubernetes Was Waiting For

A stateless broker model changes the premise. The broker is still responsible for Kafka protocol handling, request processing, partition leadership, and coordination with the rest of the cluster. What changes is the location of hard state. Durable partition data no longer lives as local broker-owned disk. It is moved into a shared storage layer such as object storage, while the broker becomes a replaceable compute process.

That shift matters because Kubernetes is fundamentally better at managing replaceable compute than managing application-level data placement. If a broker can be rescheduled without dragging its persistent partition logs with it, the operational model starts to look much closer to the rest of the platform. Replacing a failed broker becomes a compute recovery path. Scaling out adds processing capacity without requiring terabytes of historical data to migrate first. Scaling in is no longer blocked by the same broker-local ownership model.

The right way to describe this is not that the system becomes effortless. Kafka still needs metadata, leadership, client routing, quotas, security, observability, and careful operations. The difference is that the hardest state moves out of the broker layer. Once that happens, Kubernetes can do what it is good at: schedule, replace, roll, and scale compute.

This is also why "Kafka on Kubernetes" and "Kafka as a Kubernetes-native system" are not the same thing. The first is a deployment target. The second is an architecture.

AutoMQ: Kafka on Kubernetes with Stateless Brokers

AutoMQ approaches the problem by preserving Kafka's protocol and compute behavior while replacing the storage layer with a Diskless architecture based on Shared Storage architecture. Its documentation describes broker nodes as stateless because Kafka storage is offloaded through S3Stream to cloud storage. In practice, that means broker-local persistent data is removed from the center of the operational model.

For Kubernetes teams, the important part is not the product label. It is the changed failure and scaling behavior:

  • Broker data is not bound to a local PersistentVolume in the same way traditional Kafka data is.
  • Compute and storage can scale independently, so adding brokers is not primarily a data-copy operation.
  • Partition movement becomes a metadata-light operation compared with moving broker-local log data.
  • A failed broker can be replaced while durable data remains in shared storage.
  • Multi-Availability Zone (multi-AZ) deployments can reduce broker-to-broker replication traffic because durability is delegated to the storage layer rather than recreated by every broker replica.

JD.com's public AutoMQ case study shows why this matters at production scale. JDQ, JD.com's Kafka-based real-time data platform, used Kubernetes and StatefulSet-based orchestration, but the team still faced Kafka's Shared Nothing architecture limitations on Kubernetes. Their write-up describes scale-out as a process that required partition reassignment planning, impact evaluation, off-peak execution, and post-reassignment verification. After adopting AutoMQ's Shared Storage architecture on CubeFS, JD.com reported that Kubernetes scaling efficiency improved from hours to minutes and that partition reassignment became a metadata-oriented operation.

That case is useful because it does not frame Kubernetes as the problem. JD.com was already using Kubernetes. The friction came from running a broker-local storage architecture inside it. AutoMQ changed the storage premise, which allowed the Kubernetes platform to act more like a true elastic compute substrate.

Kafka on Kubernetes Needs a Stateless Foundation

Kafka on Kubernetes is no longer unusual. The more relevant question is what kind of Kafka architecture you are asking Kubernetes to operate. If each broker owns local partition data, the platform has to respect that ownership through StatefulSets, PersistentVolumes, rebalancing workflows, and careful disruption control. Those are valid engineering tools, but they are not a cure for stateful broker design.

Stateless brokers change the operating model by moving durable data out of the broker layer. That does not make Kafka trivial, and it does not remove the need for expertise. It does make the system align more closely with the way Kubernetes already manages the rest of your infrastructure: replace compute, preserve durable state elsewhere, and scale capacity without treating every broker change as a data migration project.

Running Kafka on Kubernetes is a deployment decision. Running Kafka with stateless brokers is an architecture decision. The second one is what makes the first one feel natural. To evaluate that path, start with an AutoMQ deployment assessment or review the Stateless Broker architecture.

References

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.