Blog

KRaft Era Kafka: What Changes for Cloud-Native Platforms

Kafka platform teams are searching for kraft kafka cloud native because the move to KRaft raises a bigger question than "How do we remove ZooKeeper?" KRaft changes the control plane: Kafka metadata is managed by a controller quorum instead of an external ZooKeeper ensemble. That is a major modernization step, especially for teams that operate Kafka as a long-lived platform rather than a one-off cluster. But it does not automatically make a Kafka deployment cloud-native in the way architects usually mean that term.

Cloud-native Kafka has a harder bar. It has to run across availability zones, scale when traffic changes, recover from broker failure, preserve Kafka protocol semantics, control infrastructure cost, and fit enterprise governance. KRaft helps with one important part of that picture: metadata management. The rest still depends on how the data plane stores logs, moves partitions, handles retention, and exposes operational boundaries to the teams that own production risk.

That distinction matters because a bad KRaft migration plan can look clean on paper while leaving the old operating model intact. You remove ZooKeeper, upgrade brokers, and declare success. Then the next traffic spike still requires overprovisioned brokers, the next broker replacement still depends on local disk recovery, and the next retention increase still turns into a storage planning exercise. KRaft is necessary progress, not the whole architecture decision.

KRaft Kafka cloud-native decision framework

Why kraft kafka cloud native Matters Now

KRaft became the default strategic direction for Kafka because managing cluster metadata inside Kafka removes an external dependency that every large deployment had to operate. In KRaft mode, Kafka servers can act as brokers, controllers, or both, and dedicated controllers participate in a metadata quorum. Apache Kafka documentation recommends separating broker and controller roles for critical environments, because controller isolation allows operators to roll or scale the control plane independently from the data plane.

That change is bigger than a dependency cleanup. It makes Kafka's control plane more aligned with how platform teams already think about Kubernetes, cloud instances, and managed infrastructure: small quorums for metadata, separate pools for serving data, explicit bootstrap configuration, and operational tooling for quorum status. A platform team no longer has to maintain ZooKeeper as a separate distributed system with separate failure modes.

The catch is that many cloud pain points live below the control plane. Kafka still writes records to partition logs. Brokers still serve producers and consumers. Retention still consumes storage. Replication still determines durability and availability. Partition placement still decides how much data moves when the cluster changes. KRaft improves how Kafka coordinates those decisions; it does not erase the physical cost of the decisions themselves.

This is where the search intent becomes practical. Teams are not only asking whether KRaft works. They are asking whether a KRaft-based Kafka platform can behave like the rest of their cloud-native estate.

The old shared-nothing Kafka model is easy to understand: each broker owns local storage, partitions are assigned to brokers, and replicas are placed on other brokers for durability and availability. That design has served Kafka well for years. It also explains why cloud deployments can become expensive and operationally heavy. When storage and compute are tied to a broker, a capacity problem in either dimension often becomes a broker problem.

A KRaft-era platform has to answer several production questions before it can be called cloud-native:

  • Control plane resilience: How many controllers will run, where will they be placed, and what failure budget does the metadata quorum tolerate?
  • Data plane elasticity: Can brokers be added and removed without moving large volumes of partition data during normal scaling events?
  • Storage economics: Does retention scale with broker-attached disks, block storage, remote log storage, or object storage?
  • Governance boundary: Who owns the data plane, encryption keys, network access, audit trails, and cloud account permissions?
  • Migration safety: Can workloads move without client rewrites, offset loss, or unclear rollback points?

The first question is where KRaft shines. The remaining questions are architecture choices around the data plane. They are also the questions that usually determine whether the platform feels cloud-native to SREs and application teams.

KRaft Changes Metadata, Not Storage Physics

KRaft introduces a cleaner way for Kafka to manage cluster metadata. Controllers maintain the metadata log and form a quorum. Brokers and controllers discover the active controller through quorum bootstrap configuration. For production environments, Apache Kafka's KRaft operations documentation calls out dedicated controller roles, quorum sizing, and majority availability. With 3 controllers, the controller cluster can tolerate 1 controller failure; with 5 controllers, it can tolerate 2.

Those numbers are useful because they make control-plane failure planning explicit. They also show why KRaft is not a storage architecture. The controller quorum protects metadata availability. It does not decide whether topic data lives on local NVMe, cloud block storage, a remote tier, or object storage. It does not remove the need to size brokers for traffic, page cache, network, disk throughput, or failover headroom.

Kafka Tiered Storage is a separate feature that extends the storage story. Apache Kafka documentation describes tiered storage as a 2-tier model: local storage remains on brokers for active log segments, while completed segments can be moved to remote storage such as S3 or HDFS. That is valuable for long retention and historical reads. It also has operational details: broker-level enablement, topic-level configuration, remote storage manager implementation, local retention settings, and documented limitations.

The practical conclusion is simple: KRaft modernizes coordination; tiered storage extends retention; neither one automatically creates stateless brokers. If cloud-native behavior is the goal, platform teams need to evaluate the control plane and the data plane separately.

Stateful brokers vs stateless brokers

Architecture Patterns Teams Usually Compare

Most KRaft-era decisions fall into 4 architecture patterns. The right answer depends on workload shape, risk tolerance, and the amount of operational ownership the team wants to keep.

PatternWhat changesWhat stays hard
Self-managed Kafka with KRaftZooKeeper is removed, and Kafka metadata moves into a controller quorum.Broker disks, partition reassignment, cross-zone replication, upgrades, and capacity planning remain the team's responsibility.
Managed Kafka with KRaftThe provider operates the control plane and much of the broker lifecycle.Cost model, data-plane boundary, feature availability, and migration constraints depend on the service.
Kafka with tiered storageOlder log segments can move to remote storage, reducing pressure from long retention.Active segments and broker-local behavior still matter; remote storage implementation and limitations must be understood.
Shared-storage Kafka-compatible systemsBrokers are designed around external shared storage, often making the compute layer more elastic.Teams must validate Kafka compatibility, latency profile, deployment model, and operational tooling.

This table is intentionally not a vendor ranking. It is a way to avoid mixing unrelated decisions. A team may only need KRaft migration because its current Kafka estate is stable and cost-efficient. Another team may need a managed service because staffing is the bottleneck. A third team may have outgrown broker-attached storage because retention, replay, and cross-zone movement dominate the bill.

The mistake is treating KRaft as the deciding factor for all 3 cases. KRaft is an important baseline. The platform decision still turns on data movement, storage ownership, scaling behavior, and governance.

Evaluation Checklist For Platform Teams

A strong KRaft cloud-native review should start with workload physics, not product names. Write throughput, read fan-out, retention, partition count, peak-to-average ratio, multi-AZ placement, recovery objectives, and compliance boundaries will shape the answer more than the label on the service.

KRaft-era Kafka production readiness checklist

Use this checklist before committing to a path:

  • KRaft readiness: Confirm controller quorum sizing, controller isolation, bootstrap configuration, metadata log storage, monitoring, and upgrade path. Treat controller nodes as a first-class production tier.
  • Client compatibility: Inventory client versions, security protocols, idempotent producers, transactions, consumer groups, Kafka Connect, Kafka Streams, Schema Registry usage, and operational tooling.
  • Storage model: Separate active-log behavior from long-retention behavior. Tiered storage can help retention, but active segment writes and broker recovery still determine day-to-day operations.
  • Scaling model: Ask what happens when brokers are added, removed, or replaced. If scaling triggers large partition-data movement, the platform may remain stateful even after KRaft.
  • Cost model: Compare compute, storage, cross-zone traffic, remote storage requests, managed-service fees, support, and on-call effort. Avoid precise estimates until workload assumptions are explicit.
  • Governance boundary: Map where data resides, which cloud account owns storage, how encryption keys are managed, and how private networking is enforced.
  • Migration and rollback: Define source cluster, target cluster, topic mapping, offset handling, consumer cutover, producer cutover, validation, and rollback criteria before touching production traffic.

The checklist also exposes when a KRaft migration should be a narrow control-plane project. If the current platform has predictable traffic, acceptable retention cost, and mature operations, removing ZooKeeper may be enough for the next phase. Architecture replacement becomes relevant when the recurring pain is not ZooKeeper at all.

Where AutoMQ Changes The Operating Model

Once the review separates metadata from storage, a different class of option becomes visible: Kafka-compatible systems built around shared storage. AutoMQ fits in this category. It keeps Kafka API compatibility as the application-facing contract, but redesigns the storage layer around object storage and stateless brokers so that compute can scale with less broker-local data movement.

That positioning matters. AutoMQ is not a reason to skip KRaft thinking, and it should not be evaluated as a generic "Kafka replacement" detached from workload requirements. It is most relevant when the team's pain comes from the shared-nothing data plane: long retention on broker-attached disks, slow partition reassignment, broker replacement tied to local recovery, cloud network charges from replication paths, or the need to keep the data plane inside the customer's cloud boundary.

In AutoMQ's architecture, object storage becomes the primary durable storage layer, while a write-ahead log handles low-latency persistence and recovery for data not yet uploaded to object storage. The official AutoMQ documentation describes WAL storage as the component that provides low-latency durable writes and supports broker failover recovery for data not yet flushed to S3. The key cloud-native effect is that brokers no longer need to own durable partition logs in the traditional way.

That changes the operating model in several concrete ways:

  • Scaling becomes less data-bound: Adding or removing brokers does not require the same kind of large local-log migration because durable data is externalized.
  • Broker replacement becomes less dramatic: A failed broker can be treated more like replaceable compute when persistent data is not trapped on its local disk.
  • Storage and compute can be reasoned about separately: Retention growth does not have to map directly to larger broker disks.
  • Deployment control can remain customer-owned: In BYOC or self-managed patterns, the data plane can stay in the customer's cloud account and network boundary.

Those benefits still require validation. Latency-sensitive workloads should test the selected WAL option and storage backend. Teams should verify client compatibility, operational tooling, observability, security controls, and disaster recovery behavior against their own production assumptions. The point is not that every KRaft-era Kafka team needs shared storage. The point is that KRaft makes the control plane cleaner, while shared-storage designs address a different layer of the cloud-native problem.

Decision Table: Optimize, Migrate, Or Rethink Architecture

Architecture decisions get easier when the team names the actual constraint. The table below is a practical way to sort the next move.

If this is the main problemLikely next moveWhy
ZooKeeper operational burdenMigrate to KRaft on the current platformThe pain is control-plane complexity, not necessarily data-plane architecture.
Long retention costEvaluate tiered storage or shared-storage systemsRetention is a storage problem; KRaft alone will not change stored bytes.
Slow broker scaling or replacementCompare stateful broker and stateless broker modelsThe question is how much data must move when compute changes.
Cloud bill dominated by cross-zone replicationRevisit placement, replication paths, and storage architectureThe expensive path is often created by how Kafka achieves durability and availability.
Strict data residency and procurement requirementsPrefer customer-owned data-plane optionsGovernance can matter as much as raw infrastructure cost.
Application migration riskPrioritize Kafka protocol compatibility and rollback designA lower-cost platform is not useful if clients, offsets, or tooling break during cutover.

The KRaft era is a good time to do this review because every serious Kafka estate already has to reason about control-plane modernization. Use that moment to avoid a narrow upgrade mindset. A clean metadata quorum is valuable, but the cloud-native operating model is decided by the full platform: metadata, storage, compute, networking, governance, and migration safety.

If your team is evaluating whether a shared-storage Kafka-compatible architecture belongs in that review, AutoMQ's architecture documentation is a reasonable next stop. Treat it as one option in the decision table: useful when broker-local storage is the constraint, less relevant when a straightforward KRaft migration already solves the problem.

References

FAQ

Does KRaft make Kafka cloud-native by itself?

No. KRaft modernizes Kafka's metadata management by replacing ZooKeeper with a Kafka-native controller quorum. Cloud-native operation also depends on the data plane: storage model, broker scaling, retention behavior, networking, governance, and recovery.

Is Kafka Tiered Storage the same as a shared-storage Kafka architecture?

No. Tiered Storage moves completed log segments to remote storage while local broker storage remains part of the model for active data. A shared-storage architecture is designed so durable stream data lives outside the broker-local disk model, making brokers more stateless.

How many KRaft controllers should a production Kafka cluster run?

Apache Kafka documentation commonly discusses 3 or 5 controllers, depending on failure tolerance and cost. A 3-controller quorum can tolerate 1 controller failure, while 5 controllers can tolerate 2. Critical deployments should keep controller roles isolated from broker roles.

When should a team consider AutoMQ in a KRaft-era review?

Consider AutoMQ when the recurring platform pain is tied to broker-local storage: slow partition reassignment, expensive retention, cloud network amplification, broker replacement complexity, or a need for customer-owned deployment boundaries. If the only pain is ZooKeeper operations, a focused KRaft migration may be the cleaner first move.

What should be tested before migrating production workloads?

Test client compatibility, authentication, transactions, consumer group behavior, offset migration, Kafka Connect and Kafka Streams usage, observability, failure recovery, latency under representative load, and rollback. The migration plan should define success criteria before producer or consumer cutover begins.

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.