Blog

Moving from ZooKeeper to KRaft: Operational Checklist

Searching for zookeeper to kraft migration usually means the team is no longer treating Kafka metadata as a background detail. The question has moved from "how do we keep ZooKeeper healthy?" to "what should our production Kafka operating model look like after ZooKeeper leaves the picture?" That is a good question, but it is also broader than the migration procedure itself.

KRaft changes the metadata layer. It replaces the external ZooKeeper ensemble with Kafka's own Raft-based controller quorum, so broker metadata, controller leadership, and cluster state management move into the Kafka architecture. That can simplify a real operational burden. It does not automatically make brokers stateless, remove partition data movement, eliminate disk capacity planning, or validate every client and ecosystem dependency.

That distinction is where many migration plans become too narrow. A technically correct KRaft migration can still leave the platform team with the same recurring problems: storage and compute scale together, broker replacement requires careful data movement, retention growth turns into disk planning, and every major change depends on application-level validation. The checklist should protect the migration window, but it should also help the team decide whether the migration is the right moment to revisit the architecture underneath Kafka.

ZooKeeper to KRaft migration decision framework

Why zookeeper to kraft migration matters now

ZooKeeper served Kafka well for years because Kafka needed a separate coordination system for cluster metadata, broker registration, controller election, and configuration changes. The cost was operational complexity. A production team had to operate two distributed systems, observe two failure domains, secure two access paths, and understand how ZooKeeper health affected Kafka behavior. KRaft reduces that split-brain operational model by moving metadata management into Kafka's own controller quorum.

That is the easy part to understand. The harder part is migration timing. A production Kafka estate rarely consists of only brokers and controllers. It includes producers, consumers, Kafka Streams applications, Connect workers, Schema Registry dependencies, ACLs, quotas, observability integrations, runbooks, backup expectations, and incident muscle memory. When the metadata plane changes, those dependencies need evidence, not assumptions.

Platform teams should separate three layers before choosing a path:

  • Metadata migration. Can the cluster move from ZooKeeper-based metadata to KRaft while preserving cluster state, topic configuration, security behavior, and controller stability?
  • Application compatibility. Can producers, consumers, admin clients, connectors, stream processors, and automation continue to behave as expected after the change?
  • Operating-model fit. Does the resulting Kafka platform still match the team's expectations for elasticity, cost, governance, rollback, and incident response?

The first layer is a migration project. The second is a platform validation project. The third is an architecture decision. Treating all three as one runbook makes the work look smaller than it is, which is comforting right up to the moment a critical consumer group behaves differently than expected.

KRaft removes ZooKeeper from Kafka's metadata path, but it does not erase the production constraints created by Kafka's Shared Nothing architecture. In traditional Kafka, brokers own partition replicas on local or attached storage. Producers write to partition leaders, followers replicate through ISR (In-Sync Replicas), and consumers read from broker-held log segments. That design is mature and widely understood, but it makes broker state a central part of every operational change.

During a ZooKeeper to KRaft migration, the team is already touching the control layer. That creates a natural temptation to bundle in upgrades, cluster resizing, topic cleanup, security changes, or cloud migration. Bundling can be rational when the organization has a strong staging environment and a clear rollback plan. It becomes risky when the plan mixes unrelated unknowns and then calls the whole thing "KRaft migration."

A useful checklist starts by naming the constraints that could turn a metadata migration into a production incident:

ConstraintWhy it matters during migrationEvidence to collect
Controller quorum designKRaft health depends on a correctly sized and placed controller quorum.Controller role layout, voter configuration, failure-domain mapping, controller metrics
Client behaviorKafka clients may depend on protocol versions, timeouts, retries, transactions, or admin APIs.Client versions, feature usage, staging test results, error-rate baselines
Topic and partition statePartition count, replica placement, retention, and compaction shape restart and recovery behavior.Topic export, partition distribution, retained bytes, under-replicated partition history
Security and governanceACLs, authentication, certificates, audit paths, and network policy must survive the change.ACL export, identity mapping, certificate inventory, audit evidence
Rollback boundarySome changes are reversible in practice; others are reversible only on paper.Cutover criteria, backup state, restore test, decision owner

The table is intentionally operational rather than theoretical. Teams do not fail migrations because they forgot that KRaft uses Raft. They fail because the staging cluster did not represent production, a script assumed ZooKeeper-based metadata access, a connector lagged behind, a rollback path was not tested, or a quorum placement decision created a new failure mode.

Architecture patterns teams usually compare

The simplest path is an in-place migration to KRaft within the existing Kafka estate. This keeps endpoints, application contracts, and most operational tooling familiar. It is a strong option when the current platform is healthy, the team has validated clients, and the main goal is to remove ZooKeeper while staying close to the current broker storage model.

A second path is a broader managed Kafka transition. That can reduce some infrastructure lifecycle work, but it should still be evaluated against the actual workload. Managed operations do not automatically remove every Kafka concern around partition count, retention, networking, client compatibility, cost attribution, and migration cutover. They change the service boundary, and that boundary may be valuable, but the team should be precise about what moves to the provider and what remains with the platform team.

A third path is to keep Kafka semantics while changing the storage architecture. This is where Shared Storage architecture becomes relevant. Instead of binding durable partition data to broker-local disks, a Kafka-compatible shared-storage system keeps the Kafka client contract while moving durable data into object storage and making brokers closer to compute workers. The point is not that KRaft is insufficient. The point is that KRaft solves metadata coordination, while shared storage addresses a different class of operating problems.

Stateful Kafka brokers and stateless shared-storage brokers

The decision usually turns on where the team's pain comes from. If the main pressure is ZooKeeper operations, an in-place KRaft migration may be enough. If the pressure is service ownership, a managed platform may be attractive. If the pressure is storage growth, broker replacement, partition movement, or cloud elasticity, the team should evaluate whether the broker storage model is the real bottleneck.

That last point matters because KRaft and stateless brokers are sometimes blended together in vendor conversations. They are not the same thing. KRaft changes how Kafka manages metadata. Stateless brokers change how the data plane behaves when compute and durable storage are separated. A platform can use KRaft and still have stateful brokers. A migration plan should keep those concepts separate so the architecture decision stays honest.

Evaluation checklist for platform teams

Start with compatibility, because no architecture choice is useful if applications break. The migration plan should enumerate producer and consumer versions, transactional workloads, idempotent producer usage, consumer group patterns, admin API automation, Connect tasks, stream-processing checkpoints, schema behavior, and monitoring dependencies. Each item needs a test owner and a pass condition. "Works in staging" is not enough when staging has one topic, one consumer group, and none of the awkward production scripts.

Then evaluate the control plane. KRaft controller nodes should be placed across appropriate failure domains, monitored explicitly, and protected from noisy broker workloads if the deployment separates roles. The team should understand quorum loss behavior, controller metrics, disk and network requirements for metadata, and the operational procedure for replacing a controller. KRaft reduces the need to operate ZooKeeper, but the controller quorum becomes a first-class production component.

The data plane deserves a separate checklist. This is where the team decides whether the migration is only about metadata or also about the Kafka storage model:

  • Storage growth. If retention keeps increasing faster than compute demand, the current broker sizing model may continue to over-provision compute or disk after KRaft.
  • Rebalancing behavior. If broker additions, removals, or replacements regularly consume maintenance windows, the partition movement model should be part of the architecture review.
  • Cross-AZ traffic. If cloud network charges are material, inspect producer, follower, consumer, and replication paths instead of assuming metadata migration changes data movement.
  • Rollback and coexistence. If the migration needs a phased cutover, define how messages, offsets, consumer groups, and ownership move between source and target clusters.
  • Governance boundary. If data residency or customer-cloud control matters, evaluate where the data plane, object storage, network path, and management plane run.

This list deliberately includes items that KRaft does not solve. That is the point. Good migration planning protects the target cluster; good platform planning asks whether the target cluster will still be expensive to operate once the migration is done.

Where AutoMQ changes the operating model

Once the evaluation reaches storage growth, partition movement, and cloud operating boundaries, AutoMQ becomes relevant as a Kafka-compatible cloud-native streaming platform built around Shared Storage architecture. AutoMQ keeps the Kafka protocol and ecosystem surface while replacing Kafka's local log storage with S3Stream, a storage layer that uses WAL (Write-Ahead Log) storage for low-latency persistence and S3-compatible object storage as the durable repository. Brokers become stateless in the sense that persistent partition data is no longer bound to broker-local disks.

That design changes the migration conversation. A team evaluating AutoMQ is not merely asking whether ZooKeeper disappears. It is asking whether the post-migration Kafka-compatible platform can scale compute and storage independently, reduce data movement during broker changes, and fit better into customer-controlled cloud environments. AutoMQ BYOC runs the control plane and data plane in the customer's cloud account and VPC, while AutoMQ Software is designed for private data center environments. That boundary matters for teams that cannot treat streaming data as a generic hosted service.

AutoMQ Linking is also relevant when the move is cross-cluster rather than an in-place metadata migration. AutoMQ documentation describes Kafka Linking for AutoMQ commercial editions as a migration tool that supports byte-to-byte message synchronization with offset consistency, while AutoMQ Open Source recommends MirrorMaker2 for cluster migration. The practical question is not which tool sounds smoother. It is whether the chosen migration path can preserve the application contract that producers and consumers experience: records, offsets, consumer progress, cutover timing, and rollback evidence.

Production readiness checklist for KRaft migration

This is the right point in the process to introduce product evaluation, not at the beginning. If the team's only goal is to retire ZooKeeper on a healthy Kafka estate, AutoMQ may be a future architecture option rather than the immediate migration target. If the KRaft project has exposed deeper problems around broker-local storage, scaling, and cloud cost, then evaluating a Kafka-compatible shared-storage architecture is a logical extension of the checklist.

Decision table and FAQ

The migration decision should be explicit enough that different stakeholders can disagree productively. A platform engineer may focus on failure domains, a FinOps reviewer on storage and network cost, an application owner on offsets, and a security team on cloud boundaries. The table below gives them a shared language.

If your main pressure is...Prefer this pathWatch for
Removing ZooKeeper while keeping the current Kafka estate stableIn-place ZooKeeper to KRaft migrationController quorum design, client validation, rollback criteria
Reducing infrastructure lifecycle workManaged Kafka evaluationData boundary, private networking, feature coverage, migration cutover
Long retention and broker disk pressureShared Storage architecture evaluationLatency profile, WAL choice, object storage behavior, replay testing
Broker replacement and scaling frictionStateless broker evaluationCompatibility, traffic routing, operational ownership, migration tool fit
Customer-cloud or private-environment controlBYOC or software deployment evaluationControl plane location, data plane location, IAM, audit, network paths

Back to the original search: zookeeper to kraft migration is a metadata migration keyword, but production teams are rarely searching for metadata alone. They are searching because a core platform is about to change, and that change is an opportunity to decide what should remain familiar and what should be redesigned. If the right answer is an in-place KRaft migration, make it boring with evidence. If the evidence points to a deeper storage and operations problem, evaluate AutoMQ's Kafka-compatible Shared Storage architecture with the same rigor you would apply to any production Kafka migration. A practical next step is to review the AutoMQ migration documentation and map it against your own compatibility inventory.

References

FAQ

Does KRaft make Kafka brokers stateless?

No. KRaft replaces ZooKeeper for metadata coordination, but it does not by itself move durable partition data away from broker-local or broker-attached storage. Traditional Kafka brokers still own log segments and participate in replication. Stateless broker behavior requires a different storage architecture, such as a Kafka-compatible Shared Storage architecture where durable data is kept in shared object storage.

Is ZooKeeper to KRaft migration only a Kafka version upgrade?

No. Version planning is part of the work, but the migration also affects metadata operations, controller quorum design, tooling assumptions, security validation, observability, and rollback planning. Treat it as a production platform change rather than a package update.

When should a team evaluate AutoMQ during a KRaft migration project?

Evaluate AutoMQ when the KRaft project exposes deeper pressure around broker storage, partition movement, retention cost, scaling, cloud network traffic, or customer-controlled deployment boundaries. If the only goal is removing ZooKeeper from a healthy Kafka estate, an in-place KRaft migration may be the immediate path.

Can existing Kafka clients work with AutoMQ?

AutoMQ is designed as a Kafka-compatible streaming platform and uses the Apache Kafka compute layer with storage-layer changes. Production teams should still validate their own clients, connectors, security configuration, admin tooling, consumer group behavior, and monitoring integrations before migration.

What should be tested before a production cutover?

Test controller quorum behavior, broker restart behavior, client compatibility, producer retries, transactional workloads, consumer group progress, Connect tasks, stream-processing checkpoints, ACLs, certificates, dashboards, alerts, rollback procedures, and migration tooling. The goal is to prove the application contract, not only prove that the target cluster starts.

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.