Blog

Kafka Rebalancing at Scale: How Grab Reduced Partition Reassignment from 6 Hours to Under 1 Minute

Kafka rebalancing should be a routine capacity operation: add brokers, redistribute partitions, and give the cluster more room to breathe. In large production clusters, it can become something heavier. A partition reassignment can turn into a storage migration, a network event, a disk I/O event, and a maintenance-window debate at the same time.

That is why "Kafka partition reassignment slow" is such a familiar operator search. Teams usually ask it after doing the responsible work: planning capacity, watching disk pressure, scheduling a quiet window, and still seeing a scaling task stretch across hours. The hard part is not that Kafka cannot move partitions. The hard part is that broker-local Kafka has to move the bytes attached to those partitions.

Grab's public AutoMQ case gives this problem a concrete production shape. Grab's Coban team manages a real-time streaming platform that serves as a critical ingestion point for the company's data lake. According to the public case, traffic reached terabytes per hour, and partition reassignment in the legacy Kafka architecture could take more than 6 hours. After adopting AutoMQ, reassignment was reduced to under 1 minute, with a reported 3x increase in single-core throughput and 3x overall cost efficiency.

Kafka reassignment before and after

Why Kafka Reassignment Becomes the Scaling Bottleneck

Most Kafka teams discuss scaling in terms of brokers, partitions, and throughput. Underneath those visible units sits a less forgiving model: traditional Kafka ties partition data to broker-local storage. When partition ownership moves, the cluster often has to copy log data, keep replicas consistent, preserve availability, and avoid overwhelming the same disks and network links serving live traffic.

Adding brokers gives the cluster more compute capacity, but it does not make existing partition data appear on additional brokers. Reassignment has to copy data, and the larger the cluster, the more that copy process competes with ingest, fetch, replication, and catch-up reads.

The public Grab case calls this out directly: moving partitions between brokers required physical data replication, and rebalancing tasks could drag on for up to 6 hours. The same data movement saturated network and disk I/O, creating performance jitter that threatened downstream analytics and services. Slow reassignment is not operator inconvenience; it can become instability for workloads that depend on the stream.

For teams diagnosing their own clusters, the pattern is familiar:

  • Reassignment windows expand as retained data grows.
  • Broker disk and network pressure rise during the operation.
  • Teams over-provision for peaks and maintenance headroom.
  • Scale-down becomes harder because it triggers another data movement event.

Kafka's shared-nothing design made sense for broker-local storage, replication across brokers, and predictable hardware ownership. The friction appears when cloud infrastructure changes the economics. Compute can be elastic and object storage can be durable and pay-as-you-go, but broker-local Kafka storage still asks scaling operations to carry data with the broker.

Grab's Production Kafka Workload

Grab is a Singapore-based super-app spanning ride-hailing, food and grocery delivery, and digital payments. In that setting, a streaming platform moves operational data into analytics and downstream services. The platform team has to keep that path scalable without turning every growth event into a risky maintenance project.

The public case identifies Grab's Coban team as the group managing the platform. It also says the platform served as a critical ingestion point for the company's data lake and had to handle terabytes-per-hour traffic. That combination explains why rebalancing latency mattered. A six-hour reassignment constrains when and how the team can adapt to demand.

This was not a raw scaling problem. Grab's team cared about efficiency and operational flexibility. The public quote from Grab's Data Engineering Platform Team says their focus was "improving the efficiency and scalability" of the streaming data platform, and that AutoMQ helped by using cloud-native storage and eliminating replication between brokers.

That distinction matters. A team can tune Kafka reassignment concurrency, split large topics, rebalance gradually, or add hardware ahead of time. Those tactics help within the existing model. Grab's improvement, from 6+ hours to under 1 minute, points to a change in what reassignment means.

The Architectural Reason Scaling Took Hours

Traditional Kafka couples compute, storage capacity, and data placement. A broker is both a request handler and a place where partition logs live. When a cluster grows, the added broker becomes useful after it receives partition data from existing brokers.

That makes scaling a data-placement problem before it becomes a compute-scheduling problem. The controller can decide where partitions should live, but the storage model decides how expensive that decision is to execute. If the data has to move across brokers, the window is bounded by bytes copied, disk throughput, network throughput, replication safety, and the team's tolerance for jitter.

Scaling QuestionStateful Kafka Broker-Local StorageAutoMQ Shared Storage Model
What moves during reassignment?Partition ownership and partition dataPrimarily metadata and ownership
What competes with live traffic?Broker disk I/O, network I/O, replication workFar less inter-broker data movement
Why does the window grow?More retained data means more bytes to copyDurable data is already in shared storage
What can operators scale separately?Compute and storage are tightly coupledCompute and storage can be scaled independently

In a broker-local model, the unit of reassignment is the partition role plus the data required for that role to be served from another broker. In a shared storage model, durable data is not owned by a broker-local disk in the same way, so reassignment can become closer to a metadata operation.

AutoMQ describes this as a Shared Storage architecture. It keeps Kafka protocol compatibility while replacing Kafka's native log storage with S3Stream, using WAL storage for efficient writes and object storage as the primary data repository. AutoMQ's documentation describes stateless broker nodes, second-level partition reassignment, automatic scaling, and continuous traffic rebalancing as outcomes of that model.

The operational consequence is the point. If reassignment no longer requires copying large amounts of partition data from broker disk to broker disk, the scaling bottleneck moves away from bulk data migration. That is the architectural reason Grab's result is plausible: the operation changed categories.

How Stateless Brokers Changed the Operation

In Grab's case, AutoMQ did not ask the team to abandon Kafka clients or rebuild the streaming ecosystem around a different protocol. The public page says AutoMQ offered 100% Kafka protocol compatibility, passed Grab's rigorous test suites, and integrated with the team's existing Kubernetes operator, Strimzi. Adoption therefore did not require changing operational workflows or client code.

That compatibility point matters because rebalancing pain often lives inside mature platforms. The more important Kafka is, the harder it becomes to change it. Producers, consumers, connectors, observability tools, and incident playbooks all assume Kafka semantics. A replacement that solves reassignment while breaking the surrounding operating model creates a different kind of risk.

AutoMQ's architecture shifts the storage layer while preserving the Kafka-facing surface. Durable data is offloaded to cloud storage, and brokers become effectively stateless compute nodes. In the public Grab case, this meant cluster expansion or partition migration involved metadata updates rather than physical data copying between brokers. That is the short path from "rebalancing takes hours" to "reassignment completes in under 1 minute."

Scaling operation timeline

There is also an operational flexibility angle. The case says Grab is planning to use Spot Instances for further cost savings, a strategy that had been too risky with legacy Kafka. Spot capacity is attractive when the workload can tolerate node churn; stateless brokers reduce the fear that a broker leaving the fleet implies a long data-recovery or partition-copy operation.

Results and Lessons for Kafka Teams

The headline result is direct: Grab reduced cluster partition reassignment time from more than 6 hours to under 1 minute. The supporting results are also important: 3x single-core throughput and 3x overall cost efficiency, according to the public case. The public explanation links this to eliminating inter-broker replication traffic and optimizing for cloud storage.

The lesson is not that every Kafka team should replace its cluster the moment reassignment gets slow. The better lesson is that reassignment time tells you something about architecture. If a routine scale-out event behaves like a bulk migration, your storage ownership model is part of the scaling path.

For large Kafka clusters, Grab's case suggests a practical decision frame:

  • If reassignment is rare and maintenance windows are comfortable, tuning may be enough.
  • If reassignment creates jitter, consumes operator time, or blocks elasticity, the bottleneck is architectural.
  • If storage growth forces compute growth, the cluster is carrying a coupling cost.
  • If you want elastic or interruptible compute, broker-local state defines how much churn you can tolerate.

Benchmarks can tell you that a system is fast under specific conditions. Grab's case shows a production team changing the operational shape of a real platform: a scaling task that used to be planned around hours became short enough to fit into normal operational flow.

What to Check in Your Own Kafka Cluster

The quickest way to apply this story is to look at the last few times your team changed cluster capacity. Start with the messy record: how long the reassignment took, which resources saturated, whether consumers saw lag, whether operators had to throttle the plan, and whether anyone avoided a needed change because the maintenance window was too expensive.

Rebalancing pain checklist

A useful review should include technical and organizational signals:

  • Partition movement: How much retained data had to be copied?
  • Resource contention: Did reassignment compete with producer traffic, consumer fetches, broker replication, or catch-up reads?
  • Operational behavior: Did the team postpone scaling, over-provision, or avoid scale-down?
  • Architecture coupling: Could the team add compute without adding storage?
  • Failure tolerance: Would node churn create a routine scheduling event or a storage recovery event?

The uncomfortable answer may be that your cluster is not short on capacity alone. It may be short on elasticity because its storage model makes capacity changes expensive. Kafka compatibility solves the ecosystem problem, but stateless brokers solve the reassignment problem at the layer where the pain starts.

AutoMQ will not remove the need for careful production engineering. Teams still need load tests, migration planning, observability, rollback thinking, and customer-approved acceptance criteria. What it can change is the shape of the operation: from moving partition data across broker disks to reassigning ownership over shared storage. For teams whose Kafka scaling windows have grown from minutes into hours, that distinction is the difference between "we should scale" and "we can scale when the business needs it."

FAQ

Why does Kafka partition reassignment take hours in large clusters?

In traditional Kafka, partition logs live on broker-local storage. Moving partition ownership often means copying related data between brokers while the cluster serves production traffic. As retained data and partition count grow, reassignment can become limited by disk I/O, network throughput, and operational risk.

What did Grab achieve with AutoMQ?

According to AutoMQ's public Grab case, Grab reduced cluster partition reassignment time from more than 6 hours to under 1 minute. The same case reports 3x single-core throughput and 3x overall cost efficiency.

Did Grab have to change Kafka clients?

The public case says AutoMQ offered 100% Kafka protocol compatibility, passed Grab's test suites, and integrated with Grab's existing Strimzi-based Kubernetes workflow.

Are stateless Kafka brokers the same as Kafka Tiered Storage?

No. Tiered Storage keeps a primary broker-local storage layer and moves older data to a secondary layer. That can reduce storage pressure, but reassignment can still involve primary-storage data movement. AutoMQ's Shared Storage architecture replaces Kafka's storage layer with S3Stream and stores durable data in shared object storage so brokers can operate as stateless compute nodes.

When should a team consider a shared-storage Kafka architecture?

Consider it when reassignment windows, broker disk pressure, network saturation, or maintenance risk block normal scaling. The stronger signal is a repeated pattern where changing compute becomes risky because partition data is tied to broker-local storage.

Where can I learn more about AutoMQ for this use case?

Start with the Grab case study and AutoMQ's Shared Storage architecture documentation.

Sources

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.