"At the Grab Data Engineering Platform team, we focus on improving the efficiency and scalability of our streaming data platform. By adopting AutoMQ, the platform leverages cloud-native storage and eliminates the need for replication between brokers. This enhances broker performance, reduces storage and network resource usage, and enables us to scale compute and storage resources to meet evolving demands."
Grab Data Engineering Platform Team
The Challenge
The Coban team at Grab manages a massive real-time data streaming platform that serves as the critical ingestion point for the company's data lake. However, as traffic volumes surged to terabytes per hour, the legacy Kafka architecture hit hard limitations:
- The "6-Hour" Rebalancing Bottleneck: Scaling the cluster was a heavy, data-intensive operation. Moving partitions between brokers required physical data replication, causing rebalancing tasks to drag on for up to 6 hours.
- Operational Risk & Jitter: This heavy data movement wasn't just slow; it saturated network and disk I/O, leading to performance jitters that threatened the stability of downstream analytics and services.
- Inflexible Resource Coupling: The team faced a dilemma: if they needed more storage, they had to add more brokers (wasting compute power) or vertically scale disks (complex and risky). This led to significant over-provisioning, where expensive resources sat idle during off-peak hours just to be safe for peaks.
Why AutoMQ
True Cloud-Native Architecture without Compromise
Grab chose AutoMQ to transition from a hardware-dependent design to a cloud-service-dependent design.
- S3-First Storage with High Performance: AutoMQ offloads data persistence to object storage (S3) but maintains single-digit millisecond write latency. It achieves this by using a small, fixed-size (10GB) EBS volume purely for Write-Ahead Log (WAL) and leveraging Direct I/O to bypass file system overheads.
- Stateless & Instant Elasticity: Because the storage is shared/offloaded, the brokers are effectively stateless. Expanding the cluster or migrating partitions involves only metadata updates—zero data copying is required.
- Seamless Integration: The solution offered 100% Kafka protocol compatibility, passing all of Grab's rigorous test suites. Crucially, it integrated easily with their existing Kubernetes operator (Strimzi), allowing the team to adopt the new tech without changing their operational workflows or client code.
The Results
From Bottleneck to Competitive Advantage
The migration to AutoMQ has turned Grab's streaming infrastructure into one of the most efficient fleets in their ecosystem.
Key Metrics
Partition reassignment (down from 6 hours)
Throughput per CPU core increase
Overall cost efficiency improvement
Strimzi Operator compatible
- Operational Agility: Partition reassignment for the entire cluster now takes less than 1 minute (down from 6 hours). This speed is so effective that the team is now planning to utilize Spot Instances for further cost savings—a strategy deemed too risky with legacy Kafka.
- 3x Efficiency Gains: By eliminating inter-broker replication traffic and optimizing for cloud storage, Grab observed a 3x increase in throughput per CPU core and a corresponding 3x improvement in overall cost efficiency.
- Future-Ready Architecture: With the stability issues resolved, the Coban team is now looking ahead to leverage AutoMQ's S3 Table Topics to write data directly in Iceberg format, further simplifying their data lake pipelines.
Ready to eliminate rebalancing storms?
See how AutoMQ can turn your scaling operations from hours to seconds—just like Grab. Get a personalized demo and see the difference.


