Grab Case Study

"At the Grab Data Engineering Platform team, we focus on improving the efficiency and scalability of our streaming data platform. By adopting AutoMQ, the platform leverages cloud-native storage and eliminates the need for replication between brokers. This enhances broker performance, reduces storage and network resource usage, and enables us to scale compute and storage resources to meet evolving demands."

Grab Data Engineering Platform Team

The Challenge

The Coban team at Grab manages a massive real-time data streaming platform that serves as the critical ingestion point for the company's data lake. However, as traffic volumes surged to terabytes per hour, the legacy Kafka architecture hit hard limitations:

The "6-Hour" Rebalancing Bottleneck: Scaling the cluster was a heavy, data-intensive operation. Moving partitions between brokers required physical data replication, causing rebalancing tasks to drag on for up to 6 hours.
Operational Risk & Jitter: This heavy data movement wasn't just slow; it saturated network and disk I/O, leading to performance jitters that threatened the stability of downstream analytics and services.
Inflexible Resource Coupling: The team faced a dilemma: if they needed more storage, they had to add more brokers (wasting compute power) or vertically scale disks (complex and risky). This led to significant over-provisioning, where expensive resources sat idle during off-peak hours just to be safe for peaks.

Why AutoMQ

True Cloud-Native Architecture without Compromise

Grab chose AutoMQ to transition from a hardware-dependent design to a cloud-service-dependent design.

S3-First Storage with High Performance: AutoMQ offloads data persistence to object storage (S3) but maintains low write latency. It achieves this by using a small, fixed-size WAL: for example, a 10GB EBS volume that uses Direct I/O to bypass file system overheads for sub-millisecond latency, or S3 WAL for AutoMQ Open Source, which writes directly to object storage without requiring local block storage.
Stateless & Instant Elasticity: Because the storage is shared/offloaded, the brokers are effectively stateless. Expanding the cluster or migrating partitions involves only metadata updates—zero data copying is required.
Seamless Integration: The solution offered 100% Kafka protocol compatibility, passing all of Grab's rigorous test suites. Crucially, it integrated easily with their existing Kubernetes operator (Strimzi), allowing the team to adopt the new tech without changing their operational workflows or client code.

The Results

From Bottleneck to Competitive Advantage

The migration to AutoMQ has turned Grab's streaming infrastructure into one of the most efficient fleets in their ecosystem.

Operational Agility: Partition reassignment for the entire cluster now takes less than 1 minute (down from 6 hours). This speed is so effective that the team is now planning to utilize Spot Instances for further cost savings—a strategy deemed too risky with legacy Kafka.
3x Efficiency Gains: By eliminating inter-broker replication traffic and optimizing for cloud storage, Grab observed a 3x increase in throughput per CPU core and a corresponding 3x improvement in overall cost efficiency.
Future-Ready Architecture: With the stability issues resolved, the Coban team is now looking ahead to leverage AutoMQ's S3 Table Topics to write data directly in Iceberg format, further simplifying their data lake pipelines.

Ready to eliminate rebalancing storms?

See how AutoMQ can turn your scaling operations from hours to seconds—just like Grab. Get a personalized demo and see the difference.

Try Demo Talk to an Expert

Case Study

From 6 Hours to Seconds: How Grab Achieved 3x Data Streaming Efficiency with AutoMQ

The Challenge

Why AutoMQ

True Cloud-Native Architecture without Compromise

The Results

From Bottleneck to Competitive Advantage

Ready to eliminate rebalancing storms?

See More Customer Stories

Tencent

LG U+

Geely