Kafka in Production | Grab, JD.com, Tencent Case Studies

Every large Kafka team eventually asks the same uncomfortable question: is the next scaling problem really a tuning problem, or is it a storage architecture problem? Partition reassignment slows down because data must move. Kubernetes feels awkward because brokers own state. Cloud bills grow because Kafka replicates data across brokers while the cloud storage layer is already doing its own durability work. These are not edge cases; they are the normal pressure points of Kafka in production.

That is why production case studies matter for diskless Kafka. A clean architecture diagram can make stateless brokers and object storage look inevitable, but platform teams need a stronger signal before they move real workloads. Grab, JD.com, and Tencent Cloud EMR are useful examples because they adopted AutoMQ in different environments: data engineering platforms, e-commerce core infrastructure, and a cloud-provider EMR service.

The shared pattern is simple to state and hard to retrofit into traditional Kafka: keep the Kafka protocol and ecosystem, but stop treating each broker as the long-term owner of partition data. AutoMQ keeps Kafka compatibility at the API layer while replacing the storage layer with a diskless architecture backed by object storage and a write-ahead log path.

The Production Pattern Behind the Case Studies

Traditional Kafka was designed around shared-nothing brokers. That design made sense when local disks were the primary persistence layer and machine-to-machine replication was the normal way to build durability. In cloud and Kubernetes environments, the same design can create duplicated work: Kafka replicates data between brokers, the storage platform may replicate data again, and operators still have to move partition data whenever capacity changes.

Diskless Kafka changes the ownership model. Brokers remain responsible for Kafka semantics, client traffic, fetch paths, consumer groups, and metadata participation. Object storage becomes the persistent data layer, while broker instances become replaceable compute capacity. That distinction is what connects the three case studies.

Production case	Starting pressure	Architecture pattern	Public result signal
Grab	Broker performance pressure, storage usage, and network resource usage	Data engineering platform adopts a diskless broker model	Higher broker performance with lower storage and network resource use
JD.com	Storage and network redundancy in core infrastructure	Kubernetes-native Kafka with stateless brokers	Seconds-level elasticity for e-commerce traffic peaks
Tencent Cloud EMR	Massive data streams need elastic processing inside EMR	AutoMQ integrated into Tencent Cloud EMR with storage-compute separation	Users process large-scale streams through an elastic architecture

The table is not meant to rank the customers. It shows why diskless Kafka keeps appearing in unrelated production environments. When the durable log is separated from broker lifecycle, teams get a smaller operational surface and a cleaner way to map Kafka onto cloud-native infrastructure.

Grab: Rebalancing Becomes a Metadata Problem

Grab's case is the cleanest example of the operational pain that appears when Kafka scale meets cloud elasticity. In a traditional Kafka cluster, partition reassignment requires physical data movement. That turns routine capacity changes into events that need planning.

The deeper problem is resource coupling. If a workload needs more storage, a traditional Kafka cluster often grows by adding brokers, even when compute is not the bottleneck. If a workload needs more compute, it may still trigger partition movement because the broker is both the compute endpoint and the storage owner.

AutoMQ's role in the Grab pattern is to keep Kafka-facing applications on familiar protocol semantics while moving storage responsibility away from broker-local disks. The practical result is that reassignment and scaling can become closer to metadata operations than bulk data-copy operations. That is why Grab's adoption is described in terms of improved broker performance and reduced storage and network resource use.

The win is not that Kafka magically becomes lightweight. The win is that brokers no longer need to carry durable partition ownership in the same way, so elastic operations become less disruptive.

JD.com: Kubernetes Works Better When Brokers Stop Owning Data

JD.com's case starts from a different infrastructure shape: core e-commerce infrastructure running on Kubernetes. In that environment, shared-nothing Kafka can duplicate durability work across layers while still making scaling depend on data movement.

The storage pattern explains the architectural mismatch. Traditional Kafka commonly uses multiple broker replicas, while the underlying storage layer may also replicate data for durability. Cost then becomes a consequence of two durability systems being stacked on top of each other.

AutoMQ changes the placement of responsibility. The Kafka-compatible compute layer continues to serve producers and consumers, while durable data is written into shared storage. That reduces the storage and network redundancy that appears when Kafka's own replication is layered over an already durable storage substrate. In the AP, JD.com is the proof point for seconds-level elasticity during e-commerce traffic peaks.

For Kubernetes teams, the important lesson is the lifecycle model. Stateful Kafka brokers make Kubernetes behave like a careful storage migration system because each pod carries partition ownership. Stateless brokers make the deployment model closer to ordinary cloud-native services.

Tencent Cloud EMR: Streaming Becomes Part of the Data Platform

Tencent Cloud EMR shows a third pattern: diskless Kafka as part of a broader analytics service rather than a standalone Kafka replacement project. EMR compute can scale quickly, but a rigid streaming layer can become the bottleneck between real-time ingestion and downstream analytics. The operational burden shifts from broker capacity alone to the handoff between streams and compute.

AutoMQ is integrated with Tencent Cloud EMR as part of an elastic data platform. This is a different kind of production validation from Grab and JD.com. It is less about a single Kafka cluster getting faster, and more about the streaming layer becoming elastic enough to fit the rest of the data platform.

The architectural point is still the same. Once storage is decoupled from brokers, the streaming layer can align with cloud object storage and table formats more naturally. That reduces a common source of friction: running Kafka as a stateful island beside an otherwise elastic compute and lakehouse environment.

Extracting the Common Patterns

The three cases are different enough to be useful. Grab is a data engineering platform case. JD.com is an e-commerce core infrastructure case. Tencent Cloud EMR is a provider-side integration. If a pattern survives across those environments, it is probably not a one-off implementation trick.

The reusable patterns are the ones platform teams can test against their own environment:

Kubernetes elasticity becomes realistic when brokers are stateless. Traditional Kafka can run on Kubernetes, but broker-local state makes scaling and recovery heavy.
Storage and network redundancy can be reduced at the architecture layer. If object storage already provides durability, Kafka does not need to duplicate the same responsibility in the same way.
Rebalancing shifts from data movement toward metadata movement. This is why Grab's reassignment result is so important. It shows the operational effect of changing where partition data lives.
Multi-cloud and provider integration become easier to standardize. Bambu Lab reinforces this pattern through multi-cloud standardization, while Tencent Cloud EMR shows the same architectural substrate fitting into a managed data platform.
Cost savings come from removing work, not from hiding work. The strongest results come from reducing redundant storage, redundant traffic, and over-provisioned capacity.

Poizon is also worth mentioning because it validates a different workload profile: observability data at high throughput. The provided project context describes Poizon replacing a 1,280-core observability cluster, handling peak throughput above 40 GiB/s, running for nearly 3 years, and cutting infrastructure cost by half. That does not mean every observability pipeline will see the same outcome.

What AutoMQ Contributes as the Technical Substrate

AutoMQ's role in these stories is not that every customer adopted the same deployment shape. The common substrate is a Kafka-compatible diskless architecture: the Kafka protocol and ecosystem stay intact, while the storage layer is rebuilt around object storage and a WAL path. That matters because enterprises rarely have the luxury of replacing Kafka clients, connectors, monitoring, schemas, and stream processing jobs all at once.

The architecture has a useful division of labor. Brokers handle Kafka compute and traffic. Object storage holds durable log data. The WAL path absorbs write-path requirements. Operational automation then treats brokers as elastic capacity rather than long-lived storage appliances.

That is also where production case studies beat abstract claims. Grab validates the broker-performance and resource-efficiency pattern. JD.com validates the Kubernetes elasticity pattern. Tencent Cloud EMR validates the managed analytics integration pattern.

A Production Readiness Checklist for Diskless Kafka

A strong production story should make your evaluation sharper, not looser. The wrong conclusion is "these companies use it, so we can skip due diligence." The right conclusion is that the architecture has enough production proof to deserve a serious test plan.

Use the checklist as a starting point:

Readiness area	What to verify	Why it matters
Throughput profile	Write rate, read fan-out, catch-up reads, and partition count	Diskless storage changes the bottlenecks; test the workload you actually run
Latency envelope	Producer latency, consumer fetch latency, and tail behavior during cold reads or backlog recovery	Some topics tolerate storage efficiency trade-offs; others need stricter write-path choices
Failure recovery	Broker loss, AZ failure, object storage errors, WAL recovery, and rolling upgrades	Stateless brokers help only if recovery semantics are proven under controlled faults
Cost model	Compute, storage, cross-zone traffic, retention, and scale-in behavior	Savings should come from removed redundancy and elasticity, not spreadsheet assumptions
Migration path	Client compatibility, offsets, connectors, processors, and rollback plan	Kafka-compatible does not mean migration planning disappears
Operations	Metrics, alerts, capacity policy, Kubernetes behavior, and support boundaries	A simpler storage model still needs mature day-2 operations

The most useful pilots are deliberately boring. Pick workloads with known traffic shape, known pain, and known success criteria. For Grab-like environments, focus on broker efficiency and resource use. For JD.com-like environments, focus on Kubernetes scaling and redundancy. For Tencent-like environments, focus on how streaming data flows into analytics systems.

The Practical Lesson

The production lesson from Grab, JD.com, and Tencent Cloud EMR is not that every Kafka team should run the same topology. The lesson is that the old assumption "Kafka brokers must own durable data" is no longer a requirement for Kafka-compatible production systems.

Once that assumption changes, familiar Kafka problems start to move. Rebalancing becomes less tied to bulk data movement. Kubernetes stops being forced to preserve broker identity at all costs. Storage cost stops compounding across Kafka replicas and storage replicas.

That is the real value of these case studies. They do not ask you to trust a diagram. They show the same architectural bet surviving data engineering, e-commerce infrastructure, and EMR integration. For platform teams evaluating Kafka in production, diskless Kafka has crossed the line from interesting architecture to production-validated option.

If your current Kafka roadmap is mostly about surviving the next rebalance, provisioning for the next traffic peak, or explaining another duplicated storage bill, study the architecture rather than another tuning checklist. Start with the AutoMQ Diskless Engine, read the Grab, JD.com, and Tencent Cloud EMR cases, then test the pattern against your own production constraints.