Blog

Kafka in Production | Grab, JD.com, Tencent Case Studies

Every large Kafka team eventually asks the same uncomfortable question: is the next scaling problem really a tuning problem, or is it a storage architecture problem? Partition reassignment slows down because data must move. Kubernetes feels awkward because brokers own state. Cloud bills grow because Kafka replicates data across brokers while the cloud storage layer is already doing its own durability work. These are not edge cases; they are the normal pressure points of Kafka in production.

That is why production case studies matter for diskless Kafka. A clean architecture diagram can make stateless brokers and object storage look inevitable, but platform teams need a stronger signal before they move real workloads. Grab, JD.com, and Tencent Cloud EMR are useful examples because they adopted AutoMQ in different environments: data engineering platforms, e-commerce core infrastructure, and a cloud-provider EMR service.

Real-world AutoMQ architecture patterns

The shared pattern is simple to state and hard to retrofit into traditional Kafka: keep the Kafka protocol and ecosystem, but stop treating each broker as the long-term owner of partition data. AutoMQ keeps Kafka compatibility at the API layer while replacing the storage layer with a diskless architecture backed by object storage and a write-ahead log path.

The Production Pattern Behind the Case Studies

Traditional Kafka was designed around shared-nothing brokers. That design made sense when local disks were the primary persistence layer and machine-to-machine replication was the normal way to build durability. In cloud and Kubernetes environments, the same design can create duplicated work: Kafka replicates data between brokers, the storage platform may replicate data again, and operators still have to move partition data whenever capacity changes.

Diskless Kafka changes the ownership model. Brokers remain responsible for Kafka semantics, client traffic, fetch paths, consumer groups, and metadata participation. Object storage becomes the persistent data layer, while broker instances become replaceable compute capacity. That distinction is what connects the three case studies.

Production caseStarting pressureArchitecture patternPublic result signal
GrabBroker performance pressure, storage usage, and network resource usageData engineering platform adopts a diskless broker modelHigher broker performance with lower storage and network resource use
JD.comStorage and network redundancy in core infrastructureKubernetes-native Kafka with stateless brokersSeconds-level elasticity for e-commerce traffic peaks
Tencent Cloud EMRMassive data streams need elastic processing inside EMRAutoMQ integrated into Tencent Cloud EMR with storage-compute separationUsers process large-scale streams through an elastic architecture

The table is not meant to rank the customers. It shows why diskless Kafka keeps appearing in unrelated production environments. When the durable log is separated from broker lifecycle, teams get a smaller operational surface and a cleaner way to map Kafka onto cloud-native infrastructure.

Grab: Rebalancing Becomes a Metadata Problem

Grab's case is the cleanest example of the operational pain that appears when Kafka scale meets cloud elasticity. In a traditional Kafka cluster, partition reassignment requires physical data movement. That turns routine capacity changes into events that need planning.

The deeper problem is resource coupling. If a workload needs more storage, a traditional Kafka cluster often grows by adding brokers, even when compute is not the bottleneck. If a workload needs more compute, it may still trigger partition movement because the broker is both the compute endpoint and the storage owner.

AutoMQ's role in the Grab pattern is to keep Kafka-facing applications on familiar protocol semantics while moving storage responsibility away from broker-local disks. The practical result is that reassignment and scaling can become closer to metadata operations than bulk data-copy operations. That is why Grab's adoption is described in terms of improved broker performance and reduced storage and network resource use.

The win is not that Kafka magically becomes lightweight. The win is that brokers no longer need to carry durable partition ownership in the same way, so elastic operations become less disruptive.

JD.com: Kubernetes Works Better When Brokers Stop Owning Data

JD.com's case starts from a different infrastructure shape: core e-commerce infrastructure running on Kubernetes. In that environment, shared-nothing Kafka can duplicate durability work across layers while still making scaling depend on data movement.

The storage pattern explains the architectural mismatch. Traditional Kafka commonly uses multiple broker replicas, while the underlying storage layer may also replicate data for durability. Cost then becomes a consequence of two durability systems being stacked on top of each other.

AutoMQ changes the placement of responsibility. The Kafka-compatible compute layer continues to serve producers and consumers, while durable data is written into shared storage. That reduces the storage and network redundancy that appears when Kafka's own replication is layered over an already durable storage substrate. In the AP, JD.com is the proof point for seconds-level elasticity during e-commerce traffic peaks.

For Kubernetes teams, the important lesson is the lifecycle model. Stateful Kafka brokers make Kubernetes behave like a careful storage migration system because each pod carries partition ownership. Stateless brokers make the deployment model closer to ordinary cloud-native services.

Tencent Cloud EMR: Streaming Becomes Part of the Data Platform

Tencent Cloud EMR shows a third pattern: diskless Kafka as part of a broader analytics service rather than a standalone Kafka replacement project. EMR compute can scale quickly, but a rigid streaming layer can become the bottleneck between real-time ingestion and downstream analytics. The operational burden shifts from broker capacity alone to the handoff between streams and compute.

AutoMQ is integrated with Tencent Cloud EMR as part of an elastic data platform. This is a different kind of production validation from Grab and JD.com. It is less about a single Kafka cluster getting faster, and more about the streaming layer becoming elastic enough to fit the rest of the data platform.

The architectural point is still the same. Once storage is decoupled from brokers, the streaming layer can align with cloud object storage and table formats more naturally. That reduces a common source of friction: running Kafka as a stateful island beside an otherwise elastic compute and lakehouse environment.

Extracting the Common Patterns

The three cases are different enough to be useful. Grab is a data engineering platform case. JD.com is an e-commerce core infrastructure case. Tencent Cloud EMR is a provider-side integration. If a pattern survives across those environments, it is probably not a one-off implementation trick.

Common pattern extraction matrix

The reusable patterns are the ones platform teams can test against their own environment:

  • Kubernetes elasticity becomes realistic when brokers are stateless. Traditional Kafka can run on Kubernetes, but broker-local state makes scaling and recovery heavy.
  • Storage and network redundancy can be reduced at the architecture layer. If object storage already provides durability, Kafka does not need to duplicate the same responsibility in the same way.
  • Rebalancing shifts from data movement toward metadata movement. This is why Grab's reassignment result is so important. It shows the operational effect of changing where partition data lives.
  • Multi-cloud and provider integration become easier to standardize. Bambu Lab reinforces this pattern through multi-cloud standardization, while Tencent Cloud EMR shows the same architectural substrate fitting into a managed data platform.
  • Cost savings come from removing work, not from hiding work. The strongest results come from reducing redundant storage, redundant traffic, and over-provisioned capacity.

Poizon is also worth mentioning because it validates a different workload profile: observability data at high throughput. The provided project context describes Poizon replacing a 1,280-core observability cluster, handling peak throughput above 40 GiB/s, running for nearly 3 years, and cutting infrastructure cost by half. That does not mean every observability pipeline will see the same outcome.

What AutoMQ Contributes as the Technical Substrate

AutoMQ's role in these stories is not that every customer adopted the same deployment shape. The common substrate is a Kafka-compatible diskless architecture: the Kafka protocol and ecosystem stay intact, while the storage layer is rebuilt around object storage and a WAL path. That matters because enterprises rarely have the luxury of replacing Kafka clients, connectors, monitoring, schemas, and stream processing jobs all at once.

The architecture has a useful division of labor. Brokers handle Kafka compute and traffic. Object storage holds durable log data. The WAL path absorbs write-path requirements. Operational automation then treats brokers as elastic capacity rather than long-lived storage appliances.

That is also where production case studies beat abstract claims. Grab validates the broker-performance and resource-efficiency pattern. JD.com validates the Kubernetes elasticity pattern. Tencent Cloud EMR validates the managed analytics integration pattern.

A Production Readiness Checklist for Diskless Kafka

A strong production story should make your evaluation sharper, not looser. The wrong conclusion is "these companies use it, so we can skip due diligence." The right conclusion is that the architecture has enough production proof to deserve a serious test plan.

Production readiness checklist

Use the checklist as a starting point:

Readiness areaWhat to verifyWhy it matters
Throughput profileWrite rate, read fan-out, catch-up reads, and partition countDiskless storage changes the bottlenecks; test the workload you actually run
Latency envelopeProducer latency, consumer fetch latency, and tail behavior during cold reads or backlog recoverySome topics tolerate storage efficiency trade-offs; others need stricter write-path choices
Failure recoveryBroker loss, AZ failure, object storage errors, WAL recovery, and rolling upgradesStateless brokers help only if recovery semantics are proven under controlled faults
Cost modelCompute, storage, cross-zone traffic, retention, and scale-in behaviorSavings should come from removed redundancy and elasticity, not spreadsheet assumptions
Migration pathClient compatibility, offsets, connectors, processors, and rollback planKafka-compatible does not mean migration planning disappears
OperationsMetrics, alerts, capacity policy, Kubernetes behavior, and support boundariesA simpler storage model still needs mature day-2 operations

The most useful pilots are deliberately boring. Pick workloads with known traffic shape, known pain, and known success criteria. For Grab-like environments, focus on broker efficiency and resource use. For JD.com-like environments, focus on Kubernetes scaling and redundancy. For Tencent-like environments, focus on how streaming data flows into analytics systems.

The Practical Lesson

The production lesson from Grab, JD.com, and Tencent Cloud EMR is not that every Kafka team should run the same topology. The lesson is that the old assumption "Kafka brokers must own durable data" is no longer a requirement for Kafka-compatible production systems.

Once that assumption changes, familiar Kafka problems start to move. Rebalancing becomes less tied to bulk data movement. Kubernetes stops being forced to preserve broker identity at all costs. Storage cost stops compounding across Kafka replicas and storage replicas.

That is the real value of these case studies. They do not ask you to trust a diagram. They show the same architectural bet surviving data engineering, e-commerce infrastructure, and EMR integration. For platform teams evaluating Kafka in production, diskless Kafka has crossed the line from interesting architecture to production-validated option.

If your current Kafka roadmap is mostly about surviving the next rebalance, provisioning for the next traffic peak, or explaining another duplicated storage bill, study the architecture rather than another tuning checklist. Start with the AutoMQ Diskless Engine, read the Grab, JD.com, and Tencent Cloud EMR cases, then test the pattern against your own production constraints.

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.