Kafka on Kubernetes at 40 GiB/s: How JD.com Reduced Replication Waste with AutoMQ

Running Kafka on Kubernetes often looks cloud-native from the outside: brokers are packaged as containers, scheduled as pods, monitored through the same platform, and managed with the same deployment workflow. The harder question appears when the workload grows: is the storage model as elastic as the scheduler around it? For JD.com, that question mattered because its real-time streaming platform was not a side system. It served more than 1,400 business lines, processed trillion-scale records every day, and had to absorb e-commerce traffic spikes during events such as 618 and Double 11, according to the public JD.com customer case.

JD.com's case is useful because it does not frame Kubernetes as a magic answer to Kafka operations. The team was already moving core infrastructure from physical machines to Kubernetes, but Apache Kafka® carried a storage assumption from the broker-local disk era: each broker owns local state, and durability is achieved through inter-broker replication. Put that model on top of a reliable shared storage layer, and Kubernetes can orchestrate pods while Kafka still asks the platform to move a lot of data. That is the Kubernetes paradox for Kafka teams: the control plane becomes elastic before the data plane does.

Running Kafka on Kubernetes is not the same as cloud-native Kafka

Kubernetes is good at making compute elastic. A HorizontalPodAutoscaler can adjust a scalable workload based on metrics such as CPU, memory, or custom signals. That model works well when a pod can be replaced without carrying durable local state with it. Web services, stateless APIs, and many stream processors benefit because the scheduler can add or remove capacity without negotiating the physical location of historical data.

Kafka is different because its storage layout is part of the broker identity. In traditional Kafka, partitions have replicas on brokers, and those replicas exist as local log segments. When a cluster expands, shrinks, or rebalances, the system often needs to copy partition data across brokers so that placement matches the desired topology. That behavior fits servers with attached disks and deliberate capacity changes better than frequent autoscaling.

The mismatch grows sharper in private-cloud environments that already provide durable shared storage. JD.com used CubeFS, described in the customer case as its S3-compatible storage foundation. CubeFS publicly describes support for S3-compatible object access, which makes it a plausible shared-storage substrate for cloud-native data systems. If Kafka keeps 3 ISR replicas and the storage layer also keeps 3 durable copies, a single logical record can expand into 9 physical copies. The issue is not that either layer is irrational; the issue is that both layers are trying to solve durability at the same time.

That double-redundancy pattern created three practical pressures for JD.com's platform team:

Storage footprint grew beyond what the logical data size suggested.
Network bandwidth was spent on replication traffic that did not directly serve producers or consumers. At peak scale, that replication tax matters because every unnecessary copy competes with business traffic.
Scaling remained operationally heavy because brokers were still stateful. Adding capacity was not the same as adding compute; it also meant waiting for data movement and partition reassignment.

For teams searching for "Kafka on Kubernetes at scale," this is the architectural lesson hiding inside the JD.com story. Containerizing Kafka changes how brokers are deployed. It does not automatically change the fact that Kafka's storage is bound to brokers.

JD.com's real-time platform scale

JD.com is a useful customer story because the numbers are large enough to expose hidden costs that smaller clusters can tolerate. The public AutoMQ case lists more than 200 AutoMQ pods, 250 TB of data under management, and 40 GiB/s peak throughput during 618 and Double 11 events. At that scale, a platform team cannot treat broker expansion and partition movement as occasional background work. Capacity must arrive when the business needs it, and unused peak capacity becomes expensive when the platform has to reserve it long before the surge.

Production anchor	Publicly stated JD.com case detail	Why it matters
Peak throughput	40 GiB/s	Replication overhead becomes a material bandwidth concern.
AutoMQ footprint	200+ pods	Kubernetes operations must work at fleet scale, not as a lab pattern.
Data managed	250 TB	Storage efficiency affects real infrastructure planning.
Storage copies	9 to 3	Durability moved from double replication to storage-layer responsibility.
Bandwidth cost	33%+ reduction	Removing redundant replication had a measurable network impact.

The table is deliberately limited to public claims. It does not assume JD.com's internal topic count, partition layout, client mix, or exact migration timeline beyond what the customer page states. For engineers, that restraint matters. The credibility comes from a clear mechanism connected to approved facts, not from filling gaps with invented operational drama.

Where replication waste came from

The cleanest way to understand the waste is to separate logical durability from physical copies. A Kafka topic with replication factor 3 gives the application-level log three broker-side replicas, which protects against broker and disk failure when brokers own their disks. A distributed storage layer with its own replication also protects data by writing multiple copies below the application. Put one on top of the other, and the platform can end up paying for both.

In JD.com's environment, the public case describes that combination as Kafka's 3-replica ISR mechanism on top of CubeFS' internal 3-replica consistency. The resulting footprint was 3 × 3 = 9 actual copies for a single write. Moving to AutoMQ changed the ownership boundary: AutoMQ treated CubeFS as the source of durability and removed inter-broker replication from the steady-state storage path. The result was not "less durable Kafka" in the story JD.com published; it was durability handled at the storage layer instead of duplicated at the broker layer.

This distinction is also where shared storage differs from tiered storage. Tiered storage can reduce how much historical data remains on primary broker disks, but the primary Kafka log and broker-local replica layout still exist. Scaling and reassignment still have to care about which broker owns which partition data. AutoMQ's Shared Storage architecture replaces Kafka's local log storage with S3Stream and stores data in shared storage, which makes brokers stateless from an operations perspective.

That shift matters more than the storage medium itself. Object storage is not sprinkled under Kafka as a lower-cost archive; it becomes the durable data plane. Brokers keep serving the Kafka protocol, but they no longer need to be the long-term owners of the data they serve. Kubernetes operations then start to look much closer to the way platform teams expect compute to behave.

Shared storage and stateless brokers in production

JD.com did not need a Kafka-like system that required every business line to rewrite clients. The public case states that AutoMQ preserved Kafka protocol compatibility, which mattered because the platform served more than 1,400 applications. The storage layer could change, but the application contract had to stay stable enough for the organization to adopt it.

AutoMQ's role in the JD.com story was to keep that Kafka-facing contract while changing where state lives. Its stateless broker documentation describes the storage-compute separation model: broker nodes become stateless because Kafka log storage is offloaded through S3Stream. The S3 storage documentation further explains that shared storage enables partition reassignment without data duplication, which is the mechanism behind faster scaling and traffic balancing.

In JD.com's production case, this showed up as seconds-level elasticity through Kubernetes HPA and broker monitoring metrics. That statement is narrower and more useful than a generic claim that "Kafka became cloud-native." The scaling path changed from data movement to metadata-oriented operations because brokers no longer carried durable local partition state. HPA could add pods, AutoMQ brokers could join as compute capacity, and partition placement could adjust without waiting for large log copies to settle first.

There are still engineering questions a team should ask before applying this pattern to its own platform:

What storage system is the source of durability, and what durability model does it provide?
Which metrics should drive scale-out and scale-in decisions for brokers: CPU, network, throughput, partition load, or custom broker-level signals?
How will traffic rebalancing interact with producer and consumer behavior during peak events?
Which operational workflows depend on Kafka's traditional broker-local storage assumptions?

Those questions do not weaken the case for shared storage. They make the case more practical. JD.com's lesson is not that every Kafka-on-Kubernetes deployment should copy its exact environment; storage ownership has to be designed intentionally when Kubernetes becomes the operating layer.

Results: elasticity, network cost, and operational simplification

The public results map directly back to the original pain. Storage efficiency improved because the platform reduced the replica footprint from 9 copies to 3 copies. Network bandwidth cost dropped by more than 33%, which the case connects to removing unnecessary replication traffic and reducing the need for heavy over-provisioning. Scaling operations that previously took hours could complete in seconds because partition reassignment no longer required copying broker-local data.

The most interesting outcome is not any single number. It is the way the numbers reinforce the same architectural point. A 40 GiB/s peak workload stresses the replication path. A 250 TB managed dataset stresses storage footprint. A 200+ pod deployment stresses Kubernetes operations. JD.com's case sits at the intersection of all three, which is why it is more persuasive than a generic "run Kafka on Kubernetes" tutorial.

Lessons for private-cloud Kafka teams

JD.com's story is especially relevant for private-cloud teams because many of them already run durable storage platforms beneath their application stack. The same pattern appears with S3-compatible object stores, distributed file systems, and storage systems that provide their own replication. When Kafka sits above those systems with application-level replication unchanged, the platform may be paying twice for durability.

For teams evaluating AutoMQ, the JD.com case offers a concrete decision pattern:

If Kubernetes is your operating layer, ask whether brokers can be scaled like compute.
If shared storage already provides durability, ask whether Kafka replication is duplicating that responsibility.
If peak traffic drives capacity planning, ask whether scaling requires moving historical data or updating metadata.
If many applications depend on Kafka, ask whether the migration preserves the Kafka protocol and ecosystem surface.

AutoMQ is not a shortcut around architecture work. It is a storage-architecture change that made JD.com's Kafka-compatible streaming platform better aligned with the infrastructure the company was already building. That is why the case is useful: the product decision followed the platform strategy, not the other way around.

The next time a Kafka-on-Kubernetes plan looks complete because the brokers run as pods, check where the data lives. JD.com's experience shows that the real cloud-native boundary is not the container image. It is the point where storage stops anchoring compute in place.

FAQ

What problem did JD.com solve with AutoMQ?

JD.com used AutoMQ to address storage and network redundancy created by running Kafka-style broker replication on top of CubeFS, a reliable S3-compatible storage layer. The public case reports that this reduced the data footprint from 9 copies to 3 copies and helped lower network bandwidth cost by more than 33%.

Why is Kafka on Kubernetes still hard at large scale?

Kubernetes can scale pods, but traditional Kafka brokers own local partition data. When scaling requires moving that data between brokers, adding pods is not enough. Large clusters still face partition reassignment, rebalance time, disk planning, and replication traffic.

How did AutoMQ make brokers more compatible with Kubernetes operations?

AutoMQ separates compute from storage and offloads Kafka log storage to shared storage through S3Stream. In the JD.com case, this meant broker pods could act more like stateless compute units, enabling seconds-level elasticity through Kubernetes HPA and broker metrics.

Did JD.com need to rewrite Kafka applications?

The public case states that AutoMQ remained compatible with the Kafka protocol and allowed more than 1,400 applications to migrate without application code changes. That compatibility was important because JD.com's streaming platform served many internal business lines.

Is shared storage the same as Kafka tiered storage?

No. Tiered storage adds a secondary storage layer while Kafka's primary broker-local log and ISR replication model remain in place. AutoMQ's shared storage model replaces the local log storage layer so data is not bound to broker disks, which changes how reassignment and scaling work.

Kafka on Kubernetes at 40 GiB/s: How JD.com Reduced Replication Waste with AutoMQ

Running Kafka on Kubernetes is not the same as cloud-native Kafka

JD.com's real-time platform scale

Where replication waste came from

Shared storage and stateless brokers in production

Results: elasticity, network cost, and operational simplification

Lessons for private-cloud Kafka teams

FAQ

What problem did JD.com solve with AutoMQ?

Why is Kafka on Kubernetes still hard at large scale?

How did AutoMQ make brokers more compatible with Kubernetes operations?

Did JD.com need to rewrite Kafka applications?

Is shared storage the same as Kafka tiered storage?

Trusted by teams running Kafka at scale

Grab

Tencent

LG U+

Kafka on Kubernetes at 40 GiB/s: How JD.com Reduced Replication Waste with AutoMQ

Running Kafka on Kubernetes is not the same as cloud-native Kafka

JD.com's real-time platform scale

Where replication waste came from

Shared storage and stateless brokers in production

Results: elasticity, network cost, and operational simplification

Lessons for private-cloud Kafka teams

FAQ

What problem did JD.com solve with AutoMQ?

Why is Kafka on Kubernetes still hard at large scale?

How did AutoMQ make brokers more compatible with Kubernetes operations?

Did JD.com need to rewrite Kafka applications?

Is shared storage the same as Kafka tiered storage?

Trusted by teams running Kafka at scale

Grab

Tencent

LG U+

Newsletter