Blog

Diskless Kafka in Production: 6 Real-World Case Studies

The hard question about diskless Kafka is not whether the architecture is elegant. It is whether an SRE can be paged for it at 3 a.m., trust the failure model, and defend the decision in a post-incident review.

That is where many architecture discussions become too abstract. Object storage sounds attractive because it changes the cost curve for Kafka retention and removes the worst parts of broker-local disks. But Kafka is usually not a side project. It sits behind observability pipelines, ecommerce events, billing records, device telemetry, and data lake ingestion. A storage redesign has to survive traffic spikes, upgrades, broker replacement, partition movement, consumer replay, and the ordinary messiness of production.

Diskless Kafka is past the "interesting lab architecture" stage. The more useful question is narrower: which production patterns prove that it is ready, and what should a platform team verify before moving a serious workload?

Production case study matrix

What Production-Ready Means for Kafka Infrastructure

"Production-ready" does not mean a vendor can run a benchmark or provision a demo cluster. Kafka owns both hot-path ingestion and retained history, so the bar is higher. A credible diskless Kafka implementation should be tested against at least five conditions.

  • Kafka compatibility. Existing producers, consumers, connectors, stream processors, security controls, and observability tools should keep working without an application rewrite.
  • Durable write path. A diskless broker cannot mean volatile data. The design needs a clear write-ahead log and recovery model, plus an object storage path that makes persistent data independent from broker-local disks.
  • Operational recovery. Broker failure, node replacement, partition reassignment, and rolling upgrades should avoid large data-copy loops. This is where stateless brokers matter most.
  • Elastic economics. The architecture should let compute and storage scale separately. If the team still has to provision peak compute for retained data, the storage redesign has not solved the cloud operating problem.
  • Evidence under real workloads. Production evidence should cover different workload types, not only one friendly analytics pipeline. Observability, ecommerce peaks, telecom logs, data lake ingestion, and multi-cloud IoT each stress different parts of the system.

Traditional Kafka made a sensible trade-off for the data center era: brokers own local disks, partitions are tied to brokers, and replication is handled inside the Kafka cluster. In the cloud, that model often duplicates what the cloud already provides. A single message can be replicated by Kafka across brokers, stored on redundant block volumes, and moved again during reassignment. Diskless Kafka attacks that duplication by making object storage the durable data layer and brokers the compute layer.

That does not make every implementation equivalent. Some systems are Kafka-compatible but not built from Apache Kafka. Some rely only on object storage for the hot path. Some add a WAL layer for lower-latency persistence and recovery. Some are managed-only, while others support BYOC or self-managed deployment. Production readiness depends on those details.

AutoMQ is relevant because its public materials describe a Kafka-compatible platform that reuses the Apache Kafka compute layer while replacing the storage layer with S3Stream, a shared streaming storage engine built on object storage and a WAL path. Its documentation also describes stateless brokers, second-level reassignment, and BYOC environments where both control plane and data plane can run inside the customer's cloud account or VPC.

Six Real-World Diskless Kafka Deployments

The following cases are not interchangeable logo references. Each one answers a different production-readiness question: peak throughput, operational risk, Kubernetes fit, existing observability tools, cloud-provider integration, or multi-cloud standardization.

CompanyProduction workloadPublic scale or resultWhat it proves
GrabData engineering platform for lake ingestionReassignment reduced from 6+ hours to under 1 minute; 3x efficiency gainStateless brokers can remove data-copy bottlenecks from cluster operations
JD.comEcommerce real-time data platform on Kubernetes40 GiB/s peak throughput in the customer page; 200+ AutoMQ pods; 250 TB data under managementDiskless Kafka can support high-throughput retail peaks on Kubernetes
PoizonObservability pipeline for flash-sale traffic40 GiB/s observability peaks; around 50% cost reduction; up to 85% cold-data storage savingsElastic compute plus object storage fits bursty log and metrics workloads
Tencent Cloud EMRCloud provider EMR serviceFirst-party EMR integration; under 2-minute cluster provisioningDiskless Kafka can be productized inside a managed big-data platform
LG U+Telecom log pipeline on AWS ECS2.2B daily messages; compatibility with Fluentd, Sumo Logic, and OpenSearchStateless Kafka can fit container platforms and existing observability stacks
Bambu LabMulti-cloud IoT and platform streamingAWS and GCP unified architecture; 50% infrastructure cost reductionA shared architecture can reduce multi-cloud Kafka fragmentation

Grab: Production Agility Is a Reassignment Problem

Grab's public case is a useful starting point because the pain was not only cost. The Coban team runs a real-time data streaming platform that serves as an entry point into the company's data lake. In the legacy Kafka architecture, partition reassignment could run for 6+ hours, consuming network and disk I/O that the production workload also needed.

AutoMQ changed the operation from data movement to metadata movement. The published case reports reassignment under 1 minute and a 3x improvement in single-core throughput and overall cost efficiency. The important lesson is that adding compute capacity no longer implies copying large retained logs between brokers.

That matters for cloud cost as well. Once brokers stop owning durable data, they become better candidates for elastic compute strategies. The Grab story notes interest in Spot Instances, which would be much harder to justify when losing a broker also means recovering or moving local partition data.

JD.com: Kubernetes Becomes Real When Brokers Stop Owning Disks

JD.com's case stresses the architecture in a different way. The public customer page describes a streaming platform serving more than 1,400 business lines and handling 40 GiB/s peak throughput during major ecommerce events. It also describes 200+ AutoMQ pods, 250 TB of data under management, and Kubernetes-based operation.

The core problem was double redundancy. JD.com uses CubeFS as an S3-compatible storage layer with its own replication. Traditional Kafka then added ISR replication on top, multiplying physical copies and network traffic. AutoMQ's shared storage model let JD.com rely on the storage layer for durability while reducing broker-level replication overhead. The customer page reports a 66% reduction in storage footprint by reducing replicas from 9 to 3, plus a 33%+ reduction in network bandwidth costs.

For platform teams, the Kubernetes point may be the more durable lesson. Stateful Kafka can run on Kubernetes, but disks, node affinity, pod disruption, and rebalance operations all need special handling. A stateless broker model makes HPA and pod replacement realistic tools for a Kafka-compatible service.

Poizon: Observability Workloads Need Elasticity More Than Perfect Symmetry

Observability is one of the cleanest production fits for diskless Kafka because the workload is large, bursty, and retention-heavy. Poizon's public case describes flash-sale observability peaks at 40 GiB/s, around 50% cost reduction from elastic scaling, and up to 85% savings on cold data through object storage tiering.

Observability pipelines punish static capacity planning. During normal periods, a cluster sized for flash-sale logs wastes compute. During peak events, a cluster sized for average traffic becomes an incident amplifier. Traditional Kafka teams often answer this with over-provisioning because scaling down is risky and scaling up can involve slow reassignment.

Diskless Kafka gives the platform another control surface. Compute can expand for the ingestion spike, while retained data can live in object storage with a different cost profile. Logs, metrics, and traces often tolerate a different latency envelope from transaction processing, but they generate enough volume that storage and network duplication become painful quickly.

Tencent Cloud EMR: Productization Is a Different Kind of Proof

Tencent Cloud EMR is proof that a cloud provider can integrate AutoMQ as a first-party service inside a broader data platform. The public case describes native selection from the EMR console, under 2-minute AutoMQ cluster provisioning, Tencent Cloud Object Storage as the storage layer, and Table Topics for querying streams as Iceberg tables.

That matters because production readiness is also about lifecycle integration. A service embedded in EMR has to fit provisioning, security, VPC isolation, monitoring, billing, and the expectations of teams that use Spark, Flink, Trino, and lakehouse tools.

The interesting pattern here is stream-to-lake convergence. Diskless Kafka already places durable data in object storage. When the platform can expose streams as queryable table data, it reduces connector and ETL infrastructure between Kafka and the data lake.

LG U+: Existing Toolchains Are Part of the Production Surface

LG U+ is a useful counterweight to pure throughput stories. The public case describes 2.2 billion log messages per day on AWS ECS, with AutoMQ maintaining compatibility with Fluentd, Sumo Logic, and OpenSearch. For a telecom log pipeline, that compatibility is the difference between changing the streaming foundation and forcing a redesign of upstream and downstream integrations.

The deployment environment also matters. AWS ECS is built around stateless service management, rolling updates, and task replacement. AutoMQ's stateless broker design let LG U+ treat the Kafka-compatible layer closer to the rest of its cloud-native services, while S3 handled persistent storage.

The lesson is simple: a storage redesign should reduce operational uniqueness, not introduce a new island. If a diskless Kafka system requires proprietary clients, unusual observability paths, or a separate operations model, migration risk moves rather than disappears.

Bambu Lab: Multi-Cloud Kafka Needs One Operating Model

Bambu Lab's public case highlights a problem that rarely shows up in single-cluster benchmarks: multi-cloud operational drift. The company operates across AWS and Google Cloud, and the customer page describes AutoMQ as a way to standardize streaming architecture across clouds with stateless brokers, Kubernetes-native scaling, and 100% Kafka API compatibility. It also reports a 50% reduction in Kafka infrastructure costs and scaling in seconds.

Multi-cloud Kafka is easy to draw and hard to operate. Each cloud's managed Kafka service, storage primitive, network model, and upgrade workflow can create a different playbook. That becomes expensive when platform teams need consistent incident response, capacity planning, and deployment automation.

Diskless Kafka does not remove cloud differences, but it can narrow them. If the durable layer is object storage and brokers are stateless compute, the platform team can build a more uniform model around Kubernetes, object buckets, IAM, metrics, and Kafka-compatible APIs.

What These Deployments Have in Common

The six cases span different regions, industries, and deployment styles, but the production pattern is consistent.

Diskless Kafka pattern extraction

First, the winning workloads expose the weakness of broker-owned storage. Grab and JD.com needed faster reassignment and scaling. Poizon needed to absorb observability bursts without sizing the whole platform for the worst minute of the day. LG U+ needed cloud-native task replacement and rolling operations. Bambu Lab needed a consistent multi-cloud platform. In each case, the pain is not "Kafka is bad." The pain is that local disk ownership makes cloud operations heavier than they need to be.

Second, the cases do not treat object storage as a cold archive bolted onto Kafka. Tiered storage helps reduce long-term retention cost, but the broker still owns the hot log and still moves data during many operational changes. Diskless Kafka changes the source of truth. AutoMQ's documentation describes S3 storage as the actual data location, with the WAL used for write acceleration and fault recovery. That is why partition reassignment can become metadata-only rather than a data-copy operation.

Third, compatibility is part of the architecture, not a footnote. Grab integrated with Strimzi workflows. JD.com had to support a large application estate. LG U+ kept Fluentd, Sumo Logic, and OpenSearch integrations. Bambu Lab kept Kafka APIs while standardizing across clouds. A diskless Kafka migration has to make the storage layer less visible to application teams, not more visible.

Fourth, the public evidence is not all the same kind. Some cases publish exact throughput numbers. Some publish cost reductions. Tencent Cloud EMR publishes product integration and provisioning behavior. A serious evaluation should match each proof point to its own workload.

When Diskless Kafka Is a Fit

Diskless Kafka is a strong candidate when the current Kafka pain is caused by one of these operating facts:

  • Your retained data volume is large enough that replicated broker-local disks dominate the cost model.
  • Scaling brokers is slow because reassignment copies too much partition data.
  • Traffic is spiky, and the cluster is sized for peaks that happen only part of the time.
  • Kubernetes or container operations are standard everywhere except Kafka.
  • Multi-cloud or BYOC requirements make managed Kafka fragmentation expensive.
  • Long retention, observability, or stream-to-lake workloads make object storage economics attractive.

It is less automatic when the workload has unusual latency constraints, runs at the edge without object storage, or depends on Kafka features the target platform has not verified. Run your own producer and consumer mix, replay behavior, retention pattern, connector set, failure drills, and cost model.

For AutoMQ specifically, the evaluation should include the WAL option. The documentation describes S3 WAL for the open-source path and lower-latency WAL options for production deployments that need tighter latency.

Production-Readiness Checklist

Production readiness checklist

Use this checklist before moving a meaningful Kafka workload to any diskless Kafka platform.

Readiness areaWhat to verifyWhy it matters
Client and ecosystem compatibilityProducers, consumers, Kafka Connect, Flink, Kafka Streams, Schema Registry paths, ACLs, and monitoring toolsMigration risk usually hides in the ecosystem, not in basic produce and consume tests
Write durability and recoveryWAL behavior, object storage commit path, broker crash recovery, AZ failure assumptions"Diskless" should mean no broker-local persistence, not weaker durability
Scaling and reassignmentScale-out, scale-in, partition reassignment, hotspot handling, and rollback behaviorThis is where diskless Kafka should beat broker-local storage
Cost modelCompute, object storage, WAL storage, requests, network traffic, support fees, and retention growthObject storage reduces one cost class while introducing others that still need modeling
Operational toolingMetrics, logs, alerts, runbooks, upgrades, backup posture, IaC, and access controlsA platform is production-ready only when operations can own it
Migration pathMirrorMaker2, Kafka Linking, dual-write strategy, offset preservation, cutover plan, and backout planCompatibility lowers risk, but migrations still need rehearsals

AutoMQ's strongest production argument is not a single metric. It is the combination of Apache 2.0 openness, Kafka-compatible APIs, BYOC and software deployment options, object-storage-backed durability, and public customer evidence across several workloads. That gives platform teams room to evaluate the architecture without giving up data-plane control or forcing application teams into a new streaming protocol.

FAQ

Is diskless Kafka production-ready?

Yes, when the implementation has a durable write path, verified Kafka compatibility, operational tooling, and evidence under workloads similar to yours. Public AutoMQ customer stories show diskless Kafka running in production for Grab, JD.com, Poizon, Tencent Cloud EMR, LG U+, and Bambu Lab. The more precise answer is that production readiness should be evaluated per workload, not assumed from the architecture alone.

Does diskless Kafka mean there are no brokers?

Not always. In AutoMQ's architecture, brokers still exist, but they are stateless compute nodes rather than owners of durable local log data. The durable data lives in object storage through S3Stream, and the WAL handles low-latency persistence and recovery. Some market discussions use "brokerless" and "diskless" loosely, so always check the exact architecture.

Is diskless Kafka the same as Kafka tiered storage?

No. Tiered storage typically offloads older segments to object storage while brokers still own the hot log on local disks. Diskless Kafka makes shared storage the primary durable layer, so broker replacement and partition movement do not require the same local data-copy workflow.

What workloads are the best fit for diskless Kafka?

High-volume observability, ecommerce event streams, data lake ingestion, long-retention pipelines, Kubernetes-native platforms, and multi-cloud streaming are strong candidates. The common pattern is that storage growth, reassignment time, or peak over-provisioning has become more painful than the migration effort.

What should I test before migrating?

Test your real client versions, message sizes, partition counts, consumer lag behavior, connectors, retention settings, failure drills, and cost model. Also test offset preservation, cutover, and backout. A small benchmark will not expose the same risks as a realistic replay and failover test.

Where does AutoMQ fit in the diskless Kafka market?

AutoMQ is a Kafka-compatible, diskless streaming platform that stores persistent stream data in object storage and makes brokers stateless. It is open source under Apache License 2.0 and supports deployment models including BYOC and software deployment. Its public customer stories are the main reason it deserves attention in production-readiness evaluations.

Sources

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.