Diskless Kafka Explained | S3-Native, 80% Cost Savings

What Is Diskless Kafka?

Diskless Kafka is an architecture paradigm that replaces local disk storage with cloud object storage (like Amazon S3) as the primary data layer. Instead of replicating data across local disks on multiple brokers, a diskless Kafka cluster stores data once in object storage — leveraging its built-in durability (99.999999999%, or 11 nines) and eliminating the need for application-layer replication entirely.

The idea is straightforward: if S3 already guarantees your data won't be lost, why is Kafka replicating it three times across expensive EBS volumes? That question has been simmering in the Kafka community for years, but 2023 marked the turning point. AutoMQ and WarpStream both shipped production-ready diskless Kafka implementations within months of each other. By March 2026, the Apache Kafka community formally accepted KIP-1150 — a proposal to add native diskless topic support to Kafka itself. The direction is no longer debatable. Diskless is where Kafka is headed.

This article explains how diskless Kafka works, why the entire streaming industry is converging on this architecture, and what enterprises should consider — both the benefits and the tradeoffs — before adopting it.

The Industry Is Converging on Diskless

Diskless Kafka is not one company's invention. It is the streaming industry's collective response to a simple economic reality: running disk-based replication on cloud infrastructure is wasteful.

A Timeline of Convergence

The shift happened faster than most people expected:

2023 H2 — AutoMQ and WarpStream launched within months of each other, both replacing local disk storage with S3. Two independent teams, arriving at the same architectural conclusion simultaneously, signaled that the idea's time had come.
Late 2024 — Confluent acquired WarpStream, validating the commercial viability of diskless Kafka. The acquisition confirmed that even the largest Kafka vendor saw object-storage-native architecture as strategically important.
2025 — IBM acquired Confluent for $11 billion, folding WarpStream into the IBM portfolio. Meanwhile, Aiven began developing Inkless — its own diskless implementation based on KIP-1150 — and the Apache Kafka community formally accepted the KIP-1150 Diskless Topics proposal.
2025–2026 — BufStream, StreamNative, and Redpanda (with its Cloud Topics beta) all announced or shipped object-storage integrations. The trickle became a flood.

When this many independent players converge on the same architectural direction, it stops being a trend and starts being an inevitability.

Why the Entire Industry Is Moving This Way

The driving force is economics. Cloud object storage has reached a point where its cost, durability, and elasticity make local disk replication look like an anachronism:

Durability: S3 provides 11 nines of durability out of the box. Kafka's 3× replication was designed for an era when individual disks failed regularly and there was no shared durable storage layer. On the cloud, that replication is redundant.
Cost: S3 costs $0.023/GB. EBS gp3 costs $0.08/GiB — and Kafka needs three copies. The effective storage cost of disk-based Kafka is roughly $0.48 per GB of actual data, more than 20× what S3 charges for the same byte.
Cross-AZ traffic: This is the hidden cost that breaks budgets. Every message Kafka replicates across availability zones incurs $0.01–0.02/GB in network transfer fees. For a 300 MB/s cluster, cross-AZ traffic alone can exceed $50,000/month. S3 is a regional service — writes to S3 don't cross AZ boundaries, so this entire cost category disappears.
Elasticity: S3 has unlimited capacity and requires no provisioning. Disk-based Kafka forces teams to pre-provision storage for peak + failure scenarios, leaving 50% or more of capacity idle most of the time.

The math is unambiguous. Continuing to run disk-based replication on cloud infrastructure means paying a premium for a durability guarantee that the underlying platform already provides for free.

How Diskless Kafka Works: Architecture Deep Dive

Separation of Compute and Storage

Traditional Kafka tightly couples compute and storage inside every broker. Each broker owns its data on local disk, manages replication to follower brokers, and handles both message processing and storage I/O. This makes brokers stateful, heavy, and slow to scale.

Diskless Kafka separates these concerns. Brokers become stateless compute nodes — they process messages, manage consumer groups, and handle protocol logic, but they don't store data locally. All partition data lives in object storage. The analogy to databases is direct: Aurora decoupled MySQL's compute from its storage layer, and the result was elastic scaling, simplified operations, and lower cost. Diskless Kafka applies the same principle to stream processing.

The Write Path: WAL + Asynchronous Flush

When a producer sends a message to a diskless Kafka broker, the write path looks like this:

Broker receives the record and writes it to a Write-Ahead Log (WAL) — a small, fixed-capacity buffer (~10 GiB per broker) that provides low-latency durability.
The WAL acknowledges the write as durable. The producer gets its ACK.
In the background, WAL data is batched and uploaded to object storage asynchronously.

The WAL is the key innovation that makes diskless Kafka practical. Writing directly to S3 on every produce request would add hundreds of milliseconds of latency — acceptable for log aggregation, but not for real-time workloads. The WAL absorbs writes at low latency (sub-10ms with EBS or NFS backends) and decouples the producer-facing latency from the object storage upload latency.

One particularly clever optimization is partition-aggregated writes. Traditional Kafka maintains one log file per partition on disk. A diskless implementation like AutoMQ aggregates data from all partitions into a single WAL object. This keeps S3 API call frequency constant regardless of partition count — whether you're running 10 or 10,000 partitions, write costs remain flat. It's an O(1) cost model instead of O(n).

The Read Path: Dual-Cache Architecture

Reading from object storage on every fetch request would be impractical, so diskless Kafka implementations use caching to serve reads efficiently:

WAL Cache (FIFO): Serves tailing reads — the most common pattern where consumers read the latest messages. Data is served directly from the WAL cache with zero S3 latency. Since most consumers are reading near the tip of the log, the majority of reads never touch object storage at all.
Block Cache (LRU): Handles catch-up reads — when consumers need to read historical data. Data streams from S3 in parallel, and the throughput is unbounded by local disk I/O. Catch-up reads can actually be faster than traditional Kafka, because S3 supports massive parallel reads that a single broker's disk cannot match.

This dual-cache design also solves a notorious Kafka problem: cache pollution. In traditional Kafka, a single catch-up consumer reading large volumes of historical data can thrash the OS PageCache and spike P99 latency for every other producer and consumer on the same broker. With separate caches for tailing and catch-up reads, the two workloads don't interfere with each other.

Cross-AZ Optimization: Stay Local, Store Regional

One of the most impactful architectural differences is how diskless Kafka handles multi-AZ deployments:

Brokers don't replicate data between availability zones. There is no ISR-based cross-AZ replication traffic — the single largest cost driver in traditional Kafka.
S3 is a regional service. It handles cross-AZ durability internally, at no additional charge to the user.
Clients stay in their AZ. Rack-aware routing ensures producers and consumers connect to brokers in the same availability zone.
Only metadata crosses AZ boundaries. Brokers exchange a small amount of coordination metadata — negligible compared to Kafka's full data replication.

The result: a 300 MB/s traditional Kafka cluster on AWS incurs roughly $50,000/month in cross-AZ transfer fees. A diskless cluster running the same workload incurs close to zero.

Benefits: Why Enterprises Are Adopting Diskless Kafka

Dramatically Lower Cloud Costs

The cost reduction is not incremental — it's structural. A pricing calculator comparison for a representative workload (300 MB/s write throughput, 2× read fan-out, 72-hour retention on AWS us-east-1) tells the story:

Component	Traditional Kafka	Diskless Kafka (AutoMQ)
Compute	$5,151	$1,430
Storage	$36,450 (EBS × 3 replicas)	$1,722 (S3, no replication)
Cross-AZ Traffic	$61,594	$0
S3 API + Platform Fees	—	$18,652
Total Monthly	$103,195	$21,804

That's a 79% cost reduction — and the gap widens at higher throughput, because cross-AZ traffic and storage replication scale linearly with volume in traditional Kafka but remain near-zero in a diskless architecture.

The savings come from three structural changes, not from squeezing margins:

Storage: S3 at $0.023/GB with 1 copy replaces EBS at $0.08/GiB with 3 copies and 50% utilization headroom. That's a 20×+ reduction in effective storage cost.
Network: Zero cross-AZ traffic eliminates what is typically 50–64% of a traditional Kafka cluster's total bill.
Compute: Stateless brokers scale elastically to actual demand, eliminating the chronic over-provisioning that disk-based Kafka requires.

Simplified Operations

Stateless brokers change the operational model fundamentally. There are no disks to manage, no replica sync to monitor, no capacity planning spreadsheets to maintain. Partition reassignment — the operation that takes hours in traditional Kafka because it requires moving terabytes of data between brokers — becomes a metadata-only operation that completes in seconds.

Broker failure recovery changes from "wait for the replacement broker to rebuild its data from replicas" to "another broker takes over immediately, because there's no local data to rebuild." This means brokers can even run on Spot or Preemptible instances for further cost savings — losing a broker is a non-event.

For teams that have spent years building expertise in Kafka capacity planning, rebalancing, and disk management, this is a genuine reduction in operational burden. The complexity doesn't move somewhere else; it disappears, because the root cause — stateful brokers with local data — no longer exists.

Instant Elasticity

Traditional Kafka clusters only grow; they rarely shrink. Removing a broker requires migrating all its partition data to other brokers, which generates massive internal traffic that impacts production workloads. In practice, most teams never scale down. Brokers accumulate, utilization drops, and costs keep rising.

Diskless Kafka breaks this pattern. Scaling from 3 to 30 brokers takes seconds — and scaling back down is just as fast. There's no data to move, because brokers don't hold data. This enables true pay-for-what-you-use economics: scale up for traffic spikes, scale down when they pass, and stop paying for idle capacity.

Risks and Tradeoffs: What to Consider Before Adopting

Diskless Kafka is not a free lunch. The architecture makes deliberate tradeoffs, and understanding them is essential for making an informed adoption decision.

Produce Latency Increases (Depending on WAL Choice)

The most significant tradeoff is write latency. Object storage is not designed for low-latency random writes, and this affects the produce path:

S3 WAL mode: P99 produce latency is approximately 400–500ms. This is perfectly acceptable for log aggregation, batch ETL, data lake ingestion, and metrics collection — workloads where throughput matters more than latency.
EBS or NFS WAL mode: P99 produce latency drops to under 10–30ms, comparable to traditional Kafka. This covers real-time analytics, fraud detection, microservice communication, and most production Kafka workloads.

The WAL backend is pluggable in implementations like AutoMQ, so teams can choose the right tradeoff per use case. But this is a decision that needs to be made consciously — you need to understand your latency requirements and test your streaming pipeline's tolerance for the change.

Metadata Architecture Varies by Implementation

Not all diskless Kafka implementations are created equal. Some — like WarpStream and Aiven Inkless — route every message through a centralized metadata service. This introduces an additional availability dependency: if the metadata service goes down, message production stops.

Other implementations, like AutoMQ, use a leader-based architecture consistent with Apache Kafka's design, where brokers handle message routing directly without a per-message metadata lookup. The availability characteristics are closer to traditional Kafka.

When evaluating diskless Kafka platforms, pay attention to the metadata architecture. For workloads with strict availability requirements, the difference matters.

Mixed Latency Requirements May Need Careful Planning

If your organization runs both latency-sensitive workloads (transaction processing, real-time bidding) and latency-tolerant workloads (log aggregation, analytics ETL) on the same Kafka infrastructure, a pure S3 WAL deployment won't satisfy both.

Some implementations address this with pluggable WAL backends — allowing different topics or clusters to use different WAL tiers within the same platform. AutoMQ, for example, supports S3 WAL, EBS WAL, and NFS WAL, letting teams cover multiple latency profiles with a single architecture. Others may require separate clusters or even separate vendors for different latency tiers.

Cloud Service Dependency

Diskless Kafka requires cloud object storage. It is not designed for bare-metal or fully offline deployments. For the vast majority of Kafka deployments — which already run on AWS, GCP, or Azure — this is not a practical concern. S3's availability (99.99%) and durability (11 nines) exceed what most organizations achieve with self-managed disk infrastructure. But if your deployment must run entirely on-premises without any cloud dependency, diskless Kafka in its current form is not the right fit.

Getting Started with Diskless Kafka

The diskless Kafka landscape is evolving rapidly, but the technology is already production-proven at scale. AutoMQ is the most mature diskless Kafka implementation available, with production deployments at companies including Grab, JD.com (200+ pods, 40 GiB/s peak throughput), LG U+, Tencent, and HubSpot. It is built on the Apache Kafka codebase, delivers 100% Kafka protocol compatibility, and is open source under the Apache License 2.0.

If you want to understand the architecture in depth, the AutoMQ Diskless Engine page is the most detailed technical reference available.

If you want to calculate how much diskless Kafka could save for your specific workload, the AutoMQ pricing calculator lets you input your throughput, fan-out, and retention parameters and see a side-by-side cost comparison.

And if you want to try it: AutoMQ supports BYOC (Bring Your Own Cloud) deployment on AWS, GCP, Azure, and OCI, as well as self-managed software deployment. The documentation and GitHub repository are the best places to start.

The question is no longer whether Kafka will go diskless — KIP-1150's acceptance settled that. The question is whether you'll wait years for the upstream implementation to mature, or start capturing the cost and operational benefits now.