"Kafka without disks" sounds precise until an architecture review begins. One team may mean no broker-local persistent data. Another may mean no Kubernetes PersistentVolume. A third may mean no local storage at all, including temporary cache. Those are different requirements with different failure modes, and mixing them together is how a cloud-native Kafka design becomes fragile.
Traditional Apache Kafka is built around broker-local log directories. Producers append to partition logs, consumers read from those logs, replicas protect those logs, and retention removes data from those logs. If a broker still owns durable partition data on attached storage, the deployment is stateful even when the broker runs in a Kubernetes pod. A diskless Kafka design has to answer a harder question: where does acknowledged data become durable when the broker does not keep the primary log on local disks?
For Kubernetes platform teams, the useful definition is operational rather than literal. Kafka without disks means broker-local storage is not the durable source of truth. A broker may still use memory, ephemeral local space, or a node-local cache for performance, but losing that local storage must not lose acknowledged records or force a large partition-data recovery from the failed broker.
Define "Without Disks" First
Before choosing a design, separate three meanings that often get collapsed into one phrase.
| Definition | What it removes | What may remain | Operational consequence |
|---|---|---|---|
| No local persistent data | Broker disks are not the durable partition-log home | WAL elsewhere, object storage, metadata, cache | Broker replacement should not require copying durable log data from that broker |
| No PersistentVolume | Kubernetes pods do not depend on PV-bound broker storage | Ephemeral volumes, memory cache, shared storage services | Scheduling and replacement become less tied to volume attachment |
| No temporary disk | Brokers avoid local filesystem use beyond the container runtime | Memory cache or remote cache | Strictest form; can raise read-path and buffering pressure |
Most production conversations are about the first two rows. They want Kafka no local disk in the durable-state sense, or Kubernetes Kafka without persistent volume in the scheduling sense. They rarely mean a broker process cannot touch any local filesystem path. That distinction matters because cache is not the same as durability. Removing every cache can make the architecture cleaner on paper and worse under real replay traffic.
A practical diskless Kafka evaluation should therefore start with a three-layer model.
- The durability layer decides where acknowledged writes survive broker loss.
- The cache layer decides how hot reads and replay reads avoid turning every fetch into a remote storage call.
- The broker compute layer decides how producers, consumers, partition leadership, protocol handling, and scaling behave when brokers are replaceable.
Once those layers are explicit, "Kafka without disks" becomes a set of architecture patterns instead of a slogan.
Pattern 1: Reduce Local Retention With Tiered Storage
The first pattern is not fully diskless, but it is often the least disruptive step for an existing Kafka estate. Kafka Tiered Storage, described in Apache Kafka documentation as remote log storage, lets brokers copy eligible log segments to a remote tier. Local disks still hold the active log and enough data to serve the hot path, while older segments can live in remote storage.
This pattern helps when the pressure is retention size. If teams keep many days or weeks of Kafka data because downstream consumers may replay old offsets, local broker disks become a capacity planning problem. Tiered storage reduces that pressure by moving colder data away from broker-attached volumes. It can also make historical replay less dependent on keeping every retained byte on premium block storage.
The tradeoff is that broker-local disks still matter. Producers are acknowledged through the broker's local log path. The active segment is local. Broker failure and partition movement still involve stateful Kafka behavior. Kubernetes can manage the pods, but the cluster still has to operate the storage behind those pods.
Use tiered storage when:
- Existing Kafka clusters need longer retention without expanding broker disks at the same rate.
- The team wants an incremental storage change while preserving the classic broker model.
- Most operational pain comes from cold data capacity rather than broker replacement or elastic scaling.
Do not call this pattern diskless in the strong sense. It is better described as reduced local retention. That language helps SREs set the right expectations: object storage is present, but it is not yet the primary durable home for active Kafka writes.
Pattern 2: Use Object Storage as Primary Durable Storage
The second pattern changes the storage contract. Object storage is no longer an archive for older log segments; it becomes the durable repository for stream data. The broker stops being the long-term owner of partition bytes and becomes a protocol, append, fetch, cache, and coordination participant over shared storage.
This shift is attractive because object storage has properties that match long-lived Kafka data: shared access, elastic capacity, and separation from broker lifecycle. If durable data is organized in object storage with accurate metadata, replacing a broker does not require recovering the broker's local partition directories. Scaling compute and retaining data become different decisions.
Object storage alone does not solve the write path. Kafka producers expect an acknowledgement boundary with ordered appends and well-defined durability. A diskless architecture needs a write-ahead log or equivalent durable write path before data is packed, indexed, and uploaded into object storage. Without that layer, the design either acknowledges too early or pays too much latency for every write.
This is where AutoMQ enters naturally as one implementation of the pattern. AutoMQ is a Kafka-compatible streaming system that keeps Kafka protocol compatibility while moving durable stream storage away from broker-local disks. Its architecture uses S3Stream as a shared storage layer, object storage as the primary repository, and WAL storage for the acknowledged write path. That is a different design point from adding remote storage to a local-log broker after the fact.
When evaluating this pattern, ask concrete questions rather than accepting the phrase "Kafka object storage" at face value:
- Does object storage hold the primary durable data, or only remote copies of older segments?
- What storage system receives writes before producer acknowledgements?
- How are offsets, stream ranges, object locations, and ownership tracked?
- What happens to unflushed data when a broker disappears?
- Which local resources are cache, and which resources are required for durability?
The answers determine whether the architecture is diskless in the operational sense. A broker can use local cache and still be stateless with respect to durable partition data. A broker that owns acknowledged local log segments is not stateless, even if it later uploads old segments elsewhere.
Pattern 3: Run Stateless Brokers on Kubernetes
The third pattern is the Kubernetes expression of the second pattern. Once durable Kafka data moves to shared storage and the WAL is not tied to broker-local persistent volumes, brokers can be treated more like replaceable compute units. That does not remove the need for careful operations. It changes what Kubernetes is responsible for.
In a local-disk Kafka deployment, Kubernetes has to coordinate pods with persistent volumes, node affinity, storage attachment, and the Kafka cluster's own partition placement. A failed broker may not be expensive because the container is hard to restart; it is expensive because the broker's data identity is tied to storage. The more durable state a pod owns, the less replaceable it is.
In a stateless broker design, Kubernetes schedules broker compute, while durability is handled by a shared storage layer and WAL design. This supports a more elastic operating model: brokers can be replaced, scaled, or rescheduled without moving the primary durable log from one broker disk to another. The control plane still has to maintain metadata and leadership safely, but broker compute is no longer the place where the cluster's long-lived bytes live.
WAL Choices
WAL design is the most important technical decision in Kafka without disks. It defines write latency, failure recovery, and the real durability boundary. Common choices include object storage based WAL, cloud block storage, regional block storage, or shared file storage. Each changes the latency and failure profile.
Object storage based WAL can minimize infrastructure dependencies and keep the architecture close to the storage system that will hold the data. It may fit log, observability, analytics, and other workloads where throughput and elasticity matter more than the lowest possible tail latency. Block or file based WAL options can reduce write-path latency or fit cloud-provider durability models, but they add storage services that must be operated and observed.
The right question is not whether a WAL uses a disk somewhere in the cloud provider. It is whether that WAL is broker-local durable state. If the WAL can survive broker loss and be recovered by another broker or storage component, the broker can still be diskless from a Kubernetes scheduling perspective.
Cache Choices
Cache is where many diskless designs either become practical or painful. Kafka workloads often include tail reads, fan-out consumers, and catch-up replays. If every fetch has to hit object storage directly, read latency, request cost, and throttling risk become visible quickly.
A healthy cache strategy separates source of truth from acceleration. Memory cache can serve the hottest data. Ephemeral local cache can absorb repeated reads without becoming durable state. Remote cache can help when node churn is frequent or when local cache size is constrained. None of these caches should be required to recover acknowledged records.
This is also why "no temporary disk" is a stricter requirement than most teams need. Removing durable broker disks is valuable. Removing every local cache may be counterproductive unless the workload is light, latency-insensitive, or backed by a strong remote cache design.
Failure Recovery
Failure recovery should be tested as a layered process, not as a broker restart script. A stateless Kafka broker has to recover protocol service, leadership or assignment, WAL state, cache warming, and object-storage metadata. The design succeeds when a replacement broker can resume service without reconstructing durable partition data from a failed node's local disk.
Test the failure modes that match the architecture:
- Kill a broker while producers are writing and verify acknowledged data remains readable.
- Replace a pod on a different Kubernetes node and verify no PV-bound state blocks recovery.
- Throttle or impair object storage and observe producer, consumer, cache, and WAL behavior.
- Run catch-up consumers after broker replacement and watch whether cache misses overload the backend.
- Scale brokers up and down and measure whether the operation triggers large data movement.
These tests are more useful than a generic "stateless" label. They show whether the system is stateless under the failure patterns that platform teams actually care about.
Choosing the Right Pattern
There is no universal winner among the three patterns. They solve different operational problems.
| Goal | Better starting pattern | Why |
|---|---|---|
| Keep classic Kafka but reduce disk growth | Tiered Storage | It preserves the local broker model while moving colder data remote |
| Make object storage the durable stream repository | Object storage as primary storage | It separates retained data from broker lifecycle |
| Run Kafka-compatible brokers without PV-bound durable state | Stateless brokers on Kubernetes | It aligns broker replacement and scaling with cloud-native scheduling |
| Avoid local filesystem use entirely | Strict no-temporary-disk design | It may be needed for policy reasons, but it requires stronger cache and buffering design |
For many teams, the path is evolutionary. Tiered Storage addresses retention pressure. Object-storage-primary storage addresses broker state. Stateless brokers on Kubernetes address scheduling, elasticity, and recovery. The ladder matters because each step removes a different kind of disk dependency.
AutoMQ is relevant in the second and third patterns because it starts from Kafka-compatible stateless brokers and object-storage-backed shared storage rather than treating object storage as a cold tier alone. That does not remove the need to evaluate WAL mode, cache sizing, object storage limits, or workload latency. It does make the architecture conversation cleaner: durable Kafka data is not owned by broker-local persistent volumes, so Kubernetes can operate brokers more like compute.
The safest decision process is to define "without disks" in one sentence, map it to the three layers, and then test the failure path. If the requirement is no broker-local persistent data, focus on WAL durability and shared storage metadata. If the requirement is no Kubernetes PV, focus on pod replacement and scheduling. If the requirement is no temporary disk, focus on cache alternatives and read-path load. Kafka without disks is viable only when the removed disk responsibility is replaced by a clearly designed durability, cache, and recovery model.
References
- Apache Kafka documentation, broker configuration
log.dirs: https://kafka.apache.org/documentation/#brokerconfigs_log.dirs - Apache Kafka documentation, tiered storage: https://kafka.apache.org/documentation/#tiered_storage
- Kubernetes documentation, Persistent Volumes: https://kubernetes.io/docs/concepts/storage/persistent-volumes/
- Kubernetes documentation, Ephemeral Volumes: https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/
- AutoMQ documentation, architecture overview: https://docs.automq.com/automq/architecture/overview?utm_source=blog&utm_medium=content&utm_campaign=kafka-without-disks-architecture-patterns-for-cloud-native-kafka
- AutoMQ documentation, S3Stream shared streaming storage: https://docs.automq.com/automq/architecture/s3stream-shared-streaming-storage/overview?utm_source=blog&utm_medium=content&utm_campaign=kafka-without-disks-architecture-patterns-for-cloud-native-kafka
- AutoMQ documentation, WAL storage: https://docs.automq.com/automq/architecture/s3stream-shared-streaming-storage/wal-storage?utm_source=blog&utm_medium=content&utm_campaign=kafka-without-disks-architecture-patterns-for-cloud-native-kafka
FAQ
What does Kafka without disks mean?
It usually means brokers do not store durable partition data on local persistent disks. The phrase can also mean no Kubernetes PersistentVolume or no temporary local disk, so teams should define the requirement before selecting an architecture.
Can Apache Kafka run with no local broker disk?
Traditional Apache Kafka uses broker-local log directories for partition data. It can reduce local retention through tiered storage, but a fully diskless model requires a different storage architecture where durable writes, object storage, metadata, and cache replace the broker-local log.
Is object storage enough to make Kafka diskless?
No. Object storage can hold durable stream data, but the architecture still needs a safe write path, usually a WAL or equivalent mechanism, plus metadata and cache. Without those layers, object storage is a repository, not a complete Kafka storage architecture.
Does stateless Kafka on Kubernetes need PersistentVolumes?
A stateless broker design should not require PVs for durable partition data. It may still use ephemeral volumes or memory for cache. The important test is whether deleting a broker pod and its local storage loses acknowledged data or blocks recovery.
Where does AutoMQ fit in Kafka without disks?
AutoMQ fits the object-storage-primary and stateless Kubernetes patterns. It is Kafka-compatible, uses object-storage-backed shared storage, and separates durable stream data from broker-local disks while using WAL storage and cache to handle write and read paths.