Kafka architecture discussions often compress three different ideas into one phrase: remote storage, tiered storage, and shared storage. The compression is convenient, but it creates bad design assumptions. A team may hear "Kafka stores data in S3" and conclude that broker disks are no longer part of the durability model. Another team may enable remote log storage and expect scale-out to behave like a stateless service. Both conclusions skip the architectural boundary that matters most.
Remote storage and tiered storage primarily answer a retention question: how can a Kafka cluster keep more historical data without growing broker-local disks at the same rate? Shared storage answers a different question: can persistent Kafka data be separated from broker identity so brokers behave more like replaceable compute?
That difference affects capacity planning, failure recovery, rebalance time, cloud cost modeling, and the proof you should demand before adopting a platform. The storage medium may be the same object store. The system contract is not the same.
Why the terms get confused
The confusion starts because all three terms involve data outside broker-local disks. In Apache Kafka's tiered storage model, a cluster uses a local tier and a remote tier. Kafka documentation describes the local tier as the normal broker disk used for log segments, while the remote tier stores completed log segments in an external system such as HDFS or S3. KIP-405 uses the term remote log storage as part of the tiered storage design.
That sounds close to shared storage, especially when the remote tier is object storage. But the path from "some segments are remote" to "brokers are stateless" is not automatic. Kafka still has active partitions, leaders, followers, local log directories, controller metadata, leader epochs, and replica state. A remote tier can reduce how much data needs to remain local, but it does not by itself redefine the broker as a pure compute process.
The distinction is easiest to see by asking what the architecture treats as primary:
| Term | Main purpose | Typical broker implication |
|---|---|---|
| Remote log storage | Store copied log segments outside the broker | Brokers may still own active local replicas |
| Tiered storage | Keep recent data local and older segments remote | Local disk remains part of the hot path |
| Shared storage | Put durable log state in a shared layer | Brokers can be designed as stateless compute |
The phrases are not moral categories. Tiered storage is useful and often pragmatic. The problem appears when architects use tiered storage language to reason about shared-storage failure behavior, or use shared-storage expectations to evaluate a retention-offload feature.
What Kafka remote log storage does
Remote log storage is best understood as an extension of Kafka retention management. Kafka workloads are usually tail-heavy: producers append to the end of the log, and many consumers read data near the head of the stream. Older records still matter for backfills, replay, audits, and downstream recovery, but they are not always read at the same intensity as fresh data.
Tiered storage exploits that access pattern. Recent active segments stay on local broker storage. Completed segments can be copied to the remote tier and later removed from the local tier according to local retention settings. Kafka exposes configuration around this boundary, such as enabling remote storage for a topic and setting local retention separately from total retention.
That model changes the retention economics and operational pressure in several ways:
- Longer retention becomes easier to justify. Historical segments do not have to consume broker-local disk for their full retention window.
- Backfill reads can use the remote tier. Consumers that need older data may fetch records that are no longer present locally.
- Local disk sizing can focus more on hot data. The cluster still needs enough local capacity and throughput for active segments, replication, page cache behavior, and operational headroom.
- The remote tier needs its own reliability model. Operators must understand the RemoteStorageManager implementation, metadata consistency, object lifecycle, permissions, and failure modes.
This is valuable work. It is also narrower than a shared-storage architecture. If a broker is still the durable owner of active local replicas, then scale-out, broker replacement, and reassignment still have a local-data dimension. Remote storage may reduce the amount of local data involved, but it does not erase the broker-data relationship.
What shared storage changes
Shared storage changes the binding between broker compute and persistent log data. In a shared-storage Kafka architecture, the durable log is externalized from broker-local disks into a storage layer that can be accessed by authorized brokers under strict ownership and fencing rules. Brokers still serve Kafka clients. They still handle protocol requests, leadership, fetches, caching, and runtime coordination. The durable data, however, is not permanently pinned to one broker's local disk.
That architectural shift is why shared storage is associated with stateless brokers. Stateless does not mean the process has no state. A broker still has network connections, request queues, metadata, cache, leadership assignments, metrics, and transient buffers. It means the broker should not be the long-lived owner of irreplaceable log state. If the broker disappears, the cluster should reassign compute responsibility rather than rebuild the authoritative log from that machine.
Data ownership
In broker-local Kafka, data ownership is physically intuitive. A partition replica lives in a broker log directory. The leader appends records locally, followers replicate records, and high availability depends on the in-sync replica set. The controller can move replicas, but durable data placement is still expressed through broker-local replicas.
In shared storage, data ownership is mediated by metadata and storage-layer rules. The system must know which broker is allowed to append, where the next offset begins, how leader epochs are fenced, and how readers find the correct records. This is not a looser model. It is stricter in a different place. When multiple brokers can reach the same durable storage substrate, correctness depends on preventing stale writers and ambiguous ownership.
For architects, the first review question is not "does it use S3?" It is "where is the authoritative copy of the log, and what prevents the wrong broker from writing to it?"
Scaling behavior
Scaling is where the terminology gap becomes operational. In a tiered design, adding brokers can still require moving partition replicas or changing replica placement for balance. If hot data lives locally, the system must account for local disk content, catch-up, and replica movement. Remote storage reduces retention pressure, but active placement still matters.
In a shared-storage design, scale-out can be shaped more like compute scheduling. A new broker joins, the control plane assigns partition responsibility, ownership is fenced, traffic is rebalanced, and caches warm as runtime state. The durable log does not need to be copied as a prerequisite for the broker to become useful.
This does not make scale-out free. The new broker may need metadata, cache warm-up, network bandwidth, quota headroom, and read-path preparation. Object storage calls and write-ahead behavior still need engineering. The important distinction is whether scaling is dominated by moving durable history between broker disks or by assigning compute responsibility over shared durable data.
Failure recovery
Broker failure also exposes the difference. With remote log storage, the cluster may have older completed segments in the remote tier, but the failure path still depends on the replication and local-log model for active data. Kafka can elect another in-sync replica, and operators may later restore the desired replica layout. Remote segments help reduce what must remain local, but they do not necessarily remove local replica recovery from the operational story.
With shared storage, broker recovery should be closer to replacement of a failed compute node. The controller detects the loss, assigns responsibility to healthy brokers, and those brokers continue from shared durable state after fencing and metadata checks. Local cache loss may hurt performance temporarily, but cache loss is different from durable log loss.
The hard part is proving the write path. A shared-storage system needs a precise answer to when a producer acknowledgment becomes durable. Some designs use a write-ahead layer to provide low-latency persistence before data is organized into object storage. Others may use different storage primitives. Either way, the architecture must explain how acknowledged records survive broker loss, storage delay, controller failover, and leadership change.
Where AutoMQ fits
AutoMQ enters this discussion as an example of the shared-storage direction rather than as another name for tiered storage. It is Kafka-compatible at the protocol layer, but its storage architecture replaces Kafka's broker-local log storage with S3Stream, a shared streaming storage layer built on object storage. AutoMQ documentation describes brokers as stateless because storage and compute are separated and Kafka log storage is offloaded to cloud storage through S3Stream.
The useful architectural point is not that AutoMQ uses object storage. Tiered storage can also use object storage. The more important point is that AutoMQ treats object storage as the primary durable data location, while WAL storage and caching address the latency, write efficiency, and recovery needs that raw object APIs do not solve alone. In that model, brokers are closer to Kafka protocol compute nodes over shared durable state than to fixed owners of local persistent replicas.
This is also why "diskless broker" should be interpreted carefully. A broker may still use local resources for cache, temporary files, or runtime acceleration depending on deployment choices. The architectural claim is about durable log ownership: persistent Kafka data should not be tied to the lifetime of one broker's local disk.
AutoMQ should not be evaluated by assuming every shared-storage benefit appears magically because an object bucket exists. The right evaluation is workload-shaped:
- Can the system preserve Kafka client expectations for produce, fetch, offsets, metadata, and operational tooling?
- Where does the write-ahead path acknowledge durability?
- How does broker replacement behave during high produce traffic and consumer catch-up?
- What happens when object storage has elevated latency, throttling, or request failures?
- How much cache is required to meet tail-read latency goals?
Those are architecture questions, not marketing questions. They are also the questions that separate a true shared-storage design from a retention offload feature.
Questions to ask before choosing
The practical decision is not "remote storage bad, shared storage good." It is "which problem dominates this Kafka estate?"
Choose remote or tiered storage when long retention is the main pain and the current broker-local architecture is otherwise acceptable. If your cluster is stable, scaling events are infrequent, broker replacement is well rehearsed, and the pressure is mostly historical data occupying expensive local disks, tiered storage may be the direct fix. It lets you keep Kafka's familiar operating model while changing the storage lifecycle for completed segments.
Evaluate shared storage when the storage-compute coupling itself is the pain. This often appears in cloud environments with elastic workloads, frequent cluster resizing, many partitions, operational pressure around broker replacement, or data movement that makes maintenance windows too expensive. In those cases, the goal is not only to keep more history. The goal is to stop treating each broker as the permanent home of a slice of durable history.
A short review checklist helps keep the decision grounded:
- Write path: At what exact point is a produce acknowledgment durable across broker failure?
- Read path: Which reads come from cache, local hot data, WAL, remote tier, or shared object storage?
- Broker failure: Does replacement require reconstructing local replicas, or can another broker attach to durable state?
- Scale-out: Does adding brokers trigger data copying, metadata reassignment, cache warm-up, or all three?
- Retention: Is the objective longer history, lower broker disk pressure, or a different broker ownership model?
- Operational proof: Can the vendor or implementation show failure drills under production-shaped load?
The most common architecture mistake is evaluating the storage service instead of the system contract. S3, HDFS, Azure Blob, Google Cloud Storage, or another object store can participate in different designs. What matters is how Kafka semantics are preserved above that storage and how the broker behaves when infrastructure changes.
References
- Apache Kafka Tiered Storage documentation
- Apache Kafka Tiered Storage configuration
- Apache Kafka KIP-405: Kafka Tiered Storage
- Apache Kafka documentation
- AutoMQ S3Stream overview
- AutoMQ S3 Storage documentation
- AutoMQ Stateless Broker documentation
FAQ
Is Kafka remote storage the same as shared storage?
No. Kafka remote storage usually means log segments are copied to an external storage system, often as part of tiered storage. Shared storage means the durable log layer is externalized from broker-local disks so broker compute can be reassigned without treating local disk data as the authoritative home.
Is Kafka tiered storage stateless?
Not by itself. Tiered storage can reduce the amount of historical data retained on broker disks, but active local replicas can still make brokers stateful. A stateless broker architecture requires durable log ownership to be independent of broker-local storage.
Why does shared storage need fencing?
If multiple brokers can reach the same durable storage layer, the system must prevent stale leaders from appending after ownership changes. Fencing, leader epochs, metadata consistency, and write-path rules are essential to preserving Kafka ordering and durability semantics.
Does shared storage remove the need for local cache?
No. Shared storage changes where durable data lives, but local cache can still be important for tail reads, repeated fetches, and latency control. The difference is that cache should be rebuildable runtime state, not the authoritative log.
When is tiered storage enough?
Tiered storage may be enough when the main issue is long retention and broker-local operations are otherwise acceptable. It is a targeted way to reduce local disk pressure while keeping the traditional Kafka operating model.
Where does AutoMQ fit in this comparison?
AutoMQ is a Kafka-compatible shared-storage architecture that uses S3Stream to store durable stream data on object storage, with WAL storage and caching used to support low-latency writes and reads. It is best evaluated when elasticity, broker replacement, and storage-compute separation are central requirements.