Blog

Kafka on S3 | Architecture Options Beyond Tiered Storage

"Kafka on S3" is a useful search phrase and a dangerous architecture phrase. One team may mean a Kafka Connect sink that copies topic data into an S3 bucket. Another may mean Apache Kafka tiered storage, where older log segments move to remote storage while brokers keep the active local log. A third may mean a Kafka-compatible system where S3-compatible object storage is the durable storage layer, not a sidecar archive.

Those designs all include S3, but they do not solve the same problem. Backup and archive patterns help downstream analytics and disaster-recovery workflows. Tiered storage helps long retention stop consuming as much broker-local disk. Shared-storage or diskless Kafka changes the ownership model: brokers become closer to stateless compute because durable log data is no longer primarily tied to their local disks.

Kafka-on-S3 architecture ladder

The distinction matters because S3 is not magic dust sprinkled over Kafka. It is a storage service with its own durability model, latency profile, request pricing, retrieval behavior, and network paths. Whether it improves your Kafka architecture depends on where S3 sits in the data path.

What People Mean By Kafka On S3

The phrase usually bundles four patterns into one label. The first is export: Kafka continues to run normally, and a connector or consumer writes a copy of events into S3 for a data lake, audit store, or recovery copy. This is often the lowest-risk use of S3 because it leaves Kafka's storage model untouched.

The second pattern is archive. Events leave the Kafka log and land in S3 in a format such as Parquet, JSON, or Avro files. That is valuable for analytics, but it is not the same as preserving Kafka's topic-partition-offset abstraction for online consumers. Once data becomes an external archive, replay semantics move out of Kafka and into the tooling that reads the archive.

The third pattern is tiered storage, sometimes discussed through Kafka's remote log storage design. Kafka keeps a local tier on brokers for active segments and uses remote storage for eligible completed segments. S3 can be the backing object store depending on the implementation. The important point is that brokers still have local state and still own the active part of the log.

The fourth pattern is shared-storage or diskless Kafka. In this model, object storage is designed into the primary durable storage path. Brokers serve Kafka protocol traffic, but durable data is written through a log layer and persisted into shared object storage. This is the pattern AutoMQ follows with S3Stream, WAL storage, and S3-compatible object storage.

PatternWhat S3 StoresKafka Broker ImpactMain Problem Solved
Backup/exportA copied stream or selected topicsNo change to broker storageData lake ingestion, audit copy, DR export
ArchiveHistorical data outside Kafka's live logKafka may no longer serve archived data as offsetsLong-term analytics retention
Tiered storageOlder completed log segmentsActive log and broker state remain localBroker disk pressure from retention
Shared-storage / diskless KafkaDurable streaming dataBrokers become less tied to local disksElastic compute, replacement, scale-in/out

This table is the first filter for any "Kafka on S3" plan. If the team cannot say which row it means, the design is still a slogan.

Backup And Archive Patterns

Backup and archive patterns are popular because they work with Kafka instead of changing Kafka. A sink connector, stream processing job, lakehouse ingestion pipeline, or custom consumer reads from topics and writes files into S3. Kafka still stores the operational log on broker-attached storage, and S3 receives a copy.

That design has real value. S3 is widely used for durable object storage, and AWS describes Amazon S3 as designed for very high durability across multiple Availability Zones. It also gives teams storage classes, lifecycle policies, and integration with analytics tools. For compliance, audit, and data lake workloads, exporting Kafka data to S3 can be exactly the right answer.

But it does not make Kafka itself lower-cost or more elastic by default. Brokers still need enough local storage for retention, enough replica capacity for durability, and enough headroom for failures and rebalancing. If the cluster stores seven days locally and exports a copy to S3, the S3 copy adds capability; it does not remove the local Kafka storage requirement unless retention and recovery assumptions change.

The honest checklist for this pattern is short:

  • Can consumers replay from Kafka or only from S3? If replay moves to S3, the application contract changes. Offset handling, ordering, filtering, and read tooling must be designed explicitly.
  • Is S3 the recovery source or an analytics sink? A recovery archive needs validation, restore procedures, schema compatibility, and time-to-recover tests. An analytics sink has a different bar.
  • Are duplicate bytes acceptable? Export usually means Kafka retains data and S3 retains data. That may be intentional, but it should appear in the cost model.

Backup and archive patterns are not inferior. They are precise. The mistake is expecting them to fix broker-local storage mechanics they were never designed to touch.

Tiered Storage And Remote Log Storage

Tiered storage moves the conversation one layer deeper. Apache Kafka's tiered storage documentation describes a local tier and a remote tier: broker-local storage serves active data, while older log segments can be moved to remote storage. The goal is to reduce pressure on broker disks and make longer retention more practical without forcing every retained byte to stay on local volumes.

This is a strong fit when the workload has a large cold tail. A fraud topic retained for a few hours and read constantly may not benefit much. A compliance topic retained for months and replayed rarely is a better candidate. Tiered storage lets the cluster keep the hot working set local while pushing older segments into remote object storage.

The architectural boundary is as important as the benefit. Tiered storage does not make Kafka brokers stateless. Brokers still own partitions, leaders still serve active traffic, the local tier still matters for hot reads and writes, and operational events still involve broker placement and partition ownership. Remote storage reduces one kind of pressure; it does not erase the stateful broker model.

That boundary explains why tiered storage improves some operational stories but leaves others mostly intact:

  • Retention cost: Strong fit. Older data can move away from broker-local disk capacity.
  • Broker replacement: Partial fit. Historical segments may be remote, but active state and leadership still need normal Kafka recovery behavior.
  • Scale-out: Partial fit. The cluster may avoid moving large cold segments, but traffic, leaders, and active partitions still need balancing.
  • Scale-in: Limited fit. Removing brokers still has to respect partition ownership, capacity, and the local active tier.
  • Latency-sensitive reads: Depends on hot/cold access. Remote reads can behave differently from local reads, so cold replay should be benchmarked rather than assumed.

Tiered storage is often the pragmatic upgrade for Apache Kafka estates. It is not the same thing as building Kafka on shared storage, and treating those two as synonyms leads to bad architecture reviews.

Shared-Storage And Diskless Kafka

Shared-storage Kafka starts from a different question. Instead of asking when old log segments can leave the broker, it asks why durable log data should be owned by individual broker disks at all. That shift changes the shape of scaling, replacement, and capacity planning.

In a diskless or shared-storage design, brokers become compute nodes in front of a durable shared storage layer. A write path still needs low-latency persistence; sending every producer write straight to object storage without an acceleration layer would collide with the latency expectations of many Kafka workloads. That is why these architectures usually include a write-ahead log or similar fast persistence layer before data is compacted or uploaded into object storage.

The operational payoff is not "S3 is lower cost" in isolation. The payoff is that durable state is no longer primarily a broker-local asset. When compute and durable storage separate, the platform has more freedom to replace a failed broker, add capacity for a burst, or scale in after traffic drops without copying the full retained log between broker disks.

Problem-solution coverage matrix

This is also where the S3 discussion needs the most discipline. S3 pricing depends on region, storage class, requests, retrieval, data transfer, and other dimensions. AWS publishes those components separately on its pricing page. A serious Kafka-on-S3 model should keep the same separation: broker compute, WAL or log acceleration storage, object storage capacity, object operations, network paths, product fees, and migration overlap should each have their own line.

Shared storage is powerful, but it is not a shortcut around benchmarking. Teams still need to measure produce latency, consumer lag, cold replay, failure recovery, object request volume, and network topology in the target environment. The architecture reduces a class of broker-local data movement; it does not suspend physics.

How AutoMQ Uses S3-Compatible Storage

AutoMQ is an example of the shared-storage direction. It keeps Kafka protocol compatibility while replacing the broker-local storage center of gravity with an object-storage-based design. AutoMQ documentation describes S3Stream as a shared streaming storage layer and describes WAL storage as the low-latency persistence path used before data is uploaded to object storage.

That architecture is different from adding an S3 sink or turning on a remote tier. Object storage is in the primary durable storage model, and brokers are designed to be closer to stateless compute. For teams already operating Kafka, the important question is not whether S3 exists in the diagram; it is whether S3 is beside Kafka, behind Kafka's cold tier, or inside the primary storage path.

AutoMQ S3Stream write path

AutoMQ's approach is especially relevant when the pain is broader than retention. Long retention is one driver, but platform teams usually feel the deeper problem during operations: broker replacement takes planning, scaling requires data movement, partitions become uneven, and capacity is provisioned for failure cases rather than normal load. A shared-storage architecture targets that statefulness directly.

The evaluation should still be practical:

  • Kafka compatibility: Validate clients, connectors, security, quotas, topic configuration, observability, and operational workflows.
  • WAL placement: Understand where the WAL lives, how it is replicated, and what latency envelope it provides.
  • Object storage behavior: Model storage class, request volume, read patterns, and network paths instead of assuming all S3 bytes behave the same.
  • Migration shape: Plan dual-running, rollback, data validation, and application cutover as part of the architecture, not as an afterthought.

AutoMQ belongs in this article because it makes the shared-storage pattern concrete. The broader lesson applies beyond any one product: object storage only changes Kafka operations when it moves from a side destination into the durable storage design.

Decision Guide

Start by naming the problem. If the problem is "we need Kafka data in a lake," use an export or archive pipeline. If the problem is "long retention is filling broker disks," evaluate tiered storage. If the problem is "stateful brokers make scaling, replacement, and capacity planning painful," evaluate shared-storage or diskless Kafka.

The table below is a useful first-pass guide, not a procurement decision:

RequirementBackup / ArchiveTiered StorageShared-Storage / Diskless Kafka
Data lake copyStrong fitNot the primary goalPossible, but not the reason to adopt
Long Kafka retentionWeak unless Kafka retention changesStrong fitStrong fit
Keep Kafka replay semantics for old dataWeak to partialStrong fitStrong fit, subject to implementation
Reduce broker-local disk pressureWeakStrong fit for cold dataStrong fit by design
Make brokers easier to replaceWeakPartialStrong fit
Scale compute independently from retained dataWeakPartialStrong fit
Minimize migration blast radiusStrong fitMediumRequires deeper validation

The cleanest rule is this: use S3 beside Kafka for copies, behind Kafka for cold segments, and inside Kafka-compatible storage only when broker-local ownership is the problem you are trying to remove. A lot of confusion disappears once those three positions are separated.

For SEO searches, "Kafka on S3" will continue to mean all of them. For architecture work, it should never be allowed to stay that vague.

Sources

FAQ

Is Kafka on S3 the same as Kafka tiered storage?

No. Tiered storage is one Kafka-on-S3 pattern among several. Kafka data can also be exported to S3 for analytics, archived outside the Kafka log, or stored in a shared-storage Kafka-compatible architecture where object storage is part of the primary durable data path.

Can Apache Kafka use S3 as its main storage layer?

Apache Kafka's standard architecture is still broker-centric: brokers own partitions and local log segments. Kafka tiered storage can use remote storage for older segments, but the active local tier remains part of the design. Using S3 as the main durable storage layer requires a Kafka-compatible architecture built around shared storage, such as AutoMQ's S3Stream-based design.

Does S3 make Kafka lower-cost?

S3 can lower the cost pressure of retained data when it replaces or reduces expensive broker-local storage, especially for long retention. It can also add request, retrieval, network, and operational costs. The answer depends on whether S3 is used for a copy, a remote tier, or the primary durable storage layer.

What is diskless Kafka?

Diskless Kafka usually means a Kafka-compatible architecture where brokers are not the long-term owners of durable log data on local disks. A shared storage layer holds durable data, and a WAL or similar persistence layer handles low-latency writes and recovery.

When should I use tiered storage instead of diskless Kafka?

Use tiered storage when the main problem is long retention and cold data on broker disks. Evaluate diskless or shared-storage Kafka when the pain includes broker replacement, scale-in/out, over-provisioning, rebalancing, or repeated data movement during operations.

What should I benchmark before adopting Kafka on S3?

Benchmark produce latency, p99 behavior, consumer lag, cold replay, failure recovery, broker replacement, scale-out, scale-in, object request volume, and network paths. Also validate client compatibility, security, observability, operational runbooks, and rollback procedures before treating the change as storage-only.

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.