Blog

Choosing Between Kafka Export, Lakehouse Ingestion, and Shared Storage

Teams usually search for "Kafka on S3" when the local-disk assumptions behind their Kafka platform stop matching the economics of their cloud environment. The phrase sounds specific, but it hides several different decisions. One team may want to copy Kafka records into Amazon S3 for retention and analytics. Another may want a lakehouse ingestion path that turns streams into governed tables. A third may be asking whether Kafka itself can stop treating broker-attached disks as the durable center of the system.

Those are not variations of the same architecture. They answer different ownership questions: who owns the durable log, who owns replay, who owns schema evolution, and who pays for replication and cross-zone movement. Treating them as one bucket leads to weak evaluations. A connector pipeline can be a good export mechanism and still do nothing for broker recovery time. Tiered storage can reduce cold-retention pressure and still leave the active log tied to local disks. Shared storage can change the broker operating model, but it has to prove Kafka compatibility, write-path latency, cache behavior, and failure recovery.

Kafka on S3 decision map

Why "Kafka on S3" means more than one thing

Object storage is attractive because it is elastic, durable, and already central to data lakes. Amazon S3 is documented as object storage for any amount of data across a wide range of use cases, which is why platform teams naturally ask whether Kafka data should land there too. The hard part is that Kafka is not only a file producer. It is a log with ordering, offsets, consumer groups, transactions, ACLs, compaction, retention, and client expectations built over years of production use.

That distinction separates three common paths:

  • Kafka export. Kafka remains the online system of record, while records are copied to S3 for archive, replay outside Kafka, audit, or batch analytics.
  • Lakehouse ingestion. Kafka feeds table formats, query engines, and governance workflows. The destination becomes optimized for analytical reads, not necessarily for Kafka-native fetch semantics.
  • Kafka-compatible shared storage. The streaming platform keeps the Kafka-facing contract while moving durable stream storage away from broker-local disks and into a shared storage design.

The first two paths are data movement patterns. The third is a data plane architecture choice. Mixing them in one comparison is the fastest way to buy the wrong thing.

Export is useful, but it is not a Kafka storage architecture

Exporting Kafka records to S3 is often the right answer when the question is downstream access. You may need immutable raw events for audit. You may want a lower-cost historical copy for ad hoc analytics. You may want a buffer between operational streams and a lakehouse where query engines, catalog policies, and lifecycle rules are better suited than Kafka consumers.

The tradeoff is that export creates a second representation of the data. The Kafka cluster still needs enough broker storage and replication to satisfy the online workload. The exported objects usually have their own partitioning, file sizes, schema handling, retry behavior, and compaction rules. That is useful for analytics, but it does not remove the operational work of maintaining Kafka as a stateful broker fleet.

The evaluation should start with failure ownership. If a broker dies, can the export copy rebuild Kafka service quickly, or is it mainly useful after the fact? If a consumer needs low-latency replay from a committed offset, does it read Kafka or a lake object? If a producer uses transactions, idempotence, or strict ordering assumptions, where are those semantics verified? Export can be excellent, but it is not a substitute for the Kafka log unless the consuming applications are also redesigned around the exported format.

Lakehouse ingestion changes the reader, not always the writer

Lakehouse ingestion is stronger than raw export when the target is analytical value. The pipeline can normalize schemas, produce table metadata, compact files, and make events queryable by engines that were never designed to consume Kafka partitions directly. For teams building fraud analytics, feature pipelines, clickstream reporting, or compliance archives, this can be the right center of gravity.

But lakehouse ingestion still leaves two questions outside the table layer. First, the producer path still lands in Kafka before it reaches the lakehouse. Second, operational replay may still depend on Kafka offsets and consumer group behavior rather than table snapshots. The lakehouse becomes a durable analytical store, while Kafka remains the online coordination point for many applications.

That difference matters during incidents. A platform team can have a perfect S3-backed lakehouse and still spend a weekend expanding Kafka broker disks, moving partitions, or managing cross-zone traffic. The lakehouse has solved the analytics problem. It has not necessarily solved the Kafka platform problem.

Tiered storage is a bridge, not the same as shared storage

Apache Kafka's tiered storage work, described in KIP-405, moves older log segments to remote storage while keeping the active Kafka model intact. That is valuable. It can reduce pressure from long retention windows and make historical fetches less dependent on local disk capacity. For many teams, tiered storage is a practical step because it extends the Kafka architecture instead of replacing it.

The architectural boundary is still important. Tiered storage usually keeps the hot path and active segments local to brokers, then offloads older segments. Shared storage makes the shared layer central to durable stream storage and designs the broker around that premise. One extends broker-local storage; the other changes what a broker owns.

Architecture tradeoffs for Kafka and object storage

The difference shows up in operations:

Evaluation pointExport or lakehouse pathTiered storage pathShared storage path
Primary Kafka write pathBroker-local Kafka firstBroker-local active log firstWAL/cache plus shared durable storage
Main benefitDownstream analytics and archiveLower pressure from historical retentionLess coupling between brokers and retained data
Risk to testDuplicate data contractsHot tier sizing and remote fetch behaviorWAL latency, cache efficiency, compatibility, and recovery
Incident questionCan the copy help applications recover?How much local state still moves?Can brokers be replaced without rebuilding large local logs?

None of these paths is universally right. A platform that mainly needs lake analytics should not pretend that a shared-storage Kafka engine will replace table governance. A platform that mainly struggles with broker-local state should not pretend that an export pipeline will make scaling or broker replacement a routine operation. The point is to separate the job each layer is doing.

The cost model has to follow the bytes

Cost discussions around Kafka and S3 often start with storage price, but storage price is only one line item. Kafka platforms also pay for provisioned broker disks, replica traffic, cross-zone movement, retrieval patterns, object requests, connector compute, monitoring, and the human work of capacity planning. The lowest per-GB storage line can still lead to a costly architecture if it multiplies bytes across zones or keeps operators busy moving partitions.

For a practical review, follow each byte through the system:

  • Write amplification. Does one produced byte become multiple broker replicas, an exported object, and a table file?
  • Network placement. Does replication or consumer traffic cross availability zones, VPC boundaries, or regions?
  • Retention ownership. Is long retention paid through broker-attached disks, remote object storage, or both?
  • Replay path. Do catch-up consumers read Kafka, remote segments, table files, or a separate restore workflow?
  • Operational buffers. How much spare disk, broker capacity, and connector capacity must be reserved to stay safe?

AWS publishes separate pricing dimensions for services such as S3 storage and EC2 data transfer. The exact numbers vary by region and date, so a serious review should use the current pricing pages and the team's own traffic profile. The useful model is not "S3 is lower cost." The useful model is "which bytes still move, which bytes are stored more than once, and which capacity has to be reserved before demand arrives?"

A production evaluation framework

A good "Kafka on S3" review should force each option through the same gates. This keeps the conversation away from product labels and close to the operational facts that will decide whether the platform works.

Production readiness scorecard

Start with the Kafka contract. If applications depend on standard Kafka clients, consumer group behavior, ACLs, transactions, compaction, or existing observability integrations, the proposed architecture has to prove those details with real client tests. "Kafka-compatible" should mean compatibility with the client and ecosystem behavior your workloads use, not only produce and consume in a demo.

Then test the write path. Object storage has different latency and request economics from local or block storage. Any architecture that puts object storage near the hot path needs a credible answer for batching, write-ahead logging, caching, request reduction, and recovery of data that has not yet reached the main object store. Without that layer, "Kafka on S3" becomes a slogan rather than a production design.

Recovery is the next gate because it exposes hidden state. Kill a broker. Remove a node. Fill a disk. Simulate an availability zone event. The test is not only whether the cluster survives, but what has to be copied, rebuilt, reassigned, or manually approved before service returns to normal. Architectures that look similar in steady state often diverge sharply during recovery.

Finally, inspect governance and boundaries. In regulated environments, the question is not only where data is stored. It is also which account owns the bucket, which IAM roles can access it, how encryption keys are managed, where metrics and logs go, and whether the control plane ever needs access to customer message data. A cloud-native streaming platform should make these boundaries clear enough for security and FinOps teams to review together.

How AutoMQ fits this evaluation

Once the framework is clear, AutoMQ fits as one option in the shared-storage category rather than as a replacement for every export or lakehouse pattern. AutoMQ is a Kafka-compatible streaming platform that keeps Kafka protocol and ecosystem expectations while replacing Kafka's broker-local log storage with S3Stream, a shared streaming storage layer built around WAL storage, object storage, and caching.

The important architectural claim is not that object storage is magically fast. Object storage has latency and request-cost characteristics that are a poor match for naive per-partition log writes. AutoMQ addresses that gap with a WAL layer that durably accepts writes before data is uploaded to object storage, mixes writes from multiple partitions to reduce IOPS pressure, and uses cache behavior for tailing and catch-up reads. In this model, object storage is the durable storage layer, while brokers act more like replaceable compute nodes for Kafka protocol handling, leadership, routing, caching, and scheduling.

That changes the questions a platform team can ask. Instead of sizing every broker as both a compute node and a long-lived data owner, the review can separate live throughput, cache needs, WAL choice, and retained storage. Instead of treating partition reassignment as a large data movement project, the review can focus on ownership, metadata, leadership, and traffic. Instead of accepting cross-zone broker replication as a fixed tax, the review can examine whether a shared storage design reduces broker-to-broker replication paths for the target deployment.

AutoMQ is not the right answer to every "Kafka on S3" question. If the immediate problem is governed SQL access to historical events, a lakehouse pipeline may be the center of the design. If the problem is long cold retention on an otherwise stable Kafka fleet, tiered storage may be enough. AutoMQ becomes relevant when the core pain is the Kafka data plane itself: broker-local storage economics, slow scaling, partition movement, recovery work, and the desire to keep Kafka-facing applications mostly unchanged while changing the storage architecture underneath.

For teams at that stage, the next step should be concrete rather than theoretical: choose one workload, document its write rate, read fan-out, retention, recovery objective, client features, security boundary, and cloud bill shape. Then run the same workload through the readiness gates above. If the review points toward Kafka-compatible shared storage, start with the AutoMQ architecture overview and use it as a basis for a focused proof of concept, not a generic benchmark.

References

FAQ

Is Kafka export to S3 the same as running Kafka on S3?

No. Export copies records from Kafka into object storage for archive, analytics, or downstream processing. Kafka still owns the online log, offsets, broker storage, and recovery behavior. Running a Kafka-compatible platform on shared storage changes the Kafka data plane itself.

When is lakehouse ingestion the better answer?

Lakehouse ingestion is the better answer when the primary goal is analytical access, table governance, schema evolution, and query performance over historical events. It should be evaluated as a downstream data product path, not as a complete replacement for Kafka's online streaming responsibilities.

Is Kafka tiered storage enough for long retention?

It can be enough when the main pain is cold historical retention and the active Kafka cluster is otherwise operationally healthy. Teams should still test hot-tier sizing, remote fetch behavior, recovery, and the amount of broker-local state that remains during scaling or failure.

What should teams test before choosing shared storage?

Test Kafka client compatibility, producer latency, read fan-out, catch-up reads, WAL behavior, object storage access, failure recovery, security boundaries, and real cloud cost under a workload that resembles production. Shared storage changes important operating assumptions, so the proof should include failure drills rather than only steady-state throughput.

Where does AutoMQ fit in a Kafka on S3 evaluation?

AutoMQ fits when the team wants to keep Kafka protocol and ecosystem behavior while moving durable stream storage into a shared storage architecture. It is most relevant when broker-local disk cost, cross-zone replication pressure, slow scaling, or partition movement are central platform problems.

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.