Blog

Data Plane Boundaries Behind Kafka on S3 Architectures

Kafka on S3 sounds like a storage decision, but production teams usually discover that it is a boundary decision. The first question may be narrow: should a team write its own Kafka consumer and use an object-storage SDK, or should it use Kafka Connect? That question is practical, and it deserves a practical answer. Yet it also points to a larger issue. Once Kafka records move toward S3, the team has to decide which system owns the live log, which network path carries customer data, which account stores durable bytes, and which recovery procedure is still trustworthy when something fails.

That is why a good Kafka on S3 architecture review starts with the data plane. Kafka records are not ordinary files after they enter production. They carry offsets, ordering assumptions, retention expectations, consumer group behavior, schema contracts, access controls, and replay obligations. S3 can be a destination, a remote tier, or the durable foundation of a Kafka-compatible system. Those designs can all be valid, but they do not place the data plane in the same location.

Kafka on S3 data plane boundary map

Why teams search for Kafka on S3

Most teams do not begin with a formal architecture taxonomy. They begin with an operational annoyance. Broker disks are growing faster than compute demand. Long retention is making cluster changes painful. Analytics teams want events in S3. A compliance team wants durable archives. A platform team wants Kafka semantics without a fleet of heavy stateful brokers. Finance wants to understand why storage and cross-zone traffic appear in different parts of the cloud bill.

The phrase Kafka on S3 compresses those concerns into one search. That compression is useful for discovery and dangerous for design. A sink connector that writes objects into a bucket leaves Kafka's primary storage model intact. Apache Kafka tiered storage moves eligible log segments to a remote tier, while the local tier still matters for active traffic. A Kafka-compatible shared-storage architecture changes the ownership model more deeply: brokers serve the Kafka protocol and compute path, while durable stream storage is backed by object storage and a write-ahead path.

The connector question is a good example. Writing a custom consumer can look direct because the APIs are familiar: poll Kafka, batch records, write files, commit offsets. The non-obvious work begins after the happy path. The team must handle retries without corrupting output, define file boundaries, preserve ordering where it matters, deal with schema evolution, manage credentials, monitor lag, make idempotency decisions, and prove restart behavior. Kafka Connect exists because those concerns recur across sink workloads. But even a well-built sink is still an export pattern, not a redesign of Kafka storage.

The three data plane placements

The cleanest way to evaluate Kafka on S3 is to ask where S3 sits relative to the live Kafka data plane. There are three common placements.

  • Beside Kafka: S3 receives a copy of records through a connector, stream processor, lake ingestion job, or custom consumer. Kafka remains the system of record for live producers and consumers.
  • Behind Kafka: S3 or another remote store holds older completed segments through tiered storage. Kafka brokers still own the active local tier and serve the Kafka protocol.
  • Inside the Kafka-compatible storage path: object storage becomes part of the primary durable storage architecture. Brokers are less tied to local disks, and the platform needs a purpose-built write path, cache, metadata model, and recovery model.

These placements change the blast radius of a decision. Beside-Kafka export can often be introduced with limited application risk because clients continue using the existing cluster. Behind-Kafka tiering changes retention economics and cold-read behavior while preserving much of the broker-centric operating model. Inside-the-storage-path designs change the recovery and scaling premise, so they demand deeper validation of compatibility, latency, governance, and failure behavior.

Apache Kafka's tiered storage documentation is useful because it makes one boundary explicit: the system has a local tier and a remote tier. That does not make brokers stateless. It means remote storage can reduce local disk pressure for eligible data while active log behavior remains part of the broker model. KIP-1150 pushes the conversation further by discussing diskless topics, which shows that the Kafka community is actively treating broker-local storage as an architectural boundary, not only as a capacity planning detail.

What the cost model must include

Kafka on S3 cost is often framed as a storage price comparison. That misses the expensive parts. S3 pricing includes storage, requests, retrieval, and data transfer dimensions. EC2 and network pricing add their own surfaces. Kafka adds replication, broker sizing, retained bytes, read fan-out, operational labor, and migration overlap. A cost model that only compares one GiB of object storage with one GiB of broker disk is too thin for a platform decision.

Kafka on S3 cost and recovery surfaces

Start by separating byte paths:

  • Write path: producer traffic, acknowledgments, replication or WAL writes, object commits, and metadata updates.
  • Read path: tailing consumers, catch-up consumers, replay jobs, analytics reads, and cache misses.
  • Recovery path: broker replacement, leader movement, failed writes, restore operations, and regional or zonal failure drills.
  • Export path: copied objects, compaction jobs, format conversion, duplicate retention, and downstream query scans.

Each path has a different owner. SREs care about latency and failure isolation. FinOps teams care about resource attribution and transfer charges. Security teams care about accounts, VPCs, IAM roles, private endpoints, encryption, and audit logs. Application teams care about whether offsets, ordering, transactions, consumer groups, and retention behave the way their code expects.

This is where data plane boundaries become concrete. If Kafka records move from a customer VPC into a vendor-owned account, that is a governance decision. If broker replication crosses Availability Zones, that is a network-cost and failure-domain decision. If old segments move to S3 but hot partitions remain on local disks, that is a retention decision rather than a stateless-broker decision. If object storage is the durable foundation, then WAL placement, cache behavior, and metadata consistency become core production questions.

Production questions that slogans hide

Architecture slogans are rarely wrong; they are usually incomplete. Kafka on S3 can mean lower local disk pressure, better long-term retention, lakehouse integration, or compute/storage separation. The question is which promise applies to the design in front of you.

The first production question is compatibility. Kafka clients depend on more than the wire protocol. They depend on broker discovery, partition leadership, error codes, offset commits, consumer group rebalancing, ACL behavior, transactions if used, compaction semantics, monitoring conventions, and tooling integrations. A platform can be Kafka-compatible enough for one workload and still need careful testing for another. Compatibility should be validated with real clients, not only a produce-and-consume smoke test.

The second question is the durability boundary. In traditional Kafka, durability is commonly reasoned about through broker-local logs, replication factor, acks, ISR behavior, and disk persistence. In tiered storage, a remote tier enters the lifecycle after segments become eligible. In shared-storage systems, the write path must define when a record is durable, how the write-ahead layer is protected, how object commits are coordinated, and how metadata points readers to the right bytes.

The third question is recovery. Export pipelines need replay and idempotency rules. Tiered storage needs cold-segment read behavior and local-tier recovery expectations. Shared-storage designs need broker replacement, cache warm-up, object-store access, and metadata recovery drills. A design that looks cost-effective during steady state can become expensive during recovery if every failure creates large cross-zone reads or long catch-up windows.

The fourth question is operational ownership. A SaaS service, a BYOC deployment, and a self-managed cluster can all expose Kafka endpoints, but their control plane and data plane boundaries differ. Buyers should ask where brokers run, where records persist, who can access the environment, which telemetry leaves the account, who owns cloud resources, and how emergency support works. Those details matter more than the label on the architecture diagram.

A technical evaluation framework for platform teams

The evaluation should produce an architecture decision, not a label-matching exercise. Start with the workload that hurts today. A data lake ingestion problem does not need the same answer as a broker replacement problem. A compliance archive has different success criteria from a low-latency trading stream. A bursty platform with uneven retention needs a different proof point from a stable cluster with predictable traffic.

Use a worksheet with five sections:

Evaluation areaQuestions to answerEvidence to collect
Data plane placementIs S3 beside Kafka, behind Kafka, or inside the durable path?Architecture diagram, account boundary, write/read paths
Kafka semanticsWhich client behaviors must remain unchanged?Client tests, connector tests, ACL and transaction checks
Cost surfaceWhich bytes are stored, copied, read, and transferred?Cloud bill model, request estimates, migration overlap
Recovery behaviorWhat happens during broker loss, zone failure, and cold replay?Failure drills, RTO/RPO targets, rollback plan
OperationsWho owns upgrades, metrics, scaling, incident access, and deletion?Runbooks, IAM review, support model, observability plan

The worksheet forces a useful discipline: do not compare products before naming the boundary you want to change. If the problem is lake ingestion, evaluate connector reliability and file layout. If the problem is retained data on broker disks, evaluate tiered storage behavior and cold-read economics. If the problem is that brokers are too stateful for elastic cloud operations, evaluate shared-storage Kafka-compatible systems and their write path.

Kafka on S3 production readiness scorecard

The proof should include at least one failure drill. Steady-state benchmarks are necessary, but they are not enough. Test a broker disappearing, a consumer group catching up from older offsets, a hot partition shifting, object storage access being denied by policy, and a rollback to the previous platform. These tests expose boundary mistakes faster than a month of dashboard watching.

How AutoMQ fits the evaluation

AutoMQ belongs in the shared-storage category of this framework: a Kafka-compatible streaming system that keeps the Kafka-facing protocol and ecosystem surface while replacing the traditional broker-local storage center of gravity with S3Stream shared storage, WAL storage, and cache components. In this model, object storage is not only an export destination or a cold tier. It is part of the durable stream storage design, while brokers are designed as stateless compute nodes.

That makes AutoMQ most relevant when the evaluation points to broker statefulness as the real constraint. If the team only needs to land topic data in S3 for analytics, a sink pipeline may be the right tool. If the team primarily needs longer retention on an existing Apache Kafka estate, tiered storage may be the right first evaluation. If the team wants Kafka compatibility while reducing the operational weight of broker disks, making scale-out and scale-in less tied to retained data movement, and keeping the data plane inside a cloud boundary such as BYOC, AutoMQ is worth testing with the same worksheet.

The practical validation does not change because AutoMQ is Kafka-compatible. Teams should still test real producers and consumers, Kafka Connect jobs, schema workflows, ACLs, topic operations, tail latency, catch-up reads, failure drills, object storage policies, private networking, and observability integration. The difference is the hypothesis being tested: durable stream data no longer has to be permanently owned by broker-local disks for Kafka-compatible workloads.

For teams reviewing Kafka on S3 through data plane boundaries, the next step is not to accept a category label. Build the boundary diagram, run the workload proof, and compare the failure behavior. If shared-storage Kafka compatibility is the path worth validating, talk to the AutoMQ team with one representative workload, its retention profile, and the failure drills your platform team already trusts.

References

FAQ

Is Kafka on S3 the same as writing Kafka data to S3?

No. Writing Kafka data to S3 is one export pattern. It can be the right answer for lakehouse ingestion, audit copies, or offline processing, but it does not change Kafka's primary broker-local storage model by itself. Kafka on S3 can also refer to tiered storage or a Kafka-compatible architecture where object storage is part of the primary durable path.

When should I use Kafka Connect instead of a custom consumer?

Use Kafka Connect when the job is a repeatable source or sink pipeline and the team wants managed offset handling, connector lifecycle, task parallelism, error handling, and operational conventions. A custom consumer can fit specialized logic, but the team then owns batching, retries, idempotency, schema handling, monitoring, restart behavior, and file layout.

Does tiered storage make Kafka brokers stateless?

No. Tiered storage can move eligible completed log segments to remote storage, which helps reduce local disk pressure for retained data. Brokers still own active partitions and the local tier remains part of the operating model.

What should a Kafka on S3 cost comparison include?

Include broker compute, local or WAL storage, object storage capacity, object requests, retrieval behavior, data transfer, cross-zone traffic, duplicate export retention, migration overlap, support fees, and operational labor. The most important step is mapping byte paths before assigning prices.

Where does AutoMQ fit in a Kafka on S3 architecture review?

AutoMQ fits when the target path is Kafka-compatible shared storage rather than only export or tiered retention. It uses S3Stream, WAL storage, cache components, and stateless brokers to move durable stream storage away from broker-local disks while preserving Kafka-compatible access for clients and ecosystem tools.

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.