S3-Compatible Kafka: Designing Kafka for Private Cloud Object Storage

Private cloud teams often reach the same point in their Kafka planning: broker disks are expensive to grow, long retention keeps pushing capacity upward, and every broker replacement carries the operational weight of retained log data. Object storage looks like the natural release valve. It is durable, elastic, and already part of many private cloud platforms.

The phrase "S3-compatible" makes that choice sound cleaner than it is. S3-compatible storage is not AWS S3. It means the storage system exposes an API surface shaped like Amazon S3, but the implementation behind that API may have different latency, consistency, throttling, metadata, networking, observability, and failure behavior. Kafka architecture cannot treat those differences as cosmetic.

For Kafka, object storage is not a passive bucket. It can be a sink target, a remote tier for older log segments, or the shared storage substrate behind Kafka-compatible brokers. Those three modes place very different pressure on private cloud object storage, so they should not share one validation checklist.

S3-Compatible Does Not Mean AWS S3

S3 compatibility is an interface claim. It tells you that common operations such as object writes, reads, range reads, multipart upload, deletes, authentication, and bucket operations may be available through familiar SDKs. It does not guarantee AWS S3's operational envelope, regional design, durability model, request scaling behavior, or monitoring surface.

This distinction matters because Kafka uses storage as part of a correctness and latency system. A data lake job may tolerate occasional retry storms or slow object listing. Kafka producers, consumers, and operators feel those differences immediately through acknowledgment latency, fetch latency, replay speed, recovery time, and page-cache behavior.

A useful design rule is this: do not ask whether the object store is S3-compatible in the abstract. Ask whether it is Kafka-ready for the workload, failure model, and operational team that will run it.

Three Ways Kafka Uses S3-Compatible Storage

The word "Kafka on object storage" hides several architectures. Before evaluating private cloud storage, decide which mode you are designing.

Mode	What object storage does	Kafka dependency level	Main risk to validate
Sink storage	Kafka Connect or application consumers write events into object storage.	Kafka remains disk-based; object storage is downstream.	Connector throughput, file layout, delivery semantics, and backpressure.
Tiered storage	Kafka keeps hot data on broker disks and moves completed log segments to remote storage.	Object storage serves older reads and long retention.	Remote fetch latency, metadata correctness, request cost, and recovery behavior.
Shared storage	Kafka-compatible brokers use object storage as the durable storage layer, with compute separated from retained data.	Object storage is part of the core data path.	Acknowledgment durability, cache design, failure domains, and storage SLOs.

Sink storage is the most forgiving mode. If a connector is slow, Kafka can buffer until retention or lag limits become a problem. The object store affects downstream analytics and archival, but it is not the primary persistence mechanism for Kafka itself.

Tiered storage raises the stakes. Apache Kafka's tiered storage work describes a local tier for current broker logs and a remote tier for completed segments, with old reads served from remote storage. The idea is powerful because local retention can shrink while total retention grows. It also introduces new metadata and remote read paths that must be tested under replay and failure.

Shared storage changes the architecture more deeply. Instead of treating object storage as a cold extension, a Kafka-compatible system can separate compute from storage and place durable log data in shared object storage. AutoMQ is in this category: it keeps Kafka protocol and semantic compatibility while using S3-compatible storage as a shared storage substrate, so broker replacement and scaling are less tied to moving retained local logs.

API Compatibility: Test the Calls Kafka Actually Needs

A private object store may pass common SDK smoke tests and still surprise Kafka. The test plan should include the operations used by the storage layer, not a generic upload-download demo.

Start with the basics: PUT, GET, HEAD, range reads, multipart upload, abort multipart upload, delete, list, and error responses. Then test the unpleasant edges: expired credentials, clock skew, partial uploads, network interruption, retry behavior, object overwrite rules if used by the implementation, and how quickly metadata visibility catches up after writes and deletes.

Kafka-oriented storage layers usually try to avoid expensive or ambiguous object-store patterns. They prefer immutable objects, append-like batching at the storage layer, explicit metadata, and bounded listing. Even so, recovery often depends on finding objects, validating indexes, or reconciling metadata. If the object store has weak list visibility, unstable error codes, or gateway-specific retry behavior, those details can become Kafka incidents.

API validation should produce a compatibility bill of materials: required operations and headers, SDK and authentication method, multipart limits, range-read behavior, delete visibility, and retry error mapping. Kafka operators should not discover an object-store API edge case during a broker restart.

Latency and Throughput: Measure Workload Shape, Not Vendor Claims

Object storage performance numbers are meaningful only when they match Kafka's workload shape. Kafka writes are ordered appends at partition level. Object storage writes are immutable object operations. The architecture between them must batch, index, cache, and recover without violating the point at which producers receive acknowledgments.

For sink workloads, measure connector flush latency, object creation rate, partition-to-file mapping, and downstream read efficiency. A connector that creates too many tiny objects can satisfy Kafka consumption while creating painful storage metadata pressure.

For tiered storage, test the split between hot and historical reads. Tail consumers should usually hit broker memory, page cache, or local storage. Historical replay should use remote reads efficiently, with good range reads and index lookup behavior. If tail consumers frequently touch object storage, the tier is doing hot-path work it was not meant to carry.

For shared storage, latency testing must include acknowledgment durability. A broker can acknowledge only when the architecture has met its durability contract. That contract may involve a write-ahead log, replicated commit layer, object storage, or a combination of mechanisms. The names vary by product, but the test question does not: what must complete before the producer receives success?

Throughput testing should include sustained ingest, catch-up reads, consumer group rebalances, broker restart, and cold-cache recovery. Private cloud object stores often have aggregate throughput but uneven per-node, per-gateway, or per-bucket behavior. Kafka will find those uneven parts because it creates steady pressure and sudden replay bursts.

Consistency and Metadata Are Correctness Issues

AWS documents strong read-after-write consistency for Amazon S3, but that statement does not transfer automatically to every S3-compatible implementation. Some private object stores provide strong semantics for reads but weaker behavior for listing. Some rely on gateway caches. Some behave differently during node recovery or erasure-code rebuild.

Kafka storage designs should minimize dependence on ambiguous list operations, but they cannot ignore consistency. Tiered storage needs to know which remote segments exist, which indexes correspond to them, and which offsets have moved out of the local tier. Shared storage systems need an even tighter relationship between object data, stream metadata, cache state, and crash recovery.

The consistency test should cover immediate key reads, range reads, concurrent writes, prefix listing, retention deletion, interrupted multipart uploads, and broker or object-store restarts while traffic continues.

These tests are not academic. If object metadata visibility lags behind Kafka metadata, operators can see missing segments, failed remote fetches, replay stalls, or recovery loops. The storage system may still be "S3-compatible," but it may not be acceptable for the Kafka role you assigned to it.

Failure Domains: Draw Storage, Network, and Broker Boundaries Together

Private cloud object storage is often deployed as a distributed system inside the same data center estate as Kafka. That makes topology design more important, not less. A Kafka cluster and an object store may share racks, switches, power domains, Kubernetes nodes, storage networks, load balancers, or DNS dependencies. If those boundaries are not drawn, the architecture can accidentally concentrate risk.

The failure-domain review should map four layers on one diagram: Kafka brokers, write-ahead or cache storage, object-storage gateways, and object-storage backend nodes. Then add network paths and operational systems such as metrics, logging, DNS, and identity.

The uncomfortable questions are the useful ones. If one rack fails, do brokers and object-store shards fail together? If an S3 gateway is overloaded, do producers see higher acknowledgment latency or only replay jobs slow down? If the object store rebuilds after a disk loss, does Kafka stay within SLO?

Tiered storage and shared storage respond differently to these failures. In a tiered design, hot data may still be on broker disks while old reads suffer. In a shared storage design, object storage and the write-ahead path are part of core durability and recovery, so failure isolation and backpressure must be designed with greater care.

Observability: Kafka Metrics Need Object-Store Context

Kafka teams already watch producer request latency, consumer lag, under-replicated partitions, broker disk, controller health, and network throughput. Once S3-compatible storage enters the architecture, those metrics are not enough. The object store becomes part of Kafka's behavior, so storage-level signals must be correlated with Kafka-level symptoms.

At minimum, collect per-operation latency and error rates for object PUT, GET, range GET, multipart upload, list, and delete. Track request volume, object count growth, gateway saturation, backend rebuild state, cache hit ratio, remote fetch latency, and storage throttling. Dashboards should answer operator questions: whether consumer lag comes from Kafka compute, cache misses, object-store latency, or network saturation.

Without this correlation, S3-compatible Kafka becomes hard to operate. Kafka teams see symptoms; storage teams see resource charts; neither side sees the record path end to end.

Cost: Private Cloud Still Has a Storage Economics Problem

Private cloud teams may not pay a public cloud bill for every object request, but cost still exists. It appears as storage nodes, disks, SSD cache, network fabric, load balancers, operational staff, support contracts, power, rack space, and opportunity cost. A design that creates many small objects or high list volume consumes real capacity even when no monthly S3 invoice arrives.

Evaluate cost by workload path: ingest write amplification, tail-read cache hit ratio, replay range-read efficiency, recovery object reads after broker failure, and retention object count.

The most important number is not cost per stored TB. It is cost per useful Kafka behavior: sustained ingest within SLO, replay within target time, broker replacement without long data movement, and retention growth without expanding compute in lockstep.

A PoC Plan That Catches the Right Risks

A private cloud proof of concept should be designed to disprove assumptions. A happy-path demo with one topic and a short upload test does not tell you whether the object store can carry Kafka responsibilities.

Use representative topics and client behavior. Include large partitions, small messages, high-throughput producers, tail consumers, slow consumers, replay jobs, and retention cleanup. Run the test long enough to observe object count growth, cache churn, compaction behavior, and storage-node background work.

The PoC should include three categories of gates. The first is correctness: no acknowledged data loss, no offset inconsistency, no missing remote segments, and predictable recovery after interruption. The second is performance: produce latency, fetch latency, replay throughput, and recovery time stay inside the target envelope. The third is operability: metrics, logs, alerts, and runbooks let the team identify bottlenecks without guessing.

Run fault injection after the steady-state test, not before it. Kill brokers. Restart object-store gateways. Interrupt network paths. Force cache cold starts. Trigger object-store rebuild if the platform supports a controlled test. Then rerun replay and ingest together. Many architectures pass isolated tests and fail when recovery and workload pressure overlap.

Where AutoMQ Fits

The private cloud conclusion is not that every Kafka platform should immediately move to object storage. The conclusion is narrower and more useful: if retained Kafka data should stop being anchored to broker-local disks, the storage layer must be designed as part of Kafka rather than bolted on as a bucket.

AutoMQ follows that shared-storage direction. It is a Kafka-compatible streaming platform that separates compute from storage and uses S3-compatible object storage as the durable shared storage substrate. In private cloud or data center deployments, that means the object store can support Kafka-compatible brokers without turning every broker lifecycle event into retained-log migration.

That architecture still depends on validation. S3-compatible storage must meet the required API, latency, throughput, consistency, failure-domain, observability, and cost envelope. The difference is that the design starts from shared storage as a first-class Kafka substrate rather than treating object storage only as a downstream archive.

If your private cloud team is evaluating this path, the right next step is not a generic object-store benchmark. Build a Kafka workload PoC around your own retention, replay, failover, and operational goals, then compare sink, tiered storage, and shared storage modes on the same evidence.

References

FAQ

Is S3-compatible storage the same as AWS S3 for Kafka?

No. S3-compatible storage usually means the API resembles Amazon S3, but Kafka also depends on latency, consistency, throttling, metadata visibility, failure behavior, and observability. Those properties vary by private cloud implementation and must be tested.

When is object storage a sink rather than Kafka storage?

It is a sink when Kafka Connect or an application consumes Kafka records and writes them to object storage for analytics, archival, or lake ingestion. Kafka still relies on broker storage for its own log durability.

How is tiered storage different from shared storage?

Tiered storage keeps recent data on broker-local storage and moves older completed segments to remote storage. Shared storage uses object storage as the durable substrate behind Kafka-compatible brokers, so retained data is less tied to individual broker disks.

What should a private cloud Kafka PoC measure first?

Measure producer acknowledgment latency, tail consumer latency, replay throughput, object request behavior, cache hit ratio, recovery after broker loss, object-store endpoint failure, and whether operators can correlate Kafka symptoms with object-store metrics.

Where does AutoMQ fit in S3-compatible Kafka architecture?

AutoMQ is a Kafka-compatible shared storage architecture that uses S3-compatible object storage as the durable storage substrate. It is relevant when teams want Kafka semantics while reducing the operational coupling between brokers and retained local log data.

S3-Compatible Kafka: Designing Kafka for Private Cloud Object Storage

S3-Compatible Does Not Mean AWS S3

Three Ways Kafka Uses S3-Compatible Storage

API Compatibility: Test the Calls Kafka Actually Needs

Latency and Throughput: Measure Workload Shape, Not Vendor Claims

Consistency and Metadata Are Correctness Issues

Failure Domains: Draw Storage, Network, and Broker Boundaries Together

Observability: Kafka Metrics Need Object-Store Context

Cost: Private Cloud Still Has a Storage Economics Problem

A PoC Plan That Catches the Right Risks

Where AutoMQ Fits

References

FAQ

Is S3-compatible storage the same as AWS S3 for Kafka?

When is object storage a sink rather than Kafka storage?

How is tiered storage different from shared storage?

What should a private cloud Kafka PoC measure first?

Where does AutoMQ fit in S3-compatible Kafka architecture?

Trusted by teams running Kafka at scale

Grab

Tencent

LG U+

S3-Compatible Kafka: Designing Kafka for Private Cloud Object Storage

S3-Compatible Does Not Mean AWS S3

Three Ways Kafka Uses S3-Compatible Storage

API Compatibility: Test the Calls Kafka Actually Needs

Latency and Throughput: Measure Workload Shape, Not Vendor Claims

Consistency and Metadata Are Correctness Issues

Failure Domains: Draw Storage, Network, and Broker Boundaries Together

Observability: Kafka Metrics Need Object-Store Context

Cost: Private Cloud Still Has a Storage Economics Problem

A PoC Plan That Catches the Right Risks

Where AutoMQ Fits

References

FAQ

Is S3-compatible storage the same as AWS S3 for Kafka?

When is object storage a sink rather than Kafka storage?

How is tiered storage different from shared storage?

What should a private cloud Kafka PoC measure first?

Where does AutoMQ fit in S3-compatible Kafka architecture?

Trusted by teams running Kafka at scale

Grab

Tencent

LG U+

Newsletter