Blog

Confluent Cloud Storage and Retention Cost: Why Long-Lived Kafka Data Changes the Math

Kafka budgets usually start with throughput. Producers write this much, consumers read that much, and the platform team maps the workload to a cluster type. That works for short-lived streams where most data is consumed near the tail and deleted after hours or days. It becomes incomplete when Kafka turns into an audit log, replay source, CDC backbone, or observability store. In those systems, the expensive question is more than "How fast is the stream?" It is "How long does every byte remain queryable, replicated, and available for a future read?"

Confluent Cloud makes this worth modeling explicitly. Its billing documentation states that usage includes data transferred, storage, compute units, and add-on services, and that storage is charged on the total post-replication volume stored on a Kafka cluster. Confluent Cloud also documents a built-in replication factor of three, while topic retention settings such as retention.ms and retention.bytes control how much data remains. That is a reasonable availability model, but it means long retention multiplies the storage line in a way teams often underweight during renewal planning.

Retention cost growth curve

Why Retention Changes Kafka TCO

Retention changes the unit of planning from throughput to inventory. A stream writing 100 MiB/s sounds like a throughput problem, but over one day it creates about 8.24 TiB of logical data before compression, compaction, deletes, and protocol overhead. Over 30 days, the same write rate creates roughly 247 TiB of logical retained data. If the storage model keeps three replicated copies at the Kafka service layer, the post-replication footprint becomes roughly 741 TiB. The cluster may process the same 100 MiB/s each day, but the retained inventory keeps accumulating until the retention window reaches steady state.

That is why a Kafka retention cost estimate needs four inputs, not one:

  • Write volume: sustained MiB/s or GiB/day, ideally separated by topic class instead of averaged across the whole estate.
  • Retention window: the actual period required for replay, audit, debugging, or legal hold, not the default topic setting.
  • Storage model: whether the platform charges or provisions for replicated broker storage, tiered storage, object storage, or another shared-storage design.
  • Read pattern: mostly tailing consumers, periodic catch-up reads, full historical replays, or multiple fan-out consumers reading older offsets.

The read pattern matters because long retention is rarely passive. A compliance team may ask for a historical slice. A data engineering team may rebuild a lake table after schema correction. A service owner may reset offsets after a bad deployment. A model that counts only retained bytes misses the behavior that makes those bytes valuable.

Storage Questions to Ask About Confluent Cloud

Confluent Cloud is a managed service, so teams should start from Confluent's current pricing pages, cluster type documentation, cost estimator, and billing APIs rather than copying a spreadsheet from a previous renewal. The question is whether your estimate reflects how Confluent measures retained volume and how your topics actually age.

QuestionWhy it changes the estimate
What is the logical write rate after compression?Kafka stores compressed record batches when producers use compression, so effective retained size can differ from raw application payload size.
Which topics use delete retention, compaction, or both?cleanup.policy=delete ages data by time or size, while compaction keeps the latest value per key and has tombstone behavior.
Is retention set per topic or inherited from defaults?Confluent Cloud documents editable topic-level settings and separate Dedicated-cluster defaults, so defaults may not describe the topics that drive cost.
Are teams using Infinite Storage?Confluent Cloud documents retention.ms=-1 as Infinite Storage; useful for some workloads, but it requires a deliberate cost model.
How often do historical reads happen?A cluster sized for tailing reads can behave differently during a backfill or replay from older offsets.

Those questions should be asked before comparing vendors. Otherwise the comparison collapses into a generic "managed Kafka is expensive" claim. A long-retention workload can be costly on any architecture if the team stores unnecessary data, keeps duplicate topics, or allows uncontrolled replays. The architectural question is narrower: which storage model makes necessary retention economical without turning every broker into a growing disk estate?

A Concrete Retention Scenario

Consider a topic family that writes 100 MiB/s continuously. Assume the data is already measured after producer compression because that is the closest input to Kafka storage behavior. Assume delete retention, no compaction benefit, no cross-region disaster recovery copy, and a read pattern dominated by tailing consumers plus one monthly replay of 10% of the retained window. This is a model, not a vendor quote.

Retention windowLogical data writtenPost-replication storage at 3 copies
7 days~57.7 TiB~173 TiB
30 days~247 TiB~741 TiB
90 days~741 TiB~2.17 PiB

The table explains why storage can overtake the conversation. Moving from 7 to 90 days does not change the daily write rate, but it multiplies steady-state retained inventory by almost 13. The monthly replay assumption adds another dimension: old data must remain readable for recovery, recomputation, or investigation. That is where tiering, caching, prefetch, and object layout start to matter.

The same model should be built per workload class. Audit streams may have low read volume but strict retention. CDC streams may need replay after a downstream correction. Observability streams may combine high write volume with uneven searches into older periods. Treating them as one blended Kafka estate hides the topics that drive the storage line.

Long-retention Kafka workload map

Workloads Where Retention Cost Dominates

Audit streams are the cleanest example because retention is a requirement, not a convenience. Security events, entitlement changes, payment state transitions, and administrative actions often need to remain available long after operational consumers have moved on. The platform team can reduce payload size and separate noisy diagnostics from controlled audit topics, but it cannot always shorten the retention window. In that case, the storage model becomes part of the compliance architecture.

CDC and replay-heavy pipelines are different. Their retention windows are often justified by operational recovery: rebuild a derived table, reprocess after a connector bug, or let a new consumer bootstrap from history. The cost risk is the habit of keeping "a little extra" retention because re-ingestion from the source database is risky or impossible. Kafka becomes the recovery substrate, so the team needs storage that fits both steady retention and bursty catch-up reads.

Observability and log streams push the problem from another angle. They produce high volume, have spiky investigation patterns, and often overlap with dedicated log analytics systems. Keeping every debug event in Kafka for 90 days may be unnecessary, but some platform teams retain operational events because downstream systems need replay semantics. The question is whether Kafka is being asked to act as a long-lived event store and a short-latency streaming fabric at the same time.

Tiered Storage Helps, But It Is Not the Same as Primary Object Storage

Object storage is attractive because cloud providers built it for durable, elastic data. Kafka, however, was not originally designed around object APIs. Traditional Kafka writes to broker-local logs and relies on replication across brokers for durability. Tiered storage changes part of that equation by offloading older log segments from broker disk to object storage while preserving Kafka retention semantics. Confluent Platform documentation describes this pattern as moving older segments to object storage so broker capacity does not grow with long-term retention.

That is a useful architectural move, especially for self-managed or platform-operated Confluent deployments. It reduces broker-local storage for older data and separates warm historical storage from compute resources that serve the active stream. But tiered storage is still a two-tier model: fresh data lands in the primary Kafka storage layer first, and older segments are moved out based on policy. The hotset, object upload path, and fetch path still need sizing.

Primary object storage takes a different position. Instead of treating object storage as a secondary destination for cold segments, the Kafka-compatible storage layer is built around shared object storage as the durable repository. That does not mean writing every producer request directly to S3 in the naive way; object storage has higher per-request latency and different I/O economics than local disk. It means the system has to redesign the write path, read path, metadata layout, and recovery mechanism around object storage.

Where AutoMQ Fits

AutoMQ belongs in that second category: Kafka-compatible streaming storage that changes the storage layer rather than attaching a cold archive to a conventional broker log. Its public documentation describes S3Stream as the component that replaces Kafka's native log storage with a storage-compute separated design. Data is written through a Write-Ahead Log for low-latency persistence and fault recovery, then stored in object storage through S3Stream; the docs distinguish this from tiered storage by saying object storage is the primary storage service, while WAL is a lightweight acceleration and recovery layer.

AutoMQ WAL and object storage path

This distinction matters for retention economics. If the durable repository is object storage, long-lived data is no longer tied to broker-local disks in the same way. Brokers can become more stateless because retained data is not owned by a specific broker filesystem. Reassignment, recovery, and scaling can then focus on metadata and compute responsibility rather than copying large retained partitions between disks.

There are trade-offs. Object storage access patterns must be engineered carefully, especially for small writes, high fan-out, and catch-up reads. AutoMQ addresses this through WAL, cache, and S3Stream object layout rather than pretending object storage has the same latency profile as local disk. The cost advantage is not magic pricing by itself; it comes from shifting long retention toward the infrastructure layer built for durable capacity.

For teams evaluating Confluent Cloud storage cost, AutoMQ should enter the shortlist when the workload has all three properties: retention measured in weeks or months, replay as a real operational requirement, and enough retained volume that post-replication storage becomes a first-class budget line. It is less compelling when topics are short-lived, consumers stay near the tail, and managed service simplicity dominates every other factor.

How to Build a Better Retention Cost Model

A practical model starts with topic groups, not clusters. Group topics by retention reason: audit, CDC recovery, observability, product analytics, ML feature replay, and transient operational streams. For each group, calculate logical GiB/day after compression, retention days, compaction expectation, and replay behavior. Then map that to the provider's billing dimensions or the infrastructure components in a self-managed design.

The model should expose assumptions:

  • Storage footprint: logical GiB/day x retention days x replication or storage-model factor.
  • Tailing reads: normal consumer fan-out near the head of the log.
  • Catch-up reads: expected historical replay volume per month or quarter.
  • Network path: same-region, cross-AZ, cross-region, private networking, and any connector or cluster-linking movement.
  • Operational buffer: temporary retention increases during migrations, incident investigations, or backfills.

This is where FinOps and platform engineering need the same spreadsheet. FinOps can identify the spend line, but platform engineers know why retention exists. Cutting 90 days to 7 days may look attractive until a downstream rebuild fails and the source database can no longer provide the missing history. Keeping 90 days everywhere may feel safe until the storage bill funds data nobody can explain.

Long-lived Kafka data changes the math because it changes what Kafka is being asked to be. If Kafka is only a fast moving buffer, throughput dominates. If Kafka is the replayable system of record for events, storage architecture dominates. Before the next renewal or migration plan, build the retention model with write volume, retention days, storage model, and replay behavior side by side. Then evaluate whether the current platform is priced and architected for the data you need to keep, not only the data you need to move today.

For teams exploring a Kafka-compatible architecture where object storage is the primary durable layer, the next useful step is to review AutoMQ's S3Stream architecture and run the same retention model against your own topic inventory: AutoMQ documentation.

References

FAQ

Does Confluent Cloud charge for Kafka storage after replication?

Yes. Confluent Cloud's billing documentation says storage is charged on the total post-replication volume stored on a Kafka cluster, and it documents a built-in replication factor of three. Teams should verify current rates and regional details on Confluent's pricing and billing pages before committing to a budget.

Is Infinite Storage the same as free retention?

No. Confluent Cloud documents Infinite Storage as a retention setting where retention.ms is set to -1. It changes the retention behavior, but retained data still needs to be modeled against the service's storage billing dimensions and the workload's read pattern.

Does compaction eliminate retention cost?

Compaction can reduce retained data for keyed topics by keeping newer records for each key, but it is not a universal substitute for delete retention. Tombstones, key cardinality, compaction lag, and topics that require full event history can still produce substantial retained storage.

Is object storage-backed Kafka only useful for cold archives?

No. A cold archive stores historical data outside the active Kafka storage path. Object-storage-backed Kafka designs such as AutoMQ use object storage as the primary durable storage layer and add WAL, caching, and stream-aware object layout to support Kafka workloads.

When should a team evaluate AutoMQ against Confluent Cloud for retention-heavy workloads?

Evaluate it when retention is measured in weeks or months, replay is operationally important, and post-replication storage is a visible part of the Kafka bill. For short-retention workloads where managed service convenience is the primary goal, the current Confluent Cloud model may remain the simpler choice.

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.