Blog

Capacity Planning Questions for Storage Compaction Economics

Someone searching for storage compaction economics kafka is usually past the easy stage of Kafka capacity planning. The cluster is not failing because retention was forgotten, and the team is not asking what log compaction does. They are asking why a topic that should keep only the latest value per key still creates storage pressure, cleaner backlog, broker headroom debates, and a bill that finance cannot map back to a single owner.

That question matters because compacted topics are often the topics the business least wants to lose. They hold account state, device shadows, feature flags, inventory views, service metadata, and changelog streams for stateful processing. These topics look smaller than append-only event history when viewed through the logical data model, but Kafka still has to write every update, preserve enough dirty segments for compaction, clean them later, and keep the broker stable while producers and consumers continue running. The economics are therefore not "latest value only." They are the cost of absorbing all versions until the storage engine can safely discard the obsolete ones.

Storage compaction economics Kafka decision map

The planning mistake is to treat compaction as a storage discount. It is better treated as an operating workload with its own write amplification, disk pressure, recovery behavior, and governance requirements. Once you see it that way, the capacity conversation changes from "how much data will remain after compaction" to "how much infrastructure do we need while compaction catches up?"

Why teams search for storage compaction economics kafka

Log compaction in Apache Kafka keeps at least the latest value for each key within a compacted topic. That model is valuable because consumers can rebuild state without replaying every historical change since the beginning of the system. It also creates a subtle budgeting trap: the logical size of a state table and the physical size of its Kafka log are not the same thing.

A compacted topic can have a small current keyspace and a large update stream. If 20 million keys change once per day, the physical log behaves very differently than if two million hot keys change every few seconds. Both topics may represent a "current state" table, but the second topic creates more dirty bytes, more cleaner work, more segment churn, and more pressure on broker-local disk. The right capacity question is therefore tied to update rate, key skew, delete behavior, and the cleaner's ability to make progress under peak load.

Platform teams usually arrive at this search after one of five symptoms appears:

  • Cleaner lag grows during business peaks, and storage usage remains elevated long after produce traffic drops.
  • Broker disks are sized for compaction backlog rather than steady-state state size.
  • Partition moves or broker replacement windows become hard to schedule because compacted logs still bind data to specific brokers.
  • Chargeback models undercount compacted topics because they measure retained keyspace but ignore write rate and operational headroom.
  • Governance teams need longer recovery or audit windows, which changes delete.retention.ms, tombstone handling, and restore behavior.

None of these symptoms means Kafka compaction is broken. They mean compaction is doing real work, and real work needs a capacity model.

The production constraint behind the problem

Traditional Kafka uses a Shared Nothing architecture. Each broker owns local log segments for the partitions assigned to it, and replication copies partition data across brokers for durability. This model is familiar and battle-tested, but it turns compaction economics into a broker-local problem. If a broker receives many high-update compacted partitions, its disk, I/O, cleaner threads, and recovery profile become local bottlenecks even when the cluster has spare capacity elsewhere.

The cleaner also has to respect correctness. Kafka cannot discard old key versions until it can safely rewrite segments and preserve the guarantees that consumers rely on. Tombstones need to remain visible long enough for deletes to propagate. Offsets must remain meaningful even as older records are removed from the compacted view. Transactional and idempotent workloads add more care around visibility and ordering. The result is a background workload that competes with foreground produce and fetch traffic without giving operators a single universal ratio such as "compaction saves 80%."

Shared Nothing vs Shared Storage operating model

Broker-local storage also changes the shape of operational risk. If disks fill faster than the cleaner can reclaim space, the team has to decide whether to add brokers, increase disk, change topic configuration, throttle producers, split hot partitions, or accept a shorter safety margin. Each option touches a different team. Application owners control keys and update rates. Platform teams control partitions and broker shape. SREs own incident response. Finance sees the cloud bill after those decisions have already happened.

This is why compacted topics deserve separate treatment in budget reviews. A generic Kafka cost model that counts retained GiB, broker count, and network transfer is useful, but it misses the part of the workload where transient bytes are the budget driver.

Architecture options and trade-offs

There are three broad ways to handle compaction pressure in a Kafka-compatible estate. The first is to tune the existing Shared Nothing cluster. That can work well when the root cause is a specific hot topic, a small number of under-partitioned changelogs, or cleaner settings that were never revisited after traffic changed. Tuning keeps the operational model familiar, but it does not remove the coupling between partition data and broker-local storage.

The second option is to use Tiered Storage for older Kafka data. Apache Kafka's Tiered Storage moves older log segments to remote storage while brokers retain the active local log. This can help long retention workloads because the entire historical log no longer has to sit on broker disks. It is not the same thing as making brokers stateless. The active log, cleaner work, leader placement, and broker recovery model still matter for compacted topics, especially when the update stream is hot and the working set remains active.

The third option is a Shared Storage architecture, where brokers keep the Kafka API surface but persistent data is placed in shared object storage instead of being tied to each broker's local disks. This does not make compaction free. It changes where the capacity pressure lands. The platform can reason about compute, WAL storage, object storage, metadata, and cache as separate layers instead of buying broker-local disk as one bundled unit.

Planning questionShared Nothing KafkaTiered StorageShared Storage architecture
What absorbs dirty bytes before compaction?Broker-local disks and cleaner capacityActive broker-local log plus remote segmentsWAL storage, object storage, and background object management
What happens during broker replacement?Partition data movement or replica catch-up is centralLocal active data still mattersOwnership can move without copying all persistent data
How do teams scale for peaks?Add broker capacity that includes compute and diskReduce some historical disk pressureScale compute and storage capacity more independently
What must be governed?Topic config, key design, cleaner backlog, disk headroomSame, plus remote tier behaviorSame, plus object storage policy, WAL type, cache, and metadata scale

The table is deliberately neutral. There are workloads where tuning a traditional cluster is the most practical move. There are retention-heavy topics where Tiered Storage helps. There are also cloud workloads where buying broker-local storage for compaction backlog becomes the wrong abstraction because the cost driver is not limited to retained data. It is the operational need to survive write bursts, rebalance safely, and recover without turning every storage event into a broker event.

Evaluation checklist for platform teams

A useful evaluation starts with the workload, not with the vendor. For each compacted topic family, collect enough information to separate logical state size from physical write pressure. The numbers do not need to be perfect on the first pass, but they need to be explicit enough that teams can argue about assumptions instead of arguing about opinions.

Use this checklist before changing architecture:

  • Keyspace and update rate: Count active keys, update frequency, key skew, tombstone rate, and peak-to-average write ratio. Compaction economics are dominated by updates, not only by final state size.
  • Cleaner backlog and reclaim window: Track how long dirty bytes remain on disk, how cleaner throughput behaves during peak produce traffic, and whether backlog clears before the next peak.
  • Broker-local failure margin: Model what happens when one broker is lost, one disk is slower than expected, or a partition reassignment overlaps with cleaner backlog.
  • Consumer restore behavior: Measure how long critical consumers take to rebuild from compacted topics, especially after tombstones and schema changes.
  • Governance and ownership: Decide who owns topic compaction settings, delete retention, key design, rollback windows, and cost allocation.
  • Migration reversibility: If architecture changes, define how producers, consumers, offsets, ACLs, schemas, and rollback paths will be validated before cutover.

Readiness checklist for compacted Kafka workloads

The most useful artifact from this exercise is not a single TCO number. It is a sensitivity model. What happens if update rate doubles but keyspace stays flat? What happens if governance requires a longer delete-retention window? What happens if the cloud team changes the preferred storage class or network boundary? Compacted workloads punish averages because the cost is often paid during the interval between "bytes were written" and "bytes were safely reclaimed."

How AutoMQ changes the operating model

After the evaluation framework is clear, AutoMQ becomes relevant as a specific architectural answer: a Kafka-compatible streaming platform that keeps Kafka protocol semantics while replacing broker-local persistent storage with a Shared Storage architecture backed by object storage.

In AutoMQ, brokers are stateless brokers for persistent data. Writes go through S3Stream, where WAL (Write-Ahead Log) storage provides durable write buffering and S3-compatible object storage serves as the main storage layer. This matters for compaction economics because capacity planning no longer has to buy broker-local disk, compute, and failure headroom as one inseparable unit. The platform can scale brokers for traffic, use object storage for durable data, and reason about WAL type based on latency and deployment needs.

That architecture changes four parts of the planning conversation:

  • Storage is not stranded on a broker. Persistent data lives in shared object storage, so broker replacement and partition movement are less tied to copying local log data.
  • Compute and storage can be scaled with different intent. A peak in cleaner or produce activity does not automatically mean every unit of capacity must include more local disk.
  • Cross-AZ traffic can be designed out of the data path. AutoMQ's shared-storage design and Inter-Zone traffic routing are built to reduce cross-AZ replication traffic that appears in many cloud Kafka deployments.
  • Deployment boundaries remain customer controlled. AutoMQ BYOC runs the control plane and data plane in the customer's cloud account and VPC, which is relevant when compacted topics contain sensitive operational state.

This does not remove the need to design keys well, monitor compaction backlog, or test consumer restore time. Poor key design can still create hot partitions. A delete-heavy workload still needs careful tombstone retention. Object storage request patterns and metadata scale still deserve attention. The difference is that compaction pressure no longer has to be managed primarily by adding stateful brokers with larger disks and longer data movement windows.

For teams evaluating migration, compatibility should be tested as a first-class workstream. Kafka clients, admin operations, Consumer group behavior, offsets, transactions, Kafka Connect, and schema-dependent applications should be validated before any production cutover. AutoMQ's Kafka-compatible API lowers the application rewrite burden, and Kafka Linking can help with migration paths that preserve message bytes and offset continuity, but a production migration still needs a rollback plan and a measured acceptance gate.

The practical decision is not "compaction or no compaction." It is whether the infrastructure model matches the economics of compacted workloads. If the expensive part of your compacted topics is broker-local headroom, slow data movement, and capacity tied to peak dirty bytes, a shared-storage Kafka-compatible platform deserves a serious evaluation.

FAQ

Is Kafka log compaction a cost optimization feature?

It can reduce the retained logical history for keyed data, but it should not be budgeted as a simple discount. Kafka still writes every update before older versions can be cleaned. Capacity planning must include update rate, cleaner backlog, tombstones, broker I/O, and recovery behavior.

Does Tiered Storage solve compaction economics?

Tiered Storage helps when older log segments dominate broker disk usage. It does not make brokers stateless, and it does not remove the active-log and cleaner considerations for hot compacted topics. Treat it as one option in the architecture matrix, not as a universal answer.

Which metrics should FinOps teams ask platform teams for?

Ask for active key count, update throughput, dirty-byte backlog, cleaner throughput, disk headroom, tombstone rate, restore time, and partition movement history. Those metrics connect the logical business object to the physical Kafka cost drivers.

Where does AutoMQ fit in the decision?

AutoMQ fits when the team wants Kafka compatibility but does not want persistent data tied to broker-local disks. Its Shared Storage architecture, stateless brokers, WAL storage, and object-storage-backed durability change how teams plan compute, storage, recovery, and cross-AZ traffic.

What is a reasonable next step before migration?

Pick one compacted topic family and build a readiness scorecard. Validate client compatibility, offset behavior, Consumer group cutover, rollback, observability, object storage policy, and cost assumptions in a proof of concept before moving production traffic.

If your compacted topics are forcing every budget discussion back to broker disks and failure headroom, test the architecture assumption directly. Start an AutoMQ evaluation through AutoMQ Cloud with one high-update compacted workload and compare the operating model against your current cluster.

References

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.