Teams usually search for log compaction scale kafka after a compacted topic has stopped behaving like a tidy configuration choice. The topic may back a CDC pipeline, a feature store, a user profile cache, a device registry, or the internal offsets of a platform component. At small scale, log compaction is easy to explain: keep the latest record for each key, delete older values, and preserve tombstones long enough for downstream consumers to see deletes. At production scale, that definition is only the beginning.
The hard part is not whether Kafka supports compaction. Apache Kafka documents log compaction as a retention mechanism that retains at least the latest value for each key, and the cleanup.policy=compact topic setting is the familiar entry point. The hard part is what compaction does to storage locality, broker recovery, rebalancing, cloud cost, and migration planning when the number of keys and partitions keeps growing. The architecture question is no longer "Does this platform support compacted topics?" It is "Where does the compacted log live, and what operational work follows from that answer?"
Why Teams Search For log compaction scale kafka
Compacted topics attract important data because they look like a natural fit for "latest state" workloads. A CDC connector can publish updates keyed by primary key. A risk engine can keep the latest account state. A recommendation system can keep the latest item attributes. A platform team can keep service metadata or offsets in compacted internal topics. The pattern is useful because consumers can rebuild a current view without reading an infinite append-only history.
That usefulness creates an uncomfortable scaling pattern. Compaction does not make the topic small in a predictable way; it makes retention key-aware. If the key cardinality grows, if updates arrive unevenly, if tombstones must remain visible for correctness, or if consumers occasionally replay from old offsets, the broker still needs enough local storage and I/O headroom to keep the log healthy while the cleaner works. The cleaner runs outside the request path, but its scans and rewrites still compete with the foreground path for disk, CPU, page cache, and recovery time.
For platform teams, four symptoms tend to appear together:
- Broker storage becomes a planning constraint. Compacted topics can hold a large keyspace even when each key keeps only the latest value. Capacity planning has to account for segment layout, cleaner lag, tombstone retention, and replay windows, not only daily ingest.
- Operational changes become data movement projects. Adding brokers, replacing brokers, changing partition placement, and recovering failed nodes still involve broker-local log state in a Shared Nothing architecture.
- Cloud networking shows up in the architecture review. Replicas across Availability Zones are good for durability, but they also create repeated inter-zone data movement in many cloud deployments.
- Migration risk concentrates in topic semantics. A cutover that preserves topic names but changes offsets, tombstone behavior, or consumer progress is not equivalent for compacted workloads.
These symptoms are easy to misread as tuning problems. Tuning matters, but it cannot remove the storage boundary that the architecture enforces. If the durable log is owned by broker-local disks, compaction scale is tied to broker-local storage operations.
The Production Constraint Behind The Problem
Traditional Kafka is a Shared Nothing architecture. Each broker owns local log segments for the partitions assigned to it, and durability comes from replication across brokers through ISR (In-Sync Replicas). That design is not a historical mistake. It is a strong design for a world where local disks are the natural persistence layer and moving data between machines is part of operating a distributed log.
Compaction puts pressure on that model because the compacted topic is both a stream and a state reconstruction source. The broker must accept incoming writes, serve active consumers, retain enough historical records for correctness, compact old segments, and keep replicas in sync. When the keyspace becomes large, the "latest value per key" rule does not eliminate the need to store, scan, and rewrite log segments. It changes which records survive.
That difference matters during operations. A partition reassignment does not move an abstract topic; it moves ownership of data that was physically organized around broker-local storage. A broker replacement is not only a compute event; it is also a storage recovery event. A hot partition is not only a traffic imbalance; it is a local disk and cleaner imbalance. Tiered Storage can reduce part of the historical retention burden by moving completed segments to remote storage, but Kafka's Tiered Storage documentation still describes a local tier and a remote tier. In Apache Kafka 4.2 documentation, compacted topics are listed as a limitation for Tiered Storage, so teams should verify their exact Kafka version and distribution before treating tiering as a compaction answer.
The consequence is a simple but often missed distinction: compaction policy is a topic setting, while compaction scale is an architecture property. If your production runbooks are dominated by broker disk sizing, partition movement, cross-zone replica traffic, and recovery windows, changing one topic knob will not change the operating model.
Architecture Options And Trade-Offs
There is no universal answer for compacted topics at scale. A local-storage Kafka cluster can be the right answer when the workload is latency-sensitive, the team has mature broker operations, and the keyspace fits comfortably within the storage and cleaner budget. A managed Kafka service can be the right answer when the main goal is to reduce operational ownership while preserving the traditional broker model. Tiered Storage can be the right answer when long retention is the main pain and compacted-topic support is confirmed for the chosen distribution.
The mistake is evaluating these options with a feature checklist that stops at "supports compaction." A useful architecture review asks what each option does when the compacted workload becomes large, uneven, and operationally important.
| Evaluation question | Why it matters for compacted topics | What to verify |
|---|---|---|
| Where is durable data stored? | Broker-local disks make scaling and recovery storage-aware. Shared storage changes that boundary. | Local disk, remote tier, or primary shared storage. |
| What happens during reassignment? | Compacted topics can be large even when write rate is moderate. | Whether reassignment copies data or changes ownership metadata. |
| How are tombstones handled? | Delete semantics depend on consumers observing tombstones within the configured window. | Topic configs, cleaner behavior, migration behavior, and replay tests. |
| Does Tiered Storage apply? | Tiering can help retention but may not support every compacted workload. | Official docs for the exact Kafka version or service. |
| How are offsets preserved in migration? | Reprocessing from the wrong offset can rebuild the wrong state. | Byte-for-byte topic copy, Consumer group progress, and rollback plan. |
| Who controls data boundaries? | Governance teams care where records, logs, metrics, and control metadata live. | VPC, bucket, IAM, encryption, and observability paths. |
This table is deliberately architectural. It does not assume that shared storage is always better, and it does not dismiss local storage. It forces the decision to surface the cost of ownership. If compacted topics are small and stable, the traditional model may be perfectly reasonable. If compacted topics are becoming the state backbone for multiple systems, the storage boundary deserves first-class attention.
Evaluation Checklist For Platform Teams
A practical review starts with inventory, not diagrams. List every compacted topic, its owner, key format, update rate, retained bytes, tombstone policy, consumer groups, and recovery expectation. Then look at the operational events that have already hurt the team: broker replacement, disk expansion, partition reassignment, cluster migration, AZ failover, and replay after a downstream outage. The workload tells you which architecture questions matter.
Use the following scorecard before choosing a Kafka-compatible streaming platform:
- Compatibility. Validate producers, consumers, admin clients, transactions, idempotent producers, ACLs, Kafka Connect jobs, and monitoring tools against the target platform. Compacted topics often sit behind stateful systems, so a thin "produce and consume one record" test is not enough.
- Compaction semantics. Test updates, deletes, tombstones, long-idle keys, late consumers, and replay from older offsets. The test should use real serializers and representative key cardinality.
- Storage operations. Model what happens when a broker is added, removed, replaced, or isolated. If the answer involves copying a large amount of compacted data, put that time in the runbook.
- Cost and network locality. Separate compute, broker storage, object storage, object requests, cross-zone traffic, and migration overlap. A compacted topic with modest ingest can still become expensive if its retained keyspace drives disk and recovery requirements.
- Governance. Confirm where business records, object segments, metrics, logs, and control-plane metadata live. The cleanest architecture story is weak if it does not match your security boundary.
- Rollback. Define how producers, consumers, offsets, and compacted topic state return to the previous cluster if cutover fails. Rollback is part of migration design, not a meeting after the incident.
The checklist is also a useful way to avoid overbuying architecture. If the current pain is cleaner tuning on a handful of topics, start there. If the pain is that every capacity change turns into storage choreography, the next section becomes relevant.
How AutoMQ Changes The Operating Model
Once the review points to storage-bound operations, Shared Storage architecture becomes worth evaluating. AutoMQ is a Kafka-compatible streaming platform that keeps Kafka protocol semantics while replacing broker-local log storage with S3Stream, a shared streaming storage layer built on object storage and WAL storage. The important point is not that object storage exists somewhere in the system. The important point is that durable data is no longer owned by a broker's local disk.
In AutoMQ, brokers are stateless from a persistent-data perspective. Produce requests are written durably through WAL storage and then uploaded to S3 storage through S3Stream. Object storage becomes the primary durable repository, while brokers handle protocol processing, partition leadership, caching, scheduling, and read serving. AutoMQ documentation describes this as a move from Shared Nothing architecture to Shared Storage architecture, with stateless brokers and partition reassignment that can be completed at seconds-level because reassignment no longer requires copying the partition's full data set between brokers.
That changes how compacted workloads are operated:
- Scaling changes from moving data to moving ownership. When persistent data sits in shared storage, adding or replacing brokers is less tied to copying compacted log segments across broker disks.
- Recovery depends less on a failed broker's local state. The durable record of the stream is in S3 storage, with WAL storage used for write acceleration and recovery.
- Cost modeling becomes more explicit. Teams model compute, WAL storage, S3 storage, request patterns, cache behavior, and network locality instead of treating broker disk as the central storage unit.
- Migration can be tested around Kafka semantics. For AutoMQ commercial editions, Kafka Linking is designed for byte-to-byte topic synchronization and Consumer group progress migration, which is especially relevant when compacted topics back stateful consumers.
Shared storage does not remove the need to test compaction behavior. You still need to validate tombstones, cleaner behavior, replay, offsets, client compatibility, and observability. It also does not mean every write goes directly to object storage without an acceleration layer; the WAL exists because streaming workloads need a durable write path that matches Kafka expectations. The architectural change is narrower and more powerful than a slogan: brokers stop being the place where durable topic data must live.
For teams evaluating log compaction scale kafka, that is the decision point. If compacted topics are mainly a topic configuration concern, tune the current cluster and keep the design simple. If compacted topics are forcing storage-heavy runbooks, slow recovery, and expensive over-provisioning, evaluate whether a Kafka-compatible Shared Storage architecture changes the shape of the work.
If compacted topics are becoming a storage operations problem rather than a topic configuration problem, review the AutoMQ deployment options and test your largest compacted workload against a Shared Storage architecture before the next capacity event forces the decision.
FAQ
Is log compaction the same as retention?
No. Retention deletes records based on time or size policies. Log compaction retains at least the latest value for each key and removes older values for the same key over time. A topic can use compaction, deletion, or both, depending on the configured cleanup policy and workload requirements.
Does Tiered Storage solve compacted topic scaling?
It depends on the Kafka version or managed service. Tiered Storage can help with long retention by moving eligible completed segments to remote storage, but it still has a local tier. Apache Kafka 4.2 documentation lists compacted topics as a limitation for Tiered Storage, so compacted workloads need explicit validation rather than an assumption.
Does shared storage remove Kafka compaction?
No. Shared storage changes where durable data lives and how broker operations behave. Kafka-compatible platforms still need to preserve compaction semantics for compacted topics. The benefit is that broker replacement, scaling, and reassignment are less coupled to broker-local persistent data.
What should teams test before migrating compacted topics?
Test real keys, tombstones, serializers, topic configs, Consumer group progress, replay from older offsets, connector behavior, and rollback. Include the largest and most operationally important compacted topics, not only a synthetic small topic.
Where should AutoMQ fit in an evaluation?
AutoMQ fits when the team wants Kafka compatibility but wants to move durable storage out of broker-local disks. It is most relevant when scaling, recovery, partition reassignment, and cloud storage cost are becoming architecture-level concerns.
References
- Apache Kafka documentation: Log compaction
- Apache Kafka documentation:
cleanup.policy - Apache Kafka documentation: Tiered Storage
- AutoMQ documentation: Compatibility with Apache Kafka
- AutoMQ documentation: S3Stream Shared Streaming Storage
- AutoMQ documentation: WAL storage
- AutoMQ documentation: Stateless broker
- AutoMQ documentation: Partition reassignment in seconds