Near-real-time data marts tend to fail in a place that looks boring on an architecture review: retention. The pipeline has Kafka topics, a stream processor, a lakehouse table, and a dashboard that refreshes fast enough for the business. Then the first backfill arrives, a data contract changes, or a downstream mart needs to replay the last several days of events. The team discovers that the topic retention setting was treated as a storage cleanup knob, not as part of the mart's correctness model.
That distinction matters. A near-real-time mart is not a batch table with a faster ingestion path. It is a continuously refreshed serving layer that depends on stream history, table snapshots, consumer offsets, and operational recovery all staying aligned. If any of those windows drift apart, the mart can look fresh while silently losing the ability to correct itself.
The practical question behind near real time marts kafka is therefore not whether Kafka can feed a mart. Kafka can do that. The harder question is how long the stream should remain available, who owns that lifecycle, and whether the infrastructure can support the answer without turning replay safety into runaway broker capacity.
Why Near-Real-Time Marts Stress Retention
Traditional warehouse marts usually hide retention decisions behind batch partitions. A fact table might keep raw landing data for a defined period, curated tables for longer, and aggregated marts according to business policy. Near-real-time marts move part of that lifecycle into the stream. The event log becomes both the ingestion path and the first recovery surface.
That creates a mismatch between how teams often configure Kafka and how marts actually behave. Kafka topic retention is commonly set by platform teams using broad defaults: hours for high-volume telemetry, days for business events, longer windows for audit-heavy streams. Data teams, on the other hand, think in terms of correction windows, schema evolution, and how far back a mart needs to rebuild after a bad transformation. Those are related timelines, but they are not the same timeline.
A streaming-aware retention policy ties the timelines together:
- Operational replay window. How far back can a consumer group reset offsets and rebuild state after a failed deployment, connector outage, or stream processing bug?
- Data correction window. How long can late events, CDC corrections, duplicate suppression, and contract fixes be applied without relying on an external archive?
- Table maintenance window. How do Kafka topic retention, lakehouse snapshot expiration, compaction, and file cleanup align so queryable tables do not outlive the stream evidence used to build them?
- Cost and capacity window. What does a longer replay horizon do to broker storage, network traffic, scaling time, and cross-zone movement?
The last item is where the argument becomes uncomfortable. A longer retention window is often the simplest answer for data correctness, but it is not free in a broker-local architecture. Keeping more log history means provisioning more disk, moving more data during partition reassignment, and carrying more headroom for recovery events that happen irregularly but matter when they happen.
The Lakehouse Freshness Constraint
Lakehouse teams usually describe freshness as a time target: dashboards should reflect events within minutes, feature tables should update before model scoring, or operational analytics should track a business process while the process is still active. That framing is useful, but it is incomplete. Freshness without replay is fragile because the mart can stay current while losing the ability to repair its history.
Apache Iceberg, for example, gives table formats a strong foundation for snapshots, schema evolution, partition evolution, and time travel. Query engines can reason over table metadata instead of treating object storage as a pile of files. But the stream feeding those tables still has to answer questions that table metadata alone cannot solve. Was an event transformed with the right code version? Did a consumer skip a range? Can the team rebuild from the source topic rather than from an already-corrupted table snapshot?
This is why retention belongs in the same design conversation as freshness. A mart that refreshes every few minutes may still need days of stream history if the business correction window is measured in days. A CDC-based mart may need topic compaction for entity state, but also delete retention and tombstone handling that match downstream table cleanup. A feature mart may need a shorter operational window, yet stricter auditability around offset commits and transformation versions.
The right answer is rarely "keep everything forever." That shifts the problem from correctness to cost and governance. A better answer is to classify streams by the role they play in the mart:
| Stream role | Typical mart dependency | Retention design question |
|---|---|---|
| Raw event source | Rebuild and audit | How long must the original event history remain independently replayable? |
| CDC changelog | Entity state and deletes | Should the topic be compacted, time-retained, or both? |
| Derived stream | Intermediate transformation | Can this be regenerated from raw topics, or is it a recovery boundary? |
| Mart sink feed | Queryable table ingestion | Does table snapshot retention match stream replay and validation needs? |
This table is not a substitute for policy. It is a way to stop treating every topic as a storage object with a TTL. The more a topic participates in correctness, the more its retention policy should be reviewed with the data product owner, not only with the Kafka operator.
Stream-to-Table Architecture Options
A near-real-time mart usually lands in one of three patterns. The first is Kafka to stream processor to table, where Flink, Spark Structured Streaming, Kafka Streams, or another engine transforms the stream and writes to the lakehouse. The second is Kafka Connect or a sink connector writing directly into an analytical store. The third is a built-in stream-to-table path where the streaming platform itself exposes a table-oriented ingestion feature.
Each pattern can work. The difference is where retention risk shows up. In the stream-processor pattern, the processing engine adds checkpoints, state stores, and deployment versioning to the retention discussion. In the connector pattern, task failure, offset management, and sink idempotency become central. In the built-in stream-to-table pattern, the platform boundary becomes narrower, but the team still needs clarity on catalog integration, table cleanup, and rollback.
Kafka's own semantics are a good anchor for this evaluation. Consumer groups track offsets. Producers and consumers can be configured for stronger delivery guarantees. Transactions can help coordinate writes across records and partitions in supported workflows. Topic retention, compaction, and offset retention are separate controls, so a mart design has to reason about more than one lifecycle setting.
The hidden danger is assuming that the table is the recovery boundary because it is the analytical object users query. In many production incidents, the table is where the symptom appears, not where recovery should begin. If the ingestion job wrote malformed records, the team needs to replay the source range. If a schema change broke a derived mart, the team needs the last known-good transformation and enough stream history to rebuild. If a connector stalled, the team needs offset visibility and a retained source log.
That makes retention an architecture decision rather than a topic setting. The policy should be written as a contract:
- Source topics retain enough history to cover the maximum operational repair window for the marts they feed.
- Derived topics declare whether they are disposable intermediates or independent recovery boundaries.
- Compacted topics document tombstone and delete semantics in the same place as downstream table cleanup rules.
- Consumer offset retention is checked against the longest expected pause, migration, or rollback scenario.
- Lakehouse snapshot expiration is coordinated with stream replay so table history and source evidence do not contradict each other.
This level of detail may feel excessive until a failed mart rebuild turns into a multi-team incident. Retention is much less painful when it is reviewed before the incident than when it is reverse-engineered from missing offsets and expired log segments.
Evaluation Checklist for Platform Teams
Platform teams need a way to evaluate retention decisions without becoming the owner of every data product. The clean boundary is to ask data teams for correctness requirements and then translate those requirements into infrastructure consequences. That translation should include compatibility, storage, scaling, governance, and migration risk.
Start with Kafka compatibility. If the mart stack depends on existing clients, connectors, schema tooling, consumer group management, or offset reset workflows, the streaming platform should preserve those operational habits. A platform migration that changes client behavior can be more expensive than the storage problem it was meant to solve.
Then model the retention cost. Do not stop at bytes stored. Broker-local Kafka retention affects disk provisioning, replication traffic, partition reassignment time, and recovery headroom. In cloud deployments, cross-zone traffic and replicated block storage can become part of the effective cost of keeping a longer replay horizon. Object storage can change that equation, but teams still need to account for request patterns, WAL design, and query-side table maintenance.
The governance questions are equally concrete. Who can shorten retention? Who approves a topic deletion? What happens when a privacy deletion requirement conflicts with a replay requirement? Can an auditor trace a mart row back to the stream and transformation that produced it? If these questions are answered only in tickets and chat threads, the policy will drift.
A practical readiness review should produce a short scorecard:
| Area | Passing signal | Failure signal |
|---|---|---|
| Compatibility | Existing Kafka clients and offset workflows remain valid | Migration requires custom client behavior |
| Cost | Retention growth is modeled across storage and network | Policy is chosen from a default TTL |
| Scaling | Longer history does not make rebalancing operationally risky | Partition movement grows with retained bytes |
| Governance | Data owners approve lifecycle rules | Platform team changes retention alone |
| Migration | Replay, rollback, and validation are rehearsed | Cutover depends on hope and dashboard checks |
| Observability | Lag, offsets, sink status, and table freshness are tracked together | Each team watches a different dashboard |
The point is not to slow every mart project with bureaucracy. The point is to make retention visible before it becomes a silent dependency. Once the scorecard exists, teams can reuse it across marts and tune policy by workload class.
How AutoMQ Changes the Operating Model
At this point, the platform requirement is clear: keep Kafka-compatible semantics and operational workflows, but avoid binding every extra day of replay history to broker-local storage pressure. This is where a shared storage architecture becomes relevant.
AutoMQ is a Kafka-compatible streaming platform that separates broker compute from durable stream storage. Its Shared Storage architecture moves the storage layer onto object storage with a WAL layer for write efficiency, while brokers become more stateless than in a traditional broker-local model. The important shift is not a marketing label. It changes what retention means operationally.
In a broker-local Kafka cluster, retained history lives with the broker replicas that host the partitions. Longer retention therefore increases the amount of data tied to broker lifecycle operations. Scaling, partition movement, and recovery planning all have to account for where the bytes sit. In a shared storage model, durable stream history is backed by object storage, so compute changes do not require the same style of local log movement.
For near-real-time marts, that matters in four ways. First, replay history can be treated more like a data lifecycle requirement and less like a broker disk reservation problem. Second, elastic scaling becomes easier to reason about because broker compute is less coupled to retained bytes. Third, cross-zone traffic can be reduced when storage and fetch paths are designed around cloud-native data locality. Fourth, stream-to-table features such as AutoMQ Table Topic can be evaluated as part of the same lifecycle discussion as Kafka retention and lakehouse table maintenance.
This does not remove policy work. Teams still need to decide how long raw sources, CDC changelogs, derived streams, and table snapshots should live. They still need to verify catalog behavior, delete handling, access control, observability, and rollback. Shared storage changes the cost and operations envelope in which those decisions are made; it does not make governance disappear.
That is the right role for infrastructure. It should make a sound policy affordable to run, not decide the policy on behalf of the data owners. If your marts need longer replay windows than your broker-local cluster can comfortably carry, evaluate whether a Kafka-compatible shared storage design gives you the operational room to align retention with correctness.
For teams exploring that direction, the AutoMQ architecture overview is a useful next step: review the Shared Storage model in the AutoMQ documentation.
References
- Apache Kafka documentation
- Apache Kafka topic configuration: retention.ms
- Apache Iceberg documentation
- AWS Athena documentation for querying Iceberg tables
- AWS S3 user guide
- AutoMQ Shared Storage architecture
- AutoMQ Table Topic overview
FAQ
Is Kafka retention enough to make a near-real-time mart reliable?
Kafka retention is necessary, but it is not sufficient by itself. A reliable mart also needs offset management, transformation versioning, table snapshot policy, sink observability, and a tested rebuild path. Retention gives the team room to recover; the rest of the design determines whether recovery is predictable.
Should every mart source topic keep data for the same duration?
No. Retention should follow the role of the stream. Raw sources often need longer replay windows because they are the recovery base. Derived streams can sometimes be shorter if they can be regenerated from raw topics. CDC topics may need compaction plus carefully managed delete retention. A single default usually hides these differences.
How does shared storage help with retention?
Shared storage reduces the coupling between retained stream history and broker-local disk operations. That can make longer replay windows easier to operate because scaling and broker lifecycle tasks are less tied to moving retained bytes between brokers. Teams still need governance and lifecycle policy; shared storage changes the operating model that supports those policies.
Where should AutoMQ fit in the evaluation?
AutoMQ should be evaluated after the team has defined compatibility, replay, cost, governance, and migration requirements. If the workload needs Kafka-compatible clients, object-storage-backed durability, stateless brokers, and a tighter path from streams to lakehouse tables, AutoMQ is a relevant option to test against those requirements.
