Blog

Kafka KIP Review Workflow for Storage Architecture Decisions

KIP-1150 puts a familiar Kafka question under pressure: should durable topic data keep living primarily on broker-local disks, or should Kafka move toward an object-storage-backed model where brokers become lighter compute nodes? That is not a cosmetic implementation detail. It changes how platform teams reason about acknowledgments, recovery, network topology, retention, and cloud cost.

The Apache Kafka proposal is marked Accepted, and the page records that detailed implementation work is split into follow-up KIPs such as Diskless Core and Diskless Coordinator. That status signals architectural direction without giving buyers a completed production checklist. A platform owner still has to decide what evidence is needed before a storage architecture change reaches production.

KIP review workflow

Why KIP-1150 Belongs in an Architecture Review

Traditional Kafka gives operators a concrete mental model. A partition leader appends records, followers replicate them, in-sync replicas define the durability boundary, and broker disks hold the active log. The model is operationally expensive in many cloud deployments, especially when retention grows and replication crosses Availability Zones, but the failure semantics are well understood by SREs and application teams.

Diskless Topics challenge that model by making broker disks less central to durable user data. The KIP is explicit that "diskless" does not mean there are no disks anywhere; disks may still be used for cache, staging, metadata, or system needs. The architectural change is narrower and more important: broker-local disks are no longer the main abstraction operators depend on for durable topic data.

That shift is why a KIP review should not start with a pricing spreadsheet. The first question is whether the storage contract preserves the behaviors applications already rely on. Cost comes later because savings are only useful if the platform can still prove ordering, replay, retention, transaction behavior, and recovery after failure.

Separate the KIP From the Decision

A Kafka Improvement Proposal is a design artifact, not a procurement decision. KIP-1150 defines motivation, requirements, and direction for Diskless Topics in Apache Kafka. It also leaves implementation details to follow-up work, which is normal for a change with this much surface area. The buyer mistake is treating the accepted direction as if it already answered every production question.

The cleaner workflow is to read the KIP as a category definition, then create a separate decision record for your environment. The KIP tells you what problem the ecosystem is trying to solve. Your decision record should say whether your workloads, controls, and incident process are ready for that kind of storage boundary.

Use four review layers:

  • Kafka semantics: producer acknowledgments, fetch behavior, offsets, consumer groups, transactions, compaction, and admin APIs.
  • Storage durability: WAL behavior, object storage commit path, metadata ordering, cache invalidation, and recovery rules.
  • Cloud economics: block storage, object storage, cross-zone traffic, request costs, cache sizing, and recovery movement.
  • Operations and governance: observability, IAM, encryption, lifecycle policy, deletion, backup, and rollback ownership.

This layered review prevents a common failure mode: arguing about object storage cost while leaving semantic evidence vague. An attractive architecture can still be unready for a compacted changelog topic or a transactional workload.

Step 1: Classify the Workload Before Scoring the Architecture

The same storage design can be a strong fit for one topic class and a poor fit for another. Append-heavy audit streams, long-retention event archives, telemetry topics, compacted state stores, transactional outbox patterns, and Kafka Streams changelogs do not carry the same risk. Treating them as one average Kafka workload hides the exact details that should drive the decision.

Start by classifying topics by behavior rather than team ownership. Retention, write rate, read fan-out, compaction, transaction use, latency sensitivity, and recovery expectations are better indicators than the application name. A payments topic and an observability topic may both be "critical," but they fail in different ways when the storage path changes.

Workload classReview emphasisWhy it changes the decision
Append-heavy ingestionAcknowledgment path, retention, cold reads, and cross-zone byte movementOften benefits from shared storage, but still needs replay proof
Long-retention auditObject lifecycle, deletion, encryption, and restore drillsGovernance and recovery matter more than median latency
Compacted topicsTombstones, key churn, restore behavior, and cache missesCompaction exposes edge cases in remote storage and metadata
Transactional pipelinesProducer fencing, transaction markers, offset commits, and rollbackSmall semantic gaps can break exactly-once workloads
Stream-processing stateChangelog restore speed, lag spikes, and failure replayRecovery time is often the real SLO

This classification turns the review into a staged adoption plan. A team might approve diskless storage for append-heavy analytics topics while holding compacted and transactional workloads until more evidence exists. That is a disciplined result, not a partial failure.

Step 2: Define the Semantic Contract

The semantic contract is the set of behaviors application teams assume without thinking about the storage engine. In Kafka, those assumptions include ordered records within a partition, durable replay by offset, predictable consumer group coordination, ACL and quota behavior, retention enforcement, and correct handling of idempotent or transactional producers. A storage architecture review has to test those assumptions directly.

The producer path deserves the first pass. In classic Kafka, acks=all is tied to in-sync replica replication. In a diskless design, the review must identify which WAL, metadata, and object storage operations complete before the broker acknowledges a write. If a leader fails after acknowledgment, recovery must make the record readable exactly where clients expect it.

The read path needs the same scrutiny. Consumers do not care whether a record is served from memory, local cache, staging storage, or object storage. They care that offsets remain stable, reads are correct, lag is explainable, and cold replay does not turn an incident into a storage archaeology project. That is why the review should include lagging consumers and historical fetches, not only steady-state tail reads.

A useful KIP review asks: "Which user-visible Kafka promises are preserved, and where can we observe the proof during failure?"

That question keeps the review practical. Instead of debating whether the storage design is elegant, the team has to show evidence for each promise it plans to keep.

Step 3: Model Cost as Changed Byte Paths

Cost analysis should start after the semantic contract is written, not because cost is secondary, but because the cost model depends on the behavior you are willing to accept. Classic Kafka cost is dominated by broker compute, block storage, replication, retention, and network movement. In cloud environments, cross-AZ replication can become a visible line item because replicated bytes may move through billable paths.

Diskless storage changes the byte paths. Broker disks may become cache or staging space instead of the long-term data home. Object storage may absorb retained data. Recovery may shift from copying partition replicas between brokers to reconstructing ownership from shared storage and metadata. Those changes can be economically powerful, but they also introduce object storage requests, WAL capacity, cache sizing, and storage-service dependency questions.

Storage architecture trade-off map

The review should force every cost claim into a mechanism:

  • Which block storage allocation decreases, and which WAL or cache capacity remains?
  • Which replication or cross-zone path disappears, and which producer or consumer path still crosses zones?
  • Which object storage request types appear under tail reads, catch-up reads, compaction, and retention cleanup?
  • Which recovery operations move data, metadata, or ownership during broker replacement?
  • Which metrics prove the expected cache hit rate and remote-read profile under real load?

This is also where Tiered Storage and Diskless Topics should be separated. Kafka Tiered Storage, introduced through KIP-405, moves older log segments to remote storage while the active log can still depend on broker-local disks. Diskless Topics move the primary durability model closer to shared storage. Both can use object storage, but they do not have the same write path, recovery model, or cost structure.

Step 4: Review Migration and Rollback Before the Pilot

Migration risk is often described as a data transfer problem, but storage-architecture migration is mostly a control problem. The team must control producer cutover, consumer offsets, topic configuration, observability, backout triggers, and the point at which the target storage path becomes the recovery authority. Data movement is only one part of that chain.

Document migration by topic class. For each class, define the pilot scope, success metrics, rollback trigger, and owner. A low-risk pilot might use append-only topics with clear replay requirements and limited transaction or compaction behavior. A later phase might include compacted topics after restore, deletion, and failure behavior are proven.

Strong rollback plans answer uncomfortable questions in advance. If the pilot stops, which cluster is the source of truth? How are offsets reconciled? Are transactional markers or compacted tombstones preserved? Can application owners restart consumers without custom code? If these answers depend on one engineer's memory, the migration is not ready.

Step 5: Convert the Review Into Production Gates

A storage decision becomes real when it changes runbooks. SREs need to monitor more than broker disk and CPU because the durable path includes WAL pressure, object storage latency, metadata commits, cache behavior, and remote read patterns. Security teams also need to treat buckets, IAM roles, encryption keys, lifecycle rules, and audit logs as part of the streaming platform boundary.

Production gates should be observable, repeatable, and tied to workload class. A gate that says "performance looks good" is not useful. A gate that says "broker loss after acknowledged writes recovers readable offsets within the agreed recovery window for append-heavy topics" is reviewable.

Production readiness scorecard

The scorecard should include:

  • Compatibility gate: existing Kafka clients, ACLs, schemas, admin tooling, quotas, and monitoring work without application rewrites.
  • Failure gate: broker loss, zone impairment, object storage throttling, cache misses, and metadata recovery are tested with acknowledged writes.
  • Cost gate: the changed byte paths are measured with cloud billing dimensions, not only estimated from architecture diagrams.
  • Governance gate: data residency, encryption, retention, deletion, and audit requirements are mapped to storage and metadata controls.
  • Rollback gate: cutover and backout procedures are rehearsed with offsets, consumers, and operational owners named.

These gates are intentionally concrete. They let a platform team say yes to a narrow workload class while saying not yet to workloads with heavier semantic risk.

How AutoMQ Fits the Review

After the workflow is clear, AutoMQ can be evaluated as a Kafka-compatible cloud-native streaming platform built around Shared Storage architecture. AutoMQ uses S3Stream to store stream data in S3-compatible object storage, with WAL storage and data caching bridging Kafka-like latency expectations and object storage durability. That architecture maps directly to Kafka compatibility, storage durability, network byte paths, recovery behavior, and operational evidence.

The useful way to test AutoMQ is not to assume that a shared-storage diagram answers every question. Put it through the same decision record. Use existing clients, representative topic classes, real retention settings, cross-AZ traffic measurements, failure drills, cold reads, and rollback rehearsals. If the motivation for reviewing KIP-1150 is broker-disk pressure, cross-AZ replication cost, slow recovery, or retention growth, AutoMQ provides a concrete implementation to compare against native Kafka roadmap options and managed Kafka services.

AutoMQ BYOC and AutoMQ Software also matter for governance because the control plane and data plane run in the customer's own environment. That boundary is relevant when security teams ask where Kafka records, object storage, metrics, IAM permissions, and operational control live. The architecture still needs validation, but the review can be anchored in existing cloud account or private environment constraints.

Decision Record Template

The final output should be a decision record that can survive a production incident review. It should not be a one-line approval. The record should describe which workloads are in scope, which semantic guarantees were tested, which cost paths changed, and who owns the remaining risks.

Decision fieldRequired evidence
Workload classTopic behavior, retention, throughput, compaction, transactions, fan-out, and latency target
Architecture optionClassic Kafka, Tiered Storage, Diskless Topics, Kafka-compatible shared storage, or managed service
Semantic testsProducer acknowledgments, fetch by offset, consumer groups, compaction, transactions, retention, and deletion
Cost modelStorage allocation, object storage usage, cross-zone traffic, request patterns, cache, and recovery movement
Operational gatesMetrics, alerts, failure drills, recovery windows, runbooks, and on-call ownership
Governance controlsIAM, encryption, audit logs, lifecycle policy, data residency, and deletion proof
Rollback planCutover trigger, backout trigger, offset handling, data reconciliation, and accountable owner

KIP-1150 makes Kafka's storage future more explicit, but the decision still belongs to the operator. Start with semantics, convert cost into byte paths, test migration before the pilot, and require production gates that SREs can observe. To compare a Kafka-compatible shared-storage implementation with your own workload, try the AutoMQ Cloud deployment path with one well-classified topic class and a written rollback plan.

References

FAQ

Is KIP-1150 available as a production feature in Apache Kafka?

KIP-1150 is marked Accepted, which records agreement on the direction and requirements for Diskless Topics. That is different from saying every implementation detail, release artifact, migration path, and production operating model is complete. Review the follow-up KIPs and Apache Kafka release notes before treating it as an available feature.

Does diskless Kafka mean brokers have no disks?

No. Diskless means broker disks are not the primary durable storage for user topic data. Brokers may still use disks for logs, metadata, staging, cache, operating system files, and implementation-specific needs.

How is Diskless Topics different from Tiered Storage?

Tiered Storage moves older log segments to remote storage while the active log can still rely on broker-local storage. Diskless Topics shift the primary durability model for topic data toward shared storage, which changes the review surface for acknowledgments, recovery, and operations.

Which workload should be reviewed first?

Start with a narrow topic class that has clear semantics and measurable costs, such as append-heavy ingestion or long-retention audit streams. Delay compacted, transactional, and stateful stream-processing workloads until the team has stronger evidence for restore, rollback, and failure behavior.

Where does AutoMQ fit in a KIP-1150 evaluation?

AutoMQ is not a substitute for reading the KIP. It is a Kafka-compatible shared-storage implementation that can be tested with the same workflow: semantic evidence first, then cost, migration, governance, and operations. Use representative workloads rather than synthetic averages.

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.