Blog

Kafka on OpenShift: Storage and Architecture Guide

Teams searching for Kafka on OpenShift are usually past the first question. They know Kafka can run on Kubernetes, and they know OpenShift can schedule stateful workloads through operators, StatefulSets, PersistentVolumeClaims, storage classes, and topology controls. The harder question is whether the operating model will still be sane after the first year of production growth.

Kafka is not difficult because a pod can write to a disk. It is difficult because broker-local log directories become part of the availability, capacity, scaling, and incident-response model. OpenShift gives platform teams a strong control plane, but it does not erase Kafka's storage semantics. If every broker still owns local partitions on a PersistentVolume, then broker replacement, partition reassignment, rack awareness, and storage expansion remain Kafka operations, even when OpenShift automates the pod lifecycle.

That is why the useful OpenShift Kafka discussion starts with storage. Operators decide how the cluster is declared and upgraded. StatefulSets provide stable identities and storage attachments. OpenShift Data Foundation, Ceph, or another Container Storage Interface layer decides how volumes and object endpoints are provisioned. Kafka still decides where the durable log lives.

Kafka on OpenShift Storage Options

What OpenShift Changes for Kafka Operations

OpenShift changes the control plane around Kafka more than it changes Kafka itself. Instead of hand-built brokers, the platform provides Kubernetes APIs, role-based access control, admission policies, image management, networking, storage classes, monitoring integrations, and operator lifecycle tooling. For regulated private cloud teams, that is a major advantage because Kafka can be managed through the same platform contracts as other stateful services.

The gain is strongest when responsibilities are explicit:

  • The platform team owns OpenShift clusters, node pools, security policies, ingress, certificates, namespaces, and operator installation.
  • The storage team owns OpenShift Data Foundation, Ceph, external SAN/NAS integrations, storage classes, capacity, replication policy, and storage SLOs.
  • The Kafka team owns broker sizing, topic policy, replication factor, rack awareness, client quotas, upgrades, and application-facing SLOs.
  • Application teams own producer and consumer behavior, traffic bursts, replay patterns, schema evolution, and failure expectations.

This split sounds administrative, but it is architectural. If a producer latency incident crosses Kafka, CSI, Ceph OSDs, RGW gateways, OpenShift networking, and node pressure, the incident cannot be debugged from a single dashboard. Kafka on OpenShift is easier to operate when each team understands which part of the data path it owns.

The mistake is to treat the operator as the architecture. An operator can reconcile resources, roll brokers, manage certificates, and encode a supported lifecycle. It does not automatically choose the right storage backend, isolate failure domains, or make broker-local data disappear. Those decisions still need design work.

Operator Choices and Lifecycle Management

Most OpenShift Kafka deployments are built around an operator. Strimzi is the upstream open source operator widely used for Kafka on Kubernetes, while Red Hat Streams for Apache Kafka provides a Red Hat-supported distribution for OpenShift environments. Both follow the same broad pattern: the user declares Kafka resources, the operator reconciles Kubernetes objects, and the cluster runs through StatefulSets, Services, ConfigMaps, Secrets, and PersistentVolumeClaims.

The operator layer should be evaluated on lifecycle behavior, not only installation convenience. Ask how it handles rolling upgrades, certificate rotation, listener configuration, broker configuration drift, storage changes, node pool separation, Kafka version transitions, metrics integration, and recovery from partial failures. A good operator reduces manual toil. It should also make the stateful parts more visible, not hide them behind a green status field.

For production clusters, the operating model should include at least four checks:

AreaArchitecture questionWhy it matters
ReconciliationWhat does the operator change automatically, and what requires human approval?Kafka changes can affect client availability and partition movement.
Upgrade pathAre Kafka, operator, OpenShift, and storage upgrades sequenced and tested together?Independent platform upgrades can still disturb the data path.
SecurityHow are TLS, SASL, ACLs, certificates, and secret rotation handled?OpenShift policy is not a substitute for Kafka authentication and authorization.
ObservabilityCan teams correlate broker, pod, volume, node, and storage-layer metrics?Storage symptoms often surface first as Kafka lag or produce latency.

The operator is therefore a governance tool as much as a deployment tool. It should encode the desired Kafka shape, but the platform still needs a review process for changes that alter storage, topology, listeners, or broker identity.

Persistent Storage Design

Kafka on OpenShift commonly starts with StatefulSets and PersistentVolumeClaims. Kubernetes StatefulSets provide stable pod identities and stable storage associations, while PersistentVolumes and StorageClasses allow the platform to bind broker pods to durable storage. This model maps cleanly to conventional Kafka because each broker has one or more log directories that remain attached to its identity.

That fit is also the constraint. StatefulSet storage makes Kafka portable across OpenShift nodes, but it preserves the broker-local log model. A broker restart can reattach the same volume. A broker replacement may recover its identity. Yet partition data is still tied to broker ownership, and scaling out still requires Kafka to move replicas or elect leaders according to its own rules.

For private cloud architects, storage design usually splits into three patterns.

Block Storage for Broker Logs

The most familiar pattern is block storage for broker logs. Each Kafka broker receives a persistent block volume, often through Ceph RBD via OpenShift Data Foundation or another CSI-backed storage class. Kafka writes local log segments to the mounted filesystem. Replication remains Kafka ISR replication. Consumers and producers see ordinary Kafka behavior.

This approach is attractive because it keeps the Kafka mental model intact. Existing tools, runbooks, topic configuration, and client assumptions continue to apply. OpenShift can reschedule pods, the CSI layer can provision volumes, and storage administrators can enforce platform-wide policies for encryption, replication, capacity, and access.

The tradeoff is that the broker is still stateful in the Kafka sense. A network-backed block device is not the same thing as shared Kafka storage. Rebalancing partitions still copies data through Kafka. Scaling storage still tends to be bound to broker volume sizing. A degraded storage pool can appear as broker latency, ISR churn, or controller instability.

This pattern deserves workload-shaped testing. Measure producer latency, follower fetch lag, consumer catch-up, broker restart time, and OpenShift node drain behavior while the storage system is under expected background activity. If the storage layer performs well only when idle, the production architecture is not ready.

Object Storage for Tiering or Shared Storage

Ceph and OpenShift Data Foundation can also expose object storage through S3-compatible APIs. That matters for Kafka, but the architectural meaning depends on how the object store is used.

In Apache Kafka tiered storage, object storage is typically a remote tier for completed log segments while brokers keep a local hot tier. This can reduce pressure from long retention and replay workloads, but the active broker log remains important. The object store extends the storage hierarchy; it does not by itself make brokers stateless.

In a shared-storage Kafka architecture, object storage has a deeper role. Durable log data is designed to live outside broker-local disks, and brokers act more like replaceable compute nodes that serve protocol requests, cache hot data, and coordinate ownership. AutoMQ fits this category: it is Kafka-compatible at the client and ecosystem layer, while its architecture separates compute from storage and uses S3-compatible object storage as the durable substrate.

The distinction is critical for OpenShift. Ceph block storage for broker logs, Ceph RGW for tiered storage, and Ceph RGW as the backend for a Kafka-compatible shared-storage system are three different designs. They may all involve Ceph, but they do not have the same recovery, scaling, or ownership model.

Zone and Rack Awareness

Kafka placement and OpenShift placement need to agree about failure domains. Kafka can use broker rack awareness so replicas are distributed across logical racks or zones. OpenShift can schedule pods across labeled nodes, zones, or worker pools. Ceph has its own CRUSH hierarchy and placement rules for replicas across hosts, racks, or other failure domains.

These layers do not automatically compose. Three Kafka replicas on three OpenShift nodes can still depend on the same storage network, the same Ceph failure domain, or the same ingress path. A Ceph pool can keep data durable while degraded recovery still increases latency enough to affect Kafka. A pod topology spread constraint can distribute brokers while their volumes depend on a concentrated storage backend.

The review should map the dependency graph:

  • Which OpenShift nodes can run brokers, ZooKeeper if still used, KRaft controllers, entity operators, and monitoring components?
  • Which StorageClass backs each broker volume, and what physical failure domains does it depend on?
  • How are Kafka rack labels derived, and are they aligned with OpenShift node labels and storage topology?
  • Are object gateways, load balancers, and object storage endpoints distributed separately from broker compute?
  • What happens during a node drain, storage pool recovery, zone outage, certificate rotation, or operator upgrade?

Reliable Kafka on OpenShift is not achieved by counting replicas in each layer independently. It comes from making sure the replicas do not share the same hidden dependency.

OpenShift Kafka Responsibility Boundary

How AutoMQ Fits Private Cloud Kafka on OpenShift

When the goal is to run conventional Kafka on OpenShift, an operator plus persistent block storage can be a pragmatic path. The architecture is familiar, supported by Kubernetes primitives, and compatible with existing Kafka operations. Many teams should start there when their main objective is standardization rather than a change in storage semantics.

When the goal is storage-compute separation, the evaluation changes. The question is no longer "which StorageClass should broker log directories use?" It becomes "where should the authoritative Kafka log live, and how replaceable should broker compute be?" That is where a Kafka-compatible shared-storage architecture becomes relevant.

AutoMQ can be evaluated as an option for private cloud Kafka on OpenShift when the platform already has, or plans to provide, an S3-compatible object storage layer. In that model, OpenShift still provides orchestration, isolation, policy, and lifecycle controls. Object storage provides durable shared storage. Brokers become less tied to long-lived local PVs, which can reduce the operational weight of broker replacement, scaling, and data movement.

This is not a claim that every OpenShift Kafka deployment should use shared storage. The right choice depends on workload and organization:

  • Use operator-managed Apache Kafka with block PVs when compatibility with conventional Kafka operations and minimum architectural change are the priorities.
  • Use tiered storage when the main problem is long retention or historical replay pressure while keeping local brokers for hot data.
  • Evaluate AutoMQ or another Kafka-compatible shared-storage design when broker elasticity, faster recovery, private cloud object storage, and separation of compute from durable data are strategic requirements.

The evaluation should be concrete. Test the object endpoint under Kafka-shaped write and read patterns. Review TLS, authentication, bucket policy, endpoint availability, monitoring, and network isolation. Decide whether the OpenShift platform team or Kafka team owns the object storage dependency during incidents. A shared-storage architecture can simplify broker operations, but it raises the importance of the object storage SLO.

Architecture Review Checklist

The safest OpenShift Kafka reviews start from failure and growth, not from installation steps. A cluster that installs cleanly can still be expensive to operate if every scaling event triggers partition movement, every storage issue becomes a Kafka incident, and every upgrade crosses unclear ownership lines.

Use this checklist before production approval:

OpenShift Kafka Architecture Review Checklist

Review itemDecision to record
Operator scopeWhich Kafka lifecycle actions are automated, gated, or manual?
Storage modelAre broker logs on block PVs, tiered to object storage, or externalized through shared storage?
Failure domainsAre OpenShift zones, Kafka racks, Ceph placement, gateways, and networks aligned?
Scaling pathDoes growth require broker disk expansion, partition reassignment, object storage capacity, or compute-only scaling?
ObservabilityCan teams correlate Kafka, OpenShift, CSI, Ceph or ODF, object gateway, and client metrics?
Incident ownershipWho owns the bridge between Kafka symptoms and platform or storage root causes?

The central decision is durable log ownership. If durable Kafka data remains broker-local, design OpenShift storage and placement around stateful broker identity. If durable data moves into shared object storage through a Kafka-compatible architecture, design the object layer as production streaming infrastructure, not as a passive backup target.

OpenShift is a strong platform for standardizing Kafka operations, but it rewards precise architecture. Operators manage lifecycle. StatefulSets preserve identity. PersistentVolumes provide durable attachments. ODF and Ceph provide block and object storage primitives. Kafka architecture decides whether brokers are long-lived storage owners or replaceable compute. That decision should be made deliberately, before production traffic makes it expensive to change.

References

FAQ

Can Kafka run on OpenShift?

Yes. Kafka can run on OpenShift through operators such as Strimzi or Red Hat Streams for Apache Kafka, usually with StatefulSets and PersistentVolumeClaims. The important design work is not basic deployment, but the long-term storage, topology, upgrade, observability, and incident model.

Should Kafka broker logs use Ceph block storage on OpenShift?

Ceph block storage can be a reasonable backend for broker PersistentVolumes when the storage and network layers are designed and tested for Kafka's latency, throughput, and recovery behavior. Treat this as conventional stateful Kafka on networked block storage. It does not remove Kafka's broker-local log ownership model.

Is OpenShift Data Foundation object storage the same as shared-storage Kafka?

No. Object storage is a substrate. Apache Kafka tiered storage may use object storage as a remote tier while brokers keep local hot logs. Shared-storage Kafka requires a Kafka-compatible storage architecture that externalizes durable log ownership, coordinates access, and makes brokers more replaceable.

Where does AutoMQ fit in an OpenShift private cloud architecture?

AutoMQ fits when a team wants Kafka protocol compatibility plus storage-compute separation on S3-compatible object storage. In OpenShift private clouds, it is most relevant when broker-local PV operations, partition movement, recovery time, or compute-storage coupling have become strategic constraints. The object storage layer should be tested and operated as part of the streaming platform SLO.

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.