Blog

Kafka for Manufacturing & IoT | Edge to Cloud Streaming

Factory data has a boundary problem. The signal starts beside a physical process: a motor current spike, a robot fault, a quality measurement, a temperature drift, a changeover event, a shipment scan. The first system that sees it is often a PLC, SCADA system, historian, gateway, or MES layer. The teams that need the same signal later may sit in operations, quality, planning, maintenance, customer support, finance, or data science.

Kafka fits this boundary because it turns factory events into durable streams that multiple consumers can read without each team building its own polling path into OT systems. But manufacturing is not one uniform Kafka workload. A plant-floor alert and a multi-day replay of production history both look like "streaming" from a distance, while their latency, retention, security, and cost profiles are very different.

That difference is where many industrial Kafka architectures get expensive. Teams place Kafka close to machines to reduce latency, then stretch the same pattern into cloud analytics where retained data, fan-out, and replay dominate cost. A better design treats Kafka as an edge-to-cloud fabric: deterministic processing stays close to the plant, while high-volume retained streams move into a Kafka-compatible cloud layer whose storage economics are built around object storage.

OT/IT data flow map

Manufacturing Needs Event Streaming, Not Another Point-to-Point Mesh

Industrial integration has always had layers. ISA-95, also known as IEC 62264, frames enterprise-control integration as an interface between manufacturing control systems and business systems, with Level 3 covering manufacturing operations and Level 4 covering business planning and logistics. That model still matters because it reminds architects that a machine signal is not automatically ready for ERP, analytics, or AI workflows.

The data has to cross semantic, security, and ownership boundaries. A vibration sensor may be meaningful to maintenance, but only after it is mapped to an asset hierarchy. A quality measurement may matter to traceability, but only when it is tied to a work order, batch, recipe, or serial number. A line-stop event may be urgent to plant operations and also valuable months later when process engineers analyze downtime patterns.

Kafka is useful in manufacturing because it gives those events a shared transport and retention model. Instead of forwarding the same data through brittle chains of custom integrations, teams can publish normalized events once and let different consumers read at their own pace:

  • Operations consumers need alerts, state changes, and production signals while the process is still active. They care about latency, ordering, and isolation from cloud outages.
  • Quality and traceability systems need durable event history connected to product, batch, recipe, and equipment context. They care about replay and completeness.
  • Maintenance and reliability teams need time-series signals, fault codes, and work-order context. They care about joining live data with historical patterns.
  • Planning and analytics teams need aggregated production, inventory, and throughput signals across plants. They care about consistent schemas and scalable cloud retention.
  • AI and optimization teams need features, labels, and feedback loops. They care about repeatable replay because model development rarely reads a stream only once.

The mistake is forcing every one of these consumers through the same plant-local or cloud-central architecture. OT systems have safety and availability constraints that data lake teams do not have. Cloud analytics has storage and replay pressure that a local SCADA alarm stream should not carry.

Edge Latency and Cloud Retention Pull Kafka in Opposite Directions

NIST SP 800-82 Rev. 3 defines OT broadly as programmable systems and devices that interact with the physical environment, and emphasizes that OT security work must respect performance, reliability, and safety requirements. That is the core difference between industrial streaming and ordinary SaaS event ingestion. A plant network is not only a source of data. It is part of the operating environment.

This does not mean Kafka should run inside every machine cell or replace industrial protocols. OPC UA PubSub, MQTT, AMQP, fieldbus protocols, historians, and vendor-specific interfaces all have their place. OPC Foundation describes OPC UA PubSub as a way to extend OPC UA communication across transports such as MQTT and AMQP while preserving OPC UA's security and standardized information-modeling strengths. Kafka usually sits above or beside these protocols as the durable event backbone, not underneath every control loop.

The placement decision should start with a workload matrix, not a vendor preference.

Edge vs cloud workload matrix

DimensionEdge or plant layerRegional or cloud layer
LatencyUsed when actions must stay close to equipment, operators, or MES workflows.Used when millisecond-level streaming is enough and durability, fan-out, or replay is more important.
RetentionUsually short because local infrastructure is constrained and operational data is narrowed to active use.Often longer because analytics, traceability, troubleshooting, and AI workflows read historical streams repeatedly.
ConnectivityMust tolerate WAN disruption and plant network segmentation.Can assume cloud networking, managed identity, object storage, and broader observability.
OwnershipShared with OT, plant engineering, automation vendors, and site reliability teams.Usually owned by platform, cloud, data, or enterprise architecture teams.
Cost driverEdge hardware, operational support, and site-by-site deployment complexity.Broker storage, cross-zone movement, retention, replay, and peak capacity.

Traditional Kafka can run in either place, but the same broker-attached storage model carries different trade-offs. On the plant side, local disks may be acceptable because retention is short and the cluster is scoped to a factory or line. On the cloud side, the retained stream can become a large, shared, replayable system of record. Apache Kafka's replication model stores partition replicas across brokers, with one leader and one or more followers for each replicated partition. That model is reliable, but it also couples durable storage, broker compute, and replication traffic.

Cloud manufacturing workloads hit that coupling hard. Quality records, telemetry, equipment state, and traceability events may need to be retained for days or longer. Multiple teams read the same topics for dashboards, production intelligence, predictive maintenance, data lake ingestion, and model features. Peak production windows can force capacity planning around the busiest hours even when average traffic is lower.

A Reference Architecture for OT/IT Streaming

A practical architecture starts by accepting that "edge to cloud" is a set of layers, not a single pipe. The plant layer collects and normalizes data close to machines. The site or regional layer buffers, enriches, and routes events across local systems. The cloud layer handles enterprise-scale retention, analytics fan-out, and replay. Kafka can appear in more than one layer, but each layer should have a clear reason to exist.

Edge-to-cloud Kafka architecture

At the plant edge, Kafka should be evaluated against existing OT patterns. Some sites only need protocol gateways and a local historian feeding a regional Kafka cluster. Other sites benefit from a plant-local Kafka cluster because multiple local consumers need the same event stream. The key is to keep the blast radius small: a plant outage should not corrupt cloud history, and a cloud outage should not stop a safety-adjacent local process.

In the cloud layer, the architecture changes. The problem is no longer how to shave every local millisecond from a machine event. The problem is how to retain and serve large industrial event streams without building a stateful Kafka estate that becomes expensive to expand, rebalance, or operate. This is where diskless Kafka becomes relevant.

Diskless Kafka is not a claim that storage disappears. It means Kafka-compatible brokers no longer hold the durable log primarily on broker-attached disks. Durable data moves to cloud object storage, while brokers act more like elastic serving and coordination compute. In AutoMQ, the storage layer is implemented through S3Stream, with data written to WAL storage and uploaded to object storage in near real time. AutoMQ's architecture documentation describes broker nodes as stateless after replacing Kafka's native log storage with S3Stream.

That distinction matters for manufacturing because the most expensive cloud-side streams are often not the most latency-sensitive ones. A predictive-maintenance feature stream, a batch traceability stream, or a fleet of machine telemetry topics may need Kafka compatibility, multi-consumer fan-out, and replay. They do not necessarily need every retained byte to live on replicated broker volumes.

Where Diskless Kafka Helps Manufacturing Teams

Diskless architecture helps when the workload is dominated by retained data, replay, bursty load, and cloud networking rather than deterministic control. That is a narrow claim, but it is the right claim. No serious architect should route a hard real-time safety loop through a cloud Kafka cluster because an article said "real-time." In manufacturing, real-time has layers.

The strongest cloud-side fits usually look like this:

  • Traceability streams that connect product, batch, material, recipe, equipment, and inspection events. These streams are read by quality systems, compliance workflows, and analytics teams long after the production step ends.
  • Predictive maintenance streams that combine telemetry, fault codes, work-order events, and historical labels. These workloads benefit from replay because feature definitions evolve.
  • Production intelligence streams that feed OEE, throughput, scrap, downtime, and energy dashboards across sites. These often fan out to several consumers with different aggregation windows.
  • Connected product or fleet streams where manufacturing, after-sales, and customer-facing systems need a durable event backbone outside the factory walls.
  • Data lake ingestion streams where Kafka is the controlled handoff between OT-normalized events and open table formats or warehouse ingestion.

AutoMQ fits this zone because it keeps the Kafka protocol surface while changing the storage economics behind it. Its public documentation describes a shared-storage architecture with WAL storage plus object storage, and its benchmark page reports a 1 GiB/s, three-day-retention, multi-AZ comparison where AutoMQ had much lower monthly infrastructure cost than Apache Kafka under the benchmark assumptions. That number should not be pasted into a manufacturing business case as a universal result. The useful point is the mechanism: separating broker compute from durable object storage reduces the pressure to size every broker around long-lived local log segments.

The customer evidence is also relevant, though it should be used carefully. AutoMQ's Geely case study says Geely's connected vehicle platform supports more than 10 million vehicles and uses AutoMQ as a streaming backbone in a hybrid-cloud setting. Connected vehicles are not the same as plant-floor control, but they are a credible industrial-scale event-streaming reference: high-volume telemetry, retained history, cloud integration, and operational reliability all matter. That is closer to the cloud-side manufacturing pattern than a small demo cluster.

AutoMQ is not a replacement for MES, SCADA, historians, OPC UA, or plant data models. It is a Kafka-compatible streaming layer for the points where teams need durable fan-out, cloud-native retention, elastic operations, and compatibility with the Kafka ecosystem. That makes it most natural in the regional and cloud layers, and selectively useful at the plant layer when the site has enough Kafka operational maturity.

Implementation Checklist: Build the Boundary Before the Cluster

The hardest part of industrial Kafka architecture is not creating topics. It is deciding which boundary each topic crosses. A clean design starts with the physical process and works outward, because the first mistake in OT/IT integration is pretending that all factory data has the same risk profile.

Start with a workload inventory. For each stream, write down the source system, asset context, required ordering, local action requirement, retention window, replay expectation, consumer list, and data owner. If nobody owns the semantics, the stream will become a faster version of the same old integration mess.

Then assign placement:

  1. Machine and control layer: Keep deterministic control paths in the OT domain. Use Kafka only when the event is meant to leave the control loop and serve multiple consumers.
  2. Plant edge layer: Normalize protocols, enrich with asset context, buffer through WAN disruptions, and serve local applications that need shared events.
  3. Regional layer: Aggregate sites, enforce schemas, route events across plants, and isolate cloud dependencies from factory operations.
  4. Cloud streaming layer: Use Kafka-compatible storage for durable fan-out, replay, analytics, AI features, and data lake integration.
  5. Object storage and lake layer: Treat long-retention industrial history as a storage problem as much as a streaming problem.

After placement, define the operating model. Topic ownership should map to plant or domain ownership, not only infrastructure teams. Schemas should encode production meaning, not only field names. Network rules should respect OT segmentation, with explicit paths from gateways to brokers and from brokers to cloud consumers. Disaster recovery should describe what happens during plant isolation, not only what happens when a cloud zone fails.

Finally, model cost before migration. Use measured ingress, egress, retention, partition count, consumer fan-out, peak-to-average ratio, and zone placement. Compare broker-attached storage with object-storage-backed Kafka using the same durability and retention assumptions. Scenario-based modeling is credible because manufacturing plants differ. A factory with small streams and short retention may not justify architectural change. A global operation retaining high-volume telemetry and traceability data across many sites may find that storage architecture is the difference between a strategic platform and a recurring budget fight.

What Good Looks Like

The goal is not to make every factory event pass through one giant Kafka cluster. The goal is to stop rebuilding the OT/IT bridge for every use case. Kafka gives manufacturing teams a shared event backbone, but the architecture has to respect the physical world at one end and cloud economics at the other.

At the edge, success looks like local resilience, clear semantics, and controlled data movement. In the cloud, success looks like durable replay, many consumers, elastic operations, and storage costs that do not punish every extra day of retention. Diskless Kafka is useful because it changes the second half of that equation while keeping Kafka compatibility intact.

If your manufacturing Kafka plan starts with cluster sizing, pause. Start with the boundary: which data must stay near the machine, which data must become enterprise memory, and which layer should pay the storage bill. The right Kafka architecture follows from that answer.

FAQ

Is Kafka a good fit for manufacturing and Industrial IoT?

Kafka is a strong fit when manufacturing data needs durable event ingestion, fan-out to multiple consumers, replay, and integration with analytics, maintenance, quality, or data lake systems. It is not a replacement for PLC control logic, safety systems, SCADA, MES, historians, OPC UA, or MQTT. In most industrial architectures, Kafka sits above or beside those systems as the durable event backbone.

Should Kafka run at the edge or in the cloud?

Both can be valid, but for different workloads. Edge or plant-local Kafka makes sense when local consumers need shared events and WAN independence matters. Cloud Kafka makes sense for cross-plant analytics, long retention, replay, AI features, and enterprise integration. Many manufacturing teams need a layered architecture rather than one cluster location.

What is diskless Kafka in manufacturing architecture?

Diskless Kafka is a Kafka-compatible architecture where durable log storage is moved away from broker-local disks and into cloud object storage. Brokers still serve Kafka clients, but they are less tied to long-lived local log segments. For manufacturing, this is most useful for cloud-side retained streams such as telemetry, traceability, maintenance, and analytics events.

Does diskless Kafka replace OPC UA, MQTT, or historians?

No. OPC UA, MQTT, and historians solve different parts of the industrial data problem. OPC UA provides industrial information modeling and interoperability, MQTT is widely used for lightweight publish-subscribe messaging, and historians are common for process history. Kafka-compatible streaming is most useful when normalized events need durable fan-out, replay, and integration across many consumers.

Where does AutoMQ fit in an IIoT architecture?

AutoMQ fits best in regional or cloud Kafka layers where teams need Kafka compatibility, object-storage-backed retention, elastic brokers, and lower operational coupling between compute and storage. It can also fit selected plant-layer deployments when Kafka is already appropriate there, but hard real-time control paths should remain in the OT systems designed for them.

How should teams estimate Kafka cost for industrial IoT?

Measure ingress, egress, retained bytes, consumer fan-out, retention period, partition count, peak-to-average traffic, and cross-zone placement. Then compare a broker-attached Kafka design with an object-storage-backed Kafka design under the same durability and retention assumptions. Avoid universal savings claims; the result depends on workload shape and cloud architecture.

Sources

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.