Blog

How OpenAI, Robinhood, and Disney+ Use Kafka - And Why Kafka Infrastructure Gets Expensive

Kafka has become the default backbone for companies moving high-volume operational data without turning every downstream system into a point-to-point integration project. Public talks from OpenAI, Robinhood, Disney+ Hotstar, and JioCinema show where Kafka keeps appearing when scale, replay, and fan-out matter at the same time.

The uncomfortable part is that a working Kafka architecture is not the same thing as an optimized Kafka cost model. At small scale, the bill is hidden inside brokers, EBS volumes, and a managed service line item. At large scale, the bill becomes a map of replicated local disks, cross-zone traffic, broker ownership of partitions, peak-capacity provisioning, and rebalancing work.

This analysis uses public information only. It does not claim private knowledge of OpenAI, Robinhood, Disney, Hotstar, JioCinema, or their current vendor decisions. The practical question is narrower: when internet-scale teams publish Kafka workload patterns, what should other platform teams learn before their own bill reaches the same shape?

Big-company Kafka cost driver map

What Public Architectures Reveal

OpenAI's Current New Orleans 2025 session describes StreamLink as a real-time streaming ingestion platform for a data lakehouse, powered by Apache Flink. The conference abstract places it in the context of high-quality data products for real-time applications, analytics, and AI. That is a familiar AI infrastructure pattern: models and product systems need fresh data, but the durable record often lands in a lakehouse where it can be replayed, joined, governed, and audited.

Robinhood's public WarpStream customer story is a different shape. The workload is logging and observability, with traffic tied to U.S. market hours. Robinhood reported more than 14 million monthly active users and over 10 TB of data processed per day, then described why classic Kafka was awkward for a workload that spikes during market activity and quiets down at night or on weekends. The same post reports a 45% saving after moving the logging workload to WarpStream.

Media workloads add another dimension: extreme fan-out and event bursts. Disney+ Hotstar's Kafka Summit talk describes Kafka Connect as a central integration layer across analytics, inter-service communication, workflow management, alerting, logging, monitoring, and user engagement. The public session page says Disney+ Hotstar ran over 10 Connect clusters and more than 2000 connectors. JioCinema's Kafka Summit Bangalore session goes even further on traffic intensity, describing ingestion at 10M+ messages per second for streaming data at scale.

Those examples should not be flattened into one generic "Kafka at big companies" story. They represent three workload families:

  • AI and lakehouse ingestion: data must arrive fast enough to power downstream intelligence, but it also needs durable replay and clean handoff into analytical storage.
  • Fintech observability and logs: bursts are predictable in broad shape but volatile in detail, so fixed broker capacity tends to be idle for long windows and stressed during market events.
  • Media streaming and user engagement: live events create sharp traffic peaks, while Connect, analytics, recommendation, and personalization systems multiply the number of consumers.

Kafka is good at these patterns because it decouples producers from consumers and retains records in a durable log. The Apache Kafka documentation describes Kafka as taking storage seriously, with clients controlling their read position and the system acting like a specialized distributed filesystem for log storage, replication, and propagation. That feature is the point, and it is also where the cost starts.

Public workload pattern table

Why Scale Turns Kafka Cost Into Architecture Cost

Kafka's classic durability model is based on replicated partitions. Under normal operation a partition has one leader and zero or more followers; with a replication factor of 3, Kafka keeps three copies of the partition across the cluster. That design is reliable and well understood, but cloud billing makes every copy visible.

The first visible line is storage. A broker-backed Kafka cluster typically stores hot data on block storage such as EBS, with provisioned capacity sized for retention, throughput, and headroom. AWS lists gp3 storage at $0.08/GB-month on its EBS volume types page, while S3 Standard pricing is presented separately as object storage priced by GB-month. The exact comparison changes by region and storage class, but the architectural difference is stable: Kafka traditionally pays for replicated broker-local storage.

The second visible line is cross-zone traffic. Production Kafka clusters are usually spread across Availability Zones so a zone failure does not take the cluster down. AWS states that data transfer between Availability Zones in the same Region is a distinct charge category, and the EC2 pricing page is the place to verify the region-specific rate. For Kafka, this matters because replication traffic is not an occasional background operation; it is on the write path.

The third line is capacity that exists for peaks but bills during valleys. Robinhood's logging pattern is a clean public example because market-hours traffic creates a predictable daily shape. Media platforms see the same pattern around live sports. If broker disks and broker ownership of data make scale-down slow or risky, teams provision for the top of the curve and pay for the bottom.

Here is the cost anatomy most large Kafka teams eventually recognize:

Cost driverWhy it growsWhat to inspect
Replicated broker storageRetention multiplies by replication factor and headroomTopic retention, replication factor, compression, actual bytes retained
Cross-AZ replicationFollowers often live in different zones from leadersInter-zone transfer line items, replica placement, producer locality
Consumer fan-outEach additional consumer group can reread the same dataRead throughput by group, analytical replay patterns, sink connectors
Peak overprovisioningBrokers and disks are sized for spikesHourly traffic curve, broker CPU/network idle time, storage free space
Rebalancing operationsData belongs to brokers, so scaling can move dataPartition reassignment time, incident history, operational freeze windows

The table is deliberately vendor-neutral because the first step is not buying a new platform. The first step is admitting that Kafka cost is not one number. It combines storage semantics, network topology, workload shape, and operational constraints.

The Pattern Behind OpenAI, Robinhood, and Media Workloads

The OpenAI public session sits at the boundary between operational streaming and analytical storage. A streaming ingestion platform for a lakehouse has to absorb events, preserve ordering where needed, and make data available downstream. But the long-term value often comes after the first write: replay, enrichment, feature generation, audits, and batch/stream convergence. If Kafka is used as the shock absorber before data reaches Iceberg or another lakehouse table format, the cluster may carry high write volume and meaningful retention even when the business query path lives elsewhere.

Robinhood's logging workload shows the opposite pressure. Logs are high volume, but they are often latency-tolerant compared with payment authorization or order placement. The public write-up quotes the team describing logging as a strong fit because latency sensitivity is lower, and it reports produce latency moving from 0.2 seconds to 0.45 seconds after batch tuning. That trade-off would be wrong for every workload, but it can be acceptable for observability pipelines where cost, elasticity, and operational simplicity matter more than single-digit millisecond producer latency.

Disney+ Hotstar and JioCinema show why fan-out changes the math. A media platform does not only ingest play events. It feeds analytics, personalization, user engagement, monitoring, experimentation, ad systems, and downstream data stores. Kafka Connect makes that ecosystem manageable because connectors turn integration into configuration and reusable runtime behavior. But every sink, replay, and real-time consumer creates read pressure. With enough downstream systems, Kafka becomes less like a queue and more like a shared data distribution layer.

These examples point to a decision framework:

  • Keep classic Kafka where low-latency local log behavior is the primary requirement. Some workloads need the mature behavior and performance envelope of broker-local storage, and replacing that path should not be treated as a casual optimization.
  • Evaluate diskless Kafka for high-volume, replay-heavy, or latency-tolerant streams. Logs, analytics ingestion, clickstream, lakehouse ingestion, and some Connect-heavy workloads often fit this profile.
  • Separate "Kafka API compatibility" from "Kafka storage architecture." The client protocol can remain familiar while the storage layer changes from broker-owned disks to shared object storage.
  • Model the whole workload, not only broker count. A cluster with fewer brokers can still be expensive if it drives heavy cross-zone traffic or forces high storage replication.

That last point is where teams often misread the problem. Kafka optimization is not only tuning batch.size, linger.ms, retention, or partitions. Those knobs help, but they do not change the fact that classic Kafka binds data ownership to brokers. When the data lives in shared storage, scaling becomes more of a compute placement problem.

Where Diskless Kafka Changes the Model

Diskless Kafka is not "Kafka without storage." It is Kafka-compatible streaming where durable data is moved out of broker-local disks and into shared object storage, while compute nodes become more stateless. Common implementations use S3-compatible object storage as the durable layer, then run Kafka-compatible agents or brokers that handle the protocol, caching, batching, metadata coordination, and read/write paths.

AutoMQ is one example of this architectural direction. AutoMQ keeps Kafka protocol compatibility while replacing Kafka's native log storage with S3Stream, a shared streaming storage layer built around object storage and write-ahead log acceleration. In AutoMQ's documentation, object storage is the primary data repository, and broker nodes are designed to be stateless relative to the durable stream data.

The cost model changes in four places:

  1. Storage moves from replicated broker disks to shared object storage. Instead of treating every broker as a storage owner, the durable stream data lives in object storage. The mental model changes from "replicated disks per broker" to "shared durable storage plus compute."
  2. Cross-zone replica traffic can shrink. If the platform writes durable data to regional object storage instead of replicating every partition across brokers on the write path, the large cross-AZ replication stream that classic Kafka creates can be reduced or avoided for suitable workloads.
  3. Compute can scale closer to traffic. Stateless or near-stateless brokers are easier to add and remove because they do not own large local logs. That is especially relevant for market-hours logs, live-event media traffic, and bursty AI ingestion.
  4. Rebalancing becomes less painful. If partition movement does not require copying large broker-local logs, operational actions that used to be planned around data movement can become faster and less risky.

This is where precision matters. Diskless Kafka is not automatically better for every topic. It can introduce different latency, request, caching, metadata, and object-store behavior. Latency-sensitive transactional streams, long-retention observability topics, lakehouse ingestion topics, and fan-out analytics topics may deserve different storage paths.

Diskless savings scenario model

Scenario Analysis By Workload Type

For an OpenAI-like ingestion pattern, the question is not whether Kafka is useful. It clearly is useful when many producers need to feed real-time pipelines and durable analytical stores. The better question is how long Kafka should remain the expensive system of record before the lakehouse takes over. If a topic primarily buffers ingestion into object-backed analytical storage, storing the same data on replicated broker disks for extended retention may duplicate cost.

For a Robinhood-like logging pattern, the public evidence is stronger because the migration story is already about diskless Kafka. Robinhood's published case says the team moved logging workloads from Kafka to WarpStream and saved 45%. The details matter more than the headline: cyclical traffic, tolerance for higher produce latency after batching, and the desire to reduce inter-AZ networking and fixed compute. If logs are high volume, bursty, replayed by multiple consumers, and not ultra-latency-sensitive, they are a clear place to evaluate diskless Kafka.

For Disney+ Hotstar or JioCinema-like media patterns, the fit depends on which part of the platform you are analyzing. Live user-facing control loops may need very low latency and careful isolation. Analytics, watch events, engagement streams, connector-heavy sink pipelines, and replay workloads often have more room for batching and object-backed durability. The public JioCinema session's 10M+ messages per second figure is a reminder that tiny per-message inefficiencies become material at that scale.

A practical model separates topics into four classes:

Topic classTypical examplesStorage architecture question
Latency-critical operational streamsorder events, auth decisions, fraud responsesDoes the topic need classic broker-local latency characteristics?
High-volume observabilitylogs, traces, metrics envelopes, audit eventsCan batching and object storage reduce cost without harming users?
Lakehouse ingestionclickstream, product events, feature inputsIs Kafka retaining data longer than the ingestion buffer requires?
Fan-out analytics and connectorsS3 sinks, warehouse feeds, personalization eventsAre repeated reads and connector fleets driving hidden capacity?

The answer can be mixed. Many mature teams will run multiple Kafka profiles: classic Kafka for latency-critical paths, diskless Kafka for cost-sensitive high-volume streams, and data lake formats for long-term analytical storage.

Cost Checklist For Large Kafka Teams

The fastest way to improve a Kafka bill is to stop looking at Kafka as a single cluster and start looking at it as a portfolio of workload contracts. Each contract says how fresh the data must be, how long it must be retained, how many systems read it, how much replay is expected, and what failure behavior the business needs.

Start with the billing and metrics that map directly to architecture:

  • Storage retained by topic: include replication factor, compression ratio, free-space headroom, and retention overrides. Expensive topics often survive because nobody owns the retention policy anymore.
  • Cross-zone data transfer: inspect regional transfer usage and correlate it with Kafka replication, cross-zone clients, load balancers, and sink paths. Do not assume the network line item belongs to "the cloud" in the abstract.
  • Read amplification: count consumer groups and connector sinks, then identify topics where the same data is replayed by many downstream systems.
  • Peak-to-median traffic ratio: workloads with sharp peaks and long valleys are poor fits for fixed broker capacity unless the latency contract truly requires it.
  • Operational drag: measure how often teams delay scaling, broker replacement, partition reassignment, or version upgrades because data movement is risky.

Then classify topics by migration risk. The low-drama candidates are usually high-volume observability, analytics, lakehouse ingestion, and connector-heavy fan-out streams. The high-scrutiny candidates are payment, ordering, user-facing control loops, and anything with strict tail-latency expectations. Serious infrastructure teams move the right workloads first.

AutoMQ fits into this decision as a Kafka-compatible, cloud-native streaming platform for teams that want Kafka APIs with an object-storage-centered cost and elasticity model. Its strongest role in this article is not as a universal replacement claim. It is as a concrete implementation of the architectural idea: keep the Kafka ecosystem surface area where it helps, but stop assuming broker-local replicated disks are the only durable storage model for every stream.

FAQ

Do OpenAI, Robinhood, Disney+ Hotstar, or JioCinema use AutoMQ?

This article does not claim that they use AutoMQ. It analyzes public Kafka, Flink, Connect, and diskless Kafka architecture references, then applies the cost lessons to similar workload patterns. AutoMQ is discussed as one Kafka-compatible implementation of shared-storage streaming architecture.

Is diskless Kafka always lower cost than classic Kafka?

No. Diskless Kafka can reduce storage, cross-zone replication, and overprovisioning costs for suitable workloads, but the result depends on write volume, retention, read amplification, latency targets, object-store request patterns, cache behavior, and vendor pricing. The right comparison is topic by topic.

Which Kafka workloads are usually strong candidates?

High-volume logs, observability pipelines, clickstream, lakehouse ingestion, replay-heavy analytics, and connector-heavy fan-out streams are often good candidates. Low-latency operational streams can still be candidates, but they need stricter benchmarking and failure-mode testing.

Does Kafka tiered storage solve the same problem?

Tiered storage helps move older segments to object storage, but the hot path and primary partition ownership can still remain tied to brokers. Shared-storage or diskless designs go further by making object storage the primary durable layer for stream data. That difference matters for cross-zone replication, elasticity, and rebalancing.

How should a team start a Kafka cost review?

Pick the top 10 topics by retained bytes, write throughput, and read throughput. For each topic, document retention, replication factor, consumer groups, peak-to-median traffic ratio, latency target, and business owner. That inventory usually reveals which topics should stay on classic Kafka and which deserve a diskless Kafka proof of concept.

Sources

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.