Blog

Robinhood Saved 45% with Diskless Kafka | When 80% Is Realistic

Robinhood's public WarpStream migration is useful because it does not sound like a lab benchmark. It is a logging and observability workload at a financial services company, with traffic tied to market hours, quiet nights, weekend troughs, and sudden spikes when the business gets noisy. That is exactly the kind of Kafka workload where cloud waste hides in plain sight: provisioned brokers sized for peaks, EBS volumes that cannot shrink with traffic, inter-AZ replication that repeats every byte, and operator time spent keeping a storage-heavy cluster healthy.

The headline number was 45% savings versus Kafka for the logging workload. The more interesting lesson is not the number itself. It is the shape of the workload that made the number possible: high write volume, latency tolerance, variable demand, and storage economics that mattered more than sub-10ms produce latency. Once those variables are explicit, 45% becomes a reference point rather than a ceiling.

Robinhood-inspired workload profile

There is a trap here. A team can read the Robinhood story, ask "Can we save 80%?", and jump straight into vendor comparison. That usually produces bad math. A better starting point is to ask which cost centers are actually present in your current Kafka bill. If your cluster is already right-sized, single-AZ, short-retention, and lightly read, diskless Kafka may still simplify operations but it will not magically remove 80% of the bill. If your cluster is multi-AZ, replication-heavy, storage-heavy, overprovisioned for peaks, and mostly used for logs or analytics, the savings envelope changes dramatically.

What Robinhood Proved

The Current New Orleans 2025 session by Robinhood engineers Ethan Chen and Renan Rueda described a migration from Kafka to WarpStream for logging. The session page frames the problem as the cost and complexity of logging at scale, and the slide deck gives more texture: traffic follows market hours, low-traffic nights and weekends exist, sudden spikes are normal, and logging capacity does not need to be static all day. That is not a generic Kafka problem. It is a particular cost profile.

The deck also names the tradeoffs. Diskless Kafka reduced or removed important infrastructure costs, especially inter-AZ networking, and elasticity allowed capacity to scale down during non-market hours. At the same time, Robinhood treated latency as a workload selection question. The deck says logging could tolerate increased latency, and later discusses tuning batch settings where larger batches reduce S3 requests but can increase latency while batches fill. That is a mature engineering decision: do not make every Kafka workload carry the same latency budget if the business does not need it.

The reported results are strong enough to validate the architecture category:

  • Inter-AZ networking became the main win. The Robinhood deck reports inter-AZ networking down by 99% over time and 99% lower versus Kafka on the measured comparison.
  • Compute improved, but not to zero. The same deck reports compute down by 47% over time and 36% lower versus Kafka. Diskless does not remove compute; it makes compute more elastic.
  • Storage improved less than networking. The deck reports storage down by 34% over time and 13% lower versus Kafka, which is a reminder that object storage savings depend on retention, compression, object layout, request patterns, and vendor metering.
  • Operations mattered. Robinhood also called out fewer broker-maintenance tasks, no more increasing EBS volumes, and less on-call burden.

Those bullets are not a universal promise. They are a workload fingerprint. The savings came from matching a diskless architecture to a logging system that had large, variable traffic and enough latency slack to use object storage efficiently. For a payment authorization stream with strict p99 latency, the same architecture choice would need a different WAL, storage, or deployment strategy.

Why 45% May Not Be the Upper Bound

Traditional Kafka was designed around brokers that own partitions and local disks. In a cloud multi-AZ deployment, that model turns durability into repeated movement. Producers may write across zones, brokers replicate records across zones, consumers may fetch across zones, and every replica needs attached storage sized for retention and peak throughput. The cluster works, but the cloud bill sees the same logical data in several physical forms.

Diskless Kafka attacks that cost structure by moving durable log data into object storage and making compute nodes easier to scale. This is different from tiered storage. Tiered storage usually keeps the active log on broker storage and offloads older segments. A diskless design treats object storage as the primary durable layer, so the system avoids much of the broker-local replication and data rebalancing that made the original cloud bill heavy.

The reason 80% sometimes appears in diskless Kafka cost discussions is not magic pricing. It usually means several levers are active at once:

Cost leverWhat changes in a diskless designWhen it matters most
Cross-AZ replicationBroker-to-broker replica traffic is reduced or removed because object storage handles durable storage replication.Multi-AZ Kafka clusters with replication factor 3 and high write volume.
Broker storageLarge EBS volumes for retained log replicas are replaced by object storage plus smaller WAL or metadata storage.Long-retention logs, metrics, audit, replay, and analytics topics.
Compute headroomStateless or near-stateless brokers can scale with demand instead of peak-sizing the whole cluster.Cyclical workloads, market-hour traffic, campaign spikes, and weekend troughs.
Operational laborRebalancing, disk expansion, broker replacement, and partition movement become less frequent or less painful.Teams spending SRE time on storage-heavy Kafka maintenance.
Vendor meterThe managed-service billing unit can either preserve or consume some infrastructure savings.High-compression workloads and large logical write volumes.

The last row is the one buyers often miss. A diskless architecture can reduce cloud infrastructure cost while the vendor bill reintroduces a different meter. WarpStream's billing documentation describes BYOC dimensions such as cluster-minutes, uncompressed GiB written, and uncompressed GiB stored. AutoMQ's usage-based BYOC documentation describes data ingress, egress, and retention as measured after compression. That difference is not cosmetic for log analytics, where compression ratios can be high and raw logical volume can be much larger than compressed bytes on the wire.

So the question is not "Is 45% or 80% the true number?" The question is: which meters does your workload hit, and who gets the benefit when your data compresses well?

WarpStream vs AutoMQ Cost Levers

WarpStream deserves credit for proving that a Kafka-compatible, object-storage-backed architecture can run real logging workloads. Its documentation describes stateless Agents, object storage, a cloud metadata store, and a separation of data plane from control plane. That design is attractive for teams that want to stop managing stateful Kafka brokers and can accept the operating boundary of a vendor-managed control plane.

AutoMQ starts from a different set of tradeoffs. It is an Apache 2.0 open-source project that reworks Kafka's storage layer around S3Stream shared storage while keeping Kafka API compatibility. AutoMQ's broker nodes become stateless because storage is offloaded to cloud storage, and its WAL layer absorbs writes before data is uploaded to object storage. The WAL choice matters: AutoMQ documentation describes S3 WAL for latency-insensitive logging and monitoring, Regional EBS WAL for general Kafka use cases, and NFS WAL for low-latency scenarios.

That is the architectural reason AutoMQ fits this conversation without pretending Robinhood used AutoMQ. Robinhood validated the workload category. AutoMQ gives teams another way to pursue the same cloud-native economics, with a different posture on openness, billing, control boundary, and latency options.

WarpStream vs AutoMQ decision matrix

The comparison becomes clearer when you separate five decisions:

DecisionWarpStream angleAutoMQ angle
Source availabilityProprietary service and agent model.Apache 2.0 open-source project on GitHub.
Billing unitBYOC docs describe uncompressed GiB written/stored and cluster-minutes.Usage-based BYOC docs describe compressed ingress, egress, retention, and cluster uptime.
Control boundaryAgents run in the customer environment; metadata/control plane is vendor-managed according to WarpStream docs.BYOC and software deployment options are designed around customer cloud control; selected model determines exact boundary.
Latency strategyObject-storage batching, tuning, and options such as S3 Express for lower-latency cases.WAL-backed storage path with S3 WAL, Regional EBS WAL, or NFS WAL depending on workload.
Exit riskKafka protocol compatibility helps clients, but proprietary internals still matter for long-term platform risk.Open-source codebase gives teams inspection, self-hosting, and a clearer fork or self-maintenance option.

No row makes one platform automatically right. A team standardized on the Confluent ecosystem may value a single vendor relationship. A regulated platform team may value open-source inspectability and tighter data-control posture. A logging-only cluster may prefer the lowest-cost latency-tolerant path. A mixed platform running logs, fraud signals, CDC, and transaction events may need a broader latency envelope.

A Worksheet for 45%, 60%, or 80%

The clean way to evaluate savings is to model your current Kafka bill by cost center before modeling a replacement platform. Start with a 30- to 90-day window, not a single peak day. Then split the workload into at least three families: logging and observability, analytics or replay, and latency-sensitive application events. Combining them into one average Kafka workload hides the very differences that make diskless Kafka compelling.

45% vs 80% cost lever breakdown

For each workload family, collect these inputs:

  • Write volume and compression ratio. Keep both logical and compressed volume. Logical bytes matter for some vendor meters; compressed bytes matter for network and storage paths.
  • Read fan-out and replay behavior. Logs read once into a search system behave differently from topics read by many independent services.
  • Retention by topic class. A three-day hot log topic and a 30-day audit topic should not share one storage assumption.
  • Current replication topology. Note replication factor, availability zones, producer placement, consumer placement, and whether clients fetch from followers or leaders.
  • Elasticity window. Measure how many hours per week the cluster is meaningfully below peak and whether capacity can safely scale down.
  • Latency envelope. Define p95 and p99 tolerance by workload. Logging may accept hundreds of milliseconds. Transactional event flows often cannot.
  • Operational cost. Count planned work such as broker upgrades, disk expansion, partition reassignment, incident response, and capacity planning.

Now map the result to a savings band. Around 45% is plausible when the workload has meaningful cross-AZ and storage cost, but some compute, storage, or vendor meters remain large. Around 60% becomes realistic when the cluster also has strong elasticity windows and expensive EBS overprovisioning. Around 80% usually requires most levers to fire together: high replicated write volume, long retention, high compression benefit preserved by the billing model, minimal cross-AZ traffic after migration, and a latency-tolerant workload that can use cost-optimized WAL or object-storage paths.

The AutoMQ pricing calculator is helpful because it exposes the moving parts rather than hiding them behind a single TCO number. As of May 22, 2026, one default calculator scenario on AutoMQ's pricing page shows an AutoMQ monthly estimate of $1,758 versus $5,098 for Apache Kafka, or 66% savings, under the displayed assumptions. AutoMQ's benchmark documentation also publishes a high-throughput 1 GiB/s scenario with a much larger cost-efficiency claim and a breakdown where cross-availability-zone traffic dominates the Apache Kafka side. These are not substitutes for your workload model, but they show why the savings curve steepens when throughput and replication traffic rise.

The most honest answer is also the most useful: 80% savings is a scenario, not a slogan. It is credible when the current Kafka architecture is paying cloud-era penalties that a diskless design actually removes. It is not credible when the baseline is already lean or when the replacement vendor meter captures most of the infrastructure savings.

Why Open Source Changes the Risk Calculation

Cost gets the meeting started. Control decides whether the platform becomes strategic.

Kafka is rarely a side component in a fintech, marketplace, gaming, adtech, or logistics stack. It is the event substrate for logs, transactions, fraud signals, service integration, analytics, and sometimes customer-facing behavior. Once a team moves that substrate to a Kafka-compatible alternative, the question is not only whether producers and consumers still work. It is also whether the team can inspect the implementation, run it in the deployment model they need, and keep leverage if procurement or roadmap assumptions change.

AutoMQ's Apache 2.0 license is not a marketing footnote in that context. It gives engineering teams a concrete option to inspect the code, understand the storage path, test failure behavior, and self-host. The GitHub repository also makes the project shape visible: code, releases, issues, and community signal are part of the evaluation surface. Open source does not remove execution risk, but it changes the buyer's downside if the commercial relationship changes.

AutoMQ also has public production references that make the architecture feel less theoretical. JD.com's AutoMQ case describes a Kafka-based JDQ platform serving many business lines, up to 15 trillion records daily, and peak outbound bandwidth reaching 1 TB/s before discussing how AutoMQ reduced redundant storage and network overhead in a cloud-native architecture. AutoMQ's public materials also reference customers such as Grab, Tencent Music, HubSpot, and others. The point is not to import someone else's scale into your sizing model. The point is to verify that the architecture has been exercised outside a benchmark harness.

For a Robinhood-inspired logging workload, AutoMQ's S3 WAL path is the natural place to start because the workload can tolerate more latency in exchange for lower cost. For a mixed platform, Regional EBS WAL or NFS WAL may be the safer production default because they preserve lower write latency while still moving the durable log architecture away from broker-local disks. That flexibility is where AutoMQ differs from a one-latency-profile design.

How to Decide

Treat the Robinhood result as a prompt, not a procurement answer. It proves that a serious engineering team found enough value in diskless Kafka to move a logging workload and report substantial savings. It does not prove that every Kafka cluster should move to the same product, the same WAL path, or the same control-plane boundary.

The decision should come down to a short but demanding checklist:

  • Workload fit: Which topics are logging or observability, and which topics are latency-sensitive application flows?
  • Savings source: Is the current bill dominated by cross-AZ traffic, EBS, compute headroom, managed-service fees, or operations?
  • Billing sensitivity: Are you billed on logical uncompressed volume, compressed volume, cluster uptime, partitions, or provisioned capacity?
  • Control boundary: Where do data, metadata, credentials, and operational authority reside?
  • Migration path: Can you preserve offsets, roll back, and migrate topic families gradually?
  • Failure model: What happens during object-storage latency, control-plane issues, broker or agent loss, AZ failure, and scale-down events?

The teams most likely to exceed Robinhood's 45% result are not the teams chasing a bigger headline. They are the teams that do the boring inventory: topology, bytes, compression, retention, fan-out, latency, and operational toil. Once that inventory is visible, diskless Kafka stops being a trend and becomes an accounting exercise with architectural consequences.

If your logging bill looks like Robinhood's workload profile, a diskless design deserves serious evaluation. If you also care about Apache 2.0 openness, compressed-data billing, BYOC data control, and the ability to choose a WAL path by workload, AutoMQ belongs on the shortlist.

FAQ

Did Robinhood use AutoMQ?

No. Robinhood's public Current session and WarpStream's writeup describe a migration from Kafka to WarpStream for logging workloads. AutoMQ is discussed here as an open-source diskless Kafka alternative for teams evaluating the same workload category.

Is 80% Kafka cost savings guaranteed?

No. 80% savings is realistic only under specific workload and baseline assumptions, such as high multi-AZ replication traffic, large broker storage, strong compression, long retention, elastic traffic, and a replacement architecture whose billing model preserves those savings. Teams should model their own workload before using any percentage in a business case.

Why are logging workloads a good fit for diskless Kafka?

Logging and observability pipelines often have high throughput, variable demand, long retention, and more latency tolerance than transactional event streams. Those traits make it easier to trade broker-local disks and cross-AZ replication for object storage and elastic compute.

What is the biggest cost lever in cloud Kafka?

For many multi-AZ deployments, cross-AZ traffic is the hidden lever. Producer writes, broker replication, and consumer reads can all cross availability zones. Storage is also important, especially when replicated EBS volumes are sized for long retention and peak throughput.

How is AutoMQ different from WarpStream?

Both are Kafka-compatible diskless streaming platforms, but they differ in implementation and operating model. WarpStream uses stateless Agents with object storage and a vendor-managed metadata/control-plane model. AutoMQ is Apache 2.0 open source, uses S3Stream shared storage, and adds WAL options such as S3 WAL, Regional EBS WAL, and NFS WAL to cover different latency and cost profiles.

Should fintech teams choose BYOC for Kafka?

BYOC can be attractive for fintech teams because it can keep data-plane resources in the customer's cloud account. The details still matter: metadata location, control-plane dependency, auditability, IAM boundaries, network egress, and operational access should all be reviewed before treating BYOC as equivalent across vendors.

Sources

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.