Observability data has a way of turning Kafka capacity planning into a permanent argument with the future. The write path is bursty because incidents, campaigns, releases, and traffic spikes do not ask permission before producing more logs and traces. The retention path is heavy because the data may be needed after the hot window, especially when an engineer is debugging a slow service, a payment flow, or a noisy dependency. Then the cold-read path arrives at the worst possible time: someone needs old data quickly, and the storage layer that looked economical for writes suddenly becomes a bottleneck for investigation.
That combination is why observability pipelines are one of Kafka's hardest production workloads. They behave like infrastructure insurance: quiet enough to be ignored on good days, but expensive and performance-sensitive exactly when the business needs it most. POIZON, the fashion marketplace also known as Dewu in China, ran into this problem at a scale most teams only see in planning documents. According to AutoMQ's public POIZON case study, the company's observability platform reached peak throughput above 40 GiB/s. POIZON adopted AutoMQ for this Kafka-compatible workload and reported roughly 50% infrastructure cost reduction, 5x cold-read performance improvement, and nearly 3 years of zero downtime in production.
Those numbers matter, but the more useful lesson is architectural. POIZON's story is not "Kafka got expensive, so change vendors." It is a case study in why observability workloads expose the mismatch between traditional broker-local storage and cloud infrastructure, and why separating hot ingestion from durable retention changes the capacity model.
Why Observability Workloads Are Hard for Kafka
Kafka is often a natural choice for observability pipelines because it already sits between producers and downstream systems. Logs, metrics, traces, and diagnostic events can be buffered, replayed, and consumed by multiple tools without forcing every service to know every destination.
The pressure starts when observability traffic grows beyond normal application-event assumptions. A retail or fashion marketplace can see sharp peaks around promotions, releases, and user behavior changes. During those windows, the observability pipeline is not optional; it is the instrument panel for the engineering organization. Dropping data or slowing ingestion means teams lose visibility when visibility is most valuable.
Traditional Kafka makes that requirement expensive because storage and compute are tightly coupled inside the broker. A broker is not only accepting writes and serving reads. It is also holding replicated local data, consuming disk throughput, and absorbing retention requirements. When the pipeline must survive a higher write peak, teams often add brokers, cores, and disks together even if the long-term need is mostly durable storage or occasional read-back.
For observability teams, the common cost drivers tend to reinforce each other:
- Peak writes set the floor for provisioned capacity. If the cluster must handle the largest spike safely, steady-state utilization can look wasteful for most of the day.
- Retention expands the disk footprint. Keeping logs and traces available for debugging means data lives longer than the hot ingestion window.
- Cold reads punish slow storage paths. Historical investigations often scan older data, and weak catch-up read performance turns storage savings into operational pain.
- Replication multiplies infrastructure. Traditional Kafka replication protects durability, but it also increases network and disk work across brokers.
- Operational elasticity is limited by data movement. Scaling broker-local storage often triggers partition reassignment, rebalancing, and operational risk.
This is the over-provisioning trap. The team pays for a cluster sized for the worst moment, then keeps that cluster around because moving data and resizing Kafka safely is work. In a business where observability volume keeps rising, the trap becomes a budget line.
POIZON's High-Throughput Trace and Log Pipeline
POIZON's public case describes an observability workload built for large-scale trace and log processing. AutoMQ's write-up says the platform ingests petabytes of trace data daily and handles trillions of Span records, with peak traffic above 40 GiB/s. The goal was to support a production observability system where reliability, cost, and debugging performance all had to hold at the same time.
The scale also explains why this case is useful for teams far outside fashion e-commerce. Once observability data reaches this level, the pipeline becomes a shared platform. Many application teams depend on it, downstream storage and analysis systems depend on it, and incident response depends on it. A small inefficiency in Kafka architecture gets amplified by volume; a painful cold-read path gets amplified by every investigation that needs old traces.
POIZON's public numbers provide a compact picture of the workload:
| Production signal | Publicly reported value | Why it matters |
|---|---|---|
| Peak throughput | 40+ GiB/s | Capacity must absorb bursty observability writes. |
| Compute replaced | 1,280 cores | The prior capacity footprint was large enough to make elasticity economically meaningful. |
| Infrastructure cost reduction | About 50% | The result was not only storage optimization; it changed the broader resource model. |
| Cold-read performance | 5x improvement | Historical trace/log analysis remained practical after moving to object storage-backed retention. |
| Production stability | Nearly 3 years zero downtime | The architecture had to prove itself under ongoing production traffic. |
The table is deliberately narrow because customer stories become less credible when they fill in blanks with imagined detail. The public case does not give every internal design decision, every migration step, or every benchmark condition a platform engineer might ask for. What it does give is enough evidence to study the shape of the problem: a high-throughput Kafka-compatible observability pipeline where cost and cold reads both mattered.
The Over-Provisioning Trap
The reason Kafka gets expensive in this pattern is not that Kafka is poorly designed. Kafka's original storage model was built around broker-local disks and replicated logs, which made sense for the infrastructure assumptions of its time. The broker owned the data path. Replicas lived on other brokers. Scaling throughput, retention, and availability meant scaling the machines that carried those responsibilities.
Cloud infrastructure changes the economics underneath that model. Object storage already provides durable, elastic storage. Compute can be added and removed quickly. But a traditional Kafka cluster still tends to bind compute capacity to the amount of data sitting on broker disks. That binding is manageable at modest scale; at observability scale, it turns short traffic peaks and long retention windows into permanent capacity.
Tiered storage can reduce some pressure by moving older segments to object storage, but it usually keeps the broker-local storage model as the hot path. That is useful, but it does not fully remove the coupling between broker state, partition movement, and operational elasticity. The team may still need enough broker and disk capacity to handle hot writes, replication, and read workloads before data ages out.
POIZON's case sits in the place where this distinction matters. A trace/log pipeline has heavy hot ingestion and meaningful cold reads. Solving only one side creates a different problem: optimize hot writes and cold data becomes expensive, or optimize retention and historical queries become slow.
Separating Hot Ingestion From Cold Retention
AutoMQ's architecture changes the capacity model by separating compute from storage while preserving Kafka protocol compatibility. Brokers can remain responsible for Kafka-facing ingestion and serving, while durable data is persisted through AutoMQ's shared storage layer on object storage. AutoMQ's documentation describes S3Stream as a shared streaming storage layer designed to use object storage as the primary storage medium for Kafka-compatible streaming workloads.
That design is different from bolting object storage onto the side of a traditional Kafka cluster. The important shift is that broker capacity no longer has to scale in lockstep with retained data. A team can size compute for active ingestion and serving, while object storage absorbs durable retention.
For observability workloads, the practical benefits are easy to understand:
- Hot ingestion stays close to compute. Brokers handle the active write/read path without turning every retained byte into local disk pressure.
- Cold retention moves to object storage. Older logs and traces can sit in an elastic storage layer instead of occupying expensive broker-local disks.
- Catch-up reads remain part of the design target. AutoMQ's public materials emphasize improved catch-up read efficiency, which matters when consumers or investigations need historical data.
- Scaling becomes less tied to data migration. Stateless broker elasticity reduces the operational weight of adding or removing compute capacity.
This is the architectural reason the POIZON result is interesting. A 50% cost reduction in this kind of workload is not only a procurement win. It suggests the team changed which resources had to be provisioned for the worst case.
Results: Cost, Cold Reads, and Production Stability
AutoMQ's POIZON case reports three outcomes that should be read together rather than separately. First, POIZON replaced 1,280 cores and reduced infrastructure costs by about 50%. Second, it reduced cold-data costs by 85% while improving cold-read performance by 5x. Third, the deployment ran for nearly 3 years with zero downtime in production.
The second point is the one observability platform teams should look at twice. Storage savings are easy to claim when cold data becomes slow or painful to access. That trade-off is dangerous for observability because old data often becomes important during incident review or customer-impact investigations.
There is a useful lesson here for any team running Kafka as an observability buffer: the metric that matters is not cost per stored byte in isolation. It is the cost of retaining enough data while preserving the ability to read it when the organization needs answers. A lower-cost cold tier that nobody wants to query is not a reliability platform; it is an archive with a nice invoice.
The nearly 3-year zero-downtime claim also matters because observability infrastructure is not a side system during an incident. If it disappears, the team loses the evidence trail. That does not mean every migration is effortless or that every workload will reproduce POIZON's exact numbers. It means the public case passed a higher bar than a lab benchmark: long-running production use under a demanding trace/log workload.
Checklist for Observability Kafka Teams
POIZON's workload is unusually large, but the decision framework scales down. If your team is using Kafka for logs, traces, metrics, or security events, the same questions will show up long before you reach 40 GiB/s.
Start with the shape of the data rather than the current cluster size. Observability workloads are usually defined by the gap between peak ingestion, average ingestion, retention period, and investigation-time reads. Use this checklist before adding another round of brokers and disks:
- What percentage of provisioned Kafka capacity is only there for peak observability bursts? If steady-state utilization is low, the cluster may be sized for fear rather than normal work.
- How much retained data is rarely read but must remain queryable? This is where object storage-backed retention can change the economics.
- What happens when a consumer needs to catch up from older data? Cold-read performance should be tested as a first-class path, not treated as an edge case.
- How much operational effort does resizing require? If scaling triggers large partition movements, the real cost includes human risk and change windows.
- Can you separate hot-path compute from durable retention? If not, every retention decision becomes a compute and disk decision too.
The point is not that every observability pipeline needs the same architecture as POIZON's. The point is that Kafka cost problems often look like pricing problems from a distance and architecture problems up close. Once logs and traces become a platform workload, the storage model decides how much of tomorrow's uncertainty you have to buy today.
For teams evaluating this path, AutoMQ's public POIZON customer case is a useful reference point, and the AutoMQ documentation goes deeper into the storage architecture behind it. Observability data will keep arriving in bursts, and engineers will keep asking old data hard questions during the worst moments. The better capacity plan is the one that stops making every rare moment define the permanent size of the cluster.
FAQ
Why is Kafka expensive for observability pipelines?
Kafka observability pipelines are expensive because logs, traces, and metrics combine bursty writes, long retention, and occasional historical reads. In traditional broker-local Kafka architectures, teams often provision brokers, disks, and cores for peak ingestion while also paying to keep retained data available on the cluster.
What did POIZON use AutoMQ for?
POIZON used AutoMQ for a large-scale Kafka-compatible observability workload involving trace and log data. AutoMQ's public case reports peak throughput above 40 GiB/s, replacement of 1,280 cores, roughly 50% infrastructure cost reduction, 85% cold-data cost reduction, 5x cold-read performance improvement, and nearly 3 years zero downtime.
Does AutoMQ replace Kafka clients?
AutoMQ is Kafka-compatible, so the intent is to preserve the Kafka protocol surface while changing the storage architecture underneath. Teams should still validate client compatibility, operational tooling, and migration steps for their own environment.
Is object storage too slow for Kafka cold reads?
Object storage can be slow when used as an afterthought, which is why cold-read design matters. AutoMQ's public POIZON case reports a 5x cold-read performance improvement, and AutoMQ documentation describes catch-up read efficiency as a technical advantage of its storage architecture.
Should every Kafka observability pipeline move to object storage-backed architecture?
Not automatically. The strongest fit is usually a pipeline with high peak writes, growing retention, costly broker-local disks, and meaningful cold reads. Teams with small, short-retention workloads may not feel the same pressure. The decision should start with workload shape and production tests, not with a generic rule.
Sources
- POIZON customer case, AutoMQ
- POIZON observability platform blog, AutoMQ
- Dewu trillion-level monitoring system blog, AutoMQ
- S3Stream shared streaming storage overview, AutoMQ Docs
- 5x catch-up read efficiency, AutoMQ Docs
- Continuous self-balancing, AutoMQ Docs
- WarpStream Goldsky case study, used as reference style only