Blog

Pulsar Storage Cost: BookKeeper, Tiered Storage, and Cloud Bills

Pulsar storage cost rarely looks scary in the first architecture diagram. A broker receives messages, BookKeeper stores ledgers, old data can move to object storage, and consumers read at their own pace. The bill becomes harder to explain once retention grows, consumer backlog accumulates, replicas multiply the hot path, and cloud networking starts charging for data movement that the logical diagram treats as a line.

That is why "GB stored" is a weak way to estimate Apache Pulsar cost. A streaming platform pays for storage capacity, write amplification, read fan-out, request patterns, inter-zone traffic, operational headroom, and the time engineers spend keeping the storage layer healthy. Pulsar's architecture gives teams useful separation between brokers and storage, but it does not make storage cost disappear. It changes where the cost lands.

Pulsar storage cost flow

Why Streaming Storage Cost Is Hard To Estimate

A database cost model can often start with provisioned capacity. A streaming cost model starts with time. The same 1 TB/day workload can be inexpensive or painful depending on whether data is retained for 24 hours, 7 days, 30 days, or until a slow consumer catches up. Pulsar adds another dimension because retained messages, unacknowledged backlog, and offloaded ledgers can coexist across different storage services.

The uncomfortable part is that each layer has a different billing shape:

  • Hot storage is provisioned for write and read pressure. BookKeeper bookies need disks, IOPS, throughput, CPU, memory, and network capacity. The storage bill is tied to the active ledger path, not only the final retained bytes.
  • Cold storage is billed by capacity and operations. Tiered storage can move older data to S3-compatible object storage, but object storage still charges for stored data and requests, and retrieval can matter for replay-heavy workloads.
  • Network is charged by movement, not intent. Data copied across availability zones, regions, VPC boundaries, or service endpoints may appear as transfer fees even when the application sees a single topic.
  • Operations become part of cost when storage is stateful. Bookie replacement, ledger recovery, disk balancing, compaction, and capacity planning consume engineering time that rarely appears in a list-price comparison.

This is also why a Pulsar storage estimate built from one cloud pricing page is usually incomplete. You need to know which bytes stay in BookKeeper, which bytes are offloaded, how often old data is read, and where each component runs.

How Pulsar Stores Data

Pulsar separates serving from durable storage. Brokers handle producers and consumers, while Apache BookKeeper stores persistent message data as ledgers. Pulsar's documentation describes this as a managed ledger abstraction, where topics are backed by ledgers stored in BookKeeper. In practical cost terms, that means the broker fleet and the bookie fleet can scale independently, but both fleets still need capacity planning.

BookKeeper is not a passive disk bucket. Bookies participate in the write quorum and store entries for ledgers. A Pulsar topic with sustained throughput writes data into ledgers, rolls ledgers over time, and retains data according to topic or namespace policy. When consumers fall behind, backlog keeps data alive. When retention is configured for replay, data may remain even after all consumers acknowledge it. These are product features, not accounting mistakes.

The storage cost model starts with four questions:

InputWhy it mattersCost surface
Ingest rateDetermines how fast ledgers grow.Bookie disk, network, write throughput
Ensemble, write quorum, ack quorumDetermines how entries are replicated across bookies.Hot storage amplification and bookie capacity
Retention and backlogDetermines how long messages stay addressable.BookKeeper capacity and offload volume
Replay and catch-up readsDetermines whether old data is idle or repeatedly scanned.Bookie read load, object storage requests, network

The table is deliberately mechanical. Pulsar cost is not only about selecting larger or smaller disks. It is about how application behavior turns into ledger growth, replication, and reads from either the hot or offloaded path.

Where BookKeeper Cost Comes From

BookKeeper cost begins with capacity, but the first-order risk is usually throughput. Bookies need to absorb producer writes while also serving reads, handling ledger recovery, and maintaining enough spare capacity for failures. If disks are sized only by retained bytes, the cluster can run out of I/O before it runs out of storage.

Replication multiplies that pressure. Pulsar's BookKeeper configuration uses ensembles and quorums to decide how entries are written and acknowledged. A higher replication setup improves durability and availability, but the hot storage layer stores and moves more copies. On cloud infrastructure, those copies consume disk capacity, write throughput, and sometimes inter-zone network. The pricing page does not know that those bytes came from a ledger quorum. It only sees block storage, compute, and transfer.

There is also a fragmentation problem that shows up with many topics or uneven traffic. Some bookies may carry more active ledgers, some workloads may generate many small segments, and some consumers may create long-lived backlog. The platform team then needs enough headroom to survive bookie failures and recovery without pushing the cluster into a second incident. That headroom is a real cost even when it is not used every hour.

A good BookKeeper budget separates these line items:

  • Steady-state storage: provisioned capacity for active ledgers, replicated entries, and safety margin.
  • Steady-state I/O: disk and network throughput for writes, reads, acknowledgments, and background work.
  • Failure headroom: spare capacity for bookie loss, ledger recovery, and maintenance windows.
  • Operational tooling: monitoring, alerting, automation, and on-call time for the storage fleet.

This framing keeps teams from treating BookKeeper as "the disk part" of Pulsar. It is a distributed storage system inside the streaming platform, and distributed storage cost is shaped by load, replication, and recovery behavior.

What Tiered Storage Changes

Pulsar tiered storage offloads older ledgers from BookKeeper to external storage such as S3-compatible object storage. That can reduce pressure on the hot BookKeeper layer when long retention is required. It is especially relevant for audit streams, observability data, data lake replay, or analytics pipelines where old data must remain available but is not read continuously.

The important word is "older." Tiered storage is not the same as making object storage the primary write path. New data still enters through Pulsar and BookKeeper before it is offloaded according to policy. BookKeeper remains the hot storage layer, and the tiering system adds its own configuration, monitoring, and failure cases. Used well, tiered storage is a cost control mechanism. Used casually, it can turn one storage bill into two bills plus an operational surprise.

Storage cost inputs worksheet

Tiered storage changes the cost equation in three ways. First, long-retention bytes can move from higher-cost hot storage to object storage. Second, replay from offloaded data may generate object storage request and data retrieval patterns that must be estimated. Third, the team now needs a lifecycle policy: when to offload, how long to retain, how to test restore paths, and how to monitor offload lag.

The right question is not "Does tiered storage reduce cost?" It is "Which part of our storage profile does tiered storage reduce, and what new cost surfaces does it introduce?" For a workload with high ingest, short retention, and active consumers, tiered storage may have limited impact. For a workload with modest ingest and long replay windows, it can matter a lot. For a workload with frequent historical reprocessing, object storage request and transfer patterns deserve a closer look.

Cloud Billing Items To Include

Cloud bills split the streaming storage architecture into categories that do not map cleanly to Pulsar component names. A BookKeeper cluster on virtual machines might show up as compute instances, attached block volumes, provisioned IOPS or throughput, snapshots, data transfer, and monitoring. Offloaded ledgers might show up as object storage capacity, PUT and GET requests, lifecycle operations, and data retrieval. Cross-zone placement can add another set of charges.

On AWS, for example, EBS pricing and S3 pricing are separate pages with different units and assumptions. Data transfer pricing is another page again. The same logic applies on other clouds: block storage, object storage, and network transfer are priced differently, and the streaming architecture decides how many times a byte touches each category.

A practical estimate should include:

Billing categoryPulsar-related driverWhat to estimate
Bookie computeBookKeeper write and read loadInstance count, CPU, memory, network headroom
Block storageActive ledgers and replicated hot dataGB-month, IOPS, throughput, snapshots
Object storageOffloaded ledgers and long retentionGB-month, PUT/GET, lifecycle, retrieval
Network transferReplication, reads, cross-zone access, remote consumersIn-zone, cross-zone, cross-region, internet egress
OperationsStateful storage maintenanceEngineering hours, incident frequency, automation

The last row is the easiest to omit and the hardest to explain after an incident. A platform that needs manual bookie replacement, disk rebalancing, and careful offload tuning can still be the right platform, but the people cost belongs in the same decision model as the infrastructure cost.

How AutoMQ's Object-Storage-Backed Design Differs

Once the cost model is clear, the architectural contrast becomes sharper. Pulsar uses BookKeeper as the primary hot storage layer and can offload older ledgers to object storage. AutoMQ takes a different path for Kafka workloads: it keeps Kafka protocol compatibility while replacing Kafka's broker-local log storage with S3Stream, an object-storage-backed shared streaming storage layer. In AutoMQ's architecture, WAL storage accelerates writes and supports recovery, while object storage is the primary durable data location.

That distinction matters because tiering and primary shared storage solve different problems. Tiering asks, "When can older data leave hot storage?" Primary shared storage asks, "Why should durable log data be bound to broker-local or bookie-local disks in the first place?" AutoMQ's answer is specific to Kafka-compatible workloads: preserve Kafka clients and ecosystem behavior, but move durable storage to object storage so brokers can become stateless compute nodes.

Object storage design comparison

This does not make AutoMQ a drop-in answer for every Pulsar deployment. Teams using Pulsar-specific subscription semantics, namespace models, or client APIs should evaluate the migration surface carefully. The natural fit is a team asking a narrower question: "We need Kafka-compatible streaming, long retention, cloud elasticity, and lower storage friction. Do we really want to operate a stateful hot storage layer for that?" In that case, an object-storage-first Kafka-compatible architecture deserves a place in the evaluation.

The cost implication is not a magic percentage. It is a change in what must be provisioned. Object storage becomes the durable data layer, WAL absorbs the low-latency write requirement, and brokers are less tied to local data movement. That can reduce over-provisioning and make scaling or recovery less dependent on copying retained data across nodes. The exact savings still depend on ingest rate, retention, read fan-out, cloud region, and deployment model.

Storage Cost Worksheet

The most useful Pulsar storage estimate is a worksheet, not a slogan. Start with workload behavior, translate it into storage mechanics, then map those mechanics to cloud billing categories. If a number cannot be traced through those three steps, it should not drive the decision.

Use these inputs before choosing a storage architecture:

  1. Daily ingest: compressed and uncompressed bytes per day, plus peak write rate.
  2. Retention target: required replay window after acknowledgments, separated from consumer backlog.
  3. Replication settings: BookKeeper ensemble, write quorum, ack quorum, and placement across zones.
  4. Backlog behavior: typical and worst-case consumer lag, including incident recovery scenarios.
  5. Read fan-out: number of consumer groups, historical replay frequency, and catch-up speed requirements.
  6. Tiering policy: offload threshold, offload destination, lifecycle rules, and restore expectations.
  7. Cloud placement: zones, regions, VPC boundaries, private connectivity, and internet egress paths.
  8. Operational model: who owns BookKeeper, how failures are handled, and what automation exists.

After those inputs are clear, compare architectures on the same basis. Pulsar with BookKeeper and tiered storage may be the right answer when Pulsar semantics are central to the application. A managed Pulsar provider may be the right answer when the team wants Pulsar features without operating the storage layer. A Kafka-compatible object-storage-backed platform such as AutoMQ may be the right answer when the goal is to keep Kafka applications intact while changing the storage cost structure.

The bill was never only about GB-month. It was about how many times each byte is written, copied, retained, read, and recovered. Once you model those paths explicitly, Pulsar storage cost becomes less mysterious, and architecture choices become easier to defend in front of engineering and finance. For Kafka-compatible workloads where storage cost is the main pressure, the next useful step is to run the same worksheet against an object-storage-backed Kafka design and compare the actual cloud inputs.

FAQ

Is Pulsar storage cost mostly BookKeeper cost?

For hot data, yes, BookKeeper is usually the center of the storage cost model because it stores active ledgers and handles replicated writes and reads. For long retention, tiered storage and object storage costs can become significant. The split depends on ingest rate, retention policy, backlog, and replay behavior.

Does Pulsar tiered storage remove the need for BookKeeper capacity?

No. Tiered storage offloads older ledgers, but BookKeeper remains the hot storage layer for active data. You still need enough bookie capacity, I/O, and failure headroom for the write path and recent reads.

What is the biggest hidden cost in Pulsar storage?

The hidden cost is often not one line item. It is the combination of replicated hot storage, cross-zone data movement, replay traffic, and operational headroom. Teams often budget retained bytes but under-budget recovery scenarios and slow consumer backlog.

Is object storage always lower cost than block storage for Pulsar?

Object storage is often cost-effective for long-retention cold data, but the full answer depends on request volume, retrieval patterns, transfer fees, and operational complexity. It should be modeled as part of the workload, not assumed from storage price alone.

How is AutoMQ different from Pulsar tiered storage?

Pulsar tiered storage moves older ledgers from BookKeeper to object storage. AutoMQ uses object storage as the primary durable storage layer for Kafka-compatible streaming, with WAL storage used for write acceleration and recovery. That is a different architecture, not only a different retention policy.

References

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.