Blog

MSK Disaster Recovery: Cross-Region Kafka on AWS

An Amazon MSK disaster recovery plan can look complete on a whiteboard: one MSK cluster in the primary AWS Region, another cluster in a secondary Region, and a replication arrow between them. That diagram is useful, but it hides what decides whether recovery works: topic naming, consumer offsets, producer cutover, connector state, schemas, ACLs, DNS, IAM, network reachability, and rehearsed runbooks.

AWS provides managed building blocks for this design. Amazon MSK Replicator can replicate data and selected metadata across MSK clusters in the same or different AWS Regions. Apache Kafka also has MirrorMaker 2 for cross-cluster mirroring when teams want more direct control over replication infrastructure. Neither option turns RPO and RTO into fixed constants. The outcome depends on workload shape, partition count, replication lag, client behavior, offset synchronization, target capacity, and operational automation.

MSK cross-region disaster recovery architecture

What Disaster Recovery Means for Kafka

AWS Well-Architected defines RTO as the maximum acceptable delay between interruption and restoration, and RPO as the maximum acceptable time since the last recovery point. For Kafka, those definitions become stream questions. How many acknowledged records can the business tolerate losing if the primary Region becomes unavailable? How long can producers be blocked? Can consumers reprocess duplicate records safely? Can downstream systems tolerate out-of-order replay after failover?

Those questions matter because Kafka provides a durable ordered log within a cluster, while cross-region DR is an inter-cluster system. Standard Kafka replication depends on leader and follower replicas, in-sync replicas, and producer settings such as acks=all and min.insync.replicas. Cross-region replication is a different layer. It consumes records from one cluster and produces them to another, so it has its own lag, failure modes, and offset mapping.

That distinction is easy to miss when the standby cluster has the same topics and appears healthy in dashboards. A region-level recovery event forces the standby cluster to become the write path, not only a copied data set. The target cluster needs enough capacity, storage, partitions, ACLs, security, and connectivity to serve production traffic. A DR design that only proves "records arrived eventually" has not proved that the business can run from the secondary Region.

MSK Cross-Region Replication Options

For most MSK-to-MSK scenarios, the first option to evaluate is MSK Replicator. AWS describes it as a fully managed MSK feature for asynchronous replication between clusters in the same or different Regions, including data and metadata such as topic configurations, ACLs, and consumer group offsets when configured. MSK-to-MSK replication still has account, quota, and Region constraints.

MSK Replicator reduces operational work because teams do not have to run Kafka Connect workers or MirrorMaker processes for the replication service itself. It still requires design choices that affect recovery behavior:

  • Starting position. AWS notes that MSK Replicator defaults to replicating from the latest offset unless configured to start from the earliest offset during creation. That decision cannot be treated as a cosmetic setup choice because it determines whether historical data is copied.
  • Topic naming. Prefixed topic names can reduce loop risk and make provenance visible. Identical topic names are often more convenient for failover and active-active patterns, but they change how filtering and loop prevention are handled.
  • Offset synchronization. MSK Replicator can translate consumer offsets, but AWS documents important conditions. Legacy unidirectional offset sync does not provide automatic failback; bidirectional enhanced offset synchronization requires a pair of replicators and identical topic name replication.
  • Quotas. AWS lists MSK Replicator quotas, including limits on the number of replicators, topics replicated per replicator, ingress throughput, and record size. Large Kafka estates should validate these limits before making a DR promise.

MirrorMaker 2 remains relevant when teams need cross-vendor replication, self-managed control, custom deployment topology, or a migration path that does not fit MSK Replicator. Apache Kafka documents MirrorMaker 2 as a Kafka Connect-based tool that replicates topics, configurations, ACLs, and consumer groups including offsets. That control comes with operational responsibility: worker sizing, connector configuration, monitoring, upgrades, failure handling, and security on both sides.

The choice is less "managed versus open source" than "where do you want operational complexity to live?" MSK Replicator moves replication infrastructure into the managed service boundary. MirrorMaker 2 keeps more knobs in your hands. Both still need application failover.

RPO/RTO Questions to Answer Before You Promise Anything

RPO for MSK DR is bounded by what has been acknowledged in the primary cluster and then successfully copied to the secondary cluster. With asynchronous replication, records may exist in the source cluster but not yet in the target. AWS explicitly states that MSK Replicator replication latency varies based on factors such as Region distance, source and target cluster throughput capacity, and partition count. A responsible RPO statement is conditional: "under the measured workload, with target capacity sized to absorb peak throughput, observed replication lag stayed within X during drills." Without that condition, the number is theater.

RTO has the same problem. It is the time from declaring the primary unavailable to having producers, consumers, connectors, and downstream dependencies operating acceptably in the secondary Region. That window includes decision-making, DNS or bootstrap-server changes, client metadata refresh, IAM and secret availability, connector restart, consumer group stabilization, and validation that the secondary cluster is not falling behind.

RPO and RTO validation checklist for MSK DR

Producer Failover

Producer failover is hard to automate cleanly because Kafka clients are configured with bootstrap servers, security material, and topic assumptions. In active-passive DR, producers normally write only to the primary Region until failover. The runbook must define how they discover the secondary cluster, how long DNS changes take to propagate, what happens to in-flight writes, and whether retries can produce duplicates after a timeout.

Active-active is more demanding. It can reduce user-facing latency and improve regional autonomy, but only for workloads that can handle concurrent writes, conflict avoidance, and loop prevention. Kafka topics are ordered per partition inside a cluster, not globally across Regions.

Consumer Offsets

Consumer offsets are where many DR tests stop being clean demos. MSK Replicator's offset sync documentation explains that replicated messages can have different offsets on source and target clusters, so offset translation is needed if consumers should resume near where they left off. Because offset translation is asynchronous, consumers may re-read some records after failover.

That is a design reality. Kafka consumers in a DR architecture should be idempotent, and downstream writes should use deduplication keys or transactional safeguards where the business requires them. The real test is whether consumers resume from an acceptable position and downstream systems stay correct when replay happens.

Connector Recovery

Kafka Connect and MSK Connect add another layer. A source connector might have its own offsets against an upstream database. A sink connector might have pending writes to a downstream warehouse. A connector running in the secondary Region might need network access to systems that were never tested from that Region. Schema Registry, secret stores, DLQ topics, and external credentials all belong in the runbook.

This is why connector recovery should be tested separately from broker failover. A Kafka cluster can be available while a connector estate is still unable to read, write, or authenticate. If the business process depends on those connectors, that is still downtime.

Cost Drivers in MSK DR

Cross-region MSK disaster recovery creates a second cost model, not only a second cluster. AWS MSK pricing states that MSK Replicator includes an hourly charge and a per-GB data processed charge, and that cross-region replication also incurs standard cross-region data transfer charges. The AWS pricing example for replicating 50 MB/s for 30 days from Oregon to N. Virginia totals $13,647 under the listed assumptions. That example is not a universal quote, but it is a useful reminder: replication traffic can dominate the DR bill.

DR cost stack for cross-region MSK

The cost stack usually has four layers:

Cost layerWhat to validateWhy it changes the DR decision
Standby computeBroker type, broker count, storage mode, target headroomA cold or undersized target may lower idle cost but increase RTO and failover risk.
Duplicate storageRetention, replicated topics, compaction, target replication factorDR often copies data that may never be read unless failover happens.
Replication and transferReplicator data processed, cross-region transfer, private connectivityHigh-throughput topics make continuous DR materially different from backup.
OperationsDrills, monitoring, security drift, schema and connector parityThe hidden cost is keeping the standby path actually usable.

FinOps should be involved before the architecture is blessed. Teams may replicate only critical topics, shorten secondary retention, keep some consumers inactive until failover, or split workloads by criticality. Those are business trade-offs, not purely Kafka settings.

Where AutoMQ Fits in AWS Kafka Resilience Planning

At this point, it is tempting to ask whether replacing MSK changes the DR equation. Sometimes it does, but not in the simplistic sense of "shared storage removes disaster recovery." Cross-region DR still requires a second failure domain, a replication or migration path, client failover, and validation. Storage architecture can reduce single-region operational pain while leaving region-level recovery as a separate design problem.

AutoMQ is relevant when the DR discussion becomes a broader AWS Kafka architecture review. AutoMQ is a Kafka-compatible, object-storage-backed streaming platform that separates broker compute from durable storage. Its documentation describes a cloud-native design built on S3-compatible object storage while preserving Kafka protocol and ecosystem compatibility. In BYOC deployments, AutoMQ Cloud documentation states that the service runs in the user's cloud account so data stays within the user's VPC boundary. Those properties do not eliminate RPO/RTO planning, but they can change the target architecture used after migration.

For example, a team replacing an MSK estate may care about three DR-adjacent outcomes:

  • Recovery inside a Region becomes less tied to broker-local disks. Stateless brokers can reduce data movement during broker replacement and scaling. That helps day-2 resilience, even though cross-region failover still needs explicit design.
  • Cost analysis can move from over-provisioning to workload-based capacity. If storage sits in object storage and compute scales separately, the economics of standby sizing, retention, and rebalancing can change.
  • Data control remains central. BYOC matters for regulated teams because the target platform can run in the customer's AWS account while still offering a managed operating model.

AutoMQ migration documentation recommends MirrorMaker 2 for open-source migration scenarios and describes Kafka Linking in commercial editions for byte-to-byte synchronization and migration without application changes. For an MSK DR project, that means AutoMQ should enter as a target-platform option during an architecture review, not as a drop-in "turn on DR" switch. If you are already funding a second region, rewriting failover runbooks, and testing client cutover, it may be the right moment to ask whether the primary Kafka architecture is still the one you want to carry forward. A practical next step is to compare your MSK DR cost stack with an AutoMQ target architecture using the same retention, throughput, and recovery conditions.

A Practical DR Drill Checklist

A credible MSK disaster recovery drill should produce evidence, not only a meeting note. Measure replication lag before the test, during peak write load, and during failover. Record the last committed producer offsets in the primary Region and compare them with target availability. Move a representative consumer group, verify translated offsets, and document any replay. Restart at least one connector from the secondary Region and prove it can reach dependencies.

The drill should also test the boring pieces that matter during an outage: IAM roles, TLS certificates, KMS access, Route 53 or service-discovery behavior, security group paths, alert routing, dashboards, and who can declare failover. If the runbook requires console access to the impacted Region, treat that as a risk. Recovery paths should rely on the data plane, a principle AWS also emphasizes in its reliability guidance.

When the drill is done, update the RPO/RTO statement with conditions. A useful statement sounds like this: "For critical topics, with target brokers pre-scaled to N capacity, the last two game days showed replication lag within the agreed RPO during steady state; failover RTO was driven by application restart and DNS refresh." The exact numbers must come from your environment. The structure of the claim is what makes it trustworthy.

FAQ

Is Amazon MSK multi-region by default?

No. Amazon MSK clusters are regional resources. Cross-region disaster recovery requires another cluster, replication, networking, security, client failover, and runbooks. MSK Replicator can manage important replication work, but it does not remove the need to design the whole recovery path.

Does MSK Replicator guarantee zero data loss?

Do not assume that. AWS describes MSK Replicator as asynchronous replication, and its replication latency varies by Region distance, cluster throughput capacity, partition count, and other factors. Your RPO should be based on measured replication lag under your workload and failover process.

Should I use MSK Replicator or MirrorMaker 2?

Use MSK Replicator first when your topology fits its supported MSK patterns and you want managed replication. Use MirrorMaker 2 when you need cross-distribution control, custom topology, self-managed operational behavior, or compatibility with a broader Kafka migration plan. In both cases, test offset behavior, topic naming, and failback.

Can consumer offsets be replicated across Regions?

Yes, but the details matter. MSK Replicator supports consumer group offset synchronization when configured, and AWS documents both legacy unidirectional and enhanced bidirectional patterns. MirrorMaker 2 also has checkpoint and offset sync configuration. Because offset mapping is asynchronous, consumers should be prepared for some replay after failover.

Does AutoMQ remove the need for disaster recovery?

No. AutoMQ changes the storage and compute architecture of a Kafka-compatible platform, which can improve elasticity, data control, and some recovery mechanics inside a deployment. Region-level DR still needs explicit replication or migration design, standby capacity, client failover, and drills.

What should FinOps watch first?

Start with replicated GB per month, cross-region transfer, standby broker capacity, retained data in the target Region, and private connectivity charges. Then add the operational cost of keeping security, schemas, connectors, and runbooks synchronized.

References

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.