Moving off Redpanda is rarely a question of copying bytes from one log to another. The hard part is the surface area around the log: Kafka protocol behavior, topic configuration, consumer offsets, schemas, ACLs, connector state, and application habits. Redpanda is intentionally Kafka API-compatible for common client workloads, which is useful during migration, but compatibility is not a substitute for validation.
The practical goal is to turn the migration from a calendar event into a sequence of gates. Each gate answers a narrower question: do clients behave the same way, does the target contain the right data, can consumers resume from an approved position, and can the team roll back before downstream systems are damaged? No responsible plan should promise a fixed downtime, RPO, or RTO before these answers are measured in your environment.
AutoMQ fits the Kafka-compatible shared-storage target category: it keeps Kafka protocol compatibility while moving durable data into object storage, with stateless brokers in front. That changes the post-migration operating model, especially for scaling and recovery, but it does not remove the need for a careful Redpanda migration plan.
What Makes Redpanda Migrations Risky
Redpanda migrations are deceptively familiar because most teams know the Kafka ecosystem. Producers publish to topics, consumers commit offsets, ACLs grant topic and group access, and Schema Registry holds subject versions. That familiarity can hide vendor-specific behavior, operational assumptions, and tooling differences. A client that works against Redpanda does not prove that all admin operations, security flows, and data-plane edge cases behave identically against the target.
The risk usually appears in five places:
- Protocol and client behavior. Validate the client versions, producer settings, transactions, idempotence, compression codecs, and admin APIs your applications use.
- Metadata and configuration drift. Topic names are visible, but topic-level configs are where migration mistakes hide: retention, compaction, partition count, replication assumptions, min in-sync replicas, message size, and cleanup policies.
- Consumer position. A replicated topic is not enough if consumers cannot resume at the right offsets or if offset translation changes the operational semantics for stream processors.
- Security and identity. Redpanda ACLs model principals, resources, operations, and hosts. A target Kafka-compatible platform may express equivalent permissions differently, especially when cloud IAM, SASL, mTLS, or BYOC boundaries are involved.
- Adjacent services. Schema Registry, Kafka Connect, sink connectors, source connectors, stream processors, dead-letter topics, and alerting rules often have state outside the topic log.
These risks do not mean migration is unsafe. They mean the plan needs to be conditional. You decide between replication, dual-write, and application-level cutover based on the workload's tolerance for duplicate events, reordering, replay, and temporary lag.
Inventory Your Redpanda Environment
Start with an inventory boring enough to be trusted. Generate it from Redpanda and client configuration, not memory. Redpanda's documentation covers Kafka client compatibility, ACL management, Schema Registry APIs, Kafka Connect deployment, and rpk administration; those are the surfaces to inspect before writing a migration runbook.
At minimum, capture:
| Area | What to Collect | Why It Matters |
|---|---|---|
| Topics | Names, partitions, retention, compaction, message size, cleanup policy | Topic shape must match the target before replication and cutover. |
| Consumer groups | Group IDs, active members, committed offsets, lag, reset policy | Consumer cutover depends on known resume points. |
| Producers | Client libraries, versions, idempotence, transactions, keys, partitioners | Producer semantics define duplicate and ordering risk. |
| Security | Users, principals, ACLs, SASL/mTLS settings, network allowlists | Access must be recreated before applications move. |
| Schemas | Subjects, versions, compatibility levels, references | Serialization compatibility must survive the move. |
| Connectors | Connector configs, task counts, offsets, DLQs, secrets | Connectors often need separate migration or restart logic. |
| Observability | SLOs, lag alerts, error-rate alerts, dashboards | Validation needs operational thresholds, not gut feel. |
Treat the inventory as a contract between teams. Platform engineering owns the cluster plan, but application owners know whether a topic is replayable, whether duplicates are acceptable, and whether a consumer can restart from an earlier offset. Put ownership next to every topic and group.
Validate Compatibility Before Moving Data
Compatibility validation should happen before replication at scale. Redpanda's official docs describe Kafka client compatibility for Kafka clients version 0.11 or later, with validations and exceptions documented by Redpanda. That helps migration planning, but your target still needs workload-specific tests using the same client libraries, serializers, security settings, and retry behavior as production.
A useful compatibility test is not a synthetic "produce one message, consume one message" check. It should cover the failure modes that would make rollback painful:
- Produce with real key distribution and compression settings, then verify partition assignment and ordering expectations.
- Run consumer groups with the same rebalance protocol, session timeout, and commit cadence.
- Exercise admin operations used by deployment pipelines: topic creation, config updates, ACL changes, and group inspection.
- Register and retrieve representative schemas, including compatibility checks for subjects that evolve frequently.
- Test connector startup against the target with non-production credentials and verify sink-side idempotency.
For AutoMQ as the target, evaluate compatibility and architecture separately. Kafka compatibility answers whether clients and ecosystem tools can connect. Shared storage answers whether durable data should remain tied to broker-local disks after migration. AutoMQ replaces Kafka's broker-local log storage with WAL storage and object storage, making brokers stateless from the perspective of durable data placement.
Choose a Migration Pattern
There is no universal Redpanda migration pattern. The right pattern depends on write criticality, read semantics, stateful stream processing, and how much application change you can tolerate.
| Pattern | Use When | Watch For |
|---|---|---|
| Topic replication | Consumers can move after the target has caught up, and the replication tool preserves enough metadata for your use case. | Offset translation, lag, topic config parity, and ACL parity. |
| Dual-write | Producers can write to Redpanda and the target during a validation window. | Duplicate records, ordering differences, error handling, and rollback after partial writes. |
| Application-level cutover | Applications can be updated in controlled waves and have strong idempotency. | Long migration windows, mixed consumer states, and operational coordination. |
| Batch replay | Historical topics can be replayed into the target outside the hot path. | Time range boundaries, retention gaps, and downstream duplicate handling. |
Apache Kafka's MirrorMaker 2 is the default reference point for cross-cluster replication in the Kafka ecosystem. It can be useful, but its behavior around topic naming, offset handling, checkpoints, and failover needs to be understood before using it as a cutover mechanism. AutoMQ's documentation also describes Kafka Linking for migration to AutoMQ, including source cluster configuration, mirror topics, consumer group synchronization, and producer proxy concepts. Those capabilities are relevant when AutoMQ is the target, but still require validation against Redpanda source behavior, authentication mode, topic shape, and application semantics.
The pattern decision should be made with explicit acceptance criteria:
- Data acceptance: target topic contains the expected record count or checksum sample for the migration window.
- Offset acceptance: consumer groups can resume from a position that the application owner has approved.
- Behavior acceptance: shadow consumers produce the same business-level outputs or acceptable differences.
- Security acceptance: application identities have the least permissions required on the target.
- Rollback acceptance: producers and consumers can return to Redpanda without corrupting downstream state.
That last gate is the one teams skip when the replication dashboard looks healthy. A rollback plan that has never been rehearsed is a hope, not a plan.
Replicate, Shadow, and Measure
Once the target topics and permissions exist, start with a limited replication scope. Pick topics that represent different shapes: high throughput, compacted, long retention, schema-heavy, connector-fed, and stream-processor input. Avoid beginning with your most critical topic when a lower-risk workload has similar characteristics.
During replication, measure both infrastructure and application signals:
- Replication lag per topic and partition.
- Produce and consume error rates on the target.
- Consumer lag for shadow consumers.
- Record count or sampled checksum comparison between source and target.
- Schema lookup and serializer/deserializer failures.
- Authorization failures by principal and resource.
- Connector task failures, retries, and dead-letter volume.
Shadow consumers test read behavior without taking ownership of production side effects. They can deserialize records, run validation logic, write to a quarantine topic, or compare aggregate counts without committing business actions. For Flink, Kafka Streams, or custom processors, include how state will be restored, whether offsets are preserved or translated, and what replay range is acceptable.
Cut Over Consumers Before Producers, When Possible
Many teams move producers first because it feels simpler: change bootstrap servers, deploy, and watch writes appear on the target. That can work for idempotent, append-only streams, but it can also create a split-brain write path if consumers still read from Redpanda. For many workloads, consumer cutover is the safer first production gate because it proves the target can serve reads while Redpanda remains the write authority.
A consumer-first cutover typically looks like this:
- Replicate Redpanda topics into the target.
- Validate topic config parity and target ACLs.
- Start shadow consumers against the target.
- Freeze the selected consumer group, record source offsets, and map or synchronize the approved resume position.
- Start the real consumer group against the target.
- Watch business metrics, lag, and error budgets before moving the next group.
Producer cutover comes after the read path is trusted. It can use DNS, config management, service discovery, a proxy layer, or application redeploys. The mechanism matters less than the gate: no producer batch should move until the team knows what happens if the deployment points back to Redpanda.
Connectors deserve their own cutover plan because source restarts can create duplicates and sink idempotency varies. Move connector batches after testing offset storage, task rebalance behavior, secrets, DLQ topics, and sink-side rollback.
Define Rollback Gates Before the Cutover Window
Rollback is not a button. It is a set of preconditions that remain valid long enough to reverse the last step. If you keep Redpanda writable while the target is warming up, rollback is simpler. If you switch producers to the target and continue writing there for a long period, rollback becomes a data reconciliation project.
Define rollback gates in incident-process language:
- Before producer cutover: Redpanda remains the source of truth; rollback means stopping replication or redirecting reads back.
- During producer cutover: write routing is controlled by a reversible mechanism, and the team has a clear rule for pausing writes if both sides diverge.
- After producer cutover: rollback requires a decision on records written only to the target, including replay, discard, or reconciliation.
- After consumer cutover: rollback requires consumers to resume from approved source offsets or a known replay point.
Fixed RPO and RTO claims are dangerous before testing. A stateless broker architecture can improve recovery and scaling characteristics after migration, but migration RPO and RTO depend on replication lag, application idempotency, offset handling, runbook quality, and approval paths.
Where AutoMQ Fits as the Target Platform
AutoMQ is a natural target when the migration is also about changing the long-term Kafka operating model. It remains Kafka-compatible at the client and ecosystem layer, while its shared-storage architecture moves durable data away from broker-local disks. Brokers become stateless compute nodes in front of WAL storage and object storage, which changes the mechanics of scaling, partition movement, and recovery.
That architecture is relevant if your reason for leaving Redpanda includes elasticity, cloud cost structure, multi-AZ storage design, or a desire to run the data plane in your own cloud account. AutoMQ BYOC can keep the data plane inside the customer's cloud boundary. In migration terms, the target still needs topics, ACLs, schemas, connectors, and validation gates. In steady state, it reduces the broker-local data movement that makes traditional Kafka operations heavy.
Use AutoMQ as the target after answering the same questions you would ask of any Kafka-compatible platform:
- Which Kafka client features are required by production applications?
- Which Redpanda topic configs map directly, and which need target-specific review?
- Which consumer groups require exact offset preservation, and which can replay?
- Which security identities need to be recreated or redesigned?
- Which workloads need BYOC data control, and which can use a managed model?
If those answers point toward shared storage, AutoMQ becomes more than a place to land the data. It becomes the architectural reason to do the migration at all.
References
- Redpanda Documentation, Kafka Compatibility
- Redpanda Documentation, Access Control Lists
- Redpanda Documentation, Use the Schema Registry API
- Apache Kafka Documentation, Geo-replication and cross-cluster data mirroring
- Apache Kafka Documentation, Authorization and ACLs
- AutoMQ Documentation, Migrate to AutoMQ overview
- AutoMQ Documentation, Architecture overview
- AutoMQ GitHub, AutoMQ/automq
FAQ
Can I migrate from Redpanda to Kafka without changing producers and consumers?
Often, yes, if the applications use Kafka-compatible clients and features that the target supports. You still need compatibility tests for the specific client versions, security settings, serializers, admin operations, and producer semantics used in production.
Should I use MirrorMaker 2 for a Redpanda migration?
MirrorMaker 2 is a standard Kafka ecosystem option for cross-cluster replication, but it is not automatically the right cutover tool for every workload. Validate topic naming, offset handling, checkpoint behavior, lag, and failover semantics before treating it as the migration path.
Is dual-write safer than replication?
Dual-write is safer only for applications that can handle duplicate records, partial write failures, and ordering differences. It gives you target-side validation under live traffic, but it moves complexity into application logic and rollback.
Can a migration guarantee zero downtime?
Not in the abstract. Downtime depends on the cutover mechanism, application behavior, replication lag, authentication changes, consumer offset handling, and rollback rehearsal. Some tools can reduce or remove planned producer interruption for certain patterns, but the claim must be proven per workload.
Why consider shared storage as the target?
Shared storage separates durable data placement from broker compute. For a Kafka-compatible platform such as AutoMQ, that means applications can keep using Kafka clients while the target architecture reduces future broker-local data movement during scaling, recovery, and rebalancing.
What is the first production workload to move?
Choose a workload that is representative but not the most business-critical. A good first batch has real throughput, real schemas, real consumers, and clear rollback ownership, while still allowing the team to pause or replay without a major customer incident.
The next step is not to pick a cutover weekend. Build the inventory, validate the compatibility surface, and rehearse rollback with a non-critical workload. If the target is AutoMQ, start with the migration overview and the shared-storage architecture docs, then bring measured results into an architecture review.