Blog

AWS MSK Maintenance and Rebalancing: How Avia Games Migrated to AutoMQ with Zero Downtime

For a real-time gaming platform, Kafka maintenance is not background noise. A broker restart can become a leader election, a leader election can become a latency spike, and a latency spike can become something players feel during a live tournament. Infrastructure events and user experience are closer together than most architecture diagrams admit.

Avia Games ran its core game event streaming infrastructure on AWS MSK, a managed Kafka service that removed part of the operational burden of running brokers by hand. The public case describes competitive mobile gaming with millions of players participating in real-time tournaments. In that setting, predictable backend behavior matters because tournament traffic is not evenly distributed.

The story is not that MSK was a bad starting point. Managed Kafka often is the reasonable starting point when a team wants Kafka semantics without owning every broker lifecycle task. The story is what happens when the workload outgrows the control surface that a managed service gives you.

MSK to AutoMQ cutover path

Why Kafka Maintenance Is Different for Real-Time Gaming

Many Kafka workloads can absorb maintenance noise with a little lag and a little patience. A batch analytics pipeline may tolerate a temporary backlog if it catches up before the next reporting window. Real-time gameplay does not have the same kind of buffer.

In a gaming event stream, Kafka sits close to a chain of user-visible actions: game events, scoring, tournament state, fraud checks, leaderboard updates, and downstream telemetry. The exact topology for Avia Games needs customer approval before publication, but the public case is clear about the sensitivity of the workload: maintenance operations could cause latency spikes and stability risks during live tournament periods.

The tricky part is that a broker restart is rarely only a restart. Kafka clients need to handle leadership changes, brokers need to recover, and replicas need to catch up. If the cluster is already carrying tournament traffic, background operations compete with foreground traffic for network and I/O.

AWS describes MSK maintenance as including operating system updates, security updates, and other software updates required to maintain cluster health, security, and performance. AWS also notes that during broker reboots, clients must be configured to handle leadership failover and may see transient disconnect errors. Those statements are sensible from a managed-service perspective, but the application team still has the real work: make sure the game does not care.

What Made AWS MSK Hard to Control

Avia Games' public case points to three connected problems: maintenance timing, broker restarts, and partition reassignment. None is unusual for Kafka; the issue is how they combine when the team cannot fully control timing.

Operational eventWhy it matters in KafkaWhy it mattered for Avia Games
Broker restartLeadership changes and client reconnects can create short-lived instability.Live tournament traffic has little tolerance for player-visible latency spikes.
Partition reassignmentMoving data can consume network, disk, and broker resources for an extended period.Background data movement can interfere with the event stream at exactly the wrong time.
Maintenance schedulingManaged services reduce work, but the provider still controls parts of the lifecycle.The team wanted maintenance to happen during actual low-traffic windows, not provider-driven moments.

This is the point where many Kafka teams reach for tuning before they reach for migration. They raise broker headroom, tune client retries, add alerts, and rehearse maintenance playbooks. Those steps are useful, but they are bounded by the core Kafka storage model: partition data is tied to broker-local storage, so operational changes often become data movement problems.

The public Avia Games case uses a crisp phrase for the desired change: regain control. That is an operator's goal, not a marketing goal. The team did not need a more dramatic maintenance page; it needed a Kafka-compatible architecture where broker lifecycle events stopped carrying so much data gravity.

"AutoMQ allowed us to regain control." — Engineering Team, Avia Games

That quote matters because it names the practical requirement. The migration was about making Kafka operations less coupled to the worst possible moment in the traffic curve.

Migration Requirements Before Touching Production Traffic

A zero-downtime Kafka migration sounds tidy after it is finished. Before it starts, it is mostly a list of things that can go wrong: topics, partitions, active producers, consumer offsets, ACLs, client configuration, monitoring, and rollback behavior.

For Avia Games, the public source confirms the migration from AWS MSK to AutoMQ and says the transition caused zero downtime. Runbook-level details still need customer approval: topic count, partition count, consumer group count, migration batch size, cutover order, validation queries, rollback gates, and the exact definition of "zero downtime" used internally.

Even without those private details, the migration shape is clear enough to explain the engineering logic:

  • Compatibility first. The target system had to preserve Kafka client behavior so gameplay services were not rewritten during an infrastructure migration.
  • Offset continuity second. Consumers needed a path to resume from the right place after cutover.
  • Rollback third. A gaming platform cannot treat cutover as a one-way door.
  • Observability throughout. Lag, error rate, consumer health, broker health, and application latency need to be watched during the migration.

This is where the Fresha-style engineering lesson applies: the interesting part of a migration is not that a replacement system exists. The interesting part is whether the migration plan respects the old system while gradually proving the target one. A careful Kafka migration feels more like a rolling upgrade than a database dump.

How AutoMQ Linking Supported a Zero-Downtime Migration

AutoMQ Linking is designed for that rolling-upgrade style. AutoMQ's documentation describes Kafka Linking as a migration tool for moving from Apache Kafka or Kafka distributions such as AWS MSK to AutoMQ without application changes, with byte-to-byte replication, synchronized consumer progress, and a producer proxy path. The same documentation notes an MSK-specific constraint: Kafka Linking supports AWS MSK migration, but not MSK IAM Role authentication mode.

The important part is not the feature name. The important part is what the feature changes in the migration risk model. Traditional Kafka migration tools often force a hard coordination point: stop writers, wait for replication to settle, move clients, restart, and hope the offset story is clean. That can be acceptable for internal pipelines with generous windows. It is a poor fit for real-time game infrastructure.

AutoMQ Linking breaks the work into smaller, observable steps:

  1. Link the source MSK cluster to the target AutoMQ cluster.
  2. Synchronize selected topics and consumer progress.
  3. Switch producers in controlled batches.
  4. Switch consumers after progress is aligned.
  5. Promote topics when AutoMQ is ready to become active.

The public AutoMQ migration documentation describes two paths inside Kafka Linking: a byte-to-byte replication path and a producer proxy path. The replication path keeps topic partition counts, messages, and offsets aligned; the proxy path supports producer rolling-switch behavior. For a team like Avia Games, that turns migration from one large operational bet into a series of smaller decisions.

The customer case says the migration was seamless and that AutoMQ's Kafka Linking worked as promised, with zero downtime. That statement should not be stretched into invented detail. It tells us the outcome that mattered: live gaming services were not disrupted during the transition.

What Changed After Brokers Became Stateless

The migration solved the immediate cutover problem, but the deeper change was architectural. AutoMQ separates compute from storage and stores persistent data in shared object storage rather than binding it to broker-local disks. In that model, brokers are largely stateless from the perspective of durable log data. Restarting, replacing, or scaling brokers no longer implies moving large amounts of partition data from one broker disk to another.

That difference is why Avia Games' case keeps coming back to operational control. Traditional Kafka operations often inherit the cost of local state: partition reassignment, data replication, and catch-up work can dominate a maintenance or scaling event. AutoMQ changes the job from moving data so the broker can own it to updating ownership, metadata, leadership, and traffic placement while data remains in shared storage.

Maintenance blast radius map

The public case says AutoMQ turned heavy maintenance operations into metadata updates and made cluster upgrades invisible to players. It also says scaling operations complete in seconds to tens of seconds, which allowed Avia Games to schedule maintenance during actual low-traffic windows and scale capacity around tournament launches.

For another platform team, the lesson is sharper than "migrate tomorrow." If the pain comes from Kafka's broker-local state, managed service convenience may not remove it. It may only move some controls out of your hands.

MSK Migration Checklist for Platform Teams

The right decision starts with an honest inventory. A team mildly annoyed by MSK maintenance may need better client settings and clearer runbooks. A team seeing player-visible latency or recurring fear around broker operations may need to question the storage model itself.

Use this checklist before choosing whether to optimize MSK, redesign the workload, or evaluate an AWS MSK alternative:

  • Maintenance control: Can you confidently keep broker maintenance away from peak business events, or do provider-driven operations still create scheduling risk?
  • Rebalancing blast radius: When brokers restart, scale, or change, does the cluster spend meaningful time moving partition data?
  • Client behavior: Are producers and consumers configured to handle leadership failover, reconnects, retries, and transient disconnects without user-visible impact?
  • Migration readiness: Do you know the topics, consumer groups, ACLs, authentication mode, client owners, observability checks, and rollback gates?
  • Business tolerance: If Kafka latency spikes during a live event, is that an internal SRE problem or a customer experience problem?

Migration readiness checklist

Avia Games' answer was to migrate from AWS MSK to AutoMQ. The public story gives the outcome: zero downtime during migration, more predictable maintenance, reduced operational risk, and a Kafka-compatible architecture where broker operations no longer carried the same data-movement penalty. The missing private details should stay missing until the customer approves them.

If your team is already searching for "migrate from MSK" or "AWS MSK alternative," you probably do not need another abstract explanation of Kafka. You need to know whether your pain comes from configuration, process, or architecture. Avia Games' story is a reminder that sometimes the maintenance window is only the symptom; the deeper issue is that every broker still owns too much state.

Read the original Avia Games case or talk to AutoMQ if you want to evaluate a Kafka-compatible migration path for an MSK workload.

FAQ

Why did Avia Games migrate from AWS MSK to AutoMQ?

The public case says Avia Games faced operational unpredictability from MSK maintenance, broker restarts, and automatic partition rebalancing. Those events could create latency spikes and stability risks for real-time competitive gaming workloads, so the team wanted more control over when and how Kafka operations affected production.

Did Avia Games have downtime during the migration?

The public customer case states that Avia Games migrated from AWS MSK to AutoMQ with zero downtime using AutoMQ Linking. The exact migration batches, validation checks, and internal definition of "zero downtime" should be confirmed with customer-approved details before adding runbook-level claims.

What is AutoMQ Linking?

AutoMQ Linking is AutoMQ's Kafka migration capability. AutoMQ documentation describes it as supporting byte-to-byte replication, synchronized consumer progress, and producer rolling-switch behavior for Kafka migrations without application changes.

Is this article saying AWS MSK is unreliable?

No. MSK can be a good fit for many Kafka teams. The Avia Games case is about a workload where provider-driven maintenance timing, broker restarts, and partition movement created too much operational uncertainty for real-time gaming. The right conclusion depends on the workload's latency tolerance, traffic shape, and operational control requirements.

What details still need customer approval before publication?

Topic count, partition count, consumer group count, latency impact, migration duration, rollback plan, validation checks, and any internal operator quote beyond the public case should be approved by Avia Games before publication.

Sources

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.