
Founded in 2010 and headquartered in Switzerland, FunPlus is a leading global mobile game publisher. With offices in Singapore, China, Portugal, and Spain, the company employs over 2,000 people and operates a portfolio of hit titles. State of Survival alone has surpassed 150 million downloads, alongside other flagship games including King of Avalon, Guns of Glory, and Seed of Conquest—with daily active players spanning multiple regions worldwide.
Player logins, in-game events, and real-time operational decisions all depend on a data pipeline running behind the scenes. FunPlus uses Apache Kafka® as the backbone for two core systems:
- Game observability platform: Processes server-side logs, performance metrics, and alerts in real time to ensure game service stability across multiple global regions
- Real-time analytics platform: Powers player behavior analysis, operational dashboards, and in-game recommendations to drive real-time decision-making for operations teams
These pipelines process billions of messages per day. Any disruption directly impacts the player experience. When Kafka cluster costs started spiraling out of control, the FunPlus infrastructure team had to rethink their architecture choices.
The Hidden Cost Buried in the Bill
FunPlus originally ran Kafka clusters on Amazon MSK in the AWS us-west-2 region. On the surface, MSK costs looked straightforward—instance fees and storage fees were itemized and within budget. But when the infrastructure team conducted a cost audit and dug into the AWS billing structure, they uncovered a surprising fact: cross-Availability Zone (AZ) data transfer fees accounted for a significant portion of the MSK cloud bill.
This cost had gone unnoticed because it never appeared in the MSK bill itself. AWS categorizes cross-AZ traffic fees under "EC2-Other" or "Data Transfer," lumped together with network costs from dozens of other services in the account. Isolating how much Kafka contributed was nearly impossible.
How do cross-AZ traffic costs accumulate? Kafka's replica replication, producer writes, and consumer reads all transfer data between different AZs. AWS charges $0.02/GB for each cross-AZ transfer. With FunPlus's clusters processing roughly seven billion messages per day, these three sources of cross-AZ traffic added up fast—making it the single largest line item. For a detailed breakdown and calculation of cross-AZ traffic costs, see our earlier article Is Kafka on S3 Files a Good Idea?

The gaming industry amplifies this problem: high-throughput data pipelines, mandatory multi-AZ deployments for player experience, and global multi-region operations all multiply cross-AZ costs. The team tried configuration optimizations like Fetch from Follower to reduce consumer-side traffic, but producer writes and replica replication account for the bulk of cross-AZ transfers—a consequence of Kafka's storage architecture that no configuration change can solve. Worse, every new game launch and every wave of player growth increased cross-AZ costs proportionally. FunPlus needed a fundamentally different approach.
Why AutoMQ
When evaluating alternatives, FunPlus had two non-negotiable requirements:
- Significantly reduce Kafka costs through an architecture upgrade with no side effects
- Zero impact on production workloads during migration—a cluster processing seven billion messages per day leaves no room for error
AutoMQ met both. On the architecture side, AutoMQ's Diskless architecture moves data persistence from local Elastic Block Store (EBS) on brokers to S3, eliminating broker-to-broker replica replication and removing cross-AZ traffic costs at the root (see Diskless Engine Technical Deep Dive). Storage costs also dropped dramatically—from three EBS replicas to a single copy in S3.
On the migration side, AutoMQ Linking provides zero-downtime migration with byte-level replication and 1:1 offset consistency. Migration can proceed topic by topic, and any stage rolls back without data loss. For a large-scale cluster like FunPlus's—requiring coordination with multiple business teams—this meant migration no longer had to be a company-wide mobilization effort.
AutoMQ also maintains 100% Kafka protocol compatibility. FunPlus's existing producers, consumers, Flink jobs, analytics pipelines, and observability systems all continued working without a single line of code change.
The architecture solves the cost problem. AutoMQ Linking solves the migration risk. Protocol compatibility ensures zero changes upstream and downstream. With all three conditions met, FunPlus decided to move forward.
Migration: The Step Harder Than Choosing the Technology
Teams across the industry are well aware of Kafka's cost problem and know better architectures exist. What stops them is the migration itself.
A production Kafka cluster isn't an isolated component. It's a collection of dozens—sometimes hundreds—of topics, each connected to a different business team. Real-time risk controls, data lake ingestion, player behavior tracking—each topic has a different owner and a different tolerance for disruption. Migration means coordinating with every stakeholder: When does your topic switch? Will messages be lost during the cutover? Will consumer offsets survive? Will Flink checkpoints break?
These questions can't be answered by the infrastructure team alone. They require cross-team coordination, impact assessment for each topic, and rollback plans. Often, the effort of mapping out "which topics can move first and which absolutely can't fail" is enough to shelve the migration plan for months.
Traditional migration approaches make these concerns entirely justified. MirrorMaker2 is the most common community tool, but it has three hard limitations:
- Offsets aren't preserved: MirrorMaker2 re-serializes messages, so offsets on the target cluster don't match the source. All consumer offsets become invalid after migration, Flink checkpoints are voided, and historical data must be reprocessed. For a cluster handling seven billion messages per day, this cost is unacceptable
- "Stop-the-world" cutover: All clients must switch during a single maintenance window, requiring every business team to coordinate downtime. For a 24/7 gaming service, finding a window everyone agrees on is a negotiation in itself
- Rollback is extremely costly: If something goes wrong after the switch, returning to the source cluster means performing another reverse migration—and newly produced data will be lost
It's not that teams don't want to migrate. They can't.
AutoMQ Linking: Turning Migration into a Routine Deployment
FunPlus used AutoMQ Linking, a built-in zero-downtime migration product. This isn't an external tool—it's a productized migration capability within AutoMQ, designed to solve the "can't migrate" and "afraid to migrate" problems described previously. This capability has been battle-tested in production at numerous leading enterprises, consistently earning strong feedback and significantly lowering the barrier for customers migrating to AutoMQ.

AutoMQ Linking's core design goal is to address the real pain points of Kafka migration—eliminating the concerns teams worry about most:
| Stakeholder concern | AutoMQ Linking's answer | How it works |
|---|---|---|
| "Will offsets change?" | Strict 1:1 consistency | Replicates the raw byte stream from the source cluster without re-serialization—offsets remain identical, with no impact on existing Flink, Spark, or other offset-dependent jobs |
| "Can we roll back if something goes wrong?" | Lossless rollback at any stage | Smart write-forwarding mechanism—during the transition period, writes to the new cluster are transparently proxied back to the source cluster |
| "Will there be duplicate consumption or lost messages?" | Ordered handoff, guaranteed | Consumer coordination logic prevents new consumers from fetching data until old consumers have disconnected |
| "Will the business notice the switch?" | Transparent to the business | Producers and consumers roll over gradually—true zero-downtime migration |
| "Do we have to switch everything at once?" | One topic at a time | Supports topic-level and consumer group-level granularity |
| "Does it depend on external migration components?" | No external dependencies | Built into AutoMQ, fully managed |
These six capabilities together minimize migration risk. Consistent offsets mean Flink, Spark, and other offset-dependent jobs are unaffected. Zero-downtime rolling cutover means business teams notice nothing. Topic-level granularity means teams can proceed at their own pace. Lossless rollback at any stage means no one has to "bet the farm." From the business team's perspective, the Kafka cluster before and after migration looks identical—only the underlying infrastructure has changed.
Phased Migration: Test the Waters, Then Push Forward
With topic-level granularity, FunPlus's migration didn't require an all-or-nothing gamble or simultaneous coordination with every business team. The team adopted a phased approach:
- Migrate non-critical workloads first: Monitoring and logging topics moved to AutoMQ first to validate data integrity and latency. If anything went wrong, the blast radius was contained
- Gradually migrate core pipelines: After the first batch ran stably, real-time analytics and player behavior data pipelines were switched over in stages—each batch followed by an observation period before proceeding
- Rolling client updates: Upstream producers and downstream consumers switched through standard rolling updates, with no coordinated downtime window required
Throughout the transition, AutoMQ Linking provided a critical safety net: if a producer had already switched to the new cluster and started writing, but other components had not yet migrated, those writes were automatically proxied back to the source cluster. This kept data consistent between old and new clusters at all times. If an issue surfaced at any stage, the team cut back to the source cluster immediately—no data loss, and business teams wouldn't even notice the switch had happened.
The entire migration completed with zero downtime and zero app changes. No "stop-the-world" cutover, no 3 AM maintenance window, no ten-team meeting to coordinate switch timing.
Looking back, the biggest change wasn't technical—it was psychological. When offsets are guaranteed to stay the same, rollback is available at any time, and migration can proceed one topic at a time, it transforms from a company-wide mobilization into a routine deployment that teams can execute at their own pace.
Production Results
FunPlus's AutoMQ cluster runs in AWS us-west-2, powering both the game observability platform and the real-time analytics platform across its global game portfolio.

The 60%+ cost reduction is driven by two architectural factors (based on FunPlus production data):
- Elimination of cross-AZ data transfer costs—previously the largest single cost item, now reduced to near zero because there is no broker-to-broker replication
- Storage model shift—from three EBS replicas to a single S3 copy, significantly reducing storage costs
This isn't a one-time saving from downsizing or reserved instances—it's structural, and it scales with the cluster. Downstream systems (Flink jobs, analytics pipelines, observability systems, and all game client integrations) continue running as before, connected to what looks—from their perspective—like a standard Kafka cluster.
Future Outlook
With Kafka infrastructure costs decoupled from business growth, FunPlus can confidently scale its data pipelines to support new game launches and growing player bases. The team is exploring further optimizations enabled by AutoMQ's stateless brokers architecture, including using Spot Instances to reduce compute costs.
For any team running mid-to-large-scale Kafka clusters on AWS or GCP—whether in gaming, e-commerce, fintech, or SaaS—cross-AZ traffic costs are worth a close look at the bill. The number hidden under "EC2-Other" is often larger than expected.
Multiple leading gaming companies have deployed AutoMQ at scale in production. If you are evaluating Kafka infrastructure cost optimization, contact us to learn more about gaming industry deployments.
