About XPENG Motors
XPENG Motors, founded in 2014, is a technology company focused on future mobility. The company has consistently invested heavily in R&D to build full-stack self-research core capabilities. Today, XPENG Motors has become one of China's leading smart electric vehicle companies.
Business Background
XPENG Motors utilizes Apache Kafka® to address the log collection, processing, and analysis needs of various application systems on its cloud platform. Log data from each business application is collected through a unified channel, delivered to Kafka, and then distributed by Kafka to downstream components for consumption and processing.
This pipeline currently supports multiple core scenarios and systems such as online monitoring and alerting, log retrieval, business operations data analysis, and security audit compliance.
Before using AutoMQ, the cloud platform business relied on Kafka. However, as the business continued to grow, two significant issues emerged:
High resource costs: Using Kafka on the cloud led to escalating cluster bills as the business scale increased.
Heavy operational burden for scaling: Rapid business growth frequently necessitated scaling, imposing a heavy operational burden on Kafka. This required careful consideration of partition reassignment and self-balancing of traffic.
Evaluating and Selecting AutoMQ
Due to the cost and operational burdens, XPENG Motors' cloud platform business began researching projects in the data flow domain, aiming to find a cost-effective and easy-to-maintain Kafka alternative.
Cost Optimization
During the evaluation, the XPENG Motors team analyzed that the costs of Kafka mainly stem from the following two aspects:
High storage costs : Kafka uses the ISR (In-Sync Replicas) mechanism to ensure data durability, requiring multiple copies of data to be stored. In a Public Cloud environment, constructing three replicas based on ESSD cloud disks is very expensive. Specifically, the unit price of object storage is 1/6 of that of ESSD cloud disks. Considering that the ISR mechanism requires three replicas, the storage cost difference is even greater.
Idle cost wastage : With business changes, Kafka clusters need to frequently scale in and out. If it is not possible to scale down in a timely manner, reserving resources for peak usage can lead to significant idle wastage, which can amplify over time.
Cost Comparison Items | ESSD Cloud Disk | Object storage |
---|---|---|
Storage pricing | 1 RMB/GB/month | 0.15 RMB/GB/month |
Data: Cloud ESSD Disk Price vs Object Storage Price
AutoMQ follows this approach by using object storage to replace Kafka's storage layer, achieving stateless computation, partition reassignment in seconds, automatic elasticity, and traffic self-balancing. According to data from AutoMQ's official website, storing the same message data offers a significant cost advantage over native Kafka.
Partition Reassignment in Seconds and AutoScaling
As described in the cost optimization section, cloud platform logging applications experience significant traffic fluctuations. For example, the business data write volume may be 80MB/s during off-peak times but could increase to 120-150MB/s during peak periods, nearly doubling the difference. Reserving resources based solely on peak values would lead to massive resource wastage; on the other hand, frequent scaling in and out poses a significant challenge to architecture and operations teams.
Local storage frameworks like Apache Kafka® store message data in shards across various Broker nodes, requiring manual intervention for data reassignment during scaling. The time required for reassignment varies with data scale and can take from minutes to hours, making it infeasible for quick and automated scaling needs.
AutoMQ utilizes object storage for data offloading, making the Brokers nearly stateless. In scenarios such as scaling or fault failover, only metadata changes and minimal WAL data upload and recovery are needed, allowing for partition reassignment in seconds.
Due to the nearly stateless computation layer, AutoMQ can configure an elastic scaling group in the cloud and set auto-scaling rules based on CPU, memory, and network throughput metrics. This achieves automatic horizontal scaling. During this process, partition reassignment and balancing are automated without requiring manual intervention.
Implementation and Deployment of AutoMQ at XPENG Motors
Reassignment Plan
Thanks to AutoMQ's architecture, which replaces only the storage layer while fully reusing the Apache Kafka® code for the computing layer, XPENG Motors encountered no compatibility issues when reassigning their existing Kafka business clusters to AutoMQ.
Since the current application scenarios are not sensitive to consumption latency, the reassignment plan is very simple and reliable:
Upstream Production Switch: The log collection endpoint is directly switched to the Kafka access point, directing write traffic straight to AutoMQ.
Wait for Source Cluster Consumption Completion: Downstream businesses continue to consume from the source cluster, ensuring that message consumption is completed.
Downstream Consumption Switch: Once all downstream consumption is completed, the Kafka access point can be directly switched to continue consumption from the AutoMQ cluster.
Operations & Observability
AutoMQ provides out-of-the-box Prometheus Metrics data on the cloud. You can configure the export of cluster Metrics data to Prometheus clusters with a single click. Combined with Grafana and alert templates, you can achieve production-level observability for your clusters.
AutoMQ can also leverage the Elastic Scaling Service (ESS) from cloud providers to dynamically scale. During peak times, cloud servers are launched as Brokers, and during off-peak times, they are scaled down, providing the capability to adjust resources according to business load fluctuations.
Benefits and Outlook
After migrating to AutoMQ, compared to the previously used Kafka managed service, there have been significant savings in both computation and storage costs. Instances of the same scale can support greater business traffic, resulting in overall cost savings of approximately 50% or more.
In the future, we plan to gradually expand the business scope of AutoMQ to include scenarios such as vehicle data reporting. In these scenarios, message throughput can reach about 300MB/s to 500MB/s, with more pronounced traffic peaks and valleys. Therefore, the demand for AutoMQ's automatic elastic scaling capability will become even more critical.