Blog

Kafka Retention at IoT Scale: How Geely Supports 10 Million Connected Vehicles with AutoMQ

Kafka retention looks like a configuration choice until the workload gets big enough. Then it becomes an architecture decision. A team can set retention.ms, add disk, rebalance partitions, and tune segment sizes, but the harder question is whether every additional day of history should force the streaming layer to carry more broker-local state.

Connected vehicles make that question difficult to dodge. Telemetry, remote control events, vehicle health data, firmware update signals, and downstream analytics all depend on a reliable stream of records moving from vehicles to cloud systems and then into data platforms. The data is operational in the moment, but it can also become useful later for reprocessing, diagnostics, compliance, BI, and product improvement. Short retention windows are usually chosen because the storage model made the longer window too painful.

Geely's public AutoMQ case is useful because the scale is concrete. Geely operates a connected vehicle platform supporting more than 10 million vehicles, and AutoMQ has run in production there for over 3 years. The published production metrics include 57+ nodes, 400+ TB of data under management, and 3+ GiB/s peak throughput. They are the boundary conditions for a hybrid-cloud vehicle data backbone where retention, elasticity, and operational consistency meet in the same place.

Connected vehicle streaming architecture

The Hidden Tradeoff Behind Kafka Retention

Apache Kafka gives operators several levers for retention. The broker-level defaults include settings such as log.retention.hours, log.retention.ms, and log.retention.bytes, while topics can override equivalent policies. On paper, this is flexible. In production, the flexibility runs into the shape of the storage layer.

Traditional Kafka keeps topic partitions on broker-attached disks. That coupling is part of the original design that made Kafka fast and durable in data-center environments. The tradeoff is that retention growth becomes broker growth. More retained data means more disk pressure, more partition movement, more rebalance planning, and often more compute capacity than the workload actually needs.

For IoT and vehicle data, the uncomfortable part is timing. Peaks can come from rush hours, vehicle launches, weather events, coordinated updates, or changes in user behavior. If a Kafka cluster is already tight on disk and a peak arrives, operators may narrow the retention window to protect the live service. Geely's earlier Kafka environment hit that class of compromise: the public V2X writeup says the team had reduced retention from 5 days to 2 days because of Kafka elasticity and scaling challenges, which affected historical consumption scenarios such as reruns.

That detail is the center of the story. A 5-day to 2-day reduction is not an abstract "Kafka cost" problem. It means the infrastructure could not keep the desired operational history without creating unacceptable scaling pressure. Once retention becomes a defensive maneuver, the streaming layer starts shaping what the business can do with its own data.

Why Connected Vehicle Data Is Different

Geely's connected-vehicle architecture is not a single ingestion path feeding a dashboard. Public AutoMQ and EMQX material describes a hybrid architecture where vehicles report telemetry via MQTT from telematics control units to cloud-based EMQX clusters. TSP applications process the data, AutoMQ receives and distributes TSP application data in the public cloud, and data then flows over dedicated lines into Geely's private-cloud GDMP AutoMQ cluster. Flink, Spark, Kafka consumers, and a data lake handle downstream processing and analytics.

That architecture gives the streaming layer several jobs at once:

  • It has to ingest high-volume telemetry and service events from a large fleet without turning every peak into a manual capacity project.
  • It has to preserve a Kafka-compatible interface so existing applications, tools, and downstream consumers do not need a new integration model.
  • It has to bridge public-cloud and private-cloud environments while respecting data security, cost, and operational consistency constraints.
  • It has to support historical consumption paths, because reruns and downstream analytics depend on retained event history.

Together, these jobs turn Kafka retention into a system design problem. If retention lives only on broker-local disks, then the fleet's data growth and the cluster's compute lifecycle remain tied together. The public case page frames the workload in mission-critical terms: real-time telemetry, remote control commands, and vehicle health monitoring. The approved claim is clear: Geely needed a streaming backbone stable enough for connected-vehicle operations at 10M+ vehicle scale.

Hybrid Cloud Makes Stateful Kafka Harder

Hybrid cloud is sometimes discussed as a deployment preference, but in connected-vehicle systems it can be an operating requirement. The public Geely V2X article names two drivers: existing private-cloud data infrastructure can make hybrid cloud more cost-effective overall, and some key data is stored in Geely's own data center for data privacy and security.

Traditional Kafka makes that reality harder because state follows the broker. Expanding storage can involve adding disks, mounting volumes, changing configuration, moving partitions, and draining traffic during lower-peak periods. The Geely V2X writeup describes the capacity bind directly: reserve too much and resources sit underused; reserve too little and the platform cannot expand in time when traffic grows.

The pattern is familiar to many Kafka teams:

RequirementDesired outcomeBroker-local storage pressure
Longer retentionKeep more history for reruns, diagnostics, and analyticsAdd disk, rebalance partitions, and manage larger stateful brokers
Traffic peaksScale throughput when the fleet or application workload surgesPre-provision capacity or accept high-risk expansion work
Hybrid cloudUse one operational model across public and private environmentsReconcile different storage systems, network paths, and operational runbooks
Kafka compatibilityPreserve existing clients and downstream applicationsAvoid migrations that change the application contract

Kafka remains the application contract many teams want. The question is whether the storage architecture behind that contract can change.

Shared Storage as the Retention Foundation

AutoMQ's role in the Geely story is not that it replaces the Kafka API with a different interface. The public Geely material emphasizes 100% Kafka compatibility, so existing applications, tools, and client configurations could continue using the Kafka ecosystem. The architectural difference is underneath that interface: AutoMQ separates compute from storage and uses S3-compatible object storage as the durable storage foundation.

That separation changes the retention problem. Brokers no longer need to be the long-term home for all retained records. Data persistence is offloaded to shared object storage, while brokers behave more like a stateless compute layer. Scaling throughput no longer has to be dominated by moving retained partition data between stateful brokers.

Retention tradeoff curve

For Geely, the public case describes this as "unlimited" data retention on object storage and says retention is decoupled from expensive local storage. The V2X writeup gives the operational reason this matters: the earlier retention reduction had affected historical consumption scenarios such as data reruns. Once retention stops competing directly with broker-local capacity, history can become part of the data platform instead of a temporary buffer waiting to be deleted.

There is also an elasticity angle. AutoMQ's public Geely page says stateless brokers scale in seconds, and the V2X article describes second-level partition migration capability because scaling no longer requires the same partition data replication pattern as traditional Kafka. Teams still need to validate throughput, consumer behavior, network paths, and downstream backpressure, but the scaling bottleneck changes.

Production Results and Reliability Boundaries

The strongest part of the Geely case is that the public numbers point in the same direction as the architecture. The customer page reports over 3 years of stable production operation, more than 10 million connected vehicles supported, 57+ production nodes, 400+ TB of data under management, and 3+ GiB/s peak throughput. It also identifies the deployment as BYOC across cn-hangzhou and eu-central-1.

Those metrics support a specific conclusion: Geely did not use object-storage-backed Kafka as a cost experiment. It became part of a production connected-vehicle platform spanning hybrid cloud. The public sources do not say AutoMQ is the entire platform or publish a before-and-after bill. They do say AutoMQ became the streaming engine and data bus for important parts of the connected-vehicle architecture, integrated with EMQX, GDMP, downstream processing systems, and a data lake.

Hybrid cloud consistency map

That distinction matters because connected-vehicle reliability cannot be reduced to a single Kafka metric. Vehicle connectivity involves device networks, MQTT access, TSP applications, cloud and private-cloud links, stream processing, storage, and downstream services. AutoMQ addresses the Kafka-compatible streaming layer in that chain, where retention and broker capacity had been fighting each other.

The public quote from Geely's connected vehicle platform team captures that operational shift as moving from "operational anxiety" toward competitive advantage. The phrase is about what happens when the data platform team stops treating every traffic tide or retention requirement as a future scaling incident.

What IoT Kafka Teams Should Validate

The practical lesson for other IoT teams is not to copy Geely's architecture line by line. A fleet platform, smart-device backend, manufacturing telemetry pipeline, or logistics network may have different latency targets, regulatory constraints, consumer groups, and cloud boundaries. The transferable lesson is the evaluation frame: retention, elasticity, and hybrid-cloud consistency should be tested together.

Before choosing a Kafka-compatible platform for long-retention IoT workloads, validate these areas:

  • Retention under peak ingest. Test the retention window you actually need, not the shortest window that keeps disks comfortable during a benchmark.
  • Scaling behavior during uneven traffic. Measure what happens when brokers are added or removed while producers and consumers are active.
  • Historical reads and reruns. Run downstream replay jobs against retained data and watch impact on live ingestion and consumer latency.
  • Hybrid-cloud operating model. Confirm whether public-cloud and private-cloud deployments use the same Kafka API, storage assumptions, monitoring model, and recovery playbooks.
  • Migration compatibility. Check clients, ACLs, topic settings, consumer groups, connectors, and observability tools before assuming Kafka compatibility covers every operational detail.

Geely's case puts a concrete workload behind an architectural point many Kafka operators already suspect. Retention is not merely how long Kafka keeps log segments. At IoT scale, retention is how much optionality the business preserves for diagnostics, analytics, compliance, and product improvement.

For connected-vehicle platforms, a short retention compromise ages badly. Fleet size grows, data types multiply, downstream teams ask for more history, and hybrid-cloud boundaries rarely disappear. Geely's move to AutoMQ shows one way out: keep the Kafka contract that applications already understand, but move durable retention to S3-compatible shared storage so the streaming layer can scale with the workload rather than with yesterday's disk assumptions.

If you are evaluating Kafka retention for an IoT or connected-vehicle workload, start with the window your business actually needs. Then ask whether your current architecture lets you keep it without turning every traffic peak into a storage project. AutoMQ's Geely customer case is a good reference point, and teams with similar hybrid-cloud streaming requirements can talk to AutoMQ's Kafka team to pressure-test their own design.

FAQ

What is Kafka retention?

Kafka retention controls how long or how much data Kafka keeps before deleting old log segments. Operators commonly configure retention by time, such as log.retention.ms or retention.ms, and by size, such as log.retention.bytes or retention.bytes. The hard part at scale is not setting the parameter; it is providing enough storage and operational headroom to keep the desired history safely.

Why is Kafka retention difficult for connected-vehicle and IoT workloads?

Connected-vehicle and IoT workloads combine high ingest volume, uneven traffic, many downstream consumers, and business demand for historical data. When retained records live on broker-local disks, a longer retention window can force storage expansion, partition reassignment, and more stateful operations. Geely's public V2X writeup says its earlier Kafka environment reduced retention from 5 days to 2 days because of elasticity and scaling challenges.

How does AutoMQ change the retention model?

AutoMQ keeps the Kafka-compatible API while separating compute from storage. Durable data is stored on S3-compatible object storage, and brokers can scale more like a compute layer. For retention-heavy workloads, that means history no longer has to be limited by broker-attached disk capacity in the same way as traditional Kafka.

What should teams verify before moving IoT Kafka workloads to object-storage-backed Kafka?

Teams should test retention at realistic ingest volume, scaling during traffic peaks, historical replay behavior, hybrid-cloud operations, and client compatibility. Object storage changes the storage foundation, but production readiness still depends on workload-specific validation across producers, consumers, networking, monitoring, and recovery procedures.

Sources

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.