Skip to Main Content

Biweekly #5: Validation of 100,000 partitions at PB storage scale

List of Community Contributors

AutoMQ Kafka Trunk Update

The cluster now supports 100,000 partitions and 1 PB of storage scale.

https://github.com/AutoMQ/automq-for-kafka/issues/600

  • S3Object now supports the addition of DataBlockGroup, which allows for the direct merging into larger DataBlockGroups during the compaction process using CopyPart. The index scales linearly with the size of the S3Object, where a 1MB index block can support a 28GB S3Object;

  • StreamObject compaction now allows for the deletion of outdated data based on the stream's start offset. The upper limit for StreamObject compaction has been increased to 10GB without any additional storage amplification costs. The metadata size for 1 PB of storage in StreamObject clusters is expected to be no more than 70MB;

  • S3StreamSetObjectRecord data is compressed for storage, and the KRaft checkpoint size has been optimized to reduce by 50%;

Support for 2C16G configurations with 5K partitions & 50MB write & 150MB consumption & 4.5K QPS

https://github.com/AutoMQ/automq-for-kafka/issues/621

Enhancements to the reading link include pre-checks for tail reads, a fast path for last segment reads, avoiding the creation of small objects in Lambda, and memory optimization for Metrics. In a scenario with 5K partitions & 64MB write & 192MB consumption, CPU usage has been optimized from 92% to 82%.

Transition from Metrics JMX to OpenTelemetry

https://github.com/AutoMQ/automq-for-kafka/issues/666

Converting Kafka JMX metrics to OpenTelemetry for reporting, unifying the metrics collection method for Kafka and S3Stream, and supporting data source provision for future monitoring and alert configurations in Grafana.

More Things

Chaos quality validation framework

AutoMQ Kafka has added a 24/7 Chaos quality validation framework. The Kafka cluster operates around the clock, and the framework periodically injects faults such as 10-second network delays, process termination with `kill -9`, and process pauses. It also verifies data integrity, send & consume SLA, and fault self-healing time after fault injection.