Skip to Main Content

Kafka on Kubernetes: Deploy & Best Practices

Overview

The integration of Apache Kafka with Kubernetes creates a powerful platform for scalable, resilient streaming applications. This comprehensive blog explores the deployment strategies, architectural considerations, and best practices for running Kafka on Kubernetes, drawing from industry expertise to address common challenges and optimize performance.

Understanding Kafka on Kubernetes Architecture

Apache Kafka has become a cornerstone for building robust, scalable, and reliable data streaming platforms. When deployed on Kubernetes, Kafka leverages the container orchestration capabilities to enhance its scalability and availability[1]. The fundamental architecture of Kafka on Kubernetes involves several key components working together to ensure high performance and resilience.

At the core of this architecture are Kafka brokers, which are responsible for storing and managing data streams. Each broker receives and stores specific data partitions and serves them upon request[8]. These brokers are typically deployed as StatefulSets in Kubernetes, which provide stable, unique network identifiers, persistent storage, and ordered deployment and scaling[4].

Traditionally, Kafka relied on ZooKeeper for metadata management and coordination. ZooKeeper maintains information about topics, brokers, partitions, and consumer groups[8]. However, the introduction of KRaft (Kafka Raft) has simplified this architecture by integrating metadata coordination directly into Kafka brokers, eliminating the need for a separate ZooKeeper ensemble[8]. This streamlined approach reduces operational overhead and improves efficiency.

The networking aspect of Kafka on Kubernetes requires special attention. Kubernetes distributes network traffic among multiple pods of the same service, but this approach doesn't work optimally for Kafka. Clients often need to reach the specific broker that hosts the leader of a partition directly[3]. To address this, headless services are used to give each pod running a Kafka broker a unique identifier, facilitating direct communication[3].

Deployment Methods and Options

Using Helm Charts for Kafka Deployment

Helm charts provide a package manager approach for deploying Kafka on Kubernetes. They allow defining, installing, and managing complex Kubernetes applications using pre-configured packages called charts[20]. The deployment process typically involves:

  1. Setting up a Kubernetes cluster with sufficient resources

  2. Installing Helm and adding the required repositories

  3. Configuring deployment values

  4. Deploying Kafka using the Helm chart[12]

For example, to add Confluent's Helm repository:


helm repo add confluentinc <https://packages.confluent.io/helm> helm repo update

And to deploy Kafka using Bitnami's chart:


helm install my-kafka bitnami/kafka

Using Kafka Operators

Operators take Kubernetes management to a higher level by incorporating domain-specific knowledge to automate complex operations. Unlike Helm charts which primarily handle installation, operators provide continuous management throughout the application lifecycle[20].

Several Kafka operators are available:

  1. Strimzi Kafka Operator - An open-source Kubernetes operator for Apache Kafka

  2. Confluent Operator - Provides enterprise-grade Kafka deployment and management

  3. KUDO Kafka - Offers out-of-the-box optimized Kafka clusters on Kubernetes[15]

Operators handle advanced tasks like broker failover, scaling, updates, and monitoring, reducing the operational burden significantly[20]. For instance, the Strimzi operator can be deployed using Helm:


helm repo add strimzi <https://strimzi.io/charts/> helm install my-strimzi-operator strimzi/strimzi-kafka-operator

Manual Deployment with Kubernetes Resources

For those who need more control, manual deployment using native Kubernetes resources is possible. This approach typically involves:

  1. Creating network policies for Kafka communication

  2. Deploying ZooKeeper as a StatefulSet

  3. Creating ZooKeeper services

  4. Deploying Kafka brokers as StatefulSets

  5. Creating Kafka headless services[5]

This method provides the most flexibility but requires deeper understanding of both Kafka and Kubernetes internals[5].

Best Practices for Kafka on Kubernetes

Using Separated Storage and Compute in Kafka for Better Operations and Scaling

Kubernetes is primarily designed for cloud-native stateless applications. The main challenge of running Kafka on Kubernetes lies in its architecture that couples compute and storage, with strong dependency on local disks. This makes Kafka difficult to manage and scale on Kubernetes. With the continuous evolution of the Kafka ecosystem, you can now choose next-generation storage-compute separated Kafka solutions like AutoMQ. AutoMQ is built entirely on S3, with complete separation of compute and storage. The stateless Broker significantly reduces the management complexity of Kafka on Kubernetes.

High Availability Configuration

For fault tolerance and high availability, several strategies should be implemented:

  1. Deploy Kafka brokers across multiple availability zones to protect against zone failures

  2. Configure a replication factor of at least 2 or more for each partition to ensure data durability[4]

  3. Use pod anti-affinity rules to distribute Kafka brokers across different nodes

  4. Implement proper leader election strategies to minimize downtime during failures[17]

Kubernetes adds an additional layer of availability by automatically recovering failed pods and placing them on new nodes. Configuring liveness and readiness probes, Horizontal Pod Autoscaler (HPA), and implementing cluster auto-scaling improves the durability of Kubernetes-based Kafka clusters even further[4].

Resource Management and Performance Tuning

Proper resource allocation is critical for Kafka performance on Kubernetes:

  1. Set appropriate CPU and memory requests and limits in Kubernetes manifests to prevent resource contention

  2. Configure JVM heap size according to available container memory (typically 50-70% of container memory)[10]

  3. Adjust producer settings like batch size, linger time, and compression to optimize throughput

  4. Optimize consumer configurations including fetch size and max poll records[10]

It's important to note that Kafka relies heavily on the filesystem cache for performance. On Kubernetes, where multiple containers run on a node accessing the filesystem, this means less cache is available for Kafka, potentially affecting performance[13].

Storage Configuration

Kafka's performance and reliability depend significantly on storage configuration:

  1. Use persistent volumes for data retention to maintain data across pod rescheduling

  2. Select appropriate storage class based on performance requirements

  3. Consider volume replication for faster recovery after node failures

  4. Implement proper storage monitoring to detect and address issues proactively[4]

When a Kafka broker fails and moves to another node, access to its data is critical. Without proper storage configuration, the new broker might need to replicate data from scratch, resulting in higher I/O and increased cluster latency during the rebuild[4].

Network Configuration and Connectivity

Networking is perhaps the most challenging aspect of running Kafka on Kubernetes:

  1. Use headless services for broker discovery within the cluster

  2. Configure advertised listeners correctly for both internal and external communication

  3. Address the "bootstrap server" challenge for external clients

  4. Consider using NodePort or LoadBalancer services for external access[3][13]

A common challenge occurs when a producer outside the Kubernetes cluster attempts to connect to Kafka brokers. The broker might return internal Pod IPs that aren't accessible externally, or return no IP at all. Solutions include properly configuring external access through services and correctly setting up advertised listeners[13].

Security Implementation

Security for Kafka on Kubernetes should be implemented at multiple levels:

  1. Encrypt data in transit using TLS/SSL

  2. Implement authentication using SASL or mutual TLS

  3. Configure authorization with Access Control Lists (ACLs)

  4. Use Kubernetes secrets for credential management

  5. Implement network policies to control traffic flow[11]

Kafka clusters should be isolated within private networks, with strict firewall rules limiting inbound and outbound traffic. Only necessary ports should be exposed, and all communication should be encrypted[11].

Common Challenges and Solutions

Running Apache Kafka directly on Kubernetes presents several challenges. To address these issues, we recommend following best practices and considering next-generation solutions like AutoMQ, which use storage-compute separation and shared storage architecture.

Managing Stateful Workloads on Kubernetes

Running stateful applications like Kafka on Kubernetes presents unique challenges:

  1. Ensuring persistent identity and storage for Kafka brokers

  2. Handling pod rescheduling without data loss

  3. Managing upgrades without service disruption[4]

To address these challenges, use StatefulSets and Headless services. StatefulSets provide stable identity for each pod, ensuring that if a pod is rescheduled, it gets the same IP as before. This is crucial because the address on which clients connect to brokers should remain consistent[4].

Handling Scaling Operations

Scaling Kafka on Kubernetes requires careful planning:

  1. Properly configure partition reassignment during scaling to redistribute load

  2. Manage leader rebalancing to prevent performance degradation

  3. Plan for increased network traffic and disk I/O during scaling operations[16]

When increasing or decreasing the number of brokers, ensure that partitions are redistributed evenly to maintain balanced load across the cluster[16].

Monitoring and Troubleshooting

Effective monitoring is essential for maintaining healthy Kafka clusters on Kubernetes:

  1. Implement comprehensive metrics collection using tools like Prometheus and Grafana

  2. Monitor key metrics including broker health, consumer lag, and partition status

  3. Set up alerts for critical conditions

  4. Collect and analyze logs for troubleshooting[7]

Common troubleshooting areas include connectivity issues between Kafka and monitoring tools, configuration problems, and resource constraints[7].

Pod scheduling affects Kafka's performance

Apache Kafka's impressive throughput and performance rely heavily on its Page Cache implementation. Since containers don't virtualize the operating system kernel, when Pods move between Nodes, the Page Cache must be re-warmed[8], degrading Kafka's performance. This impact is particularly noticeable during peak business periods. As a result, Kafka users concerned about performance impacts on their business become reluctant to allow Kafka Broker Pods to move freely between Nodes. However, when Pods can't move quickly and freely between Nodes, it significantly reduces Kubernetes' scheduling flexibility and prevents full utilization of its orchestration advantages. The figure below illustrates how disk reads from an un-warmed Page Cache affect Kafka performance during Broker Pod movement.

Conclusion: Choosing the Right Approach

Deploying Kafka on Kubernetes offers significant benefits in terms of scalability, resilience, and operational efficiency. However, it requires careful planning and consideration of various factors including deployment method, resource allocation, networking, and storage.

The choice between using Helm charts or operators depends on specific requirements:

  1. Operators provide more sophisticated management with automation of day-to-day operations, making them suitable for production environments with complex requirements

  2. Helm charts offer simplicity and flexibility, giving more control over the configuration but requiring more manual management[20]

Regardless of the deployment method chosen, following best practices for high availability, performance tuning, and security will ensure a robust Kafka deployment that can handle the demands of modern streaming applications. By understanding the unique challenges of running Kafka on Kubernetes and implementing appropriate solutions, organizations can build reliable, scalable streaming platforms that drive their data-intensive applications.

If you find this content helpful, you might also be interested in our product AutoMQ. AutoMQ is a cloud-native alternative to Kafka by decoupling durability to S3 and EBS. 10x Cost-Effective. No Cross-AZ Traffic Cost. Autoscale in seconds. Single-digit ms latency. AutoMQ now is source code available on github. Big Companies Worldwide are Using AutoMQ. Check the following case studies to learn more:

References:

  1. Setting up Confluent Kafka on Kubernetes for High Availability

  2. Conduktor Gateway Kubernetes Tutorial

  3. Managing Clusters in K8s Streaming Data

  4. Expert's Guide to Kafka on Kubernetes

  5. Kafka on Kubernetes Guide

  6. Containerized Apache Kafka on Kubernetes with Viktor Gamov

  7. Troubleshooting Kafka Monitoring Setup in Kubernetes

  8. Kafka Architecture and Cluster Guide

  9. Best Practices for Deploying Confluent Kafka & Spring Boot Distributed SQL-based Streaming Apps on Kubernetes

  10. Kafka Performance Tuning Best Practices & Tips

  11. Securing Kafka in the Cloud: Best Practices and Considerations

  12. How to Deploy Kafka on Kubernetes

  13. Kafka on Kubernetes: Deployment Pros and Cons

  14. Using Helm with Strimzi

  15. KUDO Kafka Custom Configuration Guide

  16. Deploying and Scaling Apache Kafka on Amazon EKS

  17. Kafka High Availability on Kubernetes Guide

  18. The Expert's Guide to Running Apache Kafka on Kubernetes

  19. Top 7 Mistakes to Avoid When Integrating Kafka with Kubernetes

  20. Difference Between Kafka Operator vs Kafka Helm Chart

  21. Kafka on Kubernetes: Integration Strategies and Best Practices

  22. Red Hat Developer's Kafka on Kubernetes Tutorial

  23. Understanding Kafka on Kubernetes

  24. Kafka on Kubernetes Video Tutorial

  25. Getting Started with Conduktor on Kubernetes

  26. Running Redpanda on Kubernetes

  27. Running Stateful Workloads with Kafka on GKE

  28. Issues to Anticipate When Running Kafka on Kubernetes

  29. Conduktor Gateway on Kubernetes Guide

  30. Deploy Kafka on Kubernetes with Confluent and OpenShift on AWS

  31. Elastically Auto-scaling Confluent on Kubernetes

  32. Building Data Pipelines with Kubernetes and Go

  33. Getting Started with Apache Kafka on Kubernetes

  34. Design and Deployment Considerations for Apache Kafka on AWS

  35. Getting Started with Conduktor Platform Installation

  36. Deploying Redpanda Connectors on Kubernetes

  37. Don't Let Apache Kafka on Kubernetes Get You Fired

  38. Developing and Deploying Kafka Streams Applications on Kubernetes

  39. Apache Kafka Architecture, Deployment and Ecosystem Guide 2025

  40. How to Deploy Apache Kafka with Kubernetes