Blog

Kafka to Iceberg Without Tableflow | Open Source Guide

Kafka teams are being pulled toward the lakehouse for a practical reason: the most valuable operational events already pass through Kafka, while the analysts, ML pipelines, and governance tools increasingly expect Apache Iceberg tables. That creates a deceptively small question with architectural consequences: how should Kafka topics become queryable Iceberg tables without turning the streaming layer into another vendor-controlled data estate?

Confluent Tableflow is attractive because it makes that bridge feel productized. In Confluent Cloud, a topic can be materialized as an Iceberg or Delta Lake table, with schema handling, CDC materialization, catalog publishing, and table maintenance packaged behind a managed feature. For teams already standardized on Confluent Cloud, that convenience has real value.

But convenience is not the same as inevitability. Iceberg is an open table format. Kafka Connect is an open integration framework. Flink can write streaming data into Iceberg with exactly-once semantics. If your goal is an open Kafka-to-Iceberg pipeline, Tableflow is one option, not the architecture itself.

Open Kafka-to-Iceberg pipeline

The better question is not "Can Tableflow do this?" It can. The better question is "Where should the control plane for streaming-to-lakehouse data live?" If you want to keep connector choice, catalog choice, storage ownership, deployment model, and cloud account control in your hands, an open stack is often the more durable path.

Why Kafka-to-Iceberg Matters

Kafka is where operational reality shows up first. Orders, clicks, payments, device telemetry, fraud signals, game events, logs, and CDC updates often arrive in Kafka before they reach a warehouse or lake. Iceberg solves a different problem: it gives object storage a table abstraction that query engines can understand, with schema evolution, hidden partitioning, time travel, snapshot isolation, and metadata-driven planning.

That pairing is powerful because it lets the same event stream serve two time horizons. Applications and stream processors consume Kafka for low-latency behavior. Analysts and AI pipelines consume Iceberg for repeatable queries over durable history. The architecture works best when each side stays good at its job instead of forcing Kafka to become a warehouse or forcing the lakehouse to behave like an event bus.

The hard part is not moving bytes from one place to another. The hard part is preserving the semantics that make the data useful:

  • Schema discipline: Iceberg tables need stable column names and types. Kafka topics may contain Avro, Protobuf, JSON Schema, schemaless JSON, or compacted CDC records with deletes and tombstones.
  • Write correctness: Streaming writers need commit coordination so query engines do not observe partial table updates.
  • Small-file control: Continuous ingestion can create many tiny files unless the pipeline manages file sizing and compaction.
  • Catalog consistency: Glue, Hive Metastore, REST catalogs, Polaris-compatible catalogs, Nessie, and vendor catalogs all impose operational choices.
  • Ownership boundaries: The storage bucket, IAM model, encryption keys, and data retention policy usually belong to the platform team, not to a single streaming product feature.

That is why the Kafka-to-Iceberg decision should be treated as platform architecture, not a connector checkbox. The topic is the front door. The table is the contract. Everything in between becomes production infrastructure.

What Tableflow Solves, and What It Centralizes

Tableflow packages a lot of pipeline work. Confluent's documentation describes it as a Confluent Cloud feature that materializes Kafka topics as Iceberg or Delta Lake tables and automates tasks such as schematization, type conversion, schema evolution, CDC stream materialization, catalog publishing, and table maintenance. Those are real jobs. Anyone who has operated a homegrown Kafka-to-lake pipeline knows the unpleasant parts: mis-sized files, schema drift, connectors that need babysitting, and downstream teams asking why yesterday's table snapshot looks different from the topic.

The tradeoff is that Tableflow also centralizes the pipeline inside Confluent Cloud. Its own documentation lists important boundaries: it is not available for Confluent Platform, it has cloud and cluster limitations, topics without schemas are unsupported, Confluent-managed storage has specific catalog limitations, and billing is based on topic-hours plus GB processed, with additional managed-storage charges when applicable. Those details do not make Tableflow bad. They make it a managed product with a product surface, a billing model, and an operational boundary.

That boundary matters most when your platform requirements look like this:

Decision AreaManaged Tableflow PathOpen Kafka-to-Iceberg Path
Kafka runtimeConfluent Cloud topic is the sourceApache Kafka-compatible runtime, including AutoMQ
Table writerManaged by TableflowKafka Connect Iceberg sink, Flink Iceberg sink, or custom writer
StorageConfluent Managed Storage or supported BYOS patternsYour object storage bucket and storage policy
CatalogSupported integrations and Tableflow constraintsGlue, REST, Hive, Nessie, Polaris-compatible catalogs, or other supported Iceberg catalogs
Cost modelProduct billing dimensions plus storage/request/data-transfer chargesInfrastructure, connector compute, catalog, and operational cost you can inspect
PortabilityConvenient inside the product boundaryMore work, but fewer assumptions tied to one vendor feature

The open path asks you to operate more pieces. In exchange, it lets you choose those pieces deliberately. That matters for teams that already run Kafka at scale, own their data lake buckets, or need the same architecture across cloud providers and private environments.

The Open-Source Kafka-to-Iceberg Architecture

An open Kafka-to-Iceberg pipeline has five layers. The names vary by implementation, but the responsibilities do not.

First, Kafka remains the operational stream. Producers write events to topics, consumers and stream processors continue to use the Kafka protocol, and retention is sized for operational replay rather than analytical history. If you use AutoMQ as the Kafka-compatible layer, existing Kafka clients, Kafka Connect, Kafka Streams, and the surrounding ecosystem can keep using Kafka APIs while the storage layer changes underneath.

Second, schemas need an enforceable contract. The contract can come from Schema Registry-compatible tooling, application-level schemas, or a CDC source connector that emits well-structured records. The important point is that Iceberg tables should not inherit every accident that appears in a topic. A pipeline that writes malformed event payloads into a table has not created a lakehouse; it has created a slower incident report.

Third, a writer materializes the stream into Iceberg. The Apache Iceberg Kafka Connect sink is designed for writing Kafka records into Iceberg tables and includes features such as commit coordination, exactly-once delivery semantics, multi-table fan-out, automatic table creation, and schema evolution. Flink is another serious option when you need stream processing before the write; Iceberg's Flink integration supports batch and streaming writes, and its Flink sink guarantees exactly-once semantics.

Fourth, the table lands in object storage and is registered in a catalog. On AWS that might be S3 plus Glue or an Iceberg REST catalog. In other environments it might be GCS, Azure object storage, Hive Metastore, Nessie, or a catalog compatible with your governance system. Iceberg's value is precisely that multiple compute engines can read the same table abstraction, so the catalog choice should follow your broader data platform, not only the streaming product.

Fifth, query engines consume the table. Trino, Spark, Flink, Presto, Hive, Impala, and many other engines can participate in the Iceberg ecosystem. The streaming team does not need to own every downstream query engine, but it does need to publish tables with a predictable partitioning strategy, snapshot retention policy, and maintenance plan.

That architecture is straightforward, but not casual. You still need to handle deletes, late events, compaction, catalog credentials, upgrades, backfills, and disaster recovery. The difference is that each concern remains visible. You can replace the writer without replacing Kafka, change catalogs without changing producers, and run the pipeline in your cloud account.

Where AutoMQ Changes the Storage Economics

Most Kafka-to-Iceberg diagrams hide a quiet inefficiency: the same event may be copied through several storage systems before it becomes useful for analytics. Traditional Kafka stores replicated logs on broker-attached disks. A sink job reads those logs, writes Parquet files to object storage, and publishes Iceberg metadata. The lakehouse ends in object storage, but the streaming layer still pays the operational cost of local disks, broker state, partition reassignment, and cross-AZ replication.

AutoMQ changes that foundation by preserving the Kafka protocol while moving persistent stream data to object storage. The product documentation describes AutoMQ as a cloud-native, fully Kafka-compatible streaming platform built on S3, with stateless brokers and a storage-compute separated design. The open-source repository is under the Apache 2.0 license, and the public README notes a Table Topic feature that combines stream and table functionality with support for Apache Iceberg and catalog services such as AWS Glue, Hive Metastore, and REST catalogs.

Shared object storage foundation

This matters for Kafka-to-Iceberg because the streaming log and the analytical table now share the same architectural substrate: object storage. They are still different abstractions. A Kafka log is not the same thing as an Iceberg table, and pretending otherwise creates correctness problems. But the platform no longer has to treat object storage as a distant sink at the end of the pipeline. Object storage becomes the durable foundation for both the streaming tier and the lakehouse tier.

The practical benefits show up in three places:

  • Elastic operations: Stateless brokers reduce scaling and maintenance pain because data is not pinned to broker-local disks. AutoMQ's AviaGames case study describes eliminating rebalancing storms caused by forced MSK maintenance and achieving zero-downtime migration via Kafka Linking.
  • Cost transparency: AutoMQ's published benchmark compares a 1 GiB/s workload with 72-hour retention and reports materially lower monthly TCO than an Apache Kafka baseline under the stated assumptions. Treat the benchmark as a scenario, not a universal promise, but the mechanism is clear: less broker-local storage, less cross-AZ replication traffic, and more elastic capacity.
  • Data-control fit: AutoMQ Cloud supports BYOC deployment where resources run in the user's VPC, and the docs state that data does not leave the user's VPC. For lakehouse teams that already care about bucket ownership, IAM boundaries, and encryption policy, that deployment model is a better fit than sending the pipeline through an opaque external boundary.

AutoMQ should not be inserted into this architecture as a magic shortcut. You still need table design, writer semantics, catalog governance, and operational discipline. Its role is more specific: it removes the old assumption that Kafka must be a disk-heavy, stateful cluster sitting in front of an object-storage lakehouse. Once that assumption disappears, Kafka-to-Iceberg becomes a layered design on one cloud-native storage foundation.

Tableflow vs Open Stack: The Real Tradeoff

The Tableflow decision is easiest when you separate feature convenience from platform control. If your team is already committed to Confluent Cloud, uses supported schemas and catalogs, accepts the billing model, and wants a managed path for exposing topics as read-only tables, Tableflow can remove a lot of work. Buying that work from a vendor can be rational.

If your team runs self-managed Kafka, Apache Kafka-compatible alternatives, private networking requirements, custom catalog standards, or multi-cloud lakehouse infrastructure, the open path deserves a serious look. It is not "free" in the operational sense: you will operate Connect or Flink jobs, track Iceberg compatibility, tune files and commits, and own the catalog integration. You will also understand where failures happen and how to replace components when requirements change.

Tableflow vs open stack comparison

A useful decision rule is to ask where you want abstraction and where you want control:

  • Choose a managed materialization feature when speed, low operational effort, and product integration matter more than portability.
  • Choose an open pipeline when storage ownership, catalog flexibility, deployment location, and long-term architectural optionality matter more than a few-click setup.
  • Choose AutoMQ as the Kafka-compatible foundation when your bigger problem is not only Kafka-to-Iceberg conversion, but the cost and operational drag of running stateful Kafka in the cloud.

The point is not to turn every platform team into an Iceberg connector maintainer. The point is to avoid mistaking a managed feature for the only viable architecture. The lakehouse ecosystem is built on open formats because data outlives compute engines and product packaging. Your streaming-to-lakehouse pipeline should respect that same principle.

Implementation Checklist

Start with the table contract, not the connector. Define the Iceberg table schema, partition strategy, primary-key or identifier-field expectations, delete semantics, and snapshot retention policy. Then map Kafka topics into that contract. If the source events cannot support the table contract, fix the source or add a Flink processing layer before the Iceberg write.

Choose the write path based on transformation needs:

  • Kafka Connect Iceberg sink: Best when records are already close to table-ready and you want a connector-driven pipeline. Validate exactly-once requirements, commit coordination, dynamic routing, schema evolution settings, and connector packaging.
  • Flink to Iceberg: Best when you need filtering, joining, enrichment, deduplication, CDC normalization, or custom stateful processing before writing the table.
  • Native table-topic capability: Worth evaluating when your Kafka-compatible platform can expose table semantics directly, but still test catalog compatibility, metadata behavior, and operational maturity before production use.

Treat small files as a production concern. Streaming creates many incremental writes by nature. Decide where compaction runs, who owns failures, and how compaction interacts with query latency. The cleaner the write path, the less downstream query engines pay for ingestion convenience.

Lock down security and ownership early. The object-storage bucket should have clear IAM boundaries, encryption policy, retention rules, and audit logging. The catalog needs a recoverable backup and an upgrade plan. If you use BYOC infrastructure such as AutoMQ Cloud BYOC, verify that the control plane, data plane, and object-storage permissions match your security model.

Backfill deliberately. Historical replay from Kafka into Iceberg may compete with live ingestion, and a CDC topic may require special handling to rebuild the latest table state. Run backfills through a separate job or controlled mode, publish a clear cutover timestamp, and validate row counts, key counts, and snapshot metadata before downstream teams depend on the table.

Finally, publish an operational contract. A Kafka-to-Iceberg pipeline is only successful when downstream users know what freshness to expect, what schema changes are allowed, how deletes are represented, who owns table maintenance, and what happens during incident recovery. Without that contract, the table may be open format but still operationally mysterious.

A Reference Architecture That Stays Open

For many teams, the most balanced architecture looks like this: AutoMQ or another Kafka-compatible runtime handles the streaming tier; Kafka Connect or Flink performs materialization; Iceberg stores table data and metadata on object storage; a catalog such as Glue or REST coordinates table discovery; Trino, Spark, Flink, or warehouse engines query the result.

That architecture is not as effortless as clicking an enable button, but it is honest about where the moving parts live. The streaming team keeps Kafka compatibility. The data platform team keeps bucket and catalog ownership. Query teams use Iceberg as an open table format rather than a vendor-specific export surface.

The result is a pipeline that can evolve. You can start with Kafka Connect and move heavy transformations into Flink later. You can adopt AutoMQ to reduce Kafka infrastructure drag without changing producers and consumers. You can add table maintenance automation when ingestion grows and change query engines without asking the streaming vendor to support every downstream preference.

That is the architectural promise of Kafka plus Iceberg: not a single product feature, but a clean contract between real-time operations and analytical history. Tableflow proves the demand is real. An open stack lets you meet that demand on your own terms.

FAQ

Is Confluent Tableflow required to write Kafka data to Apache Iceberg?

No. Tableflow is a managed Confluent Cloud feature for materializing Kafka topics as Iceberg or Delta Lake tables. Open alternatives include the Apache Iceberg Kafka Connect sink and Flink's Iceberg integration. The right choice depends on whether you prefer managed convenience or control over storage, catalog, deployment, and connector operations.

What is the best open-source Kafka-to-Iceberg connector?

For connector-driven ingestion, start with the Apache Iceberg Kafka Connect sink because it is part of the Iceberg project documentation and supports Iceberg-specific commit coordination and exactly-once delivery semantics. For pipelines that need joins, deduplication, enrichment, or CDC normalization, Flink writing to Iceberg is usually the stronger pattern.

Can AutoMQ replace Kafka in a Kafka-to-Iceberg pipeline?

AutoMQ is Kafka-compatible, so Kafka clients and ecosystem tools can continue using Kafka APIs while AutoMQ changes the storage architecture underneath. That makes it a candidate foundation for Kafka-to-Iceberg pipelines, especially when the team wants object-storage-backed streaming, stateless broker operations, BYOC deployment, and lower cloud infrastructure overhead under the right workload assumptions.

Does using Iceberg remove the need for Kafka retention?

No. Kafka retention and Iceberg table retention serve different purposes. Kafka retention supports operational replay, consumer recovery, and stream processing. Iceberg retention supports analytical history, reproducible queries, and table snapshots. You can shorten Kafka retention after reliable materialization, but you should not treat the Iceberg table as a drop-in replacement for Kafka's operational log.

What are the main risks in an open Kafka-to-Iceberg pipeline?

The main risks are schema drift, malformed CDC semantics, small-file accumulation, catalog misconfiguration, credential sprawl, and unclear ownership between streaming and data platform teams. These are manageable, but they need explicit design. A managed product hides some of this work; an open stack makes it visible.

When should a team choose Tableflow instead?

Choose Tableflow when you are already standardized on Confluent Cloud, your topics and catalogs fit its supported model, and you value managed materialization more than infrastructure portability. It is a productized path for a real problem. The open path is stronger when your architecture needs to span self-managed Kafka-compatible systems, BYOC environments, custom catalogs, or stricter data-control requirements.

Sources

Newsletter

Subscribe for the latest on cloud-native streaming data infrastructure, product launches, technical insights, and efficiency optimizations from the AutoMQ team.

Join developers worldwide who leverage AutoMQ's Apache 2.0 licensed platform to simplify streaming data infra. No spam, just actionable content.

I'm not a robot
reCAPTCHA

Never submit confidential or sensitive data (API keys, passwords, credit card numbers, or personal identification information) through this form.