Apache Kafka
- Summary
-
Discussion
- Why do we need Kafka?
- What's the architecture of Apache Kafka?
- How is Kafka used in the context of messaging?
- How is Kafka used in the context of storage?
- Could you list a few companies that use Kafka and their respective use cases?
- What are the ways to interact with Kafka?
- What are some key components in Apache Kafka ecosystem?
- Could you describe some specific features of Kafka?
- Milestones
- References
- Further Reading
- Article Stats
- Cite As
Any change of state can be considered an event. Event streaming is really a sequence of events collected, stored and processed in a timely manner, often in real time. Apache Kafka is an open-source distributed event streaming platform.
Apache Kafka offers the following three primary capabilities for organizations to implement event streaming:
- To produce and consume event streams to and from a variety of systems
- To store these events for the configured time and in a reliable manner
- To process the event streams as they happen or in a batch-oriented way
Apache Kafka offers these capabilities in a distributed, scalable, fault-tolerant, and secure way.
Kafka can be deployed on bare-metal hardware, virtual machines, or containers, both on-premise or on the cloud. Deployments can be self-managed or fully managed by third-party vendors.
Discussion
-
Why do we need Kafka? In today's Internet applications, performing analytics based on user activity or device/sensor data has become a trend. The intent is to provide useful recommendations in real time. A few of such analytical applications are recommendations based on past user actions or advertisements to users based on their current proximity. The real-time use of these events collected from production applications and IoT systems has become a challenge because of the volume of data collected and processed. Apache Kafka works well for such use cases by unifying real-time and batch processing requirements.
Apache Kafka facilitates parallel processing in batch processing systems such as Hadoop. It also provides the ability to partition online consumption of data over a cluster of machines. Kafka has gained popularity because of its exceptional performance. This is because clients and servers communicate with a simple, high-performance, language agnostic TCP protocol.
Kafka is scalable, reliable, and robust. It's now trusted by 20,360 companies.
-
What's the architecture of Apache Kafka? A producer in Kafka architecture is a client application that produces events or messages to a Kafka topic. A topic is similar to a directory in an operating system's filesystem, and the messages are the files in that directory. Kafka topics are created on a Kafka broker that operates as a Kafka server. A Kafka cluster consists of one or more Kafka brokers running Kafka. It's the Kafka broker where the produced messages are stored.
A consumer subscribes to one or more interested Kafka topics to get the messages. A consumer group is a set of consumers who coordinate to consume data from the subscribed topics. A message within a topic is consumed by a single consumer within the group.
A topic can be partitioned to spread the messages over several buckets in the Kafka brokers. Every partition holds the messages in an ordered, immutable sequence. Brokers and consumers use Zookeeper to get the Kafka server's state information and track offsets of the messages consumed. An offset is a unique sequential number assigned to a message in the topic partition.
-
How is Kafka used in the context of messaging? In the context of messaging systems, Kafka acts as a message broker that enables services, applications, and systems to communicate with each other and exchange information, even if they're written in different languages or implemented on different platforms.
A design principle of Kafka is to decouple producers and consumers via the message brokers. Broker buffers unprocessed messages. Producers don't wait for consumers to read messages. This decoupling is a key aspect of Kafka's scalability.
Consider an e-commerce site. An order service validates an order and sends a message to the Order Validated topic. A payment service is a consumer of this topic. It processes the payment and sends a message to the Payment Processed topic. Others services who are consumers of this topic continue the workflow. Due to Kafka's decoupling, each service does its work without even being aware of other services.
Kafka has better throughput, built-in partitioning, replication, and fault-tolerance when compared to other message broker solutions in the market. This makes Apache Kafka an ideal solution for large-scale message processing applications.
-
How is Kafka used in the context of storage? Kafka is different from traditional messaging systems, where the messages get deleted once consumed by the consumers. Kafka can be configured to store the messages in the Kafka broker for a particular period. Theoretically, this retention period can be set to retain the data forever.
However, Kafka can't be used as a traditional database as its core library lacks support for random data lookup based on joins or
where
conditions. However, ksqlDB, a library in the Kafka ecosystem, gives the developer the ability to build event streaming applications by leveraging the familiarity with relational databases. -
Could you list a few companies that use Kafka and their respective use cases? The following information is sourced from Apache's Confluence page:
- LinkedIn: Kafka is applied for streaming user activity data and operational metrics. LinkedIn Newsfeed and LinkedIn Today are two products that use this.
- DataSift: Kafka is used as a collector to monitor events and as a tracker of users’ consumption of data streams in real time.
- Simple: Kafka is used for log aggregation and to power their analytics infrastructure.
- Foursquare: Kafka powers online-to-online and online-to-offline messaging. It's used to integrate Foursquare's monitoring and production systems with Foursquare's Hadoop-based offline infrastructures.
- SocialTwist: Kafka is used in SocialTwist as part of their reliable email queueing system.
- Hotels.com: Hotels.com uses Kafka to collect real-time events from multiple sources and sends them to HDFS.
- Cisco: At Cisco, Kafka is used as part of their OpenSOC (Open Security Operations Center) project.
-
What are the ways to interact with Kafka? A developer can interact with Kafka via command-line scripts or APIs.
Command-line scripts are used to:
- Make ad hoc requests to a Kafka cluster such as starting up a Kafka service, creating topics in Kafka brokers, etc.
- Perform list operations such as listing the topics or the consumer groups created in the cluster, etc.
- Monitor consumer group lags or offsets of each partition in a Kafka topic
- Produce messages to a topic and consume messages from a topic
Kafka offers the following core APIs with which developers can build their client applications:
- The Producer API allows applications to send messages to topics in the Kafka cluster
- The Consumer API allows applications to consume messages from topics in the Kafka cluster
- The Streams API allows transforming messages from one topic to another
- The Connect API allows implementing connectors to pull from some source system into Kafka or push from Kafka into some sink system
- The Admin API allows managing and examining topics, brokers, and other Kafka entities
-
What are some key components in Apache Kafka ecosystem? The ecosystem around Apache Kafka is quite vast. There are plenty of libraries and frameworks to help developers integrate third-party software, log and monitor clusters, collect metrics, and distribute or package solutions. We mention key ones that help in creating enterprise-level solutions:
- Kafka Connect: Streams or batch transfers data to and from Kafka, with ready-to-use connectors.
- Kafka Streams: A client library for creating applications and microservices, where the input and output data are persisted in a Kafka cluster.
- Schema Registry: Persists the schema structure of an event that is pushed to and consumed from a topic. Helps to validate the events against the registered schema.
- Kafka REST Proxy: Offers a RESTful interface to a Kafka cluster. It facilitates another way with which the applications can interact with the Kafka cluster without using the Kafka client libraries or Kafka's native protocol.
-
Could you describe some specific features of Kafka? Replication in Kafka assures that the events will be published and consumed even in the case of broker failure by maintaining specified number of copies of data across various brokers in the Kafka cluster. The unit of replication is the partition. The replication factor configuration is set on a topic level. Kafka replication feature has been available since Kafka version 0.8.0.
Retention happens at Kafka brokers, which are configured with a default retention period per topic. The default is 7 days. It can be set using the topic-level config token
retention.ms
. The admin can also configure it by size (in bytes) using the topic-level config tokenretention.bytes
. Once these limits are reached, the corresponding events are expired and deleted. However, the developers or Kafka admin can choose Log Compaction by setting the tokenlog.cleanup.policy = compact
. This tells Kafka to retain only the last message produced for each message key in a topic. This can be useful for changelog type of data, where we need to retain only the last update.
Milestones
2011
2013
Apache Kafka v0.8.0 is released. In addition to bug fixes and other improvements, this adds a key feature called intra-cluster replication support. Also for the first time, the Kafka release JAR file gets published to a public Maven repository so that developers can conveniently download and use the software.
2014
2015
2017
2018
2019
2020
2022
2022
References
- Apache Confluence. 2016. "Powered By." Wiki, Confluence, Apache Software Foundation, July 25. Accessed 2022-01-14.
- Apache Confluence. 2017. "System Tools." Wiki, Confluence, Apache Software Foundation, October 27. Accessed 2022-01-14.
- Apache Confluence. 2021. "Ecosystem." Wiki, Confluence, Apache Software Foundation, February 19. Accessed 2022-02-01.
- Apache Incubator. 2022. "Kafka Incubation Status." Apache Incubator, Apache Software Foundation. Accessed 2022-01-13
- Apache Kafka. 2013. "Release Notes - Kafka - Version 0.8.0." Release Notes, v0.8.0, Apache Kakfa, December 3. Accessed 2022-01-13.
- Apache Kafka. 2015. "Release Notes - Kafka - Version 0.8.2.0." Release Notes, v0.8.2.0, Apache Kakfa, February 2. Accessed 2022-01-13.
- Apache Kafka. 2018. "Release Notes - Kafka - Version 2.0.0." Release Notes, v2.0.0, Apache Kakfa, July 30. Accessed 2022-01-13.
- Apache Kafka. 2019. "Release Notes - Kafka - Version 2.4.0." Release Notes, v2.4.0, Apache Kakfa, December 16. Accessed 2022-01-13.
- Apache Kafka. 2020. "Release Notes - Kafka - Version 2.6.0." Release Notes, v2.6.0, Apache Kakfa, August 3. Accessed 2022-01-13.
- Apache Kafka. 2022. "Release Notes - Kafka - Version 3.1.0." Release Notes, v3.1.0, Apache Kakfa, January 24. Accessed 2022-01-29.
- Apache Kafka. 2022a. "Apache Kafka Quickstart." Apache Kafka, Apache Software Foundation. Accessed 2022-01-15.
- Apache Kafka. 2022b. "Kafka - Download." Apache Kakfa, Apache Software Foundation, January 24. Accessed 2022-01-29.
- Apache Kafka. 2022c. "Introduction." Apache Kafka, Apache Software Foundation. Accessed 2022-01-13.
- Apache Kafka. 2022d. "Homepage." Apache Kafka, Apache Software Foundation. Accessed 2022-01-13.
- Apache Kafka. 2023. "KIP-833: Mark KRaft as Production Ready." Wiki, Apache, January 12. Accessed 2023-02-26.
- Apache Kafka Docs. 2022a. "Apache Kafka Documentation." Kafka 3.1, Apache Software Foundation. Accessed 2022-01-13.
- Buick, Benjamin. 2021. "How to implement Batch Processing with Apache Kafka." Xeotek Blog, May 4. Accessed 2022-01-14.
- Cloudera Docs. 2022. "kafka-consumer-groups." Documentation, Cloudera Runtime 7.2.10, Cloudera. Accessed 2022-01-15
- Cloudurable. 2017. "Kafka Tutorial: Using Kafka from the Command line." Blog, Cloudurable, May 13. Accessed 2022-01-15.
- Cloudurable. 2017a. "Kafka Consumer Architecture - Consumer Groups and subscriptions." Blog, Cloudurable, May 12. Accessed 2022-02-01.
- Cloudurable. 2017b. "The Kafka Ecosystem - Kafka Core, Kafka Streams, Kafka Connect, Kafka REST Proxy, and the Schema Registry." Blog, Cloudurable, May 11. Accessed 2022-02-01.
- Confluent Docs. 2022a. "Schema Registry Overview." Documentation, Confluent Platform v7.0.1, Confluent. Accessed 2022-01-15.
- Confluent Docs. 2022b. "Confluent REST APIs." Documentation, Confluent Platform v7.0.1, Confluent. Accessed 2022-01-15.
- Daza, Lucio. 2020. "Using Kafka as a Temporary Data Store and Data-loss Prevention Tool in The Data Lake." Towards Data Science, on Medium, June 6. Accessed 2022-01-14
- Dearden, Nick. 2019. "Architecture Patterns for Event Streaming." Slides, London 2019 Confluent Streaming Event, November. Accessed 2022-01-12.
- Enlyft. 2022. "Companies using Apache Kafka." Enlyft. Accessed 2022-01-14.
- Garg, Nishant. 2015. "Learning Apache Kafka." Second Edition, February. Packt Publishing. Accessed 2022-01-13.
- IBM. 2020. "Message Brokers." IBM Cloud Learn Hub, IBM Corporation, January 23. Accessed 2022-01-14
- Itzkovich, Sefi. 2019. "Redis, Kafka or RabbitMQ: Which MicroServices Message Broker To Choose?" Otonomo, May 20. Accessed 2022-01-14
- Primack, Dan. 2014. "LinkedIn engineers spin out to launch ‘Kafka’ startup Confluent." Fortune, November 6. Accessed 2022-01-13.
- Rao, Jun. 2011. "Open-sourcing Kafka, LinkedIn's distributed message queue." LinkedIn Blog, January 11. Accessed 2022-01-13.
- Sekhar, Neethu C. 2018. "Apache Kafka: A Framework for Handling Real-Time Data Feeds." OpenSource For You, August 10. Accessed 2022-01-29.
- Shafer, Ian C. 2017. "Apache Kafka reaches milestone with version 1.0.0." SD Times, November 1. Accessed 2022-01-13.
- Stopford, Ben. 2018. "Designing Event-Driven Systems." First Edition, April. O'Reilly Media. Accessed 2022-02-01.
- TIBCO Software. 2022. "What is Event Streaming?" Reference Center, TIBCO Software. Accessed 2022-01-12.
- Vanlightly, Jack. 2018. "RabbitMQ vs Kafka Part 6 - Fault Tolerance and High Availability with Kafka." Blog, September 5. Accessed 2022-02-01.
- jsancio. 2022. "What's New in Apache Kafka 3.3." Blog, Apache Kafka, October 3. Accessed 2023-02-26.
- ksqldb. 2022. "ksqlDB" ksqldb. Accessed 2022-01-14.
Further Reading
- Apache Kafka Docs. 2022a. "Apache Kafka Documentation." Kafka 3.1, Apache Software Foundation. Accessed 2022-01-13.
- Confluent Docs. 2022. "What is Confluent Platform?" Documentation, Confluent Platform v7.0.1, Confluent. Accessed 2022-01-15.
- Stopford, Ben. 2018. "Designing Event-Driven Systems." First Edition, April. O'Reilly Media. Accessed 2022-02-01.