Apache Kafka

Article Info

Contributed by
2 authors

Last updated on
2023-02-26 03:45:23

ksqlDB
Kafka Connect
Kafka Streams
Kafka Schema Registry
Kafka REST Proxy
Event Streaming

Article Versions

29 2023-02-26 03:45:23
3922,3247 29,3922

By arvindpdmn

Milestone added for KRaft.
28 2022-02-01 09:42:19
3247,3244 28,3247

By arvindpdmn

Improving citation added in prev edit.
27 2022-02-01 08:29:28
3244,3243 27,3244

By arvindpdmn

Adding a concrete example to the messaging answer.
26 2022-02-01 08:01:40
3243,3242 26,3243

By arvindpdmn

Adding missing citations for release dates.
25 2022-02-01 07:34:12
3242,3241 25,3242

By arvindpdmn

Improvements to some sentences for better clarity. Two new images. Updated two images with higher resolution ones.

Chat Room

Submitting ...

You are editing an existing chat message.
2023-02-26 03:46:16
-

By devbot5S

[URL Check] The following URLs in this article are outdated. Please update.

Missing URLs:
References: 404 HTTP response: https://downloads.apache.org/kafka/3.1.0/RELEASE_NOTES.html

Redirected URLs:
References: https://www.xeotek.com/batch-processing-with-apache-kafka/ → https://www.kadeck.com/blog/how-to-implement-batch-processing-with-apache-kafka
2022-01-29 08:05:51
-

By grwriter

Thanks for the review comments.

Addressed almost all of the review comments except a few. Here is the take from my end for the rest:

26. Agree that not all readers might have access to Books. However, during the training session, it was mentioned that Books can be quoted as reference and it forms valid primary referenceable source. Also just because the reviewer or reader cannot have access to verify the points against the books / physical copies, I feel that, we should not remove the good idea from getting presented / made available to the public / readers.

28. Producers and Consumers were already available as part of the architecture image. In this referenced section, I have provided an image that shows the reader that Producer and Consumer are nothing but an application, so it gives an another angle to represent those entities. However, as part of an edit, I have explicitly referenced that Applications (in the image) form producer and consumer in the relevant places.

29. All the fundamental concepts on Producers, consumers etc were already defined in Architecture (Bolded texts) and Summary section. Cluster was missed, and I have added the same.

32. The citation related to CLI is taken from Apache Kafka site and not from other sources. The reason for choosing this reference link is that the list of cli actions explained in that subsection are referenced from that particular url.
2022-01-27 06:00:08
-

By arvindpdmn

Discussion
22. Why question: mention of online user activity but this is not the only reason. IoT is also generating data that needs real-time processing. It should be mentioned.
23. "Few of" change to "A few of". "in real-time" should be "in real time". Hyphen is applicable only when used as an adjective. Hence, "The real-time usage" is correct. "use-cases" should be "use cases".
24. "Apache Kafka works well...": long complex sentence. Simplify it according to tips shared during author training.
25. "gained its popularity because of its exceptional performance": very important claim but unable to verify this in source. Source is Apache Kafka, which is likely to be biased for such a claim. Either find a neutral source or remove this claim.
26. Architecture: good answer. cites Garg 2015. Since book is not accessible online, this somewhat reduces the value of the reference for readers. Perhaps you have a physical copy. Prefer accessible references.
27. Architecture: image. Particularly in books, image are numbered. So citation should be [(Garg 2015, fig. 23)] or [(Garg 2015, fig. 3.4)]
28. Messaging: choose a better image that annotates producers and consumers. Current image annotates only the broker.
29. Messaging: why not give a concrete example to help readers understand. We talk about messages, producers and consumers but beginners can't relate to these unless they see an example. Beginners may wonder what's a message. I tend to introduce examples early on in article (where applicable) and then go into abstractions. For example, see summary and first question in https://devopedia.org/hidden-markov-model
30. Storage: good answer. "cannot be used as a traditional database": important claim but no citation. It seems the problem is in the placement of citation. See Author Guidelines.
31. Use cases: no comments.
32. CLI: different sources are cited. Okay but shouldn't you cite the main CLI documentation which is the primary ref?
33. The concept of cluster is not formally introduced. A single sentence may be enough.
34. key components: as noted in earlier comment, this needs to be rewritten in a shorter form. There's also some duplication with the previous question on API.
35. Last answer: no comments.
2022-01-27 05:59:57
-

By arvindpdmn

0. Overall, well-written. With minor edits, it can be published.

Milestones
1. Good work on the Milestones. Some descriptions are not in present tense. Eg. "released", "has added".
2. Another milestone: "Key engineers...announces", should be "Key engineers...announce"
3. July 2018: the cited source doesn't contain the date. Date must have been picked up from another source. Cite that as well. Similar issue may be present in other citations that use release notes. One possible source for dates: https://kafka.apache.org/downloads
4. Good to add latest release of Jan 2022, 3.1.0

Summary
5. "correct information is at the right place, at the proper time": not the good way to describe event streaming. Rewrite this sentence.
6. Citation next to two words "Event streaming" has limited utility. It's only saying that reader can verify the term from the cited source. Usually citation is best placed at the end of an independent clause or sentence that makes a claim. Similar problem with "Kafka can be deployed[(citation)]".
7. Citation is ideally placed after punctuation, usually comma or period.
8. "the Organizations": "organizations" is better.
9. "distributed, scalable, fault-tolerant, and secure way": important claims, so cite the source: [(Apache Kafka Docs 2022a)]
10. Image caption can end with: "Source: Dearden 2019, slide 19.[(Dearden 2019, slide 19)]". We prefer to display authors rather than the website or publisher. Anyway, ref has the website name.

References
11. General points on reference formatting. This will save reviewers a lot of time later to fix the issues. Some of this will be automated later on using ML, thereby simplifying manual effort. I have not checked every ref. Showing some examples below.
12. Apache Confluence. 2022.: wrong date: source was last modified at July 25, 2016. Hence we should write: * [Apache Confluence. 2016. "Powered By." Wiki, Confluence, Apache Software Foundation, July 25. Accessed 2022-01-14 ](url-here)
13. Apply the above comment to other similar sources.
14. Apache Kafka Docs. 2022b.: not really from the documentation section unlike "Apache Kafka Docs. 2022a." Same issue with Apache Kafka Docs. 2022d.
15. Cloudera: add "Cloudera Runtime 7.2.10", such as * [Cloudera Docs. 2022. "kafka-consumer-groups." Cloudera Docs, Cloudera Runtime 7.2.10. Accessed 2022-01-15]
16. Cloudurable. 2017: title need not be uppercase: it's written that way in the source because of their custom styling.
17. towardsdatascience: this is how it's in the URL: actual name is "Towards Data Science"
18. Kafka Release Notes. 2013.: the name is generally name of an entity. We use 'Docs' suffix sometimes if 'docs' is the URL but using 'Docs' is also not essential. Title should be as it appears in source. I would write this as: * [Apache Kafka. 2013. "Release Notes - Kafka - Version 0.8.0." Release Notes, v0.8.0, Apache Kakfa, December. Accessed Accessed 2022-01-13.](url-here)
19. Kreps: missing publisher name: * [Kreps, Jay. 2016. "Introducing Kafka Streams: Stream Processing Made Simple." Blog, Confluent, March 10. Accessed 2022-01-15.](url-here)
20. "Dearden, Nick. 2019. slide-19.": slide-19 should be part of citation.

See Also
21. Good. No comments.
2022-01-18 09:19:47
-

By arvindpdmn

Quick comment: some key components: should be covered in one answer but content spans five "q&a". Think about how to summarize this and reorganize this information. This is the real challenge. Will do full review after content is reorganized.

Kafka, an event streaming platform. Source: Dearden 2019, slide 19.

Any change of state can be considered an event. Event streaming is really a sequence of events collected, stored and processed in a timely manner, often in real time. Apache Kafka is an open-source distributed event streaming platform.

Apache Kafka offers the following three primary capabilities for organizations to implement event streaming:

To produce and consume event streams to and from a variety of systems
To store these events for the configured time and in a reliable manner
To process the event streams as they happen or in a batch-oriented way

Apache Kafka offers these capabilities in a distributed, scalable, fault-tolerant, and secure way.

Kafka can be deployed on bare-metal hardware, virtual machines, or containers, both on-premise or on the cloud. Deployments can be self-managed or fully managed by third-party vendors.

Discussion

Why do we need Kafka?
In today's Internet applications, performing analytics based on user activity or device/sensor data has become a trend. The intent is to provide useful recommendations in real time. A few of such analytical applications are recommendations based on past user actions or advertisements to users based on their current proximity. The real-time use of these events collected from production applications and IoT systems has become a challenge because of the volume of data collected and processed. Apache Kafka works well for such use cases by unifying real-time and batch processing requirements.
Apache Kafka facilitates parallel processing in batch processing systems such as Hadoop. It also provides the ability to partition online consumption of data over a cluster of machines. Kafka has gained popularity because of its exceptional performance. This is because clients and servers communicate with a simple, high-performance, language agnostic TCP protocol.
Kafka is scalable, reliable, and robust. It's now trusted by 20,360 companies.
What's the architecture of Apache Kafka?
Kafka single broker architecture. Source: Garg 2015, fig. 3.1.
A producer in Kafka architecture is a client application that produces events or messages to a Kafka topic. A topic is similar to a directory in an operating system's filesystem, and the messages are the files in that directory. Kafka topics are created on a Kafka broker that operates as a Kafka server. A Kafka cluster consists of one or more Kafka brokers running Kafka. It's the Kafka broker where the produced messages are stored.
A consumer subscribes to one or more interested Kafka topics to get the messages. A consumer group is a set of consumers who coordinate to consume data from the subscribed topics. A message within a topic is consumed by a single consumer within the group.
A topic can be partitioned to spread the messages over several buckets in the Kafka brokers. Every partition holds the messages in an ordered, immutable sequence. Brokers and consumers use Zookeeper to get the Kafka server's state information and track offsets of the messages consumed. An offset is a unique sequential number assigned to a message in the topic partition.
How is Kafka used in the context of messaging?
Kafka for messaging. Source: Itzkovich 2019.
In the context of messaging systems, Kafka acts as a message broker that enables services, applications, and systems to communicate with each other and exchange information, even if they're written in different languages or implemented on different platforms.
A design principle of Kafka is to decouple producers and consumers via the message brokers. Broker buffers unprocessed messages. Producers don't wait for consumers to read messages. This decoupling is a key aspect of Kafka's scalability.
Consider an e-commerce site. An order service validates an order and sends a message to the Order Validated topic. A payment service is a consumer of this topic. It processes the payment and sends a message to the Payment Processed topic. Others services who are consumers of this topic continue the workflow. Due to Kafka's decoupling, each service does its work without even being aware of other services.
Kafka has better throughput, built-in partitioning, replication, and fault-tolerance when compared to other message broker solutions in the market. This makes Apache Kafka an ideal solution for large-scale message processing applications.
How is Kafka used in the context of storage?
Kafka is different from traditional messaging systems, where the messages get deleted once consumed by the consumers. Kafka can be configured to store the messages in the Kafka broker for a particular period. Theoretically, this retention period can be set to retain the data forever.
However, Kafka can't be used as a traditional database as its core library lacks support for random data lookup based on joins or where conditions. However, ksqlDB, a library in the Kafka ecosystem, gives the developer the ability to build event streaming applications by leveraging the familiarity with relational databases.
Could you list a few companies that use Kafka and their respective use cases?
The following information is sourced from Apache's Confluence page:
- LinkedIn: Kafka is applied for streaming user activity data and operational metrics. LinkedIn Newsfeed and LinkedIn Today are two products that use this.
- DataSift: Kafka is used as a collector to monitor events and as a tracker of users’ consumption of data streams in real time.
- Simple: Kafka is used for log aggregation and to power their analytics infrastructure.
- Foursquare: Kafka powers online-to-online and online-to-offline messaging. It's used to integrate Foursquare's monitoring and production systems with Foursquare's Hadoop-based offline infrastructures.
- SocialTwist: Kafka is used in SocialTwist as part of their reliable email queueing system.
- Hotels.com: Hotels.com uses Kafka to collect real-time events from multiple sources and sends them to HDFS.
- Cisco: At Cisco, Kafka is used as part of their OpenSOC (Open Security Operations Center) project.
What are the ways to interact with Kafka?
A developer can interact with Kafka via command-line scripts or APIs.
Command-line scripts are used to:
- Make ad hoc requests to a Kafka cluster such as starting up a Kafka service, creating topics in Kafka brokers, etc.
- Perform list operations such as listing the topics or the consumer groups created in the cluster, etc.
- Monitor consumer group lags or offsets of each partition in a Kafka topic
- Produce messages to a topic and consume messages from a topic
Kafka offers the following core APIs with which developers can build their client applications:
- The Producer API allows applications to send messages to topics in the Kafka cluster
- The Consumer API allows applications to consume messages from topics in the Kafka cluster
- The Streams API allows transforming messages from one topic to another
- The Connect API allows implementing connectors to pull from some source system into Kafka or push from Kafka into some sink system
- The Admin API allows managing and examining topics, brokers, and other Kafka entities
What are some key components in Apache Kafka ecosystem?
The Apache Kafka ecosystem. Source: Cloudurable 2017b.
The ecosystem around Apache Kafka is quite vast. There are plenty of libraries and frameworks to help developers integrate third-party software, log and monitor clusters, collect metrics, and distribute or package solutions. We mention key ones that help in creating enterprise-level solutions:
- Kafka Connect: Streams or batch transfers data to and from Kafka, with ready-to-use connectors.
- Kafka Streams: A client library for creating applications and microservices, where the input and output data are persisted in a Kafka cluster.
- Schema Registry: Persists the schema structure of an event that is pushed to and consumed from a topic. Helps to validate the events against the registered schema.
- Kafka REST Proxy: Offers a RESTful interface to a Kafka cluster. It facilitates another way with which the applications can interact with the Kafka cluster without using the Kafka client libraries or Kafka's native protocol.
Could you describe some specific features of Kafka?
Replication in Apache Kafka. Source: Vanlightly 2018.
Replication in Kafka assures that the events will be published and consumed even in the case of broker failure by maintaining specified number of copies of data across various brokers in the Kafka cluster. The unit of replication is the partition. The replication factor configuration is set on a topic level. Kafka replication feature has been available since Kafka version 0.8.0.
Retention happens at Kafka brokers, which are configured with a default retention period per topic. The default is 7 days. It can be set using the topic-level config token retention.ms. The admin can also configure it by size (in bytes) using the topic-level config token retention.bytes. Once these limits are reached, the corresponding events are expired and deleted. However, the developers or Kafka admin can choose Log Compaction by setting the token log.cleanup.policy = compact. This tells Kafka to retain only the last message produced for each message key in a topic. This can be useful for changelog type of data, where we need to retain only the last update.

Milestones

Jan
2011

Kafka is open sourced by LinkedIn.

Jul
2011

Kafka project enters Apache Incubation state. In October 2012, Kafka graduates from Apache Incubator.

Dec
2013

Apache Kafka v0.8.0 is released. In addition to bug fixes and other improvements, this adds a key feature called intra-cluster replication support. Also for the first time, the Kafka release JAR file gets published to a public Maven repository so that developers can conveniently download and use the software.

Nov
2014

Key engineers who built Kafka at LinkedIn announce the formation of a new company called Confluent with a focus on Apache Kafka.

Feb
2015

Apache Kafka v0.8.2.0 is released. This includes in-built consumer offset management. This release also includes other improvements and bug fixes.

Nov
2017

Apache Kafka v1.0.0 is released. The first full version number indicates that Kafka is ready for enterprise adoption.

Jul
2018

Apache Kafka v2.0.0 is released. Major improvements are in the areas of security, stability and reliability.

Dec
2019

Apache Kafka v2.4.0 is released. This includes an upgraded version of MirrorMaker, MirrorMaker 2.0. This is a new multi-cluster, cross-datacenter replication engine.

Aug
2020

Apache Kafka v2.6.0 is released. This improves the performance of brokers dealing with large number of partitions. New monitoring metrics provide better operational insights. This release also includes other improvements and bug fixes.

Jan
2022

Apache Kafka releases v3.1.0. Among the new features are support for Java 17, extending SASL/OAUTHBEARER with support for OIDC, addition of new broker count metrics, and support for the usage of custom partitioners in foreign-key joins. Eager rebalance protocol is deprecated.

Oct
2022

Apache Kafka releases v3.3.0. KRaft mode becomes production ready in this release. KRaft allows Kafka to self-manage metadata without relying on Apache ZooKeeper. KRaft mode was first previewed in Kafka v3.0. ZooKeeper is deprecated in Kafka v3.5 (April 2023). It will removed in v4.0 (2024).

References

Article Stats

1760

Words

Authors

Edits

Chats

Likes

2810

Hits

Cite As

Devopedia. 2023. "Apache Kafka." Version 29, February 26. Accessed 2023-11-13. https://devopedia.org/apache-kafka

Contributed by
2 authors

Last updated on
2023-02-26 03:45:23

data open source big data data processing streaming

ksqlDB
Kafka Connect
Kafka Streams
Kafka Schema Registry
Kafka REST Proxy
Event Streaming

Apache Kafka

Discussion

Milestones

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login