Kafka Interview Questions

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

What do you mean by Kafka schema registry?

A Schema Registry is present for both producers and consumers in a Kafka cluster, and it holds Avro schemas. For easy serialization and de-serialization, Avro schemas enable the configuration of compatibility parameters between producers and consumers. The Kafka Schema Registry is used to ensure that the schema used by the consumer and the schema used by the producer are identical. The producers just need to submit the schema ID and not the whole schema when using the Confluent schema registry in Kafka. The consumer looks up the matching schema in the Schema Registry using the schema ID.

What do you understand about a consumer group in Kafka?

A consumer group in Kafka is a collection of consumers who work together to ingest data from the same topic or range of topics. The name of an application is essentially represented by a consumer group. Consumers in Kafka often fall into one of several categories. The '-group' command must be used to consumer messages from a consumer group.

What does it mean if a replica is not in-sync replica for a long time?

A replica that has been out of ISR for a long period of time indicates that the follower is unable to fetch data at the same rate as the leader

What is the maximum size of a message that Kafka can receive?

By default, the maximum size of a Kafka message is 1MB. The broker settings allow you to modify the size. Kafka, on the other hand, is designed to handle 1KB messages as well.

How do you start a Kafka server?

Firstly, we extract Kafka once we have downloaded the most recent version. We must make sure that our local environment has java 8+ installed in order to run Kafka. The following commands must be done in order to start the Kafka server and ensure that all services are started in the correct order: - Start the Zookeeper service by doing the following $bin/zookeeper-server-start.sh config/zookeeper.properties - To start the Kafka broker service, open a new terminal and type $bin/kafka-server-start.sh config/server.properties

What is a Replication Tool in Kafka? Explain some of the replication tools available in Kafka.

The Kafka Replication Tool is used to create a high-level design for the replica maintenance process. The following are some of the replication tools available: Preferred Replica Leader Election Tool: Partitions are spread to many brokers in a cluster, each copy known as a replica, using the Preferred Replica Leader Election Tool. The leader is frequently referred to as the favored replica. The brokers normally spread the leader position equitably across the cluster for various partitions, but owing to failures, planned shutdowns, and other factors, an imbalance can develop over time. This tool can be used to preserve the balance in these situations by reassigning the preferred replicas, and hence the leaders. Topics tool: The Kafka topics tool is in charge of all administration operations relating to topics, including:Listing and describing the topics.Topic generation.Modifying Topics.Adding a topic's dividers.Disposing of topics. Tool to reassign partitions: The replicas assigned to a partition can be changed with this tool. This refers to adding or removing followers from a partition. StateChangeLogMerger tool: The StateChangeLogMerger tool collects data from brokers in a cluster, formats it into a central log, and aids in the troubleshooting of state change issues. Sometimes there are issues with the election of a leader for a particular partition. This tool can be used to figure out what's causing the issue. Change topic configuration tool: used to create new configuration choices, modify current configuration options, and delete configuration options.

What are some of the features of Kafka?

- Kafka is a messaging system built for high throughput and fault tolerance - Kafka has a built-in partition system known as a Topic - Kafka includes a replication feature as well - Kafka provides a queue that can handle large amounts of data and move messages from one sender to another - Kafka can also save the messages to storage and replicate them across the cluster - For coordination and synchronization with other services, Kafka collaborates with Zookeeper - Apache Spark is well supported by Kafka

What are some of the disadvantages of Kafka?

- Kafka performance degrades if there is message tweaking. When the messages does not need to be updated, Kafka works well. - Wildcard topic selection is not supported by Kafka. It is necessary to match the exact topic name - Brokers and consumer reduce Kafka's performance when dealing with huge messages by compressing and decompressing the messages. This has an import on Kafka's throughput and performance - Certain message paradigms, including point-to-point queues and request/replay, are not supported by Kafka. - Kafka does not have a complete set of monitoring tools

What are the use cases of Kafka monitoring?

- Track system resource consumption: it can be used to keep track of system resources such as memory, CPU, and disk utilization over time Monitor threads and JVM usage: Kafka relies on the Java garbage collector to free up memory, ensuring that it runs frequently thereby guaranteeing that the Kafka cluster is more active. Keep an eye on the broker, controller, and replication statistics so that the statuses of partitions and replicas can be modified as needed. Finding out which applications are causing excessive demand and identifying performance bottlenecks might help solve performance issues rapidly.

What do you mean by zookeeper in Kafka and what are its uses?

Apache zookeeper is a naming registry for distributed applications as well as a distributed, open-source configuration and synchronization service. It keeps track of the Kafka cluster nodes' status, as well as Kafka topics, partitions, and so on. Zookeeper is used by Kafka brokers to maintain and coordinate the Kafka cluster. When the topology of the Kafka cluster changes, such as when brokers are topics are added or removed, Zookeeper notifies all nodes. When a new broker enters the cluster, for example, Zookeeper notifies the cluster, as well as when a broker fails. Zookeeper also allows brokers and topic portion pairs to elect leaders, allowing them to select which broker will be the leader for a given partition (and server read and write operations from producers and consumers), as well as which brokers contains clones of the same data. When the cluster of brokers receives a notification from Zookeeper, they immediately begin to coordinate with one another and elect any new partition leaders that are required. This safeguards against the unexpected absence of a broker.

Tell me some of the real-world usages of Apache Kafka

Following are some of the real world usages of Apache Kafka - As a message broker: Due to its high throughput value, Kafka is capable of managing a huge amount of comparable types of messages or data. Kafka can be used as a publish-subscribe messaging system that allows data to be read and published in a convenient manner. - To monitor operational data: Kafka can be used to keep track of metrics related to certain technologies, such as security logs. - Website activity tracking: Kafka can be used to check that data is transferred and received successfully by websites. Kafka can handle the massive amounts of data created by websites for each page and for the activities of users - Data logging: Kafka's data replication between nodes functionality can be used to restore data on a nodes that have failed. Kafka may also be used to collect data from a variety of logs and make it available to consumers. - Stream Processing with Kafka: Kafka may be used to handle streaming data, which is data that is read from one topic, processed and then written to another. Users and applications will have access to a new topic containing the processed data.

What are the traditional methods of message transfer? How is Kafka better from them?

Following are the traditional methods of message transfer: - Message Queuing A point to point technique is used in the message queuing pattern. A message in the queue will be destroyed once it has been consumed, similar to how a message is removed from the server once it has been delivered in the post office protocol. Asynchronous messaging is possible with these queues. If a network problem delays a message's delivery, such as if a consumer is unavailable, the message will be held in the queue until it can be sent. This means that messages aren't always sent in the same order. Instead, they are given on a first-come, first served basis, which can improve efficiency in some situations. - Publisher - Subscriber Model: The publish-subscribe pattern entails publishers producing ("publishing") messages in multiple categories and subscribers consuming published messages from the various categories to which they are subscribed. Unlike point-to-point texting, a message is only removed once it has been consumed by all category subscribers. Kafka caters to a single consumer abstraction that encompasses both of the aforementioned- The consumer group. Following are the benefits of using Kafka over the traditional messaging transfer techniques: - Scalable: A cluster of devices is used to partition and streamline the data thereby, scaling up the storage capacity - Faster: Thousands of clients can be served by a single Kafka broker as it can manager megabytes of reads and writes per second. - Durability and Fault-Tolerant: The data is kept persistent and tolerant to any hardware failures by copying the data in the clusters. What are the major components of Kafka? Following are the major components of Kafka: Topic: - A Topic is a category or feed in which records are saved and published - Topics are used to organize all of Kafka's records. Consumer apps read data from topics, whereas producer applications write data to them. Records published to the cluster remain in the cluster for the duration of a configurable retention period. - Kafka keeps records in the log, and it's up to the consumers to keep track of where they are in the log (the "offset"). As messages are read, a consumer typically advances the offset in a linear fashion. The consumer, on the other hand, is in charge of the position, as he or she can consume messages in any order. When reprocessing records, for example, a consumer can reset to an older offset. Producer: - A Kafka producer is a data source for one or more Kafka topics that optimizes, writes, and publishes messages. Partitioning allows Kafka producers to serialize, compress, and load balance data among brokers. Consumer: - Data is read by consumers by reading messages from topics to which they have subscribed. Consumers will be divided into groups. Each consumer is a consumer group will be responsible for reading a subset of the partitions of each subject to which they have subscribed. Broker: - A Kafka broker is a server that works as part of a Kafka cluster (in other words, a Kafka cluster is made up of a number of brokers). Multiple brokers typically work together to build a Kafka cluster, which provides load balancing, reliable redundancy, and failover. The cluster is managed and coordinated by brokers using Apache ZooKeeper. Without sacrificing performance, each broker instance can handle read and write volumes of hundreds of thousands per second (and gigabytes of messages). Each broker has its own ID and can be in charge of one or more topic log divisions. - ZooKeeper is also used by Kafka brokers for leader elections, in which broker is chosen to lead the handling of client requests for a certain partition of a topic. Connection to any broker will bring a client up to speed with the entire Kafka cluster. A minimum of three brokers should be used to achieve reliable failover; the higher the number of brokers the more reliable the failover

What do you mean by geo-replication in Kafka?

Geo-Replication in a Kafka features that allows messages in one cluster to be copied across many data centers or cloud regions. Geo-replication entails replicating all of the files and storing them throughout the globe if necessary. Geo-replication can be accomplished with Kafka's MirrorMaker tool. Geo-replication is a technique for ensuring data backup

Describe partitioning key in Kafka.

In Kafka terminology, messages are referred to as records. Each record has a key and a value, with the key being optional. For record partitioning, the record's key is used. There will be one or more partitions for each topic. Partitioning is a straightforward data structure. It's the append-only sequence of records, which is arranged chronologically by the time they were attached. Once a record is written to a partition, it is given an offset - a sequential id that reflects the record's position in the partition and uniquely identifies it inside it. Partitioning is done using the record's key. By default, Kafka producer uses the record's key to determine which partition the record should be written to. The producer will always choose the same partition for two records with the same key. This is important because we may have to deliver records to customers in the same order that they were made. You want these events to come in the order they were created when a consumer purchases an eBook from your webshop and subsequently cancels the transaction. If you receive a cancellation event before a buy event, the cancellation will be rejected as invalid (since the purchase has not yet been registered in the system), and the system will then record the purchase and send the product to the client (and lose you money). You might use a customer id as the key of these Kafka records to solve this problem and assure ordering. This will ensure that all of a customer's purchase events are grouped together in the same partition.

Explain the concept of leader and follower in Kafka

In Kafka, each partition has one server that acts as a leader and one or more servers that operate as follows. The leader is in charge of all read and writes requests from the partition, while the followers are responsible for passively replicating the leader. In the case that the leader fails, one of the followers will assume leadership. The server's load is balanced as a result of this.

Can we use Kafka without zookeeper ?

Kafka can now be used with out Zookeeper as of version 2.8. The release of Kafka 2.8.0 in April 2021 gave us all the opportunity to try it out with out zookeeper. However, this version is not yet ready for production In the previous versions, by passing zookeeper and connecting directly to the Kafka broker was not possible. This is because when the zookeeper is down, it is unable to fulfill client requests.

What are the benefits of using clusters in Kafka?

Kafka cluster is basically a group of multiple brokers. They are used to maintain load balance. Because Kafka brokers are stateless, they rely on Zookeeper to keep track of their cluster state. A single Kafka broker instance can manage hundreds of thousands of reads and writes per second, and each broker can handle TBs of messages without compromising performance. Zookeeper can be used to choose the Kafka broker leader. Thus having a cluster of Kafka brokers heavily increases the performance.

What do you mean by a partition in Kafka?

Kafka topics are separated into partitions, each of which contains records in a fixed order. A unique offset is assigned and attributed to each records in a partition. Multiple partition logs can be found in a single topic. This allow several users to read from the same topic at the same time. Topics can be parallelized via partitions, which split data into a single topic among numerous brokers. Replication in Kafka is done at the partition level. A replica is the redundant element of a topic partition. Each partition often contains one or more replicas, which means that partitions contain messages that are duplicated across many Kafka brokers in the cluster. One server serves as the leader of each partition (replica), which the others function as followers. The leader replica is in charge of all read-write requests for the partition, while the followers takes over as the leader, with each broker leading an equal number of partitions.

What do you mean by multi-tenancy in Kafka?

Multi-tenancy is a software operation mode in which many instances of one or more programs operate in a shared environment independently of one another. The instances are considered to be physically separate yet logically connected. The level of logical isolation in a system that supports multi-tenancy must be comprehensive, but the level of physical integration can vary. Kafka is multi-tenant because it allows for the configuration of many topics for data consumption and production on the same cluster.

What is the purpose of partitions in Kafka?

Partitions allow a single topic to be partitioned across numerous servers from the perspective of the Kafka broker. This allows you to store more data in a single topic than a single server can. If you have three brokers and need to store 10TB of data in a topic, one option is to construct a topic with only one partition and store all 10TB on one broker. Another alternative is to build a three-partitioned topic and distribute 10 TB of data among all brokers. A partition is a unit of parallelism from the consumer's perspective.

Explain the four core API architecture that Kafka uses.

Producer API The producer API in Kafka allows an application to publish a stream of records to one or more Kafka topics Consumer API An application can subscribe to one or more Kafka topics using the Kafka consumer API. It also enables the application to process streams of records generated in relation to such topics. Stream API The Kafka Stream API allows an application to use a stream Connect API The Kafka connector API connects Kafka topics to applications. This opens up possibilities for constructing and managing the operations of producers and consumers, as well as establishing reusable links between these solutions. A connector, for example, may capture all database updates and ensure that they are made available in a Kafka topic

Differentiate between Rabbitmq and Kafka.

Rabbitmq: Rabbitmq is a general-purpose message broker and request/reply, point-to-point, and pub-sub communication patterns are all used by it. It has a smart broker/ dumb consumer model. There is the consistent transmission of messages to consumers at about the same speed as the broker monitors the consumer's status. It is a mature platform and is well supported for Java, client libraries, .NET, Ruby, and Node.js. It offers a variety of plugins as well. The communication can be synchronous or asynchronous. It also provides options for distributed deployment. Kafka: Kafka is a message and stream platform for high-volume publish-subscribe messages and streams. It is durable, quick, and scalable. It is a durable message store, similar to a log, and it runs in a server cluster and maintains streams of records in topics (categories). In this, messages are made up of three components: a value, a key, and a timestamp. It has a dumb broker / smart consumer model as it does not track which messages are viewed by customers and only maintains unread messages. Kafka stores all messages for a specific amount of time. In this, external services are required to run, including Apache Zookeeper in some circumstances.

Why is topic Replication important in Kafka? What do you mean by ISR in Kafka?

Topic replication is critical for constructing Kafka deployments that are both durable and highly available. When one broker fails, topics replicas on other brokers remain available to ensure that data is not lost and that the Kafka deployment is not disrupted. The replication factor specifies the number of copies of a topic that are kept across the Kafka cluster. It takes place at the partition level and is defined at the subject level. A replication factor of two, for example, will keep two copies of a topic for each partition. Each partition has an elected leader, and other brokers store a copy that can be used if necessary. Logically, the replication factor cannot be more than the cluster's total number of brokers. An in-sync replica (ISR) is a replica that is up to date with the partition's leader.


Kaugnay na mga set ng pag-aaral

AP Computer Science Unit 3 Progress Check: FRQ

View Set

ATI Mobility & Tissue Integrity Quiz

View Set

Study Block 3: Chapter 20,21,19,18

View Set

Anatomy lect 15+, A&P Midterm 1, Anatomy Midterm 2

View Set

Complications Occurring Before Labor and Delivery

View Set