Apache Kafka Fundamentals
Why is a replication factor of 3 a good idea?
*Allows for one broker to be taken down for maintenance *Allows for another broker to be taken down unexpectedly
What is the default maximum message size a Kafka broker can receive?
1MB or 1000000 bytes
What are Zookeepers default ports?
2181 (Property from ZooKeeper's config zoo.cfg. The port at which the clients connect.) 3888 (Port used by ZooKeeper peers to talk to each other.) 2888 (Port used by ZooKeeper peers to talk to each other.)
What is a key?
A key is basically sent if you need message ordering for a specific field (exp: truck_id)
What is Event Sourcing?
An architectural style or approach to maintaining an application's state by capturing all changes as a sequence of time-ordered, immutable events.
What does the describe functionality do --describe --topic my_topic?
Command used to see what is going in with a topic all the partitions the leaders it is assigned to the replicas and the ISRs.
Kafka calculates the partition by taking the sum of the key modulo the number of partitions. True or False?
False Kafka calculates the partition by taking the hash of the key modulo the number of partitions. So, even though you have 2 partitions, depending on what the key hash value is, you aren't guaranteed an even distribution of records across partitions.
(StaticMembership) If a consumer is restarted or killed due to a transient failure, the broker coordinator will inform other consumers that a rebalance is necessary until session.timeout.ms is reached. One reason for that is that consumers will not send LeaveGroup request when they are stopped. True or False?
False *will not inform*
Kafka chooses two broker's partition's replicas as leader using zookeeper? True or False?
False Kafka chooses chooses one broker's partition's replicas as leader using zookeeper
When using acks=all, it's strongly not recommended to update min.insync.replicas as well, True or False?
False it is strongly recommended
When a consumer starts, it sends a first ________________ request to obtain the Kafka broker coordinator which is responsible for its group. Then, it initiates the rebalance protocol by sending a _________ request.
FindCoordinator JoinGroup
What are consumers?
Consumers read data from a topic (identified by a name). Consumers know which broker to read from. In case of broker failures, consumers know how to recover. Data is read in order within each partitions.
What are consumer groups?
Consumers read data in consumer groups. Each consumer within a group reads from exclusive partitions. If you have more consumers then partitions, some consumers will be inactive. (Consumers will automatically use a GroupCoordinator and a ConsumerCoordinator to assign a consumers to a partition.
What are Controller Brokers?
Controller Broker (KafkaController) is a Kafka service that runs on every broker in a Kafka cluster, but only one can be active (elected) at any point in time. The process of promoting a broker to be the active controller is called Kafka Controller Election. Quoting Kafka Controller Internals: In a Kafka cluster, one of the brokers serves as the controller, which is responsible for managing the states of partitions and replicas and for performing administrative tasks like reassigning partitions. Kafka Controller registers handlers to be notified about changes in Zookeeper and propagate them across brokers in a Kafka cluster.
What are Kafka default port numbers?
Default Port 9092 Default Ambari Port 6667
What is the default Message retention policy?
Default is 168 hours or 7 days.
The _____ __________ ________, as its name suggests, this protocol is in charge of the coordination of members within a group. The clients participating in a group will execute a sequence of requests/responses with a Kafka broker that acts as coordinator.
Group Membership Protocol
When should producer idempotence be enabled?
If you already use acks=all then there is no reason not to enable this feature. It works flawlessly and without an additional complexity for the application developer. It is just a no-brainer decision.
When a consumer in a group has processed data received from Kafka what should it do? (Does this by default)
It should be committing the offsets to the topic __consumer_offsets
What does unclean.leader.election.enable do?
Setting unclean.leader.election.enable to true means we allow out-of-sync replicas to become leaders, we will lose messages when this occurs, effectively losing important data and making our customers very angry.
When a broker receives a producer message where does it commit to?
The topics log and wait for the consumer
The KafkaConsumer Fetcher communicates with the cluster through the Consumer Network Client with the client open and sending TCP packets the consumer sends heartbeats so the client knows if it is still open and receives Metadata so it knows the partitions, topics etc. True or False?
True
The main idea behind static membership is that each consumer instance is attached to a unique identifier configured with group.instance.id. The membership protocol has been extended so that ids are propagated to the broker coordinator through the JoinGroup request. True or False?
True
There is a setting that can affect the pattern of duplication and overall throughput called queue.buffering.max.ms with alias linger.ms. This producer configuration allows us to accumulate messages before sending them, creating larger or smaller batches. Larger batches increase throughput, while also increasing latency as messages are accumulated in memory for a period before sending. True or False?
True
Kafka chooses one broker's partition's replicas as leader using zookeeper. True or False?
True (broker that has the partition leader handles all reads and writes of records for the partition) ( if a partition leader fails, kafka chooses a new isr as the new leader.)`
The more partitions the greater the Zookeeper overhead, true or false?
True With a large partition numbers ensure proper ZK capacity
Does each partition have a leader server and zero or more follower servers, True or False?
True also (leaders handle all read and write requests for a partition.)
Setting acks to -1 is the same as setting it to all, True or False?
True This a producer config
What two duties does the ConsumerCoordinator have?
With information about the cluster from the Metadata in KafkaConsumer. Other major elements become more involved. With information about the cluster the consumer coordinator can now take responsibility to coordinate with the consumer, this object has two main duties. First: being aware of automatic or dynamic partition reassignment and notification of assignment changes to the SubscriptionState object AND second for committing offsets to the cluster. The confirmation of which will cause the update of the SubscriptionState so it can only be aware of the status of topics and partitions.
What is every Kafka broker called?
bootstrap server
Which settings increases the chance of batching for a Kafka Producer?
compression.type (Valid values are none, gzip, snappy, lz4, or zstd.) batch.size linger.ms
Less partitions the longer the leader fail-over time, true or false?
false The more partitions the longer the leader fail-over time
fetch.min.bytes max.fetch.wait.ms max.partition.fetch.bytes max.poll.records
fetch.min.bytes * Defines a minimum number of bytes required to send data from Kafka to the consumer. When Consumer polls for data, if the minimum number of bytes is not reached, then Kafka waits until the pre-defined size is reached and then sends the data.
The first consumer, within the group, receives the list of active members and the selected assignment strategy and acts as the _____ ______ while others receive an empty response. The _____ ______ is responsible for executing the partitions assignments locally.
group leader group leader
How do you configure a console consumer to handle primitive types?
kafka-console-consumer --topic example --bootstrap-server broker:9092 \ --from-beginning \ --property print.key=true \ --property key.separator=" : " \ --key-deserializer "org.apache.kafka.common.serialization.LongDeserializer" \ --value-deserializer "org.apache.kafka.common.serialization.DoubleDeserializer" Now you know how to configure a console consumer to handle primitive types - Double, Long, Float, Integer and Short.
After connecting to any broker (bootstrap broker) will you be connected to the entire cluster?
yes
Can you connect to one broker and be connected to the entire cluster?
yes
Does each broker in a cluster know about all brokers, topics, and partitions (metadata)?
yes
Does increasing partitions generally increase throughput?
yes
Each topic has to have a single partition because each partition is a physical representation of a commit log stored on one or more brokers. In that broker system the file is stored on tmp/kafka-logs/{topic}-{partition} inside of that is .index and .log. Is that true?
yes
Does each RecordBatch get configured from the Producer with the configuration of batch.size, buffer.memory and max.block.ms?
yes batch.size: The maximum amount of data that can be sent in a single request. If batch.size is (32*1024) that means 32 KB can be sent out in a single request. buffer.memory: if Kafka Producer is not able to send messages(batches) to Kafka broker (Say broker is down). It starts accumulating the message batches in the buffer memory (default 32 MB). Once the buffer is full, It will wait for "max.block.ms" (default 60,000ms) so that buffer can be cleared out. Then it's throw exception.
What are brokers identified with?
An integer
What are the 3 delivery semantics for consumers?
At most once * offsets are committed as soon as the message is received * If the processing goes wrong, the message will be lost (it won't be read again). At least once (usually preferred) * offsets are committed after the message is processed * If the processing goes wrong, the message will be read again * This can result in duplicate processing of messages. MAKE SURE your processing is idempotent. (processing again the messages won't impact your system) Exactly Once * Can be achieved for Kafka => Kafka workflows using Kafka Streams API * For Kafka => External System workflows, use an idempotent consumer
What is returned by a producer.send() call in the Java API?
Future<RecordMetadata> object is returned
All members (consumers) send a _________ request to the coordinator. The group leader attached the computed assignments while others simply respond with an empty request.
SyncGroup
As we can see, the _________ contains some consumer client configuration such as the _______.timeout.ms and the ___.poll.interval.ms. These properties are used by the coordinator to kick members out of the group if they don't respond.
JoinGroup session max
What is the definition of latency is in Kafka?
Latency is the time it takes for data to be transferred between its original source and its destination
What is acks=all?
Leader + replicas acknowledgment (no data loss)
Can Kafka work without Zookeeper?
No
Does Zookeeper have a leader (handle read) and the rest of the servers are followers (handles writes)?
No
Is KSQL ANSI SQL compliant?
No
Does Zookeeper not care about brokers?
No it manages them
When an offset is read from a consumer does that mean it is committed?
No, subjective on the offset management mode (configuration properties). All the different types enable.auto.commit = true (default) * Automatic vs Manual auto.commit.interval.ms = 5000 (default) * lengthening makes sure your record processing is finished. The commits could also be lacking behind. auto.offset.reset = "latest" (default) * earliest * none
What is acks=1?
Producer will wait for a leader acknowledgment (limited data loss)
What is acks=0?
Producer won't wait for acknowledgment (possible data loss)
________________ gives the Producer it's ability to micro-batch records intended to be sent at high volumes and high frequencies.
RecordAccumulator Once a producer record has been assigned to a partition through the partitioner. It will get handed over to a record accumulator where it will be added to a collection of record batch objects for each topic partition combination needed by the producer instance.
How does Confluent Schema Registry use a protocol rebalance?
Relies on rebalancing to elect a leader node
If the key=null, how is the data sent two 3 brokers in a cluster?
Round robin, 101, 102 then 103
What is RoundRobinAssignor?
The RoundRobinAssignor can be used to distribute available partitions evenly across all members. As previously, the assignor will put partitions and consumers in lexicographic order before assigning each partitions.
What is StickyAssignor?
The StickyAssignor is pretty similar to the RoundRobin except that it will try to minimize partition movements between two assignments, all while ensuring a uniform distribution.
What is the RangeAssignor default strategy?
The aims of this strategy is to co-localized partitions of several topics. This is useful, for example, to join records from two topics which have the same number of partitions and the same key-partitioning logic. For doing this, the strategy will first put all consumers in lexicographic order using the member_id assigned by the broker coordinator. Then, it will put available topic-partitions in numeric order. Finally, for each topic, the partitions are assigned starting from the first consumer .
Consumer groups what if there is more consumers than partitions?
The number of consumers in a consumer group must be the same as the number of partitions in a topic so they become inactive until there is an another available partition.
What will other brokers do for the data that are not the leaders?
They will synchronize the data (ISR in-sync replicas)
What is the definition of throughput?
Throughput measures how many events arrive within a specific amount of time. Most systems are optimized for either latency or throughput. Kafka is balanced for both. A well-tuned Kafka system has just enough brokers to handle topic throughput, given the latency required to process information as it is received.
Each consumer periodically sends a Heatbeat request to the broker coordinator to keep its session alive and If a rebalance is in progress, the coordinator uses the Heatbeat response to indicate to consumers that they need to rejoin the group. True or False?
True
If you currently use acks=0 or acks=1 for reasons of latency and throughput then you might consider staying away from this idempotence. Acks=all increases latencies and latency variability. If you already use acks=0 or acks=1 then you probably value the performance benefits over data consistency. True or False?
True
Kafka calculates the partition by taking the hash of the key modulo the number of partitions, True or False?
True
Kafka stores the committed offsets in a special topic called __consumer_offsets on the cluster? True or False?
True
On the producer side, after receiving base64 data, the REST Proxy will convert it into bytes and then send that bytes payload to Kafka. Therefore consumers reading directly from Kafka will receive binary data. True or False?
True
What is CommitSync and CommitAsync?
Use commitSync for process control wouldn't want to retrieve new records until the current ones are committed. (Trades possibly throughput and performance for control over the consistency and add overall latency. * Also use retry.backoff.ms (default: 100) (This retries until succeeds or an unrecoverable error has occurred). commitAysnc to control when the message is truly processed. Due to Async, you don't know when the commit has succeeded or not. - non-blocking but non-deterministic - no retries - possible duplication of records * Throughput and overall performance will be better (Don't do this unless you register a callback and handle it correctly.)
How does Kafka Streams rebalance use a protocol rebalance?
Uses it to assign tasks and partitions to the application streams instances
How does Kafka Connect rebalance use a protocol rebalance?
Uses it to distribute tasks and connectors among the workers
Are messages appended to a topic-partition in the order they are sent?
Yes
As long as the number of partitions remains constant for a topic (no new partitions), the same key will always go to the same partition?
Yes
Do consumers read messages in the order stored in a topic-partition?
Yes
Does Kafka store the offsets at which a consumer group has been reading?
Yes
Does Zookeeper have a leader (handles writes) and the rest of the servers are followers (handle reads)?
Yes
With a replication factor of N, producers and consumers can tolerate up to N-1 brokers being down?
Yes
If a consumer dies will it be able to read from where it left off and if so how?
Yes! Thanks to the committed consumer offsets!
What does Zookeeper do in a creation of a new topic?
Zookeeper scanned its registry of Brokers and made a decision to assign a broker as the leader for the topic that was just created. (In the broker there is a logs directory called my_topic 0 inside that directory is 2 directories the index and log file.)
What happens when a broker goes down?
Zookeeper will find another broker to take it's place. The metadata used for work distribution for either producers or consumers will get updated and the system will go on. This is because of the replication-factor. The data won't be lost and it will go humming along.
Offsets committed live in a Kafka topic from a consumer is name what?
__consumer_offsets
In Kafka Streams, by what value are internal topics prefixed by?
application.id
List of some notification Zookeeper sends to Kafka in case of changes.
new topic, broker dies, broker comes up, delete topics etc etc
Is data randomly assigned to a partition when there is a key provided?
no
Once data is written to a partition can it be changed?
no it is immutable
Zookeeper by design operates with odd or even number of servers?
odd (3, 5, 7) etc.
Once the coordinator responds to all SyncGroup requests, each consumer receives their assigned partitions, invokes the __________________________ on the configured listener and, then starts fetching messages.
onPartitionsAssignedMethod
How many brokers can be a leader for a given partition?
one
What amount of time is data kept for a partition (default)?
one week
Zookeeper helps in _______ _______ _______ for patitions?
performing leader election
If each consumer application can process only 50 MB/s and our customer wants to achieve a target of 2 GB/s throughput through a single topic. How many partitions do we need? so we need at least 20 consumers consuming one partition so that 50 * 20 = 1000 MB target is achieved.
so we need at least 40 consumers consuming one partition so that 50 * 40 = 2000 MB target is achieved.
Acks is a producer setting min.insync.replicas is a topic or broker setting and is only effective when acks=all, true or false?
true
Schema Registry identities are stored in ZooKeeper and are made up of a hostname and port. If multiple listeners are configured, the first listener's port is used for its identity. Is this true or false?
true