CCDAK Kafka Theory (Need to know)
max.poll.records = 500 (default) does what?
(Consumer Poll Behavior) Controls how many records to receive per poll request. Increase if your messages are very small and have a lot of available RAM.
fetch.min.bytes = 1 (default) does what?
(Consumer Poll Behavior) Controls how much data you want to pull at least on each request. Help improving throughput and decreasing request number. At the cost of latency.
What is records-lag-max for a Consumer in Kafka?
(monitoring metrics) The maximum lag in terms of number of records for any partition in this window. An increasing value over time is your best indication that the consumer group is not keeping up with the producers.
Segment come with two indexes (files) what are they?
1. An offset to position index (.index file): Allows Kafka where to read to find a message 2. A timestamp to offset index (.timestamp file): Allows Kafka to find a message with a timestamp
If a producer writes a 1GB/sec and consumer consumes at 250MB/sec then how many partitions are required?
4
Replication factor = 3 and partition = 2 if that is the case how many total partitions are distributed across Kafka Cluster?
6 partitions. Each partition will be having 1 leader and 2 ISR (in-sync replicas)
What is the Schema Registry Port?
8081
What is the REST Proxy port?
8082
What is the KSQL Port?
8088
List Broker Port:
9092
Start Consuming messages from kafka topic my-first-topic
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my-first-topic --from-beginning >hello drew >learning kafka
Start Consuming messages in a consumer group from kafka topic my-first-topic
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my-first-topic --group my-first-consumer-group --from-beginning
Produce messages to Kafka topic my-first-topic
> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic my-first-topic --producer-property acks=all> >hello drew >learning kafka >^C
Shift offsets by 2 (backward) as another strategy
> bin/kafka-consumer-groups --bootstrap-server localhost:9092 --group my-first-consumer-group --reset-offsets --shift-by -2 --execute --topic my-first_topic
Shift offsets by 2 (forward) as another strategy
> bin/kafka-consumer-groups --bootstrap-server localhost:9092 --group my-first-consumer-group --reset-offsets --shift-by 2 --execute --topic my-first_topic
Describe consumer group
> bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe -group my-first-consumer-group
Reset offset of consumer group to replay all messages
> bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe -group my-first-consumer-group --reset-offsets --to-earliest --execute --topic my-first-topic
List all consumer groups
> bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
Acks = all must be in conjunction with min.insync.replicas. Where can this be set?
Broker or topic level
Who has defaults for all topic configuration parameters?
Brokers
Is every broker in Kafka a bootstrap server? If so what does it know?
Every broker in Kafka is a "bootstrap server" which knows about all brokers, topics and partitions (metadata) that means Kafka client (e.g. producer,consumer etc) only need to connect to one broker in order to connect to entire cluster. At all times, only one broker should be the controller, and one broker must always be the controller in the cluster
When a consumer in a group has processed the data received from Kafka, it commits the offset in Kafka topic named _consumer_commit which is used when a consumer dies, it will be able to read back from where it left off. True or False?
False Instead the kafak topic is named _consumer_offset
When do old segments get deleted?
Depending on the log.retention.hours or log.retention.bytes rule
What happens when a producer sends a key that is null will data?
It is sent round robin
Where is min.insync.replicas = 2 set from?
Set at broker or topic level (Safe Producer Config)
Replication factor can not be greater then number of broker in the kafka cluster. If topic is having a replication factor of 3 then each partition will live on 3 different brokers. True or False?
True
Will adding a partition to a topic loose the guarantee of same key go to same partition?
True
With a replication factor of N, producers and consumers can tolerate up to N-1 brokers being down. True or False?
True
min.insync.replica = 2 implies that at least 2 brokers that are ISR (including leader) must acknowledge. True or false?
True
What do you expect with enable.auto.commit=true & synchronous processing of batches? (Consumer Offset commit strategy)
With auto commit, offset will be committed automatically for you at regular interval (auto.commit.interval.ms=5000 by default) every time you call .poll(). If you don't use synchronous processing, you will be in "at most once" behavior because offsets will be committed before your data is processed.
Create a kafka topic with name my-first-topic
bin/kafka-topics.sh --zookeeper localhost:2181 --topic my-first-topic --create --replication-factor 1 --partitions 1
Delete kafka topic my-first-topic
bin/kafka-topics.sh --zookeeper localhost:2181 --topic my-first-topic --delete (Note: This will have no impact if delete.topic.enable is not set to true)
Describe kafka topic my-first-topic
bin/kafka-topics.sh --zookeeper localhost:2181 --topic my-first-topic --describe
Start a zookeeper at default port 2181
bin/zookeeper-server-start.sh config/zookeeper.properties
Deleted records can still be seen by consumers for a period determined by what?
delete.retention.ms=24 hours (default)
The offset of message is _________.
immutable
What configuration makes the cleaner check for work every 15 seconds?
log.cleaner.backoff.ms
What configuration is delete based on keys for that topic. Will delete old duplicate keys after the active segment is committed. (Kafka default for topic __consumer_offsets)
log.cleanup.policy=compact
What configuration is used to delete data based on the age of data (default is 1 week).
log.cleanup.policy=delete
What configuration is used for the max size in bytes for each partition?
log.retention.bytes = -1 (infinite default)
What configuration is used for the number hours to keep data for?
log.retention.hours= 1 week(deafult)
What is the configuration for time kafka will wait before closing the segment if not full?
log.segment.ms = 1 week (default)
What is the configuration for max size of a single segment in bytes
log.segments.bytes = 1 GB (default)
As long as number of partitions remains constant for a topic (no new partition), will the same key always go to the partition?
yes
Partition is having its own offset starting from _.
0
max.partition.fetch.bytes = 1MB (default)
Maximum data returned by broker per partition. If you read from 100 partition, you will need a lot of memory (RAM)
Deliver semantics what are the three different types and the definitions of each?
At most once : Offset are committed as soon as message batch is received. If the processing goes wrong, the message will be lost (it won't be read again) At least once (default): Offset are committed after the message is processed.If the processing goes wrong, the message will be read again. This can result in duplicate processing of message. Make sure your processing is idempotent. (i.e. re-processing the message won't impact your systems). For most of the application, we use this and ensure processing are idempotent. Exactly once: Can only be achieved for Kafka=>Kafka workflows using Kafka Streams API. For Kafka=>Sink workflows, use an idempotent consumer.
What is heartbeat.interval.ms=3 seconds(default)
Heartbeat is sent in 3 seconds interval. Usually 1/3rd of session.timeout.ms
What does the consumer heartbead thread do?
Heartbeat mechanism is used to detect if consumer application is dead.
What is session.timeout.ms=10 seconds (default)
If heartbeat is not sent in 10 seconds period, the consumer is considered dead. Set lower value to faster consumer rebalance.
ZooKeeper servers will be deployed on multiple nodes. This is called an ensemble. An ensemble is a set of 2n + 1 ZooKeeper servers where n is any number greater than 0. The odd number of servers allows ZooKeeper to perform majority elections for leadership. At any given time, there can be up to n failed servers in an ensemble and the ZooKeeper cluster will keep quorum. If at any time, quorum is lost, the ZooKeeper cluster will go down. In Zookeeper multi-node configuration, initLimit and syncLimit are used to govern how long following ZooKeeper servers can take to initialize with the current leader and how long they can be out of sync with the leader.
If tickTime=2000, initLimit=5 and syncLimit=2 then a follower can take (tickTime*initLimit) = 10000ms to initialize and may be out of sync for up to (tickTime*syncLimit) = 4000ms
auto.offset.reset=none does what?
It will throw an exception if no offset is found
When produce to a topic which does not exist and auto.create.topic.enable = true. How does it get created?
Kafka creates the topic automatically with the broker/topic settings num.partition and deafult.replication.factor.
What is batch.size=32KB or 64KB and what is it's purpose?
Maximum number of bytes that will be included in a batch (default 16KB). Any message bigger than the batch size will not be batched. (High Throughput Producer using compression and batching)
Automatically recover from errors in Producer are?
LEADER_NOT_AVAILABLE NOT_LEADER_FOR_PARTITION REBALANCE__IN_PROGRESS
What are the non retriable errors for a Producer?
MESSAGE_TOO_LARGE
What does max.poll.interval.ms = 5 minute (default) do?
Max amount of time between two .poll() calls before declaring consumer dead. If processing of message batch takes more time in general in application then should increase the interval.
What is linger.ms=20 and what is it's purpose?
Number of millisecond of a producer is willing to wait before sending a batch out. (default 0). Increase linger.ms value increasing the chance of batching. (High Throughput Producer using compression and batching)
Describe acks=0
Producer does not wait for ack (possible data loss)
Describe acks=1
Producer wait for leader ack (limited data loss)
Describe acks=alll
Producer wait for leader and replica ack (no data loss)
auto.offset.reset=latest does what?
Read from the end of the log (consumer offset)
Consumer offset can be lost if hasn't read new data in 7 days. If this is the case, how?
This can be controlled by broker setting offset.retention.minutes
If a key is sent then all message for that key will always go to the same partition? If so why?
This can be used to order the messages for a specific key since order is guarenteed in the same partition. Keys are hashed using the murmur2 algorithim by default then module of partitions
Are keys hashed by using "murmur2" algorithm by default?
True
Consumer read the messages in the order stored in topic-partition. True or False?
True
Example: replication.factor = 3, min.insync.replicas = 2, acks = all can only tolerate 1 broker going down, otherwise the producer will receive an exception NOT_ENOUGH_REPLICAS on send. True or False?
True
Messages are appended to a topic-partition in the order they are sent. True or False?
True
What is compression.type=snappy?
Value can be none(default), gzip, lz4, snappy. Compression is enabled at the producer level and doesn't require any config change in broker or consumer. Compression is more effective in case of bigger batch of messages being sent to in kafka. (High Throughput Producer using compression and batching)
Can a producer choose to send a key with a message?
Yes
Can a topic have one or more partition?
Yes
Does the poll mechanism also used to detect if the consumer application is dead?
Yes
Is enable.auto.commit=false & manual commit of offsets(recommended)? (Consumer Offset commit strategy)
Yes
Is order guaranteed within a partition and once data is written into the partition is it immutable?
Yes
It is not possible to delete a partition of a topic once it is created? Yes or No?
Yes
Kafka takes bytes as input without even loading them into memory. What is this called?
Zero Copy
List of the Default Ports for Zookeeper
Zookeeper: 2181 Zookeeper Leader Port: 3888 Zookeeper Election Port (Peer port) 2888
Start a kafka server at default port 9092
bin/kafka-server-start.sh config/server.properties
Find out all the partitions without a leader
bin/kafka-topics.sh --zookeeper localhost:2181 --describe --unavailable-partitions
List all kafka topics
bin/kafka-topics.sh --zookeeper localhost:2181 --list
What is max.in.flight.per.connection?
number of producer request can be made in parallel (default is 5) (Safe Producer Config)
What is retries = MAX_INT
number of retries by producer incase of transient failure/exception. (default is 0). (Safe Producer Config)
Per thread ___ consumer is the rule.
one The consumer must not be multi-threaded
At a time only ___ segment is active in a ________
one partition
What is enable.idempotence = true?
producer send producerId with each message to identify for duplicate msg at kafka end. When kafka receives duplicate message with same producerId which is already committed. It does not commit it again and send ack to producer (default is false) (Safe Producer Config)
auto.offset.reset=earliest does what?
reads from the start of the log
Broker contains leader partition called leader of that partition and only leader can ________ and _____ data for partition.
receive serve
Partitions are made of ________ (.log files)
segments
The log cleanup happens on partition ________. Smaller/more segments mean the log cleanup will happen more often!
segments