kafka specific
offsets
Unique, incremental IDs of messages are called offsets.
consumers
consumer is like a java application and it reads data from a topic. just like producers, consumers will recover automatically from broker failures. data is read in order within each partition
zookeeper
decides the leaders and ISRs (in sync replicas) for brokers in kafka
what's the distributed part of kafka
each broker contains certain topic partitions. when you create a topic it automatically distributes across multiple brokers
exactly once
holy grail of delivery of messages can only be achieved kafka to kafka workflows using kafka streams API. you must use an idempotent consumer for kafka to external system workflows to avoid dupes in final DB
acks strategy
how producer writes data to kafka: acks can be 0,1, all 0 = no acknowledgement, producer won't wait for acknowledgement, possible data loss. 1 = default, producer waits for leader for acknowledgement, limited data loss all = leaders and all replicas send acknowledgement (no data loss)
cluster
multiple brokers, if you connect to a single broker you an connected to the entire cluster
at most once
not great method of consumer offset delivery. offsets committed as soon as message is received. if processing goes wrong, i.e. consumer goes down, the message will be lost
consumer groups
represent an application.
broker
think about this as the server that holds the topics and their associated partitions.
at least once
this is preferred method of consumer offset delivery. offsets are committed after the message is processed. so you read data do something with the data, then you commit the offset. if the process goes wrong (consumer goes down) message will be read again.
how is kafka fault tolerant
through the topic replication factor. A with 2 partitions and rep factor of 2. topic A - part 0 will be replicated on two brokers. If we lose one of those brokers, the working broker can still server the data.
Caveat to 'at least once'
to avoid duplicate processing of messages, you need to ensure your processing is idempotent (i.e. process same message twice it does not impact your system)
topics
particular stream of data (sort of like a table in a database). Topics are split into partitions.
kafka guarantees
1. messages are appended to topic-partition in order they are sent 2. consumers read msgs in order stored in a topic-partition. 3. with a replication factor of N, producers and consumers can tolerate up to N-1 brokers being down 4. as long as # of partitions remains constant for a topic (no new partitions) the same key will always go to the same partition
To produce data to a topic, a producer must provide the Kafka client with...
any broker from the cluster and the topic name. kafka clients will route your data to the appropriate brokers and partitions for you.
consumer offsets
kafka stores the offsets at which a consumer group has been reading, they're committed live in a kafka topic named __consumer_offsets. so if a consumer dies it will ready back from where it left off. consumers chose when to deliver offsets.
producer : round robin
when there is no key provided, the data will get sent to every broker
messages
within each partition you have messages which are like individual entries of data. Each message entry is a atomic unit (single entry) in a log (partition of a topic).
producer
write data to topics (get data into kafka). Producers will automatically recover when broker fails.
when you create a topic what requirements must be met?
you must specify the number of partitions and a replication factor. Note that your replication factor cannot exceed the number of brokers that you have in your cluster