Explore topic-wise InterviewSolutions in .

This section includes InterviewSolutions, each offering curated multiple-choice questions to sharpen your knowledge and support exam preparation. Choose a topic below to get started.

1.

What are Znodes in Kafka Zookeeper? How many types of Znodes are there?

Answer»

The nodes in a ZooKeeper tree are called znodes. Version numbers for data modifications, ACL changes, and timestamps are kept by Znodes in a structure. ZooKeeper uses the version number and timestamp to verify the CACHE and guarantee that updates are coordinated. Each time the data on ZNODE changes, the version number connected with it grows.

There are three different TYPES of Znodes:

  • Persistence Znode: These are znodes that continue to function even after the client who created them has been disconnected. Unless otherwise specified, all znodes are persistent by default.
  • Ephemeral Znode: Ephemeral znodes are only active while the client is still alive. When the client who produced them disconnects from the ZooKeeper ensemble, the ephemeral Znodes are automatically removed. They have a significant part in the election of the leader.
  • Sequential Znode: When znodes are constructed, the ZooKeeper can be asked to append an increasing counter to the path's end. The PARENT znode's counter is unique. Sequential nodes can be either persistent or ephemeral.
Conclusion:

In this article, we discussed the most frequently asked interview questions on KAFKA. It should be clear why Kafka is such an effective streaming platform. Kafka is a useful solution for scenarios that require real-time data processing, application activity tracking, and monitoring. At the same time, Kafka should not be utilized for on-the-fly data conversions, data storage, or when a simple task queue is all that is required.

References and Resources:

Kafka Documentation

Spark Interview

Java Interview

2.

Differentiate between Kafka streams and Spark Streaming.

Answer»
Kafka StreamsSpark Streaming
Kafka is fault-tolerant because of partitions and their replicas.Using Cache and RDD (Resilient Distributed DATASET), Spark can restore partitions.
It is only capable of handling real-time streamsIt is capable of handling both real-time and BATCH tasks.
Messages in the Kafka log are persistent.To keep the data durable, you'll need to utilize a dataframe or ANOTHER data structure.
There are no interactive modes in Kafka. The data from the producer is simply consumed by the broker, who then WAITS for the client to read it.Interactive modes are available.
3.

How will you change the retention time in Kafka at runtime?

Answer»

A topic's retention time can be configured in Kafka. A topic's DEFAULT retention time is seven days. While creating a new subject, we can set the retention time. When a topic is GENERATED, the broker's property log.retention.hours are used to set the retention time. When configurations for a currently operating topic need to be modified, kafka-topic.sh must be used.

The right COMMAND is determined on the Kafka version in USE.

  • The command to use up to 0.8.2 is kafka-topics.sh --alter.
  • Use kafka-configs.sh --alter STARTING with version 0.9.0.
4.

What do you mean by BufferExhaustedException and OutOfMemoryException in Kafka?

Answer»

When the producer can't assign MEMORY to a record because the buffer is FULL, a BufferExhaustedException is thrown. If the producer is in non-blocking MODE, and the rate of production exceeds the rate at which data is transferred from the buffer for long enough, the allocated buffer will be depleted, the exception will be thrown.

If the consumers are sending huge messages or if there is a spike in the number of messages sent at a rate quicker than the rate of DOWNSTREAM processing, an OutOfMemoryException may arise. As a result, the message QUEUE fills up, consuming memory space.

5.

Can the number of partitions for a topic be changed in Kafka?

Answer»

CURRENTLY, Kafka does not allow you to reduce the NUMBER of partitions for a TOPIC. The partitions can be expanded but not SHRUNK. The alter command in Apache Kafka allows you to change the behavior of a topic and its associated configurations. To add EXTRA partitions, use the alter command.

To increase the number of partitions to five, use the following command:

./bin/kafka-topics.sh --alter --zookeeper localhost:2181 --topic sample-topic --partitions 5
6.

What do you mean by graceful shutdown in Kafka?

Answer»

The Apache cluster will automatically identify any broker shutdown or failure. In this instance, new leaders for partitions previously handled by that device will be chosen. This can happen as a result of a server failure or even if it is shut down for maintenance or CONFIGURATION changes. When a server is taken down on purpose, Kafka PROVIDES a GRACEFUL method for terminating the server rather than killing it.

When a server is switched off:

  • To prevent having to undertake any log recovery when Kafka is restarted, it ensures that all of its logs are synced ONTO a disk. Because log recovery takes time, purposeful restarts can be sped up.
  • Prior to shutting down, all partitions for which the server is the leader will be moved to the replicas. The leadership transfer will be FASTER as a result, and the period each partition is inaccessible will be decreased to a few milliseconds.
7.

How will you expand a cluster in Kafka?

Answer»

To add a server to a Kafka cluster, it only needs to be given a unique broker id and Kafka MUST be started on that server. However, until a new topic is created, a new server will not be given any of the data partitions. As a result, when a new MACHINE is introduced to the cluster, some existing data must be migrated to these new machines. To RELOCATE some partitions to the new broker, we USE the partition reassignment tool. Kafka will make the new server a follower of the partition it is migrating to, allowing it to replicate the data on that partition completely. When all of the data has been duplicated, the new server can join the ISR, and one of the current replicas will ERASE the data it has for that partition.  

8.

What do you mean by an unbalanced cluster in Kafka? How can you balance it?

Answer»

It's as simple as assigning a UNIQUE broker id, listeners, and log directory to the server.properties file to add new brokers to an existing Kafka cluster. However, these brokers will not be allocated any data PARTITIONS from the cluster's existing topics, so they won't be performing much work unless the partitions are moved or new topics are formed. A cluster is referred to as unbalanced if it has any of the following problems :

Leader Skew: 

Consider the following scenario: a topic with three partitions and a replication factor of three across three brokers. 

The leader receives all READS and writes on a partition. Followers send fetch requests to the leaders in order to receive their most recent messages. Followers exist solely for redundancy and fail-over purposes.

Consider the case of a broker who has failed. It's possible that the failed broker was a collection of numerous leader partitions. Each unsuccessful broker's leader partition is promoted as the leader by its followers on the other brokers. Because fail-over to an out-of-sync replica is not allowed, the follower must be in sync with the leader in order to be promoted as the leader.

If another broker goes down, all of the leaders are on the same broker, therefore there is no redundancy.

When both brokers 1 and 3 go live, the partitions gain some redundancy, but the leaders stay focused on broker 2.

As a result, the Kafka brokers have a leader imbalance. When a node is a leader for more partitions than the number of partitions/number of brokers, the cluster is in a leader skewed condition.

Solving the leader skew problem:

Kafka offers the ability to reassign leaders to the desired replicas in order to tackle this problem. This can be accomplished in one of two ways:

  • The auto.leader.rebalance.enable=true broker option allows the controller node to transfer leadership to the preferred replica leaders, RESTORING the even distribution.
  • When Kafka-preferred-replica-election.sh is run, the preferred replica is selected for all partitions: The utility requires a JSON file containing a mandatory list of zookeeper hosts and an optional list of topic partitions. If no list is provided, the utility uses a zookeeper to retrieve all of the cluster's topic partitions. The Kafka-preferred-replica-election.sh utility can be time-consuming to use. Custom scripts can render only the topics and partitions that are required, automating the process across the cluster.

Broker Skew:

Let us consider a Kafka cluster with nine brokers. Let the topic name be "sample_topic." The following is how the brokers are assigned to the topic in our example:

Broker IdNumber of PartitionsPartitionsIs Skewed?

0

3

(0, 7, 8)

No

1

4

(0, 1, 8, 9)

No

2

5

(0, 1, 2 , 9, 10)

No

3

6

(1, 2, 3, 9, 19, 11)

Yes

4

6

(2, 3, 4, 10, 11, 12)

Yes

5

6

(3, 4, 5, 11, 12, 13)

Yes

6

5

(4, 5, 6, 12, 13)

No

7

4

(5, 6, 7, 13)

No

8

3

(6, 7, 8)

No

On brokers 3,4 and 5, the topic “sample_topic” is skewed. This is because if the number of partitions per broker on a given issue is more than the average, the broker is considered to be skewed.

Solving the broker skew problem :

The following steps can be used to solve it:

  • Generate the candidate assignment configuration using the partition reassignment tool (Kafka-reassign-partition.sh) with the –generate option. The current and intended replica allocations are shown here.
  • Create a JSON file with the suggested assignment.
  • To update the metadata for balancing, run the partition reassignment tool.
  • Run the “Kafka-preferred-replica-election.sh” tool to complete the balancing after the partition reassignment is complete.
9.

What are the guarantees that Kafka provides?

Answer»

Following are the guarantees that Kafka assures :

  • The messages are displayed in the same ORDER as they were published by the producers. The order of the messages is maintained.
  • The replication factor determines the number of replicas. If the replication factor is N, the Kafka cluster has FAULT tolerance for up to n-1 servers.
  • Per PARTITION, Kafka can provide "at least one" delivery semantics. This means that if a partition is given numerous times, Kafka assures that it will reach a customer at least once.
10.

What do you understand about log compaction and quotas in Kafka?

Answer»

Log compaction is a way through which Kafka assures that for each topic partition, at least the last known value for each message KEY within the log of data is kept. This allows for the restoration of state following an application crash or a system failure. During any operational maintenance, it allows REFRESHING caches after an application restarts. Any consumer processing the log from the beginning will be able to see at least the FINAL state of all records in the order in which they were written, because of the log compaction.

A Kafka cluster can apply quotas on producers and fetch requests as of Kafka 0.9. Quotas are byte-rate limits that are set for each client-id. A client-id is a logical identifier for a request-making application. A single client-id can THEREFORE link to numerous producers and client instances. The quota will be applied to them all as a single unit. Quotas PREVENT a single application from monopolizing broker resources and causing network saturation by consuming extremely large amounts of data.

11.

Tell me about some of the use cases where Kafka is not suitable.

Answer»

Following are some of the use cases where Kafka is not suitable :

  • Kafka is designed to MANAGE large amounts of data. TRADITIONAL messaging systems would be more appropriate if only a small number of messages NEED to be processed every day.
  • Although Kafka includes a streaming API, it is insufficient for EXECUTING data transformations. For ETL (extract, transform, load) jobs, Kafka should be avoided.
  • There are superior options, such as RabbitMQ, for scenarios when a simple task queue is required.
  • If long-term storage is necessary, Kafka is not a good CHOICE. It simply allows you to save data for a specific retention period and no longer.
12.

Describe message compression in Kafka. What is the need of message compression in Kafka? Also mention if there are any disadvantages of it.

Answer»

Producers transmit data to BROKERS in JSON format in Kafka. The JSON format stores data in string form, which can result in several duplicate records being stored in the Kafka topic. As a result, the amount of disc space used increases. As a result, before delivering messages to Kafka, compression or delaying of data is performed to save disk space. Because message compression is performed on the producer side, no changes to the consumer or broker setup are required. 

It is advantageous because of the FOLLOWING factors:

  • It decreases the latency of messages transmitted to Kafka by reducing their size.
  • Producers can send more net messages to the broker with less bandwidth.
  • When data is SAVED in Kafka USING cloud platforms, it can save money in circumstances where cloud services are paid.
  • Message compression reduces the amount of data stored on disk, allowing for faster read and write operations.

Message Compression has the following disadvantages :

  • Producers must use some CPU cycles to compress their work.
  • Decompression takes up several CPU cycles for consumers.
  • Compression and decompression place a higher burden on the CPU.
13.

What do you mean by confluent kafka? What are its advantages?

Answer»

Confluent is an Apache Kafka-based data STREAMING platform: a full-scale streaming platform capable of not just publish-and-subscribe but also data STORAGE and processing within the stream. Confluent Kafka is a more comprehensive Apache Kafka distribution. It enhances Kafka's integration capabilities by including tools for optimizing and managing Kafka clusters, as well as ways for ensuring the streams' security. Kafka is easy to CONSTRUCT and operate because of the Confluent Platform. Confluent's software comes in three varieties: 

  • A free, open-source streaming platform that makes it simple to GET started with real-time data streams;
  • An enterprise-grade version with more administration, OPERATIONS, and monitoring tools;
  • A premium cloud-based version.

Following are the advantages of Confluent Kafka :

  • It features practically all of Kafka's characteristics, as well as a few extras.
  • It greatly simplifies the administrative operations procedures.
  • It relieves data managers of the burden of thinking about data relaying.
14.

Differentiate between Kafka and Flume.

Answer»

APACHE Flume is a dependable, DISTRIBUTED, and available software for aggregating, collecting, and transporting massive amounts of log data quickly and efficiently. Its architecture is VERSATILE and simple, based on streaming data flows. It's written in the Java programming language. It features its own QUERY processing engine, allowing it to alter each fresh batch of data before sending it to its intended sink. It is designed to be adaptable.

The following table illustrates the differences between Kafka and Flume :

KafkaFlume
Kafka is a distributed data system.Apache Flume is a system that is available, dependable, and distributed.
It essentially functions as a pull model.It essentially functions as a push model.
It is made for absorbing and analysing real-time streaming data.It collects, aggregates, and moves massive amounts of log data from a variety of sources to a centralised data repository in an efficient manner.
If it is resilient to NODE failure, it facilitates automatic recovery.If the flume-agent fails, you will lose events in the channel.
Kafka operates as a cluster that manages incoming high-volume data streams in real-time.Flume is a tool for collecting log data from web servers that are spread.
It is a messaging system that is fault-tolerant, efficient, and scalable.It is made specifically for Hadoop.
It's simple to scale.In comparison to Kafka, it is not scalable.
15.

What do you understand about Kafka MirrorMaker?

Answer»

The MirrorMaker is a standalone utility for copying data from one Apache KAFKA CLUSTER to ANOTHER. The MirrorMaker reads data from original cluster topics and WRITES it to a destination cluster with the same topic name. The source and destination clusters are separate entities that can have VARIOUS partition counts and offset values.

16.

Differentiate between Kafka and Java Messaging Service(JMS).

Answer»

The following table illustrates the differences between KAFKA and Java Messaging Service:

Java Messaging Service(JMS)Kafka
The push model is used to deliver the messages. Consumers receive messages on a regular basis.A pull mechanism is used in the delivery method. When consumers are ready to receive the messages, they pull them.
When the JMS queue receives confirmation from the CONSUMER that the message has been received, it is permanently destroyed.Even after the consumer has viewed the communications, they are maintained for a specified length of time.
JMS is better SUITED to multi-node CLUSTERS in very complicated systems.Kafka is better suited to HANDLING big amounts of data.
JMS is a FIFO queue that does not support any other type of ordering.Kafka ensures that partitions are sent in the order in which they appeared in the message.
17.

Describe in what ways Kafka enforces security.

Answer»

The security given by Kafka is made up of three parts:

  • Encryption: All COMMUNICATIONS sent between the Kafka broker and its many clients are encrypted. This prevents data from being intercepted by other clients. All messages are SHARED in an encrypted format between the components.
  • Authentication: Before being able to connect to Kafka, apps that use the Kafka broker must be authenticated. Only approved applications will be able to SEND or receive messages. To identify themselves, authorized applications will have unique IDS and passwords.
  • After authentication, authorization is carried out. It is possible for a CLIENT to publish or consume messages once it has been validated. The permission ensures that write access to apps can be restricted to prevent data contamination.
18.

Differentiate between Redis and Kafka.

Answer»

The following table ILLUSTRATES the DIFFERENCES between Redis and Kafka:

RedisKafka
Push-based message delivery is supported by Redis. This means that messages published to Redis will be distributed to consumers automatically.Pull-based message delivery is supported by Kafka. The messages published to the Kafka broker are not automatically sent to the consumers; instead, consumers must pull the messages when they are ready.
Message retention is not supported by Redis. The communications are DESTROYED once they have been delivered to the recipients.In its LOG, Kafka allows for message preservation.
Parallel processing is not supported by Redis.Multiple consumers in a consumer group can consume partitions of the topic concurrently because of the Kafka's partitioning feature.
Redis can not MANAGE vast amounts of data because it's an in-memory database.Kafka can handle massive amounts of data since it uses disc space as its primary storage.
Because Redis is an in-memory store, it is much faster than Kafka.Because Kafka stores data on disc, it is slower than Redis.
19.

What are the parameters that you should look for while optimising kafka for optimal performance?

Answer»

Two MAJOR measurements are taken into account while tuning for optimal performance: latency measures, which relate to the amount of TIME it takes to process one event, and throughput measures, which refer to the number of events that can be PROCESSED in a given length of time. Most systems are tuned for one of two things: delay or throughput, whereas Kafka can do both. 

The following stages are involved in optimizing Kafka's performance:

  • Kafka producer tuning: Data that producers must provide to brokers is kept in a batch. The producer transmits the batch to the broker when it's ready. To adjust the producers for latency and throughput, two parameters must be considered: batch size and linger time. The batch size must be chosen with great care. If the producer is constantly delivering messages, a bigger batch size is recommended to maximize throughput. However, if the batch size is set to a huge value, it may never fill up or take a long time to do so, affecting the latency. The batch size must be selected based on the nature of the VOLUME of messages transmitted by the producer. The linger duration is included to create a delay while more records are added to the batch, allowing for larger records to be transmitted. More messages can be transmitted in one batch with a longer linger period, but latency may suffer as a result. A shorter linger time, on the other hand, will result in fewer messages being transmitted faster, resulting in LOWER latency but also lower throughput.
  • Tuning the Kafka broker: Each partition in a topic has a leader, and each leader has 0 or more followers. It's critical that the leaders are appropriately balanced, and that some nodes aren't overworked in comparison to others.
  • Tuning Kafka Consumers: To ensure that consumers keep up with producers, the number of partitions for a topic should be equal to the number of consumers. The divisions are divided among the consumers in the same consumer group.
20.

Differentiate between Rabbitmq and Kafka.

Answer»

 Following are the differences between Kafka and Rabbitmq:

Based on Architecture :

Rabbitmq: 

  • Rabbitmq is a general-purpose message broker and request/reply, point-to-point, and pub-sub communication patterns are all used by it.
  • It has a smart broker/ dumb consumer model. There is the consistent transmission of messages to consumers at about the same speed as the broker monitors the consumer's status.
  • It is a mature platform and is well supported for Java, client libraries, .NET, Ruby, and Node.js. It offers a variety of plugins as well.
  • The communication can be synchronous or asynchronous. It also provides options for distributed deployment.

Kafka:

  • Kafka is a message and stream platform for high-volume publish-subscribe messages and streams. It is durable, quick, and scalable.
  • It is a durable message store, similar to a log, and it runs in a server cluster and maintains streams of records in topics (categories).
  • In this, messages are made up of three COMPONENTS: a value, a key, and a timestamp.
  • It has a dumb broker / smart consumer model as it does not track which messages are viewed by customers and only maintains unread messages. Kafka stores all messages for a specific amount of time.
  • In this, external services are required to run, including APACHE Zookeeper in some circumstances.

Manner of Handling Messages :

BasisRabbitmqKafka
Ordering of messagesThe ordering of messages is not supported here. Partitions in Kafka enable message ordering. Message keys are used while sending the messages to the topic.
Lifetime of messagesSince Rabbitmq is a message queue, messages are done away with once consumed and the acknowledgement is sent.Since Kafka is a log, the messages are always present there. We can have a message retention policy for the same.
Prioritizing the messages In this, priorities can be specified for the messages and the messages can be consumed according to their priority.Prioritising the messages is not possible in  Kafka.
Guarantee of delivering the messages Atomicity is not guaranteed in this case, even when the transaction involves a single queue.In Kafka, it is guaranteed that the whole batch of messages in a partition is either sent successfully or failed.

Based on Approach :

  • Kafka: The pull model is used by Kafka. Batches of messages from a given offset are requested by consumers. When there are no messages past the offset, Kafka allows for long-pooling, which eliminates tight loops.
    Because of Kafka's partitions, a pull model makes sense. In a partition with no competing customers, Kafka provides message orders. This allows USERS to take advantage of message batching for more efficient message delivery and higher throughput.
  • Rabbitmq: RabbitMQ operates on a push paradigm, which prevents users from becoming overwhelmed by imposing a prefetch limit on them. This can be used for messaging with low latency. The push model's goal is to distribute messages individually and promptly, ensuring that work is parallelized equitably and messages are handled roughly in the order they came in the queue.

Based on Performance:

  • Kafka: Compared to message brokers like RabbitMQ, Kafka provides significantly better performance. It boosts performance by using sequential disc I/O, making it a good choice for queue implementation. With limited resources, it can achieve high throughput (millions of messages per second), which is essential for large data USE cases.
  • Rabbitmq: RabbitMQ can also handle a million messages per second, but it does so at the expense of more resources (around 30 NODES). RabbitMQ can be used for many of the same applications as Kafka, however, it must be used in conjunction with other technologies such as Apache Cassandra.
21.

What is a Replication Tool in Kafka? Explain some of the replication tools available in Kafka.

Answer»

The KAFKA Replication Tool is used to create a high-level design for the replica maintenance process. The following are some of the replication tools available:

  • Preferred Replica Leader Election Tool: Partitions are spread to many brokers in a cluster, each copy known as a replica, using the Preferred Replica Leader Election Tool. The leader is FREQUENTLY referred to as the favored replica. The brokers normally spread the leader position equitably across the cluster for various partitions, but owing to failures, planned shutdowns, and other FACTORS, an imbalance can develop over time. This tool can be used to preserve the balance in these situations by reassigning the preferred replicas, and hence the leaders.
  • Topics tool: The Kafka topics tool is in charge of all administration operations relating to topics, including:
    • Listing and describing the topics.
    • Topic generation.
    • Modifying Topics.
    • Adding a topic's dividers.
    • Disposing of topics.
  • Tool to reassign partitions: The replicas assigned to a PARTITION can be changed with this tool. This refers to adding or removing followers from a partition.
  • StateChangeLogMerger tool: The StateChangeLogMerger tool COLLECTS data from brokers in a cluster, formats it into a central log, and aids in the troubleshooting of state change issues. Sometimes there are issues with the election of a leader for a particular partition. This tool can be used to figure out what's causing the issue.
  • Change topic configuration tool: used to create new configuration choices, modify current configuration options, and delete configuration options.
22.

What do you mean by multi-tenancy in Kafka?

Answer»

Multi-tenancy is a software operation MODE in which many instances of one or more programs operate in a shared environment independently of one another. The instances are considered to be physically separate yet logically connected. The level of logical isolation in a SYSTEM that supports multi-tenancy must be COMPREHENSIVE, but the level of physical integration can vary. Kafka is multi-tenant because it allows for the configuration of many topics for DATA consumption and PRODUCTION on the same cluster.