Explore topic-wise InterviewSolutions in Current Affairs.

This section includes 7 InterviewSolutions, each offering curated multiple-choice questions to sharpen your Current Affairs knowledge and support exam preparation. Choose a topic below to get started.

1.

If Reducers Do Not Start Before All Mappers Finish Then Why Does The Progress On Mapreduce Job Shows Something Like Map(50%) Reduce(10%)? Why Reducers Progress Percentage Is Displayed When Mapper Is Not Finished Yet?

Answer»

Reducers start copying intermediate key-value pairs from the mappers as SOON as they are available. The PROGRESS calculation also takes in account the processing of data TRANSFER which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be TRANSFERRED to reducer. Though the reducer progress is updated still the PROGRAMMER defined reduce method is called only after all the mappers have finished.

Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be transferred to reducer. Though the reducer progress is updated still the programmer defined reduce method is called only after all the mappers have finished.

2.

When Is The Reducers Are Started In A Mapreduce Job?

Answer»

In a MapReduce job reducers do not START EXECUTING the reduce method until the all Map JOBS have completed. Reducers start COPYING INTERMEDIATE key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.

In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.

3.

What Is A Identitymapper And Identityreducer In Mapreduce ?

Answer»
  • org.apache.hadoop.mapred.lib.IdentityMapper Implements the IDENTITY FUNCTION, mapping inputs directly to outputs. If MapReduce programmer do not set the Mapper CLASS using JobConf.setMapperClass then IdentityMapper.class is used as a DEFAULT value.
  • org.apache.hadoop.mapred.lib.IdentityReducer PERFORMS no reduction, writing all input values directly to the output. If MapReduce programmer do not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.

4.

What Is Writable & Writablecomparable Interface?

Answer»
  • org.apache.hadoop.io.Writable is a Java interface. Any key or value TYPE in the Hadoop Map-Reduce framework implements this interface. IMPLEMENTATIONS typically IMPLEMENT a static read(DataInput) method which constructs a new instance, calls readFields(DataInput) and returns the instance.
  • org.apache.hadoop.io.WritableComparable is a Java interface. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface. WritableComparable objects can be compared to each other USING COMPARATORS.

5.

What Are Combiners? When Should I Use A Combiner In My Mapreduce Job?

Answer»

Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate INTERMEDIATE map output locally on individual mapper outputs. Combiners can HELP you reduce the amount of data that needs to be transferred across to the reducers. You can use your REDUCER code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce JOBS should not depend on the combiners execution.

Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners execution.

6.

Where Is The Mapper Output (intermediate Kay-value Data) Stored ?

Answer»

The mapper output (intermediate data) is stored on the LOCAL file system (NOT HDFS) of each individual mapper NODES. This is typically a temporary directory location which can be setup in config by the HADOOP administrator. The intermediate data is CLEANED up after the Hadoop JOB completes.

The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.

7.

Can I Set The Number Of Reducers To Zero?

Answer»

Yes, SETTING the NUMBER of reducers to zero is a valid configuration in Hadoop. When you set the reducers to zero no reducers will be executed, and the OUTPUT of each MAPPER will be stored to a separate file on HDFS. [This is different from the condition when reducers are set to a number greater than zero and the Mappers output (intermediate data) is written to the Local file system(NOT HDFS) of each mappter slave node.]

Yes, Setting the number of reducers to zero is a valid configuration in Hadoop. When you set the reducers to zero no reducers will be executed, and the output of each mapper will be stored to a separate file on HDFS. [This is different from the condition when reducers are set to a number greater than zero and the Mappers output (intermediate data) is written to the Local file system(NOT HDFS) of each mappter slave node.]

8.

Does Mapreduce Programming Model Provide A Way For Reducers To Communicate With Each Other? In A Mapreduce Job Can A Reducer Communicate With Another Reducer?

Answer»

Nope, MapReduce programming model does not ALLOW REDUCERS to COMMUNICATE with each other. Reducers RUN in isolation.

Nope, MapReduce programming model does not allow reducers to communicate with each other. Reducers run in isolation.

9.

How Namenode Handles Data Node Failures?

Answer»

NameNode PERIODICALLY receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not recieved a hearbeat message from a data node after a certain amount of time, the data node is marked as DEAD. Since blocks will be under REPLICATED the system begins replicating the blocks that were stored on the dead datanode. The NameNode Orchestrates the replication of data blocks from one datanode to another. The replication data transfer happens DIRECTLY between datanodes and the data never passes through the namenode.

NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not recieved a hearbeat message from a data node after a certain amount of time, the data node is marked as dead. Since blocks will be under replicated the system begins replicating the blocks that were stored on the dead datanode. The NameNode Orchestrates the replication of data blocks from one datanode to another. The replication data transfer happens directly between datanodes and the data never passes through the namenode.

10.

What Is The Difference Between Hdfs And Nas ?

Answer»

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the DIFFERENCES from other distributed file systems are significant.

Following are differences between HDFS and NAS

  • In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NAS data is stored on dedicated hardware.
  • HDFS is designed to work with MapReduce System, since computation are MOVED to data. NAS is not suitable for MapReduce since data is stored separately from the COMPUTATIONS.
  • HDFS runs on a cluster of machines and PROVIDES redundancy using a REPLICATION protocol. Whereas NAS is provided by a single machine therefore does not provide data redundancy.

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.

Following are differences between HDFS and NAS

11.

What Is Configuration Of A Typical Slave Node On Hadoop Cluster? How Many Jvms Run On A Slave Node?

Answer»
  • Single instance of a Task Tracker is run on each Slave node. Task tracker is run as a SEPARATE JVM PROCESS.
  • Single instance of a DataNode daemon is run on each Slave node. DataNode daemon is run as a separate JVM process.
  • One or Multiple instances of Task Instance is run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be CONTROLLED by configuration. Typically a HIGH end machine is configured to run more task instances.

12.

How Many Daemon Processes Run On A Hadoop System?

Answer»

Hadoop is comprised of FIVE separate daemons. Each of these daemon run in its own JVM.Following 3 Daemons run on Master nodes

NAMENODE : This daemon stores and MAINTAINS the metadata for HDFS.

Secondary NameNode : Performs housekeeping functions for the NameNode.

JobTracker : Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.

Following 2 Daemons run on each Slave nodes

DataNode : Stores actual HDFS data blocks.

TaskTracker : Responsible for instantiating and monitoring individual MAP and Reduce tasks.

Hadoop is comprised of five separate daemons. Each of these daemon run in its own JVM.Following 3 Daemons run on Master nodes

NameNode : This daemon stores and maintains the metadata for HDFS.

Secondary NameNode : Performs housekeeping functions for the NameNode.

JobTracker : Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.

Following 2 Daemons run on each Slave nodes

DataNode : Stores actual HDFS data blocks.

TaskTracker : Responsible for instantiating and monitoring individual Map and Reduce tasks.

13.

What Is A Task Instance In Hadoop? Where Does It Run?

Answer»

Task instances are the actual MAPREDUCE jobs which are run on each slave node. The TaskTracker starts a separate JVM PROCESSES to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task TRACKER. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on task tracker. By default a NEW task instance JVM process is spawned for a task.

Task instances are the actual MapReduce jobs which are run on each slave node. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on task tracker. By default a new task instance JVM process is spawned for a task.

14.

What Is A Task Tracker In Hadoop? How Many Instances Of Tasktracker Run On A Hadoop Cluster

Answer»

A TaskTracker is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle operations) from a JOBTRACKER. There is only One Task Tracker process RUN on any hadoop slave node. Task Tracker RUNS on its own JVM process. Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, SUCCESSFULLY or not, the task tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.

A TaskTracker is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle operations) from a JobTracker. There is only One Task Tracker process run on any hadoop slave node. Task Tracker runs on its own JVM process. Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.

15.

How Jobtracker Schedules A Task?

Answer»

The TaskTrackers send out heartbeat MESSAGES to the JOBTRACKER, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster WORK can be delegated. When the JobTracker TRIES to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

16.

How Can I Install Cloudera Vm In My System?

Answer»

When you enrol for the HADOOP COURSE at EDUREKA, you can download the Hadoop INSTALLATION steps.pdf file from our dropbox.

When you enrol for the hadoop course at Edureka, you can download the Hadoop Installation steps.pdf file from our dropbox.

17.

Can Hadoop Be Compared To Nosql Database Like Cassandra?

Answer»

THOUGH NOSQL is the closet technology that can be compared to Hadoop, it has its own pros and cons. There is no DFS in NOSQL. Hadoop is not a database. It’s a file system (HDFS) and distributed PROGRAMMING FRAMEWORK (MapReduce).

Though NOSQL is the closet technology that can be compared to Hadoop, it has its own pros and cons. There is no DFS in NOSQL. Hadoop is not a database. It’s a file system (HDFS) and distributed programming framework (MapReduce).

18.

Why 'reading' Is Done In Parallel And 'writing' Is Not In Hdfs?

Answer»

Reading is DONE in parallel because by doing so we can access the data fast. But we do not perform the write operation in parallel. The reason is that if we perform the write operation in parallel, then it might result in data inconsistency. For EXAMPLE, you have a file and two NODES are trying to write data into the file in parallel, then the first node does not know what the second node has written and vice-versa. So, this makes it CONFUSING which data to be stored and ACCESSED.

Reading is done in parallel because by doing so we can access the data fast. But we do not perform the write operation in parallel. The reason is that if we perform the write operation in parallel, then it might result in data inconsistency. For example, you have a file and two nodes are trying to write data into the file in parallel, then the first node does not know what the second node has written and vice-versa. So, this makes it confusing which data to be stored and accessed.

19.

Which Are The Two Types Of 'writes' In Hdfs?

Answer»

There are two TYPES of writes in HDFS: posted and non-posted write. Posted Write is when we write it and forget about it, without worrying about the acknowledgement. It is similar to our TRADITIONAL Indian post. In a Non-posted Write, we WAIT for the acknowledgement. It is similar to the today’s courier services. Naturally, non-posted write is more EXPENSIVE than the posted write. It is much more expensive, though both writes are asynchronous.

There are two types of writes in HDFS: posted and non-posted write. Posted Write is when we write it and forget about it, without worrying about the acknowledgement. It is similar to our traditional Indian post. In a Non-posted Write, we wait for the acknowledgement. It is similar to the today’s courier services. Naturally, non-posted write is more expensive than the posted write. It is much more expensive, though both writes are asynchronous.

20.

Is A Job Split Into Maps?

Answer»

No, a job is not SPLIT into MAPS. Spilt is CREATED for the file. The file is PLACED on datanodes in blocks. For each split, a MAP is needed.

No, a job is not split into maps. Spilt is created for the file. The file is placed on datanodes in blocks. For each split, a map is needed.

21.

Why Are The Number Of Splits Equal To The Number Of Maps?

Answer»

The number of MAPS is EQUAL to the number of INPUT SPLITS because we WANT the key and value pairs of all the input splits.

The number of maps is equal to the number of input splits because we want the key and value pairs of all the input splits.

22.

Do We Require Two Servers For The Namenode And The Datanodes?

Answer»

Yes, we need two different servers for the NAMENODE and the datanodes. This is because Namenode requires highly CONFIGURABLE system as it stores information about the LOCATION details of all the files stored in different datanodes and on the other HAND, datanodes require low configuration system.

Yes, we need two different servers for the Namenode and the datanodes. This is because Namenode requires highly configurable system as it stores information about the location details of all the files stored in different datanodes and on the other hand, datanodes require low configuration system.

23.

Is Map Like A Pointer?

Answer»

No, MAP is not LIKE a POINTER.

No, Map is not like a pointer.

24.

What Is The Difference Between Mapreduce Engine And Hdfs Cluster?

Answer»

HDFS cluster is the name given to the whole configuration of master and slaves where data is STORED. Map REDUCE Engine is the programming module which is used to retrieve and ANALYZE data.

HDFS cluster is the name given to the whole configuration of master and slaves where data is stored. Map Reduce Engine is the programming module which is used to retrieve and analyze data.

25.

What Is 'key Value Pair' In Hdfs?

Answer»

Key VALUE pair is the INTERMEDIATE data generated by MAPS and SENT to reduces for generating the FINAL output.

Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output.

26.

Can You Explain How Do 'map' And 'reduce' Work?

Answer»

Namenode takes the input and divide it into parts and assign them to data nodes. These datanodes PROCESS the tasks assigned to them and make a key-value pair and returns the intermediate output to the REDUCER. The reducer COLLECTS this key value pairs of all the datanodes and combines them and GENERATES the final output.

Namenode takes the input and divide it into parts and assign them to data nodes. These datanodes process the tasks assigned to them and make a key-value pair and returns the intermediate output to the Reducer. The reducer collects this key value pairs of all the datanodes and combines them and generates the final output.

27.

What Is The Difference Between Gen1 And Gen2 Hadoop With Regards To The Namenode?

Answer»

In Gen 1 Hadoop, Namenode is the single POINT of failure. In Gen 2 Hadoop, we have what is KNOWN as Active and PASSIVE Namenodes KIND of a structure. If the active Namenode fails, passive Namenode takes over the charge.

In Gen 1 Hadoop, Namenode is the single point of failure. In Gen 2 Hadoop, we have what is known as Active and Passive Namenodes kind of a structure. If the active Namenode fails, passive Namenode takes over the charge.

28.

What Is A Secondary Namenode? Is It A Substitute To The Namenode?

Answer»

The SECONDARY Namenode constantly reads the data from the RAM of the Namenode and writes it into the hard DISK or the file system. It is not a SUBSTITUTE to the Namenode, so if the Namenode fails, the ENTIRE HADOOP system goes down.

The secondary Namenode constantly reads the data from the RAM of the Namenode and writes it into the hard disk or the file system. It is not a substitute to the Namenode, so if the Namenode fails, the entire Hadoop system goes down.

29.

What If Rack 2 And Datanode Fails?

Answer»

If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid such situations, we need to replicate that data more number of times INSTEAD of replicating only thrice. This can be DONE by changing the VALUE in REPLICATION factor which is SET to 3 by default.

If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid such situations, we need to replicate that data more number of times instead of replicating only thrice. This can be done by changing the value in replication factor which is set to 3 by default.

30.

Do We Need To Place 2nd And 3rd Data In Rack 2 Only?

Answer»

YES, this is to AVOID DATANODE FAILURE.

Yes, this is to avoid datanode failure.

31.

On What Basis Data Will Be Stored On A Rack?

Answer»

When the client is ready to load a file into the cluster, the content of the file will be divided into BLOCKS. Now the client consults the Namenode and gets 3 datanodes for EVERY block of the file which INDICATES where the block should be stored. While placing the datanodes, the key RULE FOLLOWED is “for every block of data, two copies will exist in one rack, third copy in a different rack“. This rule is known as “Replica Placement Policy“.

When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the client consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block should be stored. While placing the datanodes, the key rule followed is “for every block of data, two copies will exist in one rack, third copy in a different rack“. This rule is known as “Replica Placement Policy“.

32.

What Is A Rack?

Answer»

Rack is a storage area with all the datanodes put TOGETHER. These datanodes can be physically located at different PLACES. Rack is a PHYSICAL collection of datanodes which are stored at a SINGLE location. There can be multiple RACKS in a single location.

Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.

33.

What Is The Communication Channel Between Client And Namenode/datanode?

Answer»

The MODE of COMMUNICATION is SSH.

The mode of communication is SSH.

34.

Is Client The End User In Hdfs?

Answer»

No, CLIENT is an application which RUNS on your machine, which is USED to interact with the NAMENODE (job TRACKER) or datanode (task tracker).

No, Client is an application which runs on your machine, which is used to interact with the Namenode (job tracker) or datanode (task tracker).

35.

Who Is A 'user' In Hdfs?

Answer»

A user is LIKE you or me, who has some QUERY or who NEEDS some KIND of DATA.

A user is like you or me, who has some query or who needs some kind of data.

36.

Doesn't Google Have Its Very Own Version Of Dfs?

Answer»

Yes, GOOGLE owns a DFS KNOWN as “Google FILE System (GFS)” developed by Google Inc. for its own use.

Yes, Google owns a DFS known as “Google File System (GFS)” developed by Google Inc. for its own use.

37.

On What Basis Namenode Will Decide Which Datanode To Write On?

Answer»

As the Namenode has the METADATA (information) related to all the data nodes, it KNOWS which DATANODE is FREE.

As the Namenode has the metadata (information) related to all the data nodes, it knows which datanode is free.

38.

Does Hadoop Always Require Digital Data To Process?

Answer»

Yes. Hadoop ALWAYS require DIGITAL DATA to be processed.

Yes. Hadoop always require digital data to be processed.

39.

When We Send A Data To A Node, Do We Allow Settling In Time, Before Sending Another Data To That Node?

Answer»

YES, we do.

Yes, we do.

40.

Are Job Tracker And Task Trackers Present In Separate Machines?

Answer»

YES, job tracker and task tracker are present in different MACHINES. The REASON is job tracker is a single point of failure for the Hadoop MAPREDUCE service. If it goes down, all running jobs are HALTED.

Yes, job tracker and task tracker are present in different machines. The reason is job tracker is a single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

41.

If Datanodes Increase, Then Do We Need To Upgrade Namenode?

Answer»

While installing the Hadoop system, Namenode is determined BASED on the size of the clusters. Most of the time, we do not need to UPGRADE the Namenode because it does not store the actual DATA, but just the metadata, so such a requirement RARELY ARISE.

While installing the Hadoop system, Namenode is determined based on the size of the clusters. Most of the time, we do not need to upgrade the Namenode because it does not store the actual data, but just the metadata, so such a requirement rarely arise.

42.

If A Data Node Is Full How It's Identified?

Answer»

When data is stored in DATANODE, then the METADATA of that data will be stored in the NAMENODE. So Namenode will identify if the data node is FULL.

When data is stored in datanode, then the metadata of that data will be stored in the Namenode. So Namenode will identify if the data node is full.

43.

How Indexing Is Done In Hdfs?

Answer»

Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will KEEP on storing the last part of the data which will say where the next part of the data will be. In FACT, this is the base of HDFS.

Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.

44.

If We Want To Copy 10 Blocks From One Machine To Another, But Another Machine Can Copy Only 8.5 Blocks, Can The Blocks Be Broken At The Time Of Replication?

Answer»

In HDFS, blocks cannot be broken down. Before COPYING the blocks from one machine to another, the Master node will figure out what is the actual AMOUNT of SPACE required, how many block are being used, how much space is AVAILABLE, and it will ALLOCATE the blocks accordingly.

In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node will figure out what is the actual amount of space required, how many block are being used, how much space is available, and it will allocate the blocks accordingly.

45.

What Are The Benefits Of Block Transfer?

Answer»

A FILE can be LARGER than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. MAKING the unit of abstraction a block rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and AVAILABILITY. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically SEPARATE machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.

A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.

46.

If A Particular File Is 50 Mb, Will The Hdfs Block Still Consume 64 Mb As The Default Size?

Answer»

No, not at all! 64 MB is just a unit where the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS block and 14 mb will be FREE to store something else. It is the MasterNode that does data ALLOCATION in an efficient MANNER.

No, not at all! 64 mb is just a unit where the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS block and 14 mb will be free to store something else. It is the MasterNode that does data allocation in an efficient manner.

47.

What Is A 'block' In Hdfs?

Answer»

A ‘block’ is the MINIMUM amount of data that can be read or written. In HDFS, the default block size is 64 MB as contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are BROKEN down into block-sized chunks, which are STORED as independent units. HDFS blocks are large as compared to disk blocks, particularly to MINIMIZE the cost of seeks.

A ‘block’ is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. HDFS blocks are large as compared to disk blocks, particularly to minimize the cost of seeks.

48.

Are Namenode And Job Tracker On The Same Host?

Answer»

No, in practical ENVIRONMENT, NAMENODE is on a separate HOST and JOB TRACKER is on a separate host.

No, in practical environment, Namenode is on a separate host and job tracker is on a separate host.

49.

What Is A Heartbeat In Hdfs?

Answer»

A HEARTBEAT is a signal indicating that it is ALIVE. A DATANODE sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will DECIDE that there is some problem in datanode or task tracker is UNABLE to perform the assigned task.

A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task.

50.

Is Namenode Machine Same As Datanode Machine As In Terms Of Hardware?

Answer»

It depends upon the cluster you are trying to CREATE. The Hadoop VM can be there on the same machine or on another machine. For instance, in a single node cluster, there is only one machine, whereas in the DEVELOPMENT or in a TESTING environment, Namenode and data nodes are on DIFFERENT machines.

It depends upon the cluster you are trying to create. The Hadoop VM can be there on the same machine or on another machine. For instance, in a single node cluster, there is only one machine, whereas in the development or in a testing environment, Namenode and data nodes are on different machines.

Previous Next