1.

Brief about the Job or Application ID. how job history server is handling the Job details and brief about logging and log files.

Answer»

After jobs submissions, Job IDs are generated by job tracker in Hadoop 1 and in Hadoop 2/3 Application IDs are generated. Application ID or Job ID is represented as a globally unique identifier for an Application or Job.

Example: job_1410450250506_002 / application_1410450250506_002
_1410450250506 ==> this is the start time of Resource manager which is achieved by using "cluster timestamp"
_002 ==> Basically counter is used to keep track of occurrences of the job

Task IDs are formed by replacing the job or Application with task prefix within the job
Example: task_1410450250506_0002_m_000002

Here in the above example, _000002 is the third map task of the job "job_1410450250506_002"
Tasks may be executed more than once due to task failure so to identify different instances of task execution, Task attempts are given unique IDs.

Example: attempt_1410450250506_0002_m_0000002_0
_0 is the first attempt of the task task_1410450250506_002_m_0000002

When you will open the Job history WEB UI, you will get the image below. Here in the image, you can ABLE to SEE the Job state where the Job is succeeded or Failed. How many Mappers and Reducers are launched whether all the Mappers and Reducers are completed or not you can find all these details.

JOB HISTORY Server:

When you click the Job id from the Job history server, you will get below image and more or less similar information you will get as above.

Overview:

Hadoop Counters:

This is the most useful option to examine job performance. Hadoop provides several built-in counters as well as you can customize counters as per your requirements. Counters help you to get the below kind of information.

  • Whether the correct number of Mappers and Reducers were launched and completed or not
  • Whether the correct number of input bytes were read and the expected number of output bytes were written or not.
  • Whether a correct number of records were read and written in the local file as well as HDFS files or not.
  • For the Job whether CPU usage and memory consumption are appropriate or not

Hadoop counters provide three types of Built-in counters such as :

  1. File system counters
  2. Job Counters
  3. Map-reduce Framework counters. 

In addition to this Hadoop provides another 3 counters from other groups by DEFAULT, such as:

  1. Shuffle error counters
  2. File input format counters 
  3. File output format counters.

File system counters:

Under File system counter You can get the information regarding reading and write operations in both the local file system and HDFS as well. The total number of bytes read and written depending upon COMPRESSION algorithms. Here are the few key counters.

File_Bytes_Read: The total number of bytes read from the local file system by the map-reduce Tasks. File_Bytes_Write: Total number of bytes written to the local file system. During the Map phase,  the mapper task WRITES the intermediate results to the local file system and during the shuffle phase of the Reducer task also write to the local file system when they spill intermediate results to the local file system during sorting.

  • HDFS_Bytes_Read: Total bytes read from HDFS
  • HDFS_Bytes_Written: Total bytes are written to HDFS.

JOB Counters:

You will get Job information related to Mapper and reducer under JOB Counters. The following are the key job counters.

  • DATA_LOCAL_MAPS: It indicates how many map tasks executed on the local file system alternatively Number of map tasks are running on the same node where the Tasks related data are also available in the same node.
  • TOTAL_LAUNCHED_MAPS: It shows the Total number of Launched map tasks including failed tasks too. Basically, it is the same as the number of input splits for the job.
  • TOTAL_LAUNCHED_REDUCES: It shows the total number of reducer task launched for the job
  • NUM_KILLED_MAPS: Number of killed map tasks
  • NUM_KILLED_REDUCES: Number of killed reduce tasks.
  • MILLIS_MAPS: This is the total time(In milli sec) spent by all map tasks which are running for the job.
  • MILLIS_REDUCES: This is the total time spent by all reduce tasks that are running for the job.

MapReduce Framework counters:

You will get all the statistic of MapReduce job under MapReduce framework counter. It will help you to do the performance tuning of the job.

  • MAP_INPUT_RECORDS: The total number of input records reads for the job during the Map phase.
  • MAP_OUTPUT_RECORDS: Total number of records written for the job during the Map phase.
  • CPU_MILLISECONDS: CPU time spent on all the tasks
  • GC_TIME_MILLIS: Total time spent during the garbage collection of the JVMs. Garbage collection is the process of getting back the run time unused memory automatically.  
  • PHYSICAL_MEMORY_BYTES: Total physical memory used by all tasks.
  • REDUCE_SHUFFLE_BYTES: Total number of output bytes copied from Map tasks to reduce tasks during the shuffle phase.
  • SPILLED_RECORDS: The total number of records spilled to the disk for all the Map and reducer tasks.

Other counters are as follows:

  1. Shuffle error counters: Error details during the shuffle phase such as BAD_ID, IO_ERROR, WRONG_MAP, and WRONG_REDUCE
  2. File input format counters: It includes the bytes read by each task.
  3. File output format counters: Bytes written by each map and reduce task using an output format


Discussion

No Comment Found

Related InterviewSolutions