1.

What is an InputFormat and Record Reader in Hadoop? What are the various Input Formats in Hadoop?

Answer»

A file is read by a Map-Reduce job using an InputFormat. It defines how the file being read needs to be split up and read. InputFormat, in turn, defines a RecordReader which is responsible for reading ACTUAL records from the input files. The split computed by InputFormat is operated upon by map TASK. Map task USES Record Reader corresponding to InputFormat to read the data within each split and create key-value pairs.

The various TYPES of InputFormat in Hadoop are:

  • FileInputFormat – Base class for all file-based input formats.
  • TEXTINPUTFORMAT – Default input format used in a map-reduce job. It treats each line of input as a separate record. LineRecordReader is the default record reader for TextInputFormat which treats each line of the input file as a separate value.
  • KeyValueTextInputFormat – Similar to TextInputFormat but it breaks the line being read into key and value. The key is text up to tab (\t) character while all the text after a tab is considered value.
  • SeqenceFileInputFormat – An input format for reading sequence files.
  • NlineInputFormat – It is a type of TextInputFormat but one which can read a variable number of lines of input.
  • DBInputFormat – An input format to read data from a relational database using JDBC. It is however suited for reading small datasets only.


Discussion

No Comment Found

Related InterviewSolutions