1.

What are the file formats that support HADOOP? Brief about the same.

Answer»

Below are the file formats which support Hadoop.

  1. Text(Ex: CSV(Comma separated file) and TSV(Tab-separated file))
  2. JSON(Javascript object notation)
  3. AVRO
  4. RC(Record Columnar)
  5. ORC(Optimized Record Columnar)
  6. Parquet.
  • Text file format (Ex:  CSV(Comma-separated values) and TSV(Tab-separated values)) 

Usually, text format was very common prior to Hadoop and even it is very common in a Hadoop environment as well. Data are presented as lines and each line terminated by a NEWLINE character as /N or Tab separated as /t.

CSV stands for comma-separated-values, so data fields are separated or delimited by comma. For example, we have below value in excel sheet

Nameclasssectionsubject
Bibhu7AEnglish

The above data will be present in a CSV formatted file as follows.
Bibhu,7, A, English

  • JSON 

JSON stands for Javascript object Notion. It is a READABLE format for structuring data, basically, it is used to transfer the data from server to web Application. We can use it as an alternative to XML. In JSON data are presenting as key and value pairs. The key is always a string data type which is enclosed with a quotation mark. Value can be a String, Number, Boolean, Array or object.

the basic syntax is Key followed by a colon followed by a value.
Example: "Name" : "Bibhu"

  • AVRO

AVRO stores the data in JSON format which is easy to read and understand. The Data itself stored in Binary format which is making it compressed and Efficient, Each value is stored without having any METADATA other than a small schema identifier having a size of 1 to 4 bytes. it is having the capability to split the large data set into subsets which are very much suitable for Map Reduce processing. 

In Hive following command is used to use AVRO.

Create table avro_school

(column_address)

stored as avro;

  • RC

RC stands for Record Columnar which is one type of Binary file format, it will provide high compression on top of rows or on multiple rows at a time for which we want to do some operation.RC Files consisting of Binary Key/Value pairs. RC File format first partitions the rows horizontally into Row split and after that all the row split presented vertically in a columnar way. please find the example as mentioned below:

Step 1
First, partition the rows horizontally into Row split

501502503504
505506507508
509510511512
513514515516

Step 2
All the row split presented vertically in a columnar way

501502503504
505506507508
509510511512
513514515516

RC file combines Multiple functions such as data storage formatting, data compression, and data access optimization. It is able to meet all the four below requirements of data storage.

  1. Fast data storing
  2. Improved query processing
  3. optimized storage space utilization
  4. dynamic data access patterns.
  • ORC(Optimized Record Columnar)

The ORC File provides a more efficient way to store the Relational Data than then RC file. It is basically reducing the data storage format by up to 75% of the original. as compared to the RC file ORC file takes less time to access the data and takes less space to store the data as well, It internally divides the data again with a default size of 250M.

In Hive following command is used to use the ORC file.

CREATE TABLE ...STORED AAS ORC
ALTER TABLE ... SET FILEFORMAT ORC
SET hive.default.fileformat=ORC

  • Parquet

It's another column-oriented storage like RC format and ORC format but it's very good at handling nested data as well as good at query scan for a particular column in a table. In the Parquet New column can be added at the end of the structure. It is handling the compression using Snappy, ggip currently snappy is a default. The parquet is supported by Cloudera and optimized for Cloudera Impala.

Hive Parquet File Format Example:

Create table parquet_school_table
(column_specs)
stored as parquet;



Discussion

No Comment Found

Related InterviewSolutions