How HIVE Database and IMPALA are working together in CLOUDERA?

1.	How HIVE Database and IMPALA are working together in CLOUDERA?
Answer» As we know that most of the Hive tables are containing billions and millions records and for any computation hive query will process with the help of Mapper and Reducer and it will consume more time and memory. Few of the optimization techniques which will always help hive query to perform better . Please find few of the below techniques. 1. Use Tez to Fasten the execution: Apache TEZ is an execution engine used for faster query execution. Tez will allow you to launch a single Application Master for each session for multiple job, condition is that jobs are comparatively small so that Tez memory can use for those jobs. You need to set up the processing engine as Tez instead of default Map-Reduce execution engine providing below parameter. Set hive.execution.engine=tez; If you are using Cloudera/Hortonworks, then you will find TEZ option in the Hive query editor as well. 2. Enable compression in Hive Basically Compression techniques, It reduce the amount of data size being transferred, so that it reduces the data transfer between mappers and reducers and compression is not suggestible if your data is already compressed because the output file size might be larger than the original. For better result, you need to perform compression at both mapper and reducer side separately. There are many compression formats are AVAILABLE out of which gzip is taking more CPU resources than Snappy or LZO but it provides HIGHER compression ratio. It is not relevant for splittable table. Other formats are snappy, lzo, bzip, etc. You can set compression at mapper and reducer side using codes below: set mapred.compress.map.output = true; set mapred.output.compress= true; Users can also set the following properties in hive-site.xml and map-site.xml to get permanent effects. <property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression(for MR)/compress(for Yarn).codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> 3. Use ORC file format ORC (optimized record columnar) is an appropriate format for hive performance tunin,query performance can improve using ORC file format easily. We can use ORC file format for all kind of table whether it is partitioned or single and in response, you get faster computation and compressed file size. Create table orctbl (id int, name string, address string) stored as ORC tblproperties (“orc.compress”= “SNAPPY”); Now simply you can also insert the data like Insert overwrite table orctbl select * from tbldetails; 4. Optimize your joins If your table is having large data then it is not advisable to just use normal joins which we use in SQL. There are many other joins like Map Join; bucket joins, etc. which will help to improve Hive query performance. 5. Use Map Join When we are talking about Map join, It is beneficial when one table is as compare to other table which will take part of the Join. so that it can fit into the memory. Hive has a property which can do auto-map join when enabled. Set the below parameter to true to enable auto map join. Set hive.auto.convert.join to true to enable the auto map join. we can set this from the command line as well as from the hive-site.xml file <property> <name>hive.auto.convert.join</name> <value>true</value> <description>Whether Hive enables the optimization about converting common join into mapjoin based on the input file size</description> </property> 6. Bucketed Map Join If tables are bucketed by a particular column, you can use bucketed map join to improve the hive query performance. You can set the below two property to enable the bucketed map to join in Hive. <property> <name>hive.optimize.bucketmapjoin</name> <value>true</value> <description>Whether to try bucket mapjoin</description> </property> <property> <name>hive.optimize.bucketmapjoin.sortedmerge</name> <value>true</value> <description>Whether to try sorted bucket merge map join</description> </property> 7. Use Partition Partition is always helpful for huge data. It is used to segregate the large table based on certain columns so that the whole data can be divided into small chunks. When we are saying partition the table, basically It allows you to store the data under sub-directory inside a table. Selecting the partition table is always a critical decision, and you need to take care of future data and volume of data as well. For example, if you have data of a particular location then you can partition the table based on state. You can also partition the data in month wise as well. You can define the partition column based on your requirement. Here is the syntax to create a partition table CREATE TABLE countrydata_partition (Id int, country name string, population int, description string) PARTITIONED BY (country VARCHAR(64), state VARCHAR(64)) row format DELIMITED fields terminated by ‘\t’ stored AS textfile; There are two types of partition in Hive. Static partition Dynamic partition By default, the partition is static in a hive. In static partition usually we are providing the parameter as " PARTITIONED BY (department String) ". when loading big files into the hive, the static partition is preferred. Single insert to partition table is known as dynamic partition and it load the data from non partitioned Table. If you don't know how many columns are available in your table in this scenario also dynamic partition is suitable. To use dynamic partition in Hive, you need to set the following property- set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions=1000; set hive.exec.max.dynamic.partitions.pernode=100; 8. Use Vectorization A standard query is executed one row at a time. vectorized query execution, it improves performance of operation like scan, AGGREGATION, filter and joins and it is considering 1024 rows at a time to perform the operation. To use Vectorization you can use the below parameter. set hive.vectorized.execution.enabled=true set hive.vactorized.execution.reduce.enabled=true

Answer»

As we know that most of the Hive tables are containing billions and millions records and for any computation hive query will process with the help of Mapper and Reducer and it will consume more time and memory. Few of the optimization techniques which will always help hive query to perform better . Please find few of the below techniques.

1. Use Tez to Fasten the execution:

Apache TEZ is an execution engine used for faster query execution. Tez will allow you to launch a single Application Master for each session for multiple job, condition is that jobs are comparatively small so that Tez memory can use for those jobs. You need to set up the processing engine as Tez instead of default Map-Reduce execution engine providing below parameter.

Set hive.execution.engine=tez;

If you are using Cloudera/Hortonworks, then you will find TEZ option in the Hive query editor as well.

2. Enable compression in Hive

Basically Compression techniques, It reduce the amount of data size being transferred, so that it reduces the data transfer between mappers and reducers and compression is not suggestible if your data is already compressed because the output file size might be larger than the original.

For better result, you need to perform compression at both mapper and reducer side separately. There are many compression formats are AVAILABLE out of which gzip is taking more CPU resources than Snappy or LZO but it provides HIGHER compression ratio. It is not relevant for splittable table.

Other formats are snappy, lzo, bzip, etc. You can set compression at mapper and reducer side using codes below:

set mapred.compress.map.output = true;
set mapred.output.compress= true;

Users can also set the following properties in hive-site.xml and map-site.xml to get permanent effects.

<property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression(for MR)/compress(for Yarn).codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property>

3. Use ORC file format

ORC (optimized record columnar) is an appropriate format for hive performance tunin,query performance can improve using ORC file format easily. We can use ORC file format for all kind of table whether it is partitioned or single and in response, you get faster computation and compressed file size.

Create table orctbl (id int, name string, address string) stored as ORC tblproperties (“orc.compress”= “SNAPPY”);
Now simply you can also insert the data like
Insert overwrite table orctbl select * from tbldetails;

4. Optimize your joins

If your table is having large data then it is not advisable to just use normal joins which we use in SQL. There are many other joins like Map Join; bucket joins, etc. which will help to improve Hive query performance.

5. Use Map Join

When we are talking about Map join, It is beneficial when one table is as compare to other table which will take part of the Join. so that it can fit into the memory. Hive has a property which can do auto-map join when enabled. Set the below parameter to true to enable auto map join. Set hive.auto.convert.join to true to enable the auto map join. we can set this from the command line as well as from the hive-site.xml file

<property> <name>hive.auto.convert.join</name> <value>true</value> <description>Whether Hive enables the optimization about converting common join into mapjoin based on the input file size</description> </property>

6. Bucketed Map Join

If tables are bucketed by a particular column, you can use bucketed map join to improve the hive query performance. You can set the below two property to enable the bucketed map to join in Hive.

<property> <name>hive.optimize.bucketmapjoin</name> <value>true</value> <description>Whether to try bucket mapjoin</description> </property> <property> <name>hive.optimize.bucketmapjoin.sortedmerge</name> <value>true</value> <description>Whether to try sorted bucket merge map join</description> </property>

7. Use Partition

Partition is always helpful for huge data. It is used to segregate the large table based on certain columns so that the whole data can be divided into small chunks. When we are saying partition the table, basically It allows you to store the data under sub-directory inside a table.

Selecting the partition table is always a critical decision, and you need to take care of future data and volume of data as well. For example, if you have data of a particular location then you can partition the table based on state. You can also partition the data in month wise as well. You can define the partition column based on your requirement.

Here is the syntax to create a partition table

CREATE TABLE countrydata_partition (Id int, country name string, population int, description string) PARTITIONED BY (country VARCHAR(64), state VARCHAR(64)) row format DELIMITED fields terminated by ‘\t’ stored AS textfile;

There are two types of partition in Hive.

Static partition
Dynamic partition

By default, the partition is static in a hive. In static partition usually we are providing the parameter as " PARTITIONED BY (department String) ". when loading big files into the hive, the static partition is preferred.

Single insert to partition table is known as dynamic partition and it load the data from non partitioned Table. If you don't know how many columns are available in your table in this scenario also dynamic partition is suitable. To use dynamic partition in Hive, you need to set the following property-

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=100;

8. Use Vectorization

A standard query is executed one row at a time. vectorized query execution, it improves performance of operation like scan, AGGREGATION, filter and joins and it is considering 1024 rows at a time to perform the operation. To use Vectorization you can use the below parameter.

set hive.vectorized.execution.enabled=true
set hive.vactorized.execution.reduce.enabled=true

How HIVE Database and IMPALA are working together in CLOUDERA?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment