1.

In your MapReduce job you consistently see that MapReduce map tasks on your cluster are running slowly because of excessive garbage collection of JVM. How do you increase JVM heap size property to 3GB to optimize performance?

Answer»

Hive works on structured data provide a SQL like a layer on top of HDFS, Map-reduce task will execute for each query of Hive which is trying to do some compute of HDFS data. Impala is a Massive PARALLEL processing SQL query engine that is capable enough to handle a huge volume of data. Impala is faster than Hive because Impala is not storing the intermediate query results on disk, it processes the SQL query in Memory without running any Map-reduce. 

Below are the few Hive components

  • Hive Clients
  • Hive Services

1. Hive Clients:

Hive clients are helping hive to perform the queries. There are three types of clients we can use to perform the queries 

  • Thrift Clients
  • JDBC clients
  • ODBC clients
    • Thrift clients: Basically Apache Thrift is a   which will help to get connect in between client and server. Apache Hive uses Thrift to allow remote users to make a connection with HiveServer2(The thrift server) to connect to it and submit queries. Thrift protocols are written in different languages like C++, Java, Python so a user can query the same source in different languages.
    • JDBC Clients: Apache hive allow Java applications to connect the Hive using the JDBC driver. It is defined as the class as apache .hadoop.hive.jdbc.HiveDriver.
    • ODBC Clients: ODBC driver allows applications that support Open database connectivity protocol to connect to the hive.

2. Hive Services

  • Apache Hive provides below services.
  • CLI(COMMAND Line Interfac): This is the default hive shell that will help query and command   to directly.
  • Web interface: Hive also provides web-based GUI for executing Hive queries and commands for example HUE in Cloudera.
  • Hive server/Thrift server: Different clients submit their requests to Hive and get the result accordingly.
  • Hive Driver: Once queries are submitted from Thrift/ JDBC/ODBC/CLI/Web UL, a driver is responsible to receive those queries, then it will process through a compiler, optimizer, and executor.

The compiler will verify the syntax check with the help of schema present in the metastore then optimizer generates the optimized logical plan in the form of Directed Acyclic Graph of Map-reduce and HDFS tasks. The Executor executes the tasks after the compilation and optimization steps. The Executor directly interacts with the Hadoop Job Tracker for scheduling of tasks to be run.

  • Metastore: Metastore is the central repository of Hive Metadata like schema and locations ETC

Impala components are 1. Impala daemon(Impalad) 2. Impala State Store 3. Impala Catalog Service.

  • Impala daemon(Impalad): Impala daemon is running where impala is installed, Impalad accepts the queries from various interfaces like impala-shell, hue, etc and processes the queries to get the result.

Whenever query submitted in any impala daemon, the related node is considered " central coordinator node" for that query. After accepting the query, IMPALAD logically divides the query into smaller parallel queries and distribute them to different nodes in the impala cluster. all the Impalad gather all the intermediate result and send it to the central coordinator node, accordingly central coordinator node constructs the final query output.

  • Impala State Store: Impala daemons are continuously communicating with the statstore to identify the nodes that are healthy and capable enough to accept the new work. This information will convey to the Daemons by the Statstore component. due to any reason if any node is getting failed then Statestore updates all other nodes about this failure and once this notification is available to the other impalad, no other Impala daemon assigns any further queries to the affected node.
  • Impala Catalog Service: Catalog service provides the information about the metadata changes from Impala SQL statement to all the Impala daemons in the cluster. Impala uses the data stored in Hive so Impala refers the Metastore Table to get connect the Database and Tables created in Hive. Once the Table is created through Hive shell, prior to available this table for Impala queries we need to invalidate the metadata so that Impala RELOADS the corresponding metadata before the query is processed. 

Example : INVALIDATE METADATA [[db_name.]table_name];

REFRESH [db_name.]table_name];



Discussion

No Comment Found

Related InterviewSolutions