435 + Interview Questions in GENERAL QA IN HADOOP Page 2 InterviewSolution

51.	Apache Hadoop Development Tools is an effort undergoing incubation at _________(a) ADF(b) ASF(c) HCC(d) AFS
Answer» Right answer is (b) ASF The explanation is: The Apache Software Foundation(ASF) is sponsored by the Apache Incubator PMC.

Discussion

52.	__________ method tells LoadFunc which fields are required in the Pig script.(a) pushProjection()(b) relativeToAbsolutePath()(c) prepareToRead()(d) none of the mentioned
Answer» Right choice is (a) pushProjection() To explain: Pig will use the column index requiredField.index to communicate with the LoadFunc about the fields required by the Pig script.

Discussion

53.	_________ function is responsible for consolidating the results produced by each of the Map() functions/tasks.(a) Reduce(b) Map(c) Reducer(d) All of the mentioned
Answer» The correct option is (a) Reduce To explain I would say: Reduce function collates the work and resolves the results.

Discussion

54.	The output of the _______ is not sorted in the Mapreduce framework for Hadoop.(a) Mapper(b) Cascader(c) Scalding(d) None of the mentioned
Answer» Right choice is (d) None of the mentioned Best explanation: The output of the reduce task is typically written to the FileSystem. The output of the Reducer is not sorted.

Discussion

55.	GraphX provides an API for expressing graph computation that can model the __________ abstraction.(a) GaAdt(b) Spark Core(c) Pregel(d) None of the mentioned
Answer» The correct option is (c) Pregel To elaborate: GraphX is used for machine learning.

Discussion

56.	Point out the wrong statement.(a) ConcurScheduler detects whether the index is on SSD or not(b) Memory index supports payloads(c) Auto-IO-throttling has been added to ConcurrentMergeScheduler, to rate limit IO writes for each merge depending on incoming merge rate(d) The default codec has an option to control BEST_SPEED or BEST_COMPRESSION for stored fields
Answer» Right choice is (a) ConcurScheduler detects whether the index is on SSD or not To explain: ConcurrentMergeScheduler does a better job defaulting its settings.

Discussion

57.	Spark is packaged with higher level libraries, including support for _________ queries.(a) SQL(b) C(c) C++(d) None of the mentioned
Answer» Right answer is (a) SQL The explanation: Standard libraries increase developer productivity and can be seamlessly combined to create complex workflows.

Discussion

58.	The ___________ can also be used to distribute both jars and native libraries for use in the map and/or reduce tasks.(a) DataCache(b) DistributedData(c) DistributedCache(d) All of the mentioned
Answer» Correct choice is (c) DistributedCache To explain I would say: The child-jvm always has its current working directory added to the java.library.path and LD_LIBRARY_PATH.

Discussion

59.	The _____________ can also be used to distribute both jars and native libraries for use in the map and/or reduce tasks.(a) DistributedLog(b) DistributedCache(c) DistributedJars(d) None of the mentioned
Answer» Correct choice is (b) DistributedCache Easiest explanation: Cached libraries can be loaded via System.loadLibrary or System.load.

Discussion

60.	____________ method enables the RecordReader associated with the InputFormat provided by the LoadFunc is passed to the LoadFunc.(a) getNext()(b) relativeToAbsolutePath()(c) prepareToRead()(d) all of the mentioned
Answer» The correct choice is (c) prepareToRead() The best explanation: The RecordReader can then be used by the implementation in getNext() to return a tuple representing a record of data back to pig.

Discussion

61.	Apache _________ is a project that enables development and consumption of REST style web services.(a) Wives(b) Wink(c) Wig(d) All of the mentioned
Answer» Right choice is (b) Wink To explain: The core server runtime is based on the JAX-RS (JSR 311) standard.

Discussion

62.	The output descriptor for the table to be written is created by calling ____________(a) OutputJobInfo.describe(b) OutputJobInfo.create(c) OutputJobInfo.put(d) None of the mentioned
Answer» Correct choice is (b) OutputJobInfo.create To explain I would say: The implementation of Map takes HCatRecord as an input and the implementation of Reduce produces it as an output.

Discussion

63.	Point out the correct statement.(a) Mahout is distributed under a commercially friendly Apache Software license(b) Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop® and using the MapReduce paradigm(c) Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms(d) None of the mentioned
Answer» Correct choice is (d) None of the mentioned The explanation is: The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases.

Discussion

64.	PostingsFormat now uses a __________ API when writing postings, just like doc values.(a) push(b) pull(c) read(d) all of the mentioned
Answer» Right answer is (b) pull To explain: This is powerful because you can do things in your postings format that require making more than one pass through the postings such as iterating over all postings.

Discussion

65.	SolrJ now has first class support for __________ API.(a) Compactions(b) Collections(c) Distribution(d) All of the mentioned
Answer» Right option is (b) Collections Explanation: Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene.

Discussion

66.	_________ is the output produced by TextOutputFor mat, Hadoop default OutputFormat.(a) KeyValueTextInputFormat(b) KeyValueTextOutputFormat(c) FileValueTextInputFormat(d) All of the mentioned
Answer» The correct answer is (b) KeyValueTextOutputFormat Explanation: To interpret such files correctly, KeyValueTextInputFormat is appropriate.

Discussion

67.	Which of the following class provides a subset of features provided by the Unix/GNU Sort?(a) KeyFieldBased(b) KeyFieldComparator(c) KeyFieldBasedComparator(d) All of the mentioned
Answer» Correct choice is (c) KeyFieldBasedComparator To explain I would say: Hadoop has a library class, KeyFieldBasedComparator, that is useful for many applications.

Discussion

68.	___________ executes the pipeline as a series of MapReduce jobs.(a) SparkPipeline(b) MRPipeline(c) MemPipeline(d) None of the mentioned
Answer» The correct answer is (b) MRPipeline For explanation I would say: Every Crunch data pipeline is coordinated by an instance of the Pipeline interface.

Discussion

69.	__________ is a log collection and correlation software with reporting and alarming functionalities.(a) Lucene(b) ALOIS(c) Imphal(d) None of the mentioned
Answer» The correct answer is (b) ALOIS Explanation: This Project activity is transferred to another Incubator project – ODE.

Discussion

70.	_____________ will skip the nodes given in the config with the same exit transition as before.(a) ActionMega handler(b) Action handler(c) Data handler(d) None of the mentioned
Answer» The correct answer is (b) Action handler Best explanation: Currently there is no way to remove an existing configuration but only override by passing a different value in the input configuration.

Discussion

71.	Drill analyze semi-structured/nested data coming from _________ applications.(a) RDBMS(b) NoSQL(c) NewSQL(d) None of the mentioned
Answer» Correct choice is (b) NoSQL The best I can explain: Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications.

Discussion

72.	Point out the wrong statement.(a) Oozie provides a unique callback URL to the task, the task should invoke the given URL to notify its completion(b) All computation/processing tasks triggered by an mechanism node are remote to Oozie(c) Oozie workflows can be parameterized(d) None of the mentioned
Answer» The correct answer is (b) All computation/processing tasks triggered by an mechanism node are remote to Oozie Easy explanation: All computation/processing tasks are executed by Hadoop Map/Reduce framework.

Discussion

73.	Hive uses _________ for logging.(a) logj4(b) log4l(c) log4i(d) log4j
Answer» Correct choice is (d) log4j For explanation I would say: By default Hive will use hive-log4j.default in the conf/ directory of the Hive installation.

Discussion

74.	Which of the following data type is supported by Hive?(a) map(b) record(c) string(d) enum
Answer» The correct option is (d) enum The explanation: Hive has no concept of enums.

Discussion

75.	New ____________ type enables Indexing and searching of date ranges, particularly multi-valued ones.(a) RangeField(b) DateField(c) DateRangeField(d) All of the mentioned
Answer» The correct answer is (c) DateRangeField Explanation: A new ExitableDirectoryReader extends FilterDirectoryReader and enables exiting requests that take too long to enumerate over terms.

Discussion

76.	Mahout provides ____________ libraries for common and primitive Java collections.(a) Java(b) Javascript(c) Perl(d) Python
Answer» Correct option is (a) Java The explanation: Maths operations are focused on linear algebra and statistics.

Discussion

77.	__________ has the world’s largest Hadoop cluster.(a) Apple(b) Datamatics(c) Facebook(d) None of the mentioned
Answer» Correct answer is (c) Facebook To explain I would say: Facebook has many Hadoop clusters, the largest among them is the one that is used for Data warehousing.

Discussion

78.	______________ class allows the Map/Reduce framework to partition the map outputs based on certain key fields, not the whole keys.(a) KeyFieldPartitioner(b) KeyFieldBasedPartitioner(c) KeyFieldBased(d) None of the mentioned
Answer» Right choice is (b) KeyFieldBasedPartitioner For explanation I would say: The primary key is used for partitioning, and the combination of the primary and secondary keys is used for sorting.

Discussion

79.	Lucene provides scalable, high-Performance indexing over ______ per hour on modern hardware.(a) 1 TB(b) 150GB(c) 10 GB(d) None of the mentioned
Answer» Right choice is (b) 150GB Easy explanation: Lucene offers powerful features through a simple API.

Discussion

80.	A __________ represents a distributed, immutable collection of elements of type T.(a) PCollect(b) PCollection(c) PCol(d) All of the mentioned
Answer» Right choice is (b) PCollection To elaborate: PCollection provides a method, parallelDo, that applies a DoFn to each element in the PCollection.

Discussion

81.	Using Hadoop Archives in __________ is as easy as specifying a different input filesystem than the default file system.(a) Hive(b) Pig(c) MapReduce(d) All of the mentioned
Answer» Correct choice is (c) MapReduce Easy explanation: Hadoop Archives is exposed as a file system MapReduce will be able to use all the logical input files in Hadoop Archives as input.

Discussion

82.	___________ property allows us to specify a custom dir location pattern for all the writes, and will interpolate each variable.(a) hcat.dynamic.partitioning.custom.pattern(b) hcat.append.limit(c) hcat.pig.storer.external.location(d) hcatalog.hive.client.cache.expiry.time
Answer» Correct answer is (a) hcat.dynamic.partitioning.custom.pattern Best explanation: hcat.append.limit allows an HCatalog user to specify a custom append limit.

Discussion

83.	Apache _________ provides direct queries on self-describing and semi-structured data in files.(a) Drill(b) Mahout(c) Oozie(d) All of the mentioned
Answer» Right answer is (a) Drill For explanation: Users can explore live data on their own as it arrives versus spending weeks or months on data preparation, modeling, ETL and subsequent schema management.

Discussion

84.	Nodes in the config _____________ must be completed successfully.(a) oozie.wid.rerun.skip.nodes(b) oozie.wf.rerun.skip.nodes(c) oozie.wf.run.skip.nodes(d) all of the mentioned
Answer» The correct option is (b) oozie.wf.rerun.skip.nodes Easiest explanation: If no configuration is passed, existing coordinator/workflow configuration will be used.

Discussion

85.	Point out the wrong statement.(a) Falcon promotes Javascript Programming(b) Falcon does not do any heavy lifting but delegates to tools with in the Hadoop ecosystem(c) Falcon handles retry logic and late data processing. Records audit, lineage and metrics(d) All of the mentioned
Answer» Right answer is (a) Falcon promotes Javascript Programming Best explanation: Falcon promotes Polyglot Programming.

Discussion

86.	A __________ in a social graph is a group of people who interact frequently with each other and less frequently with others.(a) semi-cluster(b) partial cluster(c) full cluster(d) none of the mentioned
Answer» Correct option is (a) semi-cluster To explain: semi-cluster is different from ordinary clustering in the sense that a vertex may belong to more than one semi-cluster.

Discussion

87.	__________ is a REST API for HCatalog.(a) WebHCat(b) WbHCat(c) InpHCat(d) None of the mentioned
Answer» Right choice is (a) WebHCat To explain: REST stands for “representational state transfer”, a style of API based on HTTP verbs.

Discussion

88.	Hive, Pig, and Cascading all use a _________ data model.(a) value centric(b) columnar(c) tuple-centric(d) none of the mentioned
Answer» The correct choice is (c) tuple-centric The best I can explain: Crunch allows developers considerable flexibility in how they represent their data, which makes Crunch the best pipeline platform for developers.

Discussion

89.	____________ is a distributed machine learning framework on top of Spark.(a) MLlib(b) Spark Streaming(c) GraphX(d) RDDs
Answer» Correct answer is (a) MLlib Explanation: MLlib implements many common machine learning and statistical algorithms to simplify large scale machine learning pipelines.

Discussion

90.	Hive also support custom extensions written in ____________(a) C#(b) Java(c) C(d) C++
Answer» The correct answer is (b) Java Easiest explanation: Hive also supports custom extensions written in Java, including user-defined functions (UDFs) and serializer-deserializers for reading and optionally writing custom formats.

Discussion

91.	Point out the wrong statement.(a) Elastic MapReduce (EMR) is Facebook’s packaged Hadoop offering(b) Amazon Web Service Elastic MapReduce (EMR) is Amazon’s packaged Hadoop offering(c) Scalding is a Scala API on top of Cascading that removes most Java boilerplate(d) All of the mentioned
Answer» Correct answer is (a) Elastic MapReduce (EMR) is Facebook’s packaged Hadoop offering Best explanation: Rather than building Hadoop deployments manually on EC2 (Elastic Compute Cloud) clusters, users can spin up fully configured Hadoop installations using simple invocation commands, either through the AWS Web Console or through command-line tools.

Discussion

92.	________ is a platform for constructing data flows for extract, transform, and load (ETL) processing and analysis of large datasets.(a) Pig Latin(b) Oozie(c) Pig(d) Hive
Answer» Correct choice is (c) Pig Best explanation: Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.

Discussion

93.	________________ is complete FTP Server based on Mina I/O system.(a) Giraph(b) Gereition(c) FtpServer(d) Oozie
Answer» The correct choice is (c) FtpServer For explanation I would say: Giraph is a large-scale, fault-tolerant, Bulk Synchronous Parallel (BSP)-based graph processing framework.

Discussion

94.	___________ provides Java-based indexing and search technology.(a) Solr(b) Lucene Core(c) Lucy(d) All of the mentioned
Answer» Correct answer is (b) Lucene Core The best I can explain: Lucene provides spellchecking, hit highlighting and advanced analysis/tokenization capabilities.

Discussion

95.	_____________ is a software distribution framework based on OSGi.(a) ACE(b) Abdera(c) Zeppelin(d) Accumulo
Answer» Correct answer is (a) ACE Easy explanation: ACE allows you to manage and distribute artifacts.

Discussion

96.	Falcon provides ___________ workflow for copying data from source to target.(a) recurring(b) investment(c) data(d) none of the mentioned
Answer» Right option is (a) recurring Best explanation: Falcon instruments workflows for dependencies, retry logic, Table/Partition registration, notifications, etc.

Discussion

97.	Falcon promotes decoupling of data set location from ___________ definition.(a) Oozie(b) Impala(c) Kafka(d) Thrift
Answer» The correct option is (a) Oozie Explanation: Falcon uses declarative processing with simple directives enabling rapid prototyping.

Discussion

98.	Distributed Mode are mapped in the __________ file.(a) groomservers(b) grervers(c) grsvers(d) groom
Answer» The correct choice is (a) groomservers To explain I would say: Distributed Mode is used when you have multiple machines.

Discussion

99.	Workflow with id __________ should be in SUCCEEDED/KILLED/FAILED.(a) wfId(b) iUD(c) iFD(d) all of the mentioned
Answer» The correct option is (a) wfId The best I can explain: Workflow with id wfId should exist.

Discussion

100.	Point out the wrong statement.(a) Storm is difficult and can be used with only Java(b) Storm is fast: a benchmark clocked it at over a million tuples processed per second per node(c) Storm is scalable, fault-tolerant, guarantees your data will be processed(d) All of the mentioned
Answer» Right choice is (a) Storm is difficult and can be used with only Java The best explanation: Storm is simple, can be used with any programming language.

Discussion

Explore topic-wise InterviewSolutions in .

Apache Hadoop Development Tools is an effort undergoing incubation at _________(a) ADF(b) ASF(c) HCC(d) AFS

__________ method tells LoadFunc which fields are required in the Pig script.(a) pushProjection()(b) relativeToAbsolutePath()(c) prepareToRead()(d) none of the mentioned

_________ function is responsible for consolidating the results produced by each of the Map() functions/tasks.(a) Reduce(b) Map(c) Reducer(d) All of the mentioned

The output of the _______ is not sorted in the Mapreduce framework for Hadoop.(a) Mapper(b) Cascader(c) Scalding(d) None of the mentioned

GraphX provides an API for expressing graph computation that can model the __________ abstraction.(a) GaAdt(b) Spark Core(c) Pregel(d) None of the mentioned

Spark is packaged with higher level libraries, including support for _________ queries.(a) SQL(b) C(c) C++(d) None of the mentioned

The ___________ can also be used to distribute both jars and native libraries for use in the map and/or reduce tasks.(a) DataCache(b) DistributedData(c) DistributedCache(d) All of the mentioned

The _____________ can also be used to distribute both jars and native libraries for use in the map and/or reduce tasks.(a) DistributedLog(b) DistributedCache(c) DistributedJars(d) None of the mentioned

____________ method enables the RecordReader associated with the InputFormat provided by the LoadFunc is passed to the LoadFunc.(a) getNext()(b) relativeToAbsolutePath()(c) prepareToRead()(d) all of the mentioned

Apache _________ is a project that enables development and consumption of REST style web services.(a) Wives(b) Wink(c) Wig(d) All of the mentioned

The output descriptor for the table to be written is created by calling ____________(a) OutputJobInfo.describe(b) OutputJobInfo.create(c) OutputJobInfo.put(d) None of the mentioned

PostingsFormat now uses a __________ API when writing postings, just like doc values.(a) push(b) pull(c) read(d) all of the mentioned

SolrJ now has first class support for __________ API.(a) Compactions(b) Collections(c) Distribution(d) All of the mentioned

_________ is the output produced by TextOutputFor mat, Hadoop default OutputFormat.(a) KeyValueTextInputFormat(b) KeyValueTextOutputFormat(c) FileValueTextInputFormat(d) All of the mentioned

Which of the following class provides a subset of features provided by the Unix/GNU Sort?(a) KeyFieldBased(b) KeyFieldComparator(c) KeyFieldBasedComparator(d) All of the mentioned

___________ executes the pipeline as a series of MapReduce jobs.(a) SparkPipeline(b) MRPipeline(c) MemPipeline(d) None of the mentioned

__________ is a log collection and correlation software with reporting and alarming functionalities.(a) Lucene(b) ALOIS(c) Imphal(d) None of the mentioned

_____________ will skip the nodes given in the config with the same exit transition as before.(a) ActionMega handler(b) Action handler(c) Data handler(d) None of the mentioned

Drill analyze semi-structured/nested data coming from _________ applications.(a) RDBMS(b) NoSQL(c) NewSQL(d) None of the mentioned

Point out the wrong statement.(a) Oozie provides a unique callback URL to the task, the task should invoke the given URL to notify its completion(b) All computation/processing tasks triggered by an mechanism node are remote to Oozie(c) Oozie workflows can be parameterized(d) None of the mentioned

Hive uses _________ for logging.(a) logj4(b) log4l(c) log4i(d) log4j

Which of the following data type is supported by Hive?(a) map(b) record(c) string(d) enum

New ____________ type enables Indexing and searching of date ranges, particularly multi-valued ones.(a) RangeField(b) DateField(c) DateRangeField(d) All of the mentioned

Mahout provides ____________ libraries for common and primitive Java collections.(a) Java(b) Javascript(c) Perl(d) Python

__________ has the world’s largest Hadoop cluster.(a) Apple(b) Datamatics(c) Facebook(d) None of the mentioned

______________ class allows the Map/Reduce framework to partition the map outputs based on certain key fields, not the whole keys.(a) KeyFieldPartitioner(b) KeyFieldBasedPartitioner(c) KeyFieldBased(d) None of the mentioned

Lucene provides scalable, high-Performance indexing over ______ per hour on modern hardware.(a) 1 TB(b) 150GB(c) 10 GB(d) None of the mentioned

A __________ represents a distributed, immutable collection of elements of type T.(a) PCollect(b) PCollection(c) PCol(d) All of the mentioned

Using Hadoop Archives in __________ is as easy as specifying a different input filesystem than the default file system.(a) Hive(b) Pig(c) MapReduce(d) All of the mentioned

___________ property allows us to specify a custom dir location pattern for all the writes, and will interpolate each variable.(a) hcat.dynamic.partitioning.custom.pattern(b) hcat.append.limit(c) hcat.pig.storer.external.location(d) hcatalog.hive.client.cache.expiry.time

Apache _________ provides direct queries on self-describing and semi-structured data in files.(a) Drill(b) Mahout(c) Oozie(d) All of the mentioned

Nodes in the config _____________ must be completed successfully.(a) oozie.wid.rerun.skip.nodes(b) oozie.wf.rerun.skip.nodes(c) oozie.wf.run.skip.nodes(d) all of the mentioned

Point out the wrong statement.(a) Falcon promotes Javascript Programming(b) Falcon does not do any heavy lifting but delegates to tools with in the Hadoop ecosystem(c) Falcon handles retry logic and late data processing. Records audit, lineage and metrics(d) All of the mentioned

A __________ in a social graph is a group of people who interact frequently with each other and less frequently with others.(a) semi-cluster(b) partial cluster(c) full cluster(d) none of the mentioned

__________ is a REST API for HCatalog.(a) WebHCat(b) WbHCat(c) InpHCat(d) None of the mentioned

Hive, Pig, and Cascading all use a _________ data model.(a) value centric(b) columnar(c) tuple-centric(d) none of the mentioned

____________ is a distributed machine learning framework on top of Spark.(a) MLlib(b) Spark Streaming(c) GraphX(d) RDDs

Hive also support custom extensions written in ____________(a) C#(b) Java(c) C(d) C++

Point out the wrong statement.(a) Elastic MapReduce (EMR) is Facebook’s packaged Hadoop offering(b) Amazon Web Service Elastic MapReduce (EMR) is Amazon’s packaged Hadoop offering(c) Scalding is a Scala API on top of Cascading that removes most Java boilerplate(d) All of the mentioned

________ is a platform for constructing data flows for extract, transform, and load (ETL) processing and analysis of large datasets.(a) Pig Latin(b) Oozie(c) Pig(d) Hive

________________ is complete FTP Server based on Mina I/O system.(a) Giraph(b) Gereition(c) FtpServer(d) Oozie

___________ provides Java-based indexing and search technology.(a) Solr(b) Lucene Core(c) Lucy(d) All of the mentioned

_____________ is a software distribution framework based on OSGi.(a) ACE(b) Abdera(c) Zeppelin(d) Accumulo

Falcon provides ___________ workflow for copying data from source to target.(a) recurring(b) investment(c) data(d) none of the mentioned

Falcon promotes decoupling of data set location from ___________ definition.(a) Oozie(b) Impala(c) Kafka(d) Thrift

Distributed Mode are mapped in the __________ file.(a) groomservers(b) grervers(c) grsvers(d) groom

Workflow with id __________ should be in SUCCEEDED/KILLED/FAILED.(a) wfId(b) iUD(c) iFD(d) all of the mentioned

Point out the wrong statement.(a) Storm is difficult and can be used with only Java(b) Storm is fast: a benchmark clocked it at over a million tuples processed per second per node(c) Storm is scalable, fault-tolerant, guarantees your data will be processed(d) All of the mentioned