435 + Interview Questions in GENERAL QA IN HADOOP Page 8 InterviewSolution

351.	__________ is a fully integrated, state-of-the-art analytic database architected specifically to leverage strengths of Hadoop.(a) Oozie(b) Impala(c) Lucene(d) BigTop
Answer» Correct answer is (b) Impala To elaborate: Impala provides scalability and flexibility to hadoop.

Discussion

352.	Which of the following companies shipped Impala?(a) Amazon(b) Oracle(c) MapR(d) All of the mentioned
Answer» Correct option is (d) All of the mentioned The explanation is: Impala is shipped by Cloudera, MapR, Oracle, and Amazon.

Discussion

353.	Amazon EMR uses Hadoop processing combined with several __________ products.(a) AWS(b) ASQ(c) AMR(d) AWES
Answer» Correct answer is (a) AWS For explanation I would say: Amazon Elastic MapReduce (Amazon EMR) is a web service that makes it easy to process large amounts of data efficiently.

Discussion

354.	Impala is integrated with native Hadoop security and Kerberos for authentication via __________ module.(a) Sentinue(b) Sentry(c) Sentinar(d) All of the mentioned
Answer» Correct option is (b) Sentry Explanation: Via the Sentry module, you can ensure that the right users and applications are authorized for the right data.

Discussion

355.	The Amazon EMR default input format for Hive is __________(a) org.apache.hadoop.hive.ql.io.CombineHiveInputFormat(b) org.apache.hadoop.hive.ql.iont.CombineHiveInputFormat(c) org.apache.hadoop.hive.ql.io.CombineFormat(d) All of the mentioned
Answer» Right option is (a) org.apache.hadoop.hive.ql.io.CombineHiveInputFormat The best I can explain: You can specify the hive.base.inputformat option in Hive to select a different file format,

Discussion

356.	Impala on Amazon EMR requires _________ running Hadoop 2.x or greater.(a) AMS(b) AMI(c) AWR(d) All of the mentioned
Answer» Correct choice is (b) AMI The explanation: Impala is an open source tool in the Hadoop ecosystem for interactive, ad hoc querying using SQL syntax.

Discussion

357.	InfoSphere ___________ provides you with the ability to flexibly meet your unique information integration requirements.(a) Data Server(b) Information Server(c) Info Server(d) All of the mentioned
Answer» Right answer is (b) Information Server To explain I would say: IBM InfoSphere Information Server is a market-leading data integration platform which includes a family of products that enable you to understand, cleanse, monitor, transform, and deliver data.

Discussion

358.	Spark is engineered from the bottom-up for performance, running ___________ faster than Hadoop by exploiting in memory computing and other optimizations.(a) 100x(b) 150x(c) 200x(d) None of the mentioned
Answer» Right choice is (a) 100x The best explanation: Spark is fast on disk too; it currently holds the world record in large scale on-disk sorting.

Discussion

359.	Hadoop has a library class, org.apache.hadoop.mapred.lib.FieldSelectionMapReduce, that effectively allows you to process text data like the unix ______ utility.(a) Copy(b) Cut(c) Paste(d) Move
Answer» Correct option is (b) Cut The best I can explain: The map function defined in the class treats each input key/value pair as a list of fields.

Discussion

360.	Point out the correct statement.(a) The sequence file also can contain a “secondary” key-value list that can be used as file Metadata(b) SequenceFile formats share a header that contains some information which allows the reader to recognize is format(c) There’re Key and Value Class Name’s that allow the reader to instantiate those classes, via reflection, for reading(d) All of the mentioned
Answer» Correct option is (d) All of the mentioned The best I can explain: In contrast with other persistent key-value data structures like B-Trees, you can’t seek to specified key editing, adding or removing it.

Discussion

361.	How many formats of SequenceFile are present in Hadoop I/O?(a) 2(b) 3(c) 4(d) 5
Answer» Right choice is (b) 3 For explanation I would say: SequenceFile has 3 available formats: An “Uncompressed” format, a “Record Compressed” format and a “Block-Compressed”.

Discussion

362.	Apache Hadoop ___________ provides a persistent data structure for binary key-value pairs.(a) GetFile(b) SequenceFile(c) Putfile(d) All of the mentioned
Answer» Correct option is (b) SequenceFile Explanation: SequenceFile is append-only.

Discussion

363.	Hadoop ___________ is a utility to support running external map and reduce jobs.(a) Orchestration(b) Streaming(c) Collection(d) All of the mentioned
Answer» Right choice is (b) Streaming The best explanation: These external jobs can be written in various programming languages such as Python or Ruby.

Discussion

364.	___________ was created to allow you to flow data from a source into your Hadoop environment.(a) Imphala(b) Oozie(c) Flume(d) All of the mentioned
Answer» Right choice is (c) Flume Explanation: In Flume, the entities you work with are called sources, decorators, and sinks.

Discussion

365.	Point out the wrong statement.(a) TusCAN ia Service Component Architecture implementation(b) Tob is a JSF based framework for web-applications(c) Traffic is a scalable and extensible HTTP proxy server and cache(d) None of the mentioned
Answer» The correct choice is (a) TusCAN ia Service Component Architecture implementation Best explanation: Tuscany is used for service Component Architecture implementation.

Discussion

366.	___________ is a distributed data warehouse system for Hadoop.(a) Stratos(b) Tajo(c) Sqoop(d) Lucene
Answer» Correct choice is (b) Tajo The best explanation: Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Discussion

367.	Many people use Kafka as a replacement for a ___________ solution.(a) log aggregation(b) compaction(c) collection(d) all of the mentioned
Answer» Correct answer is (a) log aggregation Best explanation: Log aggregation typically collects physical log files off servers and puts them in a central place.

Discussion

368.	Kafka uses key-value pairs in the ____________ file format for configuration.(a) RFC(b) Avro(c) Property(d) None of the mentioned
Answer» Right option is (c) Property For explanation I would say: These key values can be supplied either from a file or programmatically.

Discussion

369.	Point out the wrong statement.(a) CDH contains the main, core elements of Hadoop(b) In October 2012, Cloudera announced the Cloudera Impala project(c) CDH may be downloaded from Cloudera’s website at no charge(d) None of the mentioned
Answer» The correct option is (d) None of the mentioned To explain: CDH may be downloaded from Cloudera’s website with no technical support nor Cloudera Manager.

Discussion

370.	Point out the wrong statement.(a) InfoSphere DataStage also facilitates extended metadata management and enterprise connectivity(b) Real-Time Integration pack can turn server or parallel jobs into SOA services(c) In 2012 the suite was renamed to InfoSphere Information Server and the product was renamed to InfoSphere DataStage(d) None of the mentioned
Answer» The correct option is (c) In 2012 the suite was renamed to InfoSphere Information Server and the product was renamed to InfoSphere DataStage Easiest explanation: In 2006 the product was released as part of the IBM Information Server under the Information Management family but was still known as WebSphere DataStage.

Discussion

371.	The ________ method in the ModelCountReducer class “reduces” the values the mapper collects into a derived value.(a) count(b) add(c) reduce(d) all of the mentioned
Answer» Correct option is (c) reduce Explanation: In some cases, it can be a simple sum of the values.

Discussion

372.	____________ is an open-source version control system.(a) Stratos(b) Kafka(c) Sqoop(d) Subversion
Answer» The correct answer is (d) Subversion Best explanation: Subversion contains lot of features for hadoop.

Discussion

373.	To configure short-circuit local reads, you will need to enable ____________ on local Hadoop.(a) librayhadoop(b) libhadoop(c) libhad(d) none of the mentioned
Answer» Correct choice is (b) libhadoop Explanation: Short-circuit reads make use of a UNIX domain socket.

Discussion

374.	__________ is an abstraction over Apache Hadoop YARN that reduces the complexity of developing distributed applications.(a) Wave(b) Twill(c) Usergrid(d) None of the mentioned
Answer» Right choice is (b) Twill To elaborate: Twill allows developers to focus more on their business logic.

Discussion

375.	Kafka maintains feeds of messages in categories called __________(a) topics(b) chunks(c) domains(d) messages
Answer» Correct answer is (a) topics The explanation is: We’ll call processes that publish messages to Kafka topic producers.

Discussion

376.	Applications can use the _________ provided to report progress or just indicate that they are alive.(a) Collector(b) Reporter(c) Dashboard(d) None of the mentioned
Answer» Right choice is (b) Reporter To explain I would say: In scenarios where the application takes a significant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task.

Discussion

377.	Point out the wrong statement.(a) The Mapper outputs are sorted and then partitioned per Reducer(b) The total number of partitions is the same as the number of reduce tasks for the job(c) The intermediate, sorted outputs are always stored in a simple (key-len, key, value-len, value) format(d) None of the mentioned
Answer» Right choice is (d) None of the mentioned Easiest explanation: All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer(s) to determine the final output.

Discussion

378.	The right number of reduces seems to be ____________(a) 0.90(b) 0.80(c) 0.36(d) 0.95
Answer» Right answer is (d) 0.95 The best explanation: The right number of reduces seems to be 0.95 or 1.75.

Discussion

379.	The number of maps is usually driven by the total size of ____________(a) inputs(b) outputs(c) tasks(d) None of the mentioned
Answer» Correct option is (a) inputs To explain I would say: Total size of inputs means the total number of blocks of the input files.

Discussion

380.	Cloudera ___________ includes CDH and an annual subscription license (per node) to Cloudera Manager and technical support.(a) Enterprise(b) Express(c) Standard(d) All of the mentioned
Answer» Correct option is (a) Enterprise Easiest explanation: CDH includes the core elements of Apache Hadoop plus several additional key open source projects.

Discussion

381.	The number of reduces for the job is set by the user via _________(a) JobConf.setNumTasks(int)(b) JobConf.setNumReduceTasks(int)(c) JobConf.setNumMapTasks(int)(d) All of the mentioned
Answer» Correct choice is (b) JobConf.setNumReduceTasks(int) To elaborate: Reducer has 3 primary phases: shuffle, sort and reduce.

Discussion

382.	__________ maps input key/value pairs to a set of intermediate key/value pairs.(a) Mapper(b) Reducer(c) Both Mapper and Reducer(d) None of the mentioned
Answer» Right option is (a) Mapper The explanation is: Maps are the individual tasks that transform input records into intermediate records.

Discussion

383.	Applications can use the ____________ to report progress and set application-level status messages.(a) Partitioner(b) OutputSplit(c) Reporter(d) All of the mentioned
Answer» The correct answer is (c) Reporter Best explanation: Reporter is also used to update Counters, or just indicate that they are alive.

Discussion

384.	The right level of parallelism for maps seems to be around _________ maps per-node.(a) 1-10(b) 10-100(c) 100-150(d) 150-200
Answer» Right answer is (b) 10-100 The best explanation: Task setup takes a while, so it is best if the maps take at least a minute to execute.

Discussion

385.	Point out the wrong statement.(a) If V2 and V3 are the same, you only need to use setOutputValueClass()(b) The overall effect of Streaming job is to perform a sort of the input(c) A Streaming application can control the separator that is used when a key-value pair is turned into a series of bytes and sent to the map or reduce process over standard input(d) None of the mentioned
Answer» The correct answer is (d) None of the mentioned Explanation: If a combine function is used then it is the same form as the reduce function, except its output types are the intermediate key and value types (K2 and V2), so they can feed the reduce function.

Discussion

386.	An input _________ is a chunk of the input that is processed by a single map.(a) textformat(b) split(c) datanode(d) all of the mentioned
Answer» The correct option is (b) split For explanation: Each split is divided into records, and the map processes each record—a key-value pair—in turn.

Discussion

387.	Point out the correct statement.(a) The reduce input must have the same types as the map output, although the reduce output types may be different again(b) The map input key and value types (K1 and V1) are different from the map output types(c) The partition function operates on the intermediate key(d) All of the mentioned
Answer» Right answer is (d) All of the mentioned To elaborate: In practice, the partition is determined solely by the key (the value is ignored).

Discussion

388.	__________ is a variant of SequenceFileInputFormat that converts the sequence file’s keys and values to Text objects.(a) SequenceFile(b) SequenceFileAsTextInputFormat(c) SequenceAsTextInputFormat(d) All of the mentioned
Answer» Correct choice is (b) SequenceFileAsTextInputFormat Best explanation: With multiple reducers, records will be allocated evenly across reduce tasks, with all records that share the same key being processed by the same reduce task.

Discussion

389.	___________ generates keys of type LongWritable and values of type Text.(a) TextOutputFormat(b) TextInputFormat(c) OutputInputFormat(d) None of the mentioned
Answer» Correct answer is (b) TextInputFormat For explanation I would say: If K2 and K3 are the same, you don’t need to call setMapOutputKeyClass().

Discussion

390.	With ______ we can store data and read it easily with various programming languages.(a) Thrift(b) Protocol Buffers(c) Avro(d) None of the mentioned
Answer» The correct option is (c) Avro The explanation: Avro is optimized to minimize the disk space needed by our data and it is flexible.

Discussion

391.	NiFi is a dataflow system based on the concepts of ________ programming.(a) structured(b) relational(c) set(d) flow-based
Answer» Right option is (d) flow-based Explanation: NiFi is incubator made by Billie Rinaldi.

Discussion

392.	____________ is a query processing and optimization system for large-scale.(a) MRQL(b) NiFi(c) OpenAz(d) ODF Toolkit
Answer» Correct answer is (a) MRQL For explanation: MRQL is built on top of Apache Hadoop, Hama, Spark, and Flink.

Discussion

393.	Point out the wrong statement.(a) Hadoop has a library package called Aggregate(b) Aggregate allows you to define a mapper plugin class that is expected to generate “aggregatable items” for each input key/value pair of the mappers(c) To use Aggregate, simply specify “-mapper aggregate”(d) None of the mentioned
Answer» Correct option is (c) To use Aggregate, simply specify “-mapper aggregate” The best I can explain: To use Aggregate, simply specify “-reducer aggregate”

Discussion

394.	________ is a columnar storage format for Hadoop.(a) MRQL(b) NiFi(c) OpenAz(d) Parquet
Answer» Right choice is (d) Parquet The explanation is: NiFi is a dataflow system based on the concepts of flow-based programming.

Discussion

395.	__________ is a columnar storage format for Hadoop.(a) Ranger(b) Parquet(c) REEF(d) None of the mentioned
Answer» The correct option is (b) Parquet The explanation is: The Ranger project is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.

Discussion

396.	To set an environment variable in a streaming command use ____________(a) -cmden EXAMPLE_DIR=/home/example/dictionaries/(b) -cmdev EXAMPLE_DIR=/home/example/dictionaries/(c) -cmdenv EXAMPLE_DIR=/home/example/dictionaries/(d) -cmenv EXAMPLE_DIR=/home/example/dictionaries/
Answer» Correct answer is (c) -cmdenv EXAMPLE_DIR=/home/example/dictionaries/ For explanation I would say: Environment Variable is set using cmdenv command.

Discussion

397.	Which of the following is only for storage with limited compute?(a) Hot(b) Cold(c) Warm(d) All_SSD
Answer» The correct choice is (b) Cold Easiest explanation: When a block is cold, all replicas are stored in the ARCHIVE.

Discussion

398.	During the execution of a streaming job, the names of the _______ parameters are transformed.(a) vmap(b) mapvim(c) mapreduce(d) mapred
Answer» Right answer is (d) mapred For explanation: To get the values in a streaming job’s mapper/reducer use the parameter names with the underscores.

Discussion

399.	When a block is warm, some of its replicas are stored in DISK and the remaining replicas are stored in _________(a) ROM_DISK(b) ARCHIVE(c) RAM_DISK(d) All of the mentioned
Answer» Correct answer is (b) ARCHIVE Easiest explanation: Warm storage policy is partially hot and partially cold.

Discussion

400.	The standard output (stdout) and error (stderr) streams of the task are read by the TaskTracker and logged to _________(a) ${HADOOP_LOG_DIR}/user(b) ${HADOOP_LOG_DIR}/userlogs(c) ${HADOOP_LOG_DIR}/logs(d) None of the mentioned
Answer» Correct answer is (b) ${HADOOP_LOG_DIR}/userlogs The explanation: The child-jvm always has its current working directory added to the java.library.path and LD_LIBRARY_PATH.

Discussion

Explore topic-wise InterviewSolutions in .

__________ is a fully integrated, state-of-the-art analytic database architected specifically to leverage strengths of Hadoop.(a) Oozie(b) Impala(c) Lucene(d) BigTop

Which of the following companies shipped Impala?(a) Amazon(b) Oracle(c) MapR(d) All of the mentioned

Amazon EMR uses Hadoop processing combined with several __________ products.(a) AWS(b) ASQ(c) AMR(d) AWES

Impala is integrated with native Hadoop security and Kerberos for authentication via __________ module.(a) Sentinue(b) Sentry(c) Sentinar(d) All of the mentioned

The Amazon EMR default input format for Hive is __________(a) org.apache.hadoop.hive.ql.io.CombineHiveInputFormat(b) org.apache.hadoop.hive.ql.iont.CombineHiveInputFormat(c) org.apache.hadoop.hive.ql.io.CombineFormat(d) All of the mentioned

Impala on Amazon EMR requires _________ running Hadoop 2.x or greater.(a) AMS(b) AMI(c) AWR(d) All of the mentioned

InfoSphere ___________ provides you with the ability to flexibly meet your unique information integration requirements.(a) Data Server(b) Information Server(c) Info Server(d) All of the mentioned

Spark is engineered from the bottom-up for performance, running ___________ faster than Hadoop by exploiting in memory computing and other optimizations.(a) 100x(b) 150x(c) 200x(d) None of the mentioned

Hadoop has a library class, org.apache.hadoop.mapred.lib.FieldSelectionMapReduce, that effectively allows you to process text data like the unix ______ utility.(a) Copy(b) Cut(c) Paste(d) Move

How many formats of SequenceFile are present in Hadoop I/O?(a) 2(b) 3(c) 4(d) 5

Apache Hadoop ___________ provides a persistent data structure for binary key-value pairs.(a) GetFile(b) SequenceFile(c) Putfile(d) All of the mentioned

Hadoop ___________ is a utility to support running external map and reduce jobs.(a) Orchestration(b) Streaming(c) Collection(d) All of the mentioned

___________ was created to allow you to flow data from a source into your Hadoop environment.(a) Imphala(b) Oozie(c) Flume(d) All of the mentioned

Point out the wrong statement.(a) TusCAN ia Service Component Architecture implementation(b) Tob is a JSF based framework for web-applications(c) Traffic is a scalable and extensible HTTP proxy server and cache(d) None of the mentioned

___________ is a distributed data warehouse system for Hadoop.(a) Stratos(b) Tajo(c) Sqoop(d) Lucene

Many people use Kafka as a replacement for a ___________ solution.(a) log aggregation(b) compaction(c) collection(d) all of the mentioned

Kafka uses key-value pairs in the ____________ file format for configuration.(a) RFC(b) Avro(c) Property(d) None of the mentioned

Point out the wrong statement.(a) CDH contains the main, core elements of Hadoop(b) In October 2012, Cloudera announced the Cloudera Impala project(c) CDH may be downloaded from Cloudera’s website at no charge(d) None of the mentioned

The ________ method in the ModelCountReducer class “reduces” the values the mapper collects into a derived value.(a) count(b) add(c) reduce(d) all of the mentioned

____________ is an open-source version control system.(a) Stratos(b) Kafka(c) Sqoop(d) Subversion

To configure short-circuit local reads, you will need to enable ____________ on local Hadoop.(a) librayhadoop(b) libhadoop(c) libhad(d) none of the mentioned

__________ is an abstraction over Apache Hadoop YARN that reduces the complexity of developing distributed applications.(a) Wave(b) Twill(c) Usergrid(d) None of the mentioned

Kafka maintains feeds of messages in categories called __________(a) topics(b) chunks(c) domains(d) messages

Applications can use the _________ provided to report progress or just indicate that they are alive.(a) Collector(b) Reporter(c) Dashboard(d) None of the mentioned

The right number of reduces seems to be ____________(a) 0.90(b) 0.80(c) 0.36(d) 0.95

The number of maps is usually driven by the total size of ____________(a) inputs(b) outputs(c) tasks(d) None of the mentioned

Cloudera ___________ includes CDH and an annual subscription license (per node) to Cloudera Manager and technical support.(a) Enterprise(b) Express(c) Standard(d) All of the mentioned

The number of reduces for the job is set by the user via _________(a) JobConf.setNumTasks(int)(b) JobConf.setNumReduceTasks(int)(c) JobConf.setNumMapTasks(int)(d) All of the mentioned

__________ maps input key/value pairs to a set of intermediate key/value pairs.(a) Mapper(b) Reducer(c) Both Mapper and Reducer(d) None of the mentioned

Applications can use the ____________ to report progress and set application-level status messages.(a) Partitioner(b) OutputSplit(c) Reporter(d) All of the mentioned

The right level of parallelism for maps seems to be around _________ maps per-node.(a) 1-10(b) 10-100(c) 100-150(d) 150-200

An input _________ is a chunk of the input that is processed by a single map.(a) textformat(b) split(c) datanode(d) all of the mentioned

__________ is a variant of SequenceFileInputFormat that converts the sequence file’s keys and values to Text objects.(a) SequenceFile(b) SequenceFileAsTextInputFormat(c) SequenceAsTextInputFormat(d) All of the mentioned

___________ generates keys of type LongWritable and values of type Text.(a) TextOutputFormat(b) TextInputFormat(c) OutputInputFormat(d) None of the mentioned

With ______ we can store data and read it easily with various programming languages.(a) Thrift(b) Protocol Buffers(c) Avro(d) None of the mentioned

NiFi is a dataflow system based on the concepts of ________ programming.(a) structured(b) relational(c) set(d) flow-based

____________ is a query processing and optimization system for large-scale.(a) MRQL(b) NiFi(c) OpenAz(d) ODF Toolkit

________ is a columnar storage format for Hadoop.(a) MRQL(b) NiFi(c) OpenAz(d) Parquet

__________ is a columnar storage format for Hadoop.(a) Ranger(b) Parquet(c) REEF(d) None of the mentioned

To set an environment variable in a streaming command use ____________(a) -cmden EXAMPLE_DIR=/home/example/dictionaries/(b) -cmdev EXAMPLE_DIR=/home/example/dictionaries/(c) -cmdenv EXAMPLE_DIR=/home/example/dictionaries/(d) -cmenv EXAMPLE_DIR=/home/example/dictionaries/

Which of the following is only for storage with limited compute?(a) Hot(b) Cold(c) Warm(d) All_SSD

During the execution of a streaming job, the names of the _______ parameters are transformed.(a) vmap(b) mapvim(c) mapreduce(d) mapred

When a block is warm, some of its replicas are stored in DISK and the remaining replicas are stored in _________(a) ROM_DISK(b) ARCHIVE(c) RAM_DISK(d) All of the mentioned

The standard output (stdout) and error (stderr) streams of the task are read by the TaskTracker and logged to _________(a) ${HADOOP_LOG_DIR}/user(b) ${HADOOP_LOG_DIR}/userlogs(c) ${HADOOP_LOG_DIR}/logs(d) None of the mentioned