435 + Interview Questions in GENERAL QA IN HADOOP Page 4 InterviewSolution

151.	Point out the correct statement.(a) Cassandra delivers continuous availability, linear scalability, and operational simplicity across many commodity servers(b) Cassandra has a “masterless” architecture, meaning all nodes are the same(c) Cassandra also provides customizable replication, storing redundant copies of data across nodes that participate in a Cassandra ring(d) All of the mentioned
Answer» Correct choice is (d) All of the mentioned To explain: Cassandra provides automatic data distribution across all nodes that participate in a “ring” or database cluster.

Discussion

152.	_________ can be configured per table for non-QUORUM consistency levels.(a) Read repair(b) Read damage(c) Write repair(d) None of the mentioned
Answer» The correct answer is (a) Read repair The best I can explain: If the replicas are inconsistent, the coordinator issues writes to the out-of-date replicas to update the row to the most recent values. This process is known as read repair.

Discussion

153.	Cassandra uses a protocol called _______ to discover location and state information.(a) gossip(b) intergos(c) goss(d) all of the mentioned
Answer» Correct option is (a) gossip To explain I would say: Gossip is used for internode communication.

Discussion

154.	There are _________ types of read requests that a coordinator can send to a replica.(a) two(b) three(c) four(d) all of the mentioned
Answer» Right choice is (b) three The explanation is: The coordinator node contacts one replica node with a direct read request.

Discussion

155.	Cassandra searches the __________ to determine the approximate location on disk of the index entry.(a) partition record(b) partition summary(c) partition search(d) all of the mentioned
Answer» Correct option is (b) partition summary The best I can explain: If the Bloom filter does not rule out the SSTable, Cassandra checks the partition key cache.

Discussion

156.	For each SSTable, Cassandra creates _________ index.(a) memory(b) partition(c) in memory(d) all of the mentioned
Answer» Correct choice is (b) partition Best explanation: Partition index is list of partition keys and the start position of rows in the data file (on disk).

Discussion

157.	The type of __________ strategy Cassandra performs on your data is configurable and can significantly affect read performance.(a) compression(b) collection(c) compaction(d) decompression
Answer» Correct answer is (c) compaction The best I can explain: Using the SizeTieredCompactionStrategy or DateTieredCompactionStrategy tends to cause data fragmentation when rows are frequently updated.

Discussion

158.	Authorization capabilities for Cassandra use the familiar _________ security paradigm to manage object permissions.(a) COMMIT(b) GRANT(c) ROLLBACK(d) None of the mentioned
Answer» Right option is (b) GRANT The explanation: Once authenticated into a database cluster using either internal authentication, the next security issue to be tackled is permission management.

Discussion

159.	The _____________ allows external processes to watch the stream of chunks passing through the collector.(a) LocalWriter(b) SeqFileWriter(c) SocketTeeWriter(d) All of the mentioned
Answer» Right choice is (c) SocketTeeWriter Explanation: SocketTeeWriter listens on a port (specified by conf option chukwaCollector.tee.port, defaulting to 9094.)

Discussion

160.	Point out the correct statement.(a) chukwa supports two different reliability strategies(b) chukwaCollector.asyncAcks.scantime affects how often collectors will check the filesystem for commits(c) chukwaCollector.asyncAcks.scanperiod defaults to thrice the rotation interval(d) all of the mentioned
Answer» Right choice is (a) chukwa supports two different reliability strategies To explain I would say: The first, default strategy, is as follows: collectors write data to HDFS, and as soon as the HDFS write call returns success, report success to the agent, which advances its checkpoint state.

Discussion

161.	Point out the wrong statement.(a) Filters use the same syntax as the Dump command(b) “RAW” will send the internal data of the Chunk, without any metadata, prefixed by its length encoded as a 32-bit int(c) Specifying “WRITABLE” will cause the chunks to be written using Hadoop Writable serialization framework(d) None of the mentioned
Answer» The correct option is (d) None of the mentioned Explanation: “HEADER” is similar to “RAW”, but with a one-line header in front of the content.

Discussion

162.	Conceptually, each _________ emits a semi-infinite stream of bytes, numbered starting from zero.(a) Collector(b) Adaptor(c) Compactor(d) LocalWriter
Answer» Right answer is (b) Adaptor For explanation I would say: A Chunk is a sequence of bytes, with some metadata. Several of these are set automatically by the Agent or Adaptors.

Discussion

163.	Point out the wrong statement.(a) The framework calls reduce method for each pair in the grouped inputs(b) The output of the Reducer is re-sorted(c) reduce method reduces values for a given key(d) None of the mentioned
Answer» Right choice is (b) The output of the Reducer is re-sorted The best explanation: The output of the Reducer is not re-sorted.

Discussion

164.	_____________ is used to read data from bytes buffers.(a) write()(b) read()(c) readwrite()(d) all of the mentioned
Answer» Correct answer is (a) write() To explain I would say: readfully method can also be used instead of read method.

Discussion

165.	The _________ collocation identifier is integrated into the process that is used to create vectors from sequence files of text keys and values.(a) lbr(b) lcr(c) llr(d) lar
Answer» Right option is (c) llr Easiest explanation: The –minLLR option can be used to control the cutoff that prevents collocations below the specified LLR score from being emitted.

Discussion

166.	____________ generates NGrams and counts frequencies for ngrams, head and tail subgrams.(a) CollocationDriver(b) CollocDriver(c) CarDriver(d) All of the mentioned
Answer» Right choice is (b) CollocDriver The best I can explain: Each call to the mapper passes in the full set of tokens for the corresponding document using a StringTuple.

Discussion

167.	The _________ as just the value field append(value) and the key is a LongWritable that contains the record number, count + 1.(a) SetFile(b) ArrayFile(c) BloomMapFile(d) None of the mentioned
Answer» The correct option is (b) ArrayFile To explain: The SetFile instead of append(key, value) as just the key field append(key) and the value is always the NullWritable instance.

Discussion

168.	________ method adds the deprecated key to the global deprecation map.(a) addDeprecits(b) addDeprecation(c) keyDeprecation(d) none of the mentioned
Answer» The correct answer is (b) addDeprecation For explanation: addDeprecation does not override any existing entries in the deprecation map.

Discussion

169.	_________ method clears all keys from the configuration.(a) clear(b) addResource(c) getClass(d) none of the mentioned
Answer» Right choice is (a) clear For explanation I would say: getClass is used to get the value of the name property as a Class.

Discussion

170.	Point out the wrong statement.(a) With Thrift, it is not possible to define a service and change the protocol and transport without recompiling the code(b) Thrift includes server infrastructure to tie protocols and transports together, like blocking, non-blocking, and multi threaded servers(c) Thrift supports a number of protocols for service definition(d) None of the mentioned
Answer» The correct choice is (d) None of the mentioned Best explanation: The underlying I/O part of the stack is differently implemented for different languages.

Discussion

171.	Which of the following is a more compact binary format?(a) TCompactProtocol(b) TDenseProtocol(c) TBinaryProtocol(d) TSimpleJSONProtocol
Answer» Right answer is (a) TCompactProtocol To elaborate: TCompactProtocol is typically more efficient to process as well.

Discussion

172.	Which of the following is a straightforward binary format?(a) TCompactProtocol(b) TDenseProtocol(c) TBinaryProtocol(d) TSimpleJSONProtocol
Answer» Correct answer is (c) TBinaryProtocol Best explanation: TBinaryProtocol is not optimized for space efficiency.

Discussion

173.	Point out the wrong statement.(a) The Kafka cluster does not retain all published messages(b) A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients(c) Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization(d) Messages are persisted on disk and replicated within the cluster to prevent data loss
Answer» The correct answer is (a) The Kafka cluster does not retain all published messages To explain I would say: The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable period of time.

Discussion

174.	__________ is one of many possible IAuthorizer implementations and the one that stores permissions in the system_auth.permissions table to support all authorization-related CQL statements.(a) CassandraAuth(b) CassandraAuthorizer(c) CassAuthorizer(d) All of the mentioned
Answer» The correct choice is (b) CassandraAuthorizer The best explanation: Configuration consists mainly of changing the authorizer option in the cassandra.yaml to use the CassandraAuthorizer.

Discussion

175.	Avro is said to be the future _______ layer of Hadoop.(a) RMC(b) RPC(c) RDC(d) All of the mentioned
Answer» Right answer is (b) RPC The best I can explain: When Avro is used in RPC, the client and server exchange schemas in the connection handshake.

Discussion

176.	Thrift resolves possible conflicts through _________ of the field.(a) Name(b) Static number(c) UID(d) None of the mentioned
Answer» Correct option is (b) Static number The explanation: Avro resolves possible conflicts through the name of the field.

Discussion

177.	Point out the wrong statement.(a) HBase provides only sequential access to data(b) HBase provides high latency batch processing(c) HBase internally provides serialized access(d) All of the mentioned
Answer» Right option is (c) HBase internally provides serialized access Explanation: HBase internally uses Hash tables and provides random access.

Discussion

178.	The Apache Jenkins server runs the ______________ job whenever code is committed to the trunk branch.(a) “Bigtop-trunk”(b) “Bigtop”(c) “Big-trunk”(d) None of the mentioned
Answer» Right choice is (a) “Bigtop-trunk” Easiest explanation: Jenken Server in turn runs several test jobs.

Discussion

179.	Apache ________ is a lightweight server for ActivityStreams.(a) Sirona(b) Taverna(c) Slider(d) Streams
Answer» Right choice is (d) Streams For explanation I would say: Taverna is a domain-independent suite of tools used to design and execute data-driven workflows.

Discussion

180.	Point out the wrong statement.(a) HiveServer2 has a new JDBC driver(b) CSV and TSV output formats are maintained for forward compatibility(c) HiveServer2 supports both embedded and remote access to HiveServer2(d) None of the mentioned
Answer» Correct choice is (b) CSV and TSV output formats are maintained for forward compatibility To explain: CSV and TSV output formats are maintained for backward compatibility.

Discussion

181.	Which of the following is used to set transaction isolation level?(a) –incremental=[true/false](b) –isolation=LEVEL(c) –force=[true/false](d) –truncateTable=[true/false]
Answer» Correct option is (b) –isolation=LEVEL To explain I would say: Set the transaction isolation level to TRANSACTION_READ_COMMITTED or TRANSACTION_SERIALIZABLE.

Discussion

182.	Point out the correct statement.(a) –helpusage display a usage message(b) The JDBC connection URL format has the prefix jdbc:hive:(c) Starting with Hive 0.14, there are improved SV output formats(d) None of the mentioned
Answer» Right choice is (c) Starting with Hive 0.14, there are improved SV output formats Easiest explanation: Output formats available are namely DSV, CSV2 and TSV2.

Discussion

183.	Hive specific commands can be run from Beeline, when the Hive _______ driver is used.(a) ODBC(b) JDBC(c) ODBC-JDBC(d) All of the Mentioned
Answer» The correct choice is (b) JDBC Easy explanation: Hive specific commands are same as Hive CLI commands.

Discussion

184.	To force Hive to be more verbose, it can be started with ___________(a) hive –hiveconf hive.root.logger=INFO,console(b) hive –hiveconf hive.subroot.logger=INFO,console(c) hive –hiveconf hive.root.logger=INFOVALUE,console(d) All of the mentioned
Answer» Correct answer is (a) hive –hiveconf hive.root.logger=INFO,console The best explanation: This Statement will spit orders of magnitude more information to the console and will likely include any information the AvroSerde is trying to get you about what went wrong.

Discussion

185.	_______ supports a new command shell Beeline that works with HiveServer2.(a) HiveServer2(b) HiveServer3(c) HiveServer4(d) None of the mentioned
Answer» Right answer is (a) HiveServer2 Easy explanation: The Beeline shell works in both embedded mode as well as remote mode.

Discussion

186.	The need for data replication can arise in various scenarios like ____________(a) Replication Factor is changed(b) DataNode goes down(c) Data Blocks get corrupted(d) All of the mentioned
Answer» Right answer is (d) All of the mentioned Best explanation: Data is replicated across different DataNodes to ensure a high degree of fault-tolerance.

Discussion

187.	Which of the following scenario may not be a good fit for HDFS?(a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file(b) HDFS is suitable for storing data related to applications requiring low latency data access(c) HDFS is suitable for storing data related to applications requiring low latency data access(d) None of the mentioned
Answer» Correct answer is (a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file To elaborate: HDFS can be used for storing archive data since it is cheaper as HDFS allows storing the data on low cost commodity hardware while ensuring a high degree of fault-tolerance.

Discussion

188.	HDFS works in a __________ fashion.(a) master-worker(b) master-slave(c) worker/slave(d) all of the mentioned
Answer» The correct option is (a) master-worker Explanation: NameNode servers as the master and each DataNode servers as a worker/slave

Discussion

189.	Point out the wrong statement.(a) Replication Factor can be configured at a cluster level (Default is set to 3) and also at a file level(b) Block Report from each DataNode contains a list of all the blocks that are stored on that DataNode(c) User data is stored on the local file system of DataNodes(d) DataNode is aware of the files to which the blocks stored on it belong to
Answer» Right choice is (d) DataNode is aware of the files to which the blocks stored on it belong to For explanation I would say: NameNode is aware of the files to which the blocks stored on it belong to.

Discussion

190.	__________ mode is a Namenode state in which it does not accept changes to the name space.(a) Recover(b) Safe(c) Rollback(d) None of the mentioned
Answer» Correct answer is (c) Rollback For explanation: dfsadmin runs a HDFS dfsadmin client.

Discussion

191.	Point out the wrong statement.(a) classNAME displays the class name needed to get the Hadoop jar(b) Balancer Runs a cluster balancing utility(c) An administrator can simply press Ctrl-C to stop the rebalancing process(d) None of the mentioned
Answer» Correct choice is (a) classNAME displays the class name needed to get the Hadoop jar Easiest explanation: classpath prints the class path needed to get the Hadoop jar and the required libraries.

Discussion

192.	________ NameNode is used when the Primary NameNode goes down.(a) Rack(b) Data(c) Secondary(d) None of the mentioned
Answer» Right option is (c) Secondary The best explanation: Secondary namenode is used for all time availability and reliability.

Discussion

193.	On the write side, it is expected that the user pass in valid _________ with data correctly.(a) HRecords(b) HCatRecos(c) HCatRecords(d) None of the mentioned
Answer» Right answer is (c) HCatRecords To elaborate: In some cases where a user of HCat (such as some older versions of pig) does not support all the datatypes supported by hive, there are a few config parameters provided to handle data promotions/conversions to allow them to read data through HCatalog.

Discussion

194.	Point out the correct statement.(a) The framework groups Reducer inputs by keys(b) The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged(c) Since JobConf.setOutputKeyComparatorClass(Class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate secondary sort on values(d) All of the mentioned
Answer» The correct answer is (d) All of the mentioned For explanation I would say: If equivalence rules for keys while grouping the intermediates are different from those for grouping keys before reduction, then one may specify a Comparator.

Discussion

195.	In order to read any file in HDFS, instance of __________ is required.(a) filesystem(b) datastream(c) outstream(d) inputstream
Answer» Correct option is (a) filesystem The best explanation: InputDataStream is used to read data from file.

Discussion

196.	Point out the correct statement.(a) All hadoop commands are invoked by the bin/hadoop script(b) Hadoop has an option parsing framework that employs only parsing generic options(c) Archive command creates a hadoop archive(d) All of the mentioned
Answer» Correct option is (a) All hadoop commands are invoked by the bin/hadoop script Easy explanation: Running the hadoop script without any arguments prints the description for all commands.

Discussion

197.	In ___________ mode, the NameNode will interactively prompt you at the command line about possible courses of action you can take to recover your data.(a) full(b) partial(c) recovery(d) commit
Answer» The correct answer is (c) recovery Easy explanation: Recovery mode can cause you to lose data, you should always backup your edit log and fsimage before using it.

Discussion

198.	Reducer is input the grouped output of a ____________(a) Mapper(b) Reducer(c) Writable(d) Readable
Answer» Correct answer is (a) Mapper To explain I would say: In the phase the framework, for each Reducer, fetches the relevant partition of the output of all the Mappers, via HTTP.

Discussion

199.	Point out the wrong statement.(a) Version 1.4.0 is the fourth Flume release as an Apache top-level project(b) Apache Flume 1.5.2 is a security and maintenance release that disables SSLv3 on all components in Flume that support SSL/TLS(c) Flume is backwards-compatible with previous versions of the Flume 1.x codeline(d) None of the mentioned
Answer» The correct answer is (d) None of the mentioned Easiest explanation: Apache Flume 1.3.1 is a maintenance release for the 1.3.0 release, and includes several bug fixes and performance enhancements.

Discussion

200.	A number of ____________ source adapters give you the granular control to grab a specific file.(a) multimedia file(b) text file(c) image file(d) none of the mentioned
Answer» Correct option is (b) text file The best I can explain: A number of predefined source adapters are built into Flume.

Discussion

Explore topic-wise InterviewSolutions in .

_________ can be configured per table for non-QUORUM consistency levels.(a) Read repair(b) Read damage(c) Write repair(d) None of the mentioned

Cassandra uses a protocol called _______ to discover location and state information.(a) gossip(b) intergos(c) goss(d) all of the mentioned

There are _________ types of read requests that a coordinator can send to a replica.(a) two(b) three(c) four(d) all of the mentioned

Cassandra searches the __________ to determine the approximate location on disk of the index entry.(a) partition record(b) partition summary(c) partition search(d) all of the mentioned

For each SSTable, Cassandra creates _________ index.(a) memory(b) partition(c) in memory(d) all of the mentioned

The type of __________ strategy Cassandra performs on your data is configurable and can significantly affect read performance.(a) compression(b) collection(c) compaction(d) decompression

Authorization capabilities for Cassandra use the familiar _________ security paradigm to manage object permissions.(a) COMMIT(b) GRANT(c) ROLLBACK(d) None of the mentioned

The _____________ allows external processes to watch the stream of chunks passing through the collector.(a) LocalWriter(b) SeqFileWriter(c) SocketTeeWriter(d) All of the mentioned

Point out the correct statement.(a) chukwa supports two different reliability strategies(b) chukwaCollector.asyncAcks.scantime affects how often collectors will check the filesystem for commits(c) chukwaCollector.asyncAcks.scanperiod defaults to thrice the rotation interval(d) all of the mentioned

Conceptually, each _________ emits a semi-infinite stream of bytes, numbered starting from zero.(a) Collector(b) Adaptor(c) Compactor(d) LocalWriter

Point out the wrong statement.(a) The framework calls reduce method for each pair in the grouped inputs(b) The output of the Reducer is re-sorted(c) reduce method reduces values for a given key(d) None of the mentioned

_____________ is used to read data from bytes buffers.(a) write()(b) read()(c) readwrite()(d) all of the mentioned

The _________ collocation identifier is integrated into the process that is used to create vectors from sequence files of text keys and values.(a) lbr(b) lcr(c) llr(d) lar

____________ generates NGrams and counts frequencies for ngrams, head and tail subgrams.(a) CollocationDriver(b) CollocDriver(c) CarDriver(d) All of the mentioned

The _________ as just the value field append(value) and the key is a LongWritable that contains the record number, count + 1.(a) SetFile(b) ArrayFile(c) BloomMapFile(d) None of the mentioned

________ method adds the deprecated key to the global deprecation map.(a) addDeprecits(b) addDeprecation(c) keyDeprecation(d) none of the mentioned

_________ method clears all keys from the configuration.(a) clear(b) addResource(c) getClass(d) none of the mentioned

Which of the following is a more compact binary format?(a) TCompactProtocol(b) TDenseProtocol(c) TBinaryProtocol(d) TSimpleJSONProtocol

Which of the following is a straightforward binary format?(a) TCompactProtocol(b) TDenseProtocol(c) TBinaryProtocol(d) TSimpleJSONProtocol

__________ is one of many possible IAuthorizer implementations and the one that stores permissions in the system_auth.permissions table to support all authorization-related CQL statements.(a) CassandraAuth(b) CassandraAuthorizer(c) CassAuthorizer(d) All of the mentioned

Avro is said to be the future _______ layer of Hadoop.(a) RMC(b) RPC(c) RDC(d) All of the mentioned

Thrift resolves possible conflicts through _________ of the field.(a) Name(b) Static number(c) UID(d) None of the mentioned

Point out the wrong statement.(a) HBase provides only sequential access to data(b) HBase provides high latency batch processing(c) HBase internally provides serialized access(d) All of the mentioned

The Apache Jenkins server runs the ______________ job whenever code is committed to the trunk branch.(a) “Bigtop-trunk”(b) “Bigtop”(c) “Big-trunk”(d) None of the mentioned

Apache ________ is a lightweight server for ActivityStreams.(a) Sirona(b) Taverna(c) Slider(d) Streams

Point out the wrong statement.(a) HiveServer2 has a new JDBC driver(b) CSV and TSV output formats are maintained for forward compatibility(c) HiveServer2 supports both embedded and remote access to HiveServer2(d) None of the mentioned

Which of the following is used to set transaction isolation level?(a) –incremental=[true/false](b) –isolation=LEVEL(c) –force=[true/false](d) –truncateTable=[true/false]

Point out the correct statement.(a) –helpusage display a usage message(b) The JDBC connection URL format has the prefix jdbc:hive:(c) Starting with Hive 0.14, there are improved SV output formats(d) None of the mentioned

Hive specific commands can be run from Beeline, when the Hive _______ driver is used.(a) ODBC(b) JDBC(c) ODBC-JDBC(d) All of the Mentioned

To force Hive to be more verbose, it can be started with ___________(a) *hive –hiveconf hive.root.logger=INFO,console*(b) *hive –hiveconf hive.subroot.logger=INFO,console*(c) *hive –hiveconf hive.root.logger=INFOVALUE,console*(d) All of the mentioned

_______ supports a new command shell Beeline that works with HiveServer2.(a) HiveServer2(b) HiveServer3(c) HiveServer4(d) None of the mentioned

The need for data replication can arise in various scenarios like ____________(a) Replication Factor is changed(b) DataNode goes down(c) Data Blocks get corrupted(d) All of the mentioned

HDFS works in a __________ fashion.(a) master-worker(b) master-slave(c) worker/slave(d) all of the mentioned

__________ mode is a Namenode state in which it does not accept changes to the name space.(a) Recover(b) Safe(c) Rollback(d) None of the mentioned

Point out the wrong statement.(a) classNAME displays the class name needed to get the Hadoop jar(b) Balancer Runs a cluster balancing utility(c) An administrator can simply press Ctrl-C to stop the rebalancing process(d) None of the mentioned

________ NameNode is used when the Primary NameNode goes down.(a) Rack(b) Data(c) Secondary(d) None of the mentioned

On the write side, it is expected that the user pass in valid _________ with data correctly.(a) HRecords(b) HCatRecos(c) HCatRecords(d) None of the mentioned

In order to read any file in HDFS, instance of __________ is required.(a) filesystem(b) datastream(c) outstream(d) inputstream

Point out the correct statement.(a) All hadoop commands are invoked by the bin/hadoop script(b) Hadoop has an option parsing framework that employs only parsing generic options(c) Archive command creates a hadoop archive(d) All of the mentioned

In ___________ mode, the NameNode will interactively prompt you at the command line about possible courses of action you can take to recover your data.(a) full(b) partial(c) recovery(d) commit

Reducer is input the grouped output of a ____________(a) Mapper(b) Reducer(c) Writable(d) Readable

A number of ____________ source adapters give you the granular control to grab a specific file.(a) multimedia file(b) text file(c) image file(d) none of the mentioned

To force Hive to be more verbose, it can be started with ___________(a) hive –hiveconf hive.root.logger=INFO,console(b) hive –hiveconf hive.subroot.logger=INFO,console(c) hive –hiveconf hive.root.logger=INFOVALUE,console(d) All of the mentioned