InterviewSolution
This section includes InterviewSolutions, each offering curated multiple-choice questions to sharpen your knowledge and support exam preparation. Choose a topic below to get started.
| 1. |
Mention the consequences of Distributed Applications. |
Answer»
Recommended Resource: |
|
| 2. |
Explain the architecture of Flume. |
|
Answer» In general Apache Flume architecture is composed of the following components:
|
|
| 3. |
What is Apache Flume in Hadoop ? |
|
Answer» APACHE Flume is a tool/service/data INGESTION mechanism for assembling, aggregating, and carrying huge amounts of streaming data such as record files, EVENTS from various references to a centralized data store. Flume is a very stable, DISTRIBUTED, and configurable tool. It is generally designed to copy streaming data (LOG data) from various web servers to HDFS. |
|
| 4. |
What is the default File format to import data using Apache sqoop? |
|
Answer» There are BASICALLY TWO file FORMATS sqoop allos to IMPORT DATA they are:
|
|
| 5. |
Where is table data stored in Apache Hive by default? |
|
Answer» By DEFAULT,the is table DATA in APACHE HIVE is stored in: Hdfs://namenode_server/user/hive/warehouse |
|
| 6. |
If the source data gets updated every now and then, how will you synchronize the data in HDFS that is imported by Sqoop? |
|
Answer» If the SOURCE data GETS updated in a very short interval of time, the synchronization of data in HDFS that is imported by Sqoop is done with the HELP of incremental parameters. We should USE incremental import along with the append choice even when the table is refreshed continuously with new rows. PRINCIPALLY where values of a few columns are examined, and if it encounters any revised value for those columns, only a new row will be inserted. Similar to incremental import, the origin has a date column examined for all the records that have been modified after the last import, depending on the previous revised column in the beginning. The values would be modernized. |
|
| 7. |
How do you differentiate inner bag and outer bag in Pig. |
||||||||
Answer»
|
|||||||||
| 8. |
Why do we need to perform partitioning in Hive? |
|
Answer» Apache HIVE organizes tables into PARTITIONS. Partitioning is the manner in which a table is split into RELATED components depending on the values of APPROPRIATE columns like date, city, and department. Every table in the hive can have one or more than one partition keys to recognize a distinct partition. With the help of partitions, it is effortless to do queries on SLICES of the data. |
|
| 9. |
Are Multiline Comments supported in Hive? Why? |
|
Answer» No, as of now MULTILINE COMMENTS are not SUPPORTED in HIVE, only single-line comments are supported. |
|
| 10. |
Compare differences between Local Metastore and Remote Metastore |
||||||||
Answer»
|
|||||||||
| 11. |
Explain a metastore in Hive? |
|
Answer» Metastore is USED to store the metadata information; it’s also possible to use RDBMS and the open-source ORM layer, converting OBJECT Representation into a relational schema. It’s the central repository of Apache Hive metadata. It stores metadata for Hive tables (similar to their schema and LOCATION) and partitions in a relational database. It gives the client access to this information by USING metastore service API. Disk storage for the Hive metadata is SEPARATE from HDFS storage. |
|
| 12. |
What applications are supported by Apache Hive? |
|
Answer» The APPLICATIONS that are SUPPORTED by APACHE HIVE are,
|
|
| 13. |
Give a brief on how Spark is good at low latency workloads like graph processing and Machine Learning. |
|
Answer» The data is stored in MEMORY by APACHE Spark for faster processing and development of machine learning models, which may need a lot of Machine Learning algorithms for multiple repetitions and various CONCEPTUAL steps to create an optimized model. In the case of Graph algorithms, it moves within all the nodes and edges to make a graph. These LOW latency WORKLOADS, which need many iterations, enhance the performance. |
|
| 14. |
Explain the Resilient Distributed Datasets in Spark. |
|
Answer» Resilient Distributed DATASETS is the basic data structure of Apache Spark. It is INSTALLED in the Spark Core. They are immutable and fault-tolerant. RDDs are generated by transforming already PRESENT RDDs or storing an outer dataset from well-built storage like HDFS or HBase. Since they have distributed collections of objects, they can be operated in parallel. Resilient Distributed Datasets are DIVIDED into parts such that they can be executed on various nodes of a CLUSTER. |
|
| 15. |
What are the basic parameters of a mapper? |
|
Answer» The primary PARAMETERS of a mapper are text, LONGWRITABLE, text, and IntWritable. The initial TWO represent INPUT parameters, and the other two signify intermediate output parameters. |
|
| 16. |
List the actions that happen when a DataNode fails. |
Answer»
|
|
| 17. |
Explain the distributed Cache in MapReduce framework. |
|
Answer» Distributed Cache is a significant feature provided by the MapReduce Framework, practiced when you want to share the FILES across all nodes in a Hadoop cluster. These files can be jar files or simple properties files. Hadoop's MapReduce framework allows the facility to cache SMALL to moderate read-only files such as text files, ZIP files, jar files, ETC., and distribute them to all the Datanodes(worker-nodes) MapReduce jobs are running. All Datanode gets a COPY of the file(local-copy), which is sent by Distributed Cache. |
|
| 18. |
Explain the actions followed by a Jobtracker in Hadoop. |
Answer»
|
|
| 19. |
Explain the purpose of the dfsadmin tool? |
|
Answer» The dfsadmin tools are a specific set of tools designed to HELP you root out information about your Hadoop Distributed FILE system (HDFS). As a BONUS, you can use them to perform some administration operations on HDFS as well. |
|
| 20. |
Which Command is used to find the status of the Blocks and File-system health? |
|
Answer» The command used to find the status of the BLOCK is: HDFS fsck <PATH> -files –blocks |
|
| 21. |
Where are the two types of metadata that NameNode server stores? |
|
Answer» The two types of metadata that NameNode server stores are in DISK and RAM.
Once the file is deleted from HDFS, the NameNode will immediately store this in the EditLog. |
|
| 22. |
How can you skip the bad records in Hadoop? |
|
Answer» Hadoop provides an option where a particular set of lousy input records can be skipped when processing map inputs. APPLICATIONS can manage this feature through the SkipBadRecords class. |
|
| 23. |
What is the default replication factor? |
|
Answer» By default, the replication factor is 3. There are no two copies that will be on the same data NODE. Usually, the first two copies will be on the same rack, and the third copy will be off the shelf. It is advised to set the replication factor to at LEAST three so that one copy is always safe, EVEN if something happens to the rack. |
|
| 24. |
Why are blocks in HDFS huge? |
|
Answer» By default, the size of the HDFS data block is 128 MB. The ideas for the large size of blocks are:
|
|