This section includes 7 InterviewSolutions, each offering curated multiple-choice questions to sharpen your Current Affairs knowledge and support exam preparation. Choose a topic below to get started.
| 1. |
Steps for Data preparation. |
|
Answer» Steps for data preparation are:
Important Resources Big Data Tools Big Data Engineer Applications of Big Data Big Data Projects Highest Paying Jobs |
|
| 2. |
What is data preparation? |
|
Answer» Data preparation is the method of cleansing and modifying raw data before processing and analyzing it. It is a crucial step before processing and usually requires reformatting data, making improvements to data, and CONSOLIDATING data sets to enrich data. Data preparation is an unending task for data specialists or business users. But, it is essential to convert data into context to get insights and then, can eliminate the biased results FOUND due to poor data quality. For instance, the data construction PROCESS typically includes standardizing data formats, enhancing source data, and/or ELIMINATING outliers. |
|
| 3. |
How do you convert unstructured data to structured data? |
|
Answer» An open-ended question and there are many ways to achieve this.
|
|
| 4. |
Explain the Pros and Cons of Big Data? |
|
Answer» Pros of Big DATA are:
On the other hand, implementing big data analytics is not as easy as we think; there are a few difficulties too when it comes to implementing it. Cons of Big Data are:
|
|
| 5. |
Explain Persistent, Ephemeral and Sequential Znodes. |
Answer»
|
|
| 6. |
What is Distcp? |
|
Answer» It is a Tool which is used for copying a very large AMOUNT of data to and from HADOOP file systems in parallel. It uses MapReduce to affect its distribution, ERROR handling, recovery, and reporting. It expands a LIST of files and directories into input to map tasks, each of which will copy a PARTITION of the files specified in the source list. |
|
| 7. |
Explain Outliers. |
|
Answer» Outliers are the data points that are very far from the group, which is not a part of any group or cluster. This may affect the behavior of the model, they may PREDICT wrong RESULTS, or their accuracy will be very low. Therefore Outliers must be handled CAREFULLY as they may also contain some helpful information. The presence of these outliers may lead to MISLEADING a Big Data model or a MACHINE Learning Model. The results of this may be,
|
|
| 8. |
How can you skip bad records in Hadoop ? |
|
Answer» Hadoop can provide an option wherein a particular set of lousy input records could be skipped while processing map inputs. SkipBadRecords class in Hadoop OFFERS an optional mode of execution in which the bad records can be detected and neglected in multiple ATTEMPTS. This may happen due to the presence of some bugs in the map FUNCTION. The USER has to manually fix it, which may sometimes be possible because the bug may be in third-party libraries. With the help of this feature, only a small amount of data is lost, which may be acceptable because we are dealing with a large amount of data. |
|
| 9. |
Mention the main configuration parameters that has to be specified by the user to run MapReduce. |
|
Answer» The chief configuration parameters that the user of the MapReduce FRAMEWORK needs to mention is: |
|
| 10. |
What are the things to consider when using distributed cache in Hadoop MapReduce? |
Answer»
|
|
| 11. |
What are missing values in Big data? And how to deal with it? |
|
Answer» Missing values in Big Data generally refer to the values which aren’t present in a particular column, in the worst case they may lead to erroneous data and may provide incorrect results. There are several techniques used to deal with the missing values they are -
|
|
| 12. |
What is the use of the -compress-codec parameter? |
|
Answer» -compress-codec parameter is GENERALLY used to get the output file of a sqoop import in FORMATS other than .GZ. |
|
| 13. |
How can you restart NameNode and all the daemons in Hadoop? |
|
Answer» The following commands will help you RESTART NameNode and all the daemons: You can STOP the NameNode with ./sbin /Hadoop-daemon.sh stop NameNode COMMAND and then start the NameNode USING ./sbin/Hadoop-daemon.sh start NameNode command.You can stop all the daemons with the ./sbin/stop-all.sh command and then start the daemons using the ./sbin/start-all.sh command. |
|
| 14. |
Explain Features Selection. |
|
Answer» During processing, Big data may contain a large amount of data that is not REQUIRED at a particular time, So we may be required to select only some specific features that we are interested in. The PROCESS of extracting only the needed features from the Big data is called FEATURE selection. Feature selection Methods are -
|
|
| 15. |
What is partitioning in Hive? |
|
Answer» In general partitioning in Hive is a logical division of tables into related columns such as date, city, and department based on the values of partitioned columns. Then these partitions are subdivided into buckets so that they provide extra structure to the data that may be used for more EFFICIENT querying. Now let’s experience data partitioning in Hive with an instance. Consider a table named Table1. The table contains client details like id, NAME, DEPT, and year of joining. ASSUME we need to retrieve the details of all the clients who JOINED in 2014. Then, the query examines the whole table for the necessary data. But if we partition the client data by the year and save it in a different file, this will decrease the query processing time. |
|
| 16. |
Write the command used to copy data from the local system onto HDFS? |
|
Answer» The command used for copying data from the LOCAL system to HDFS is: |
|
| 17. |
Mention features of Apache sqoop. |
Answer»
|
|
| 18. |
What is the default replication factor in HDFS? |
|
Answer» By default, the replication factor is 3. There are no TWO copies that will be on the same data node. Usually, the first two copies will be on the same rack, and the third copy will be off the shelf. It is advised to set the replication factor to at least three so that ONE copy is always safe, even if SOMETHING happens to the rack. We can set the default replication factor of the file system and each file and directory exclusively. We can lower the replication factor for files that are not essential, and critical files should have a HIGH replication factor. |
|
| 19. |
What is a Zookeeper? What are the benefits of using a zookeeper? |
|
Answer» Hadoop’s most remarkable TECHNIQUE for ADDRESSING big data challenges is its capability to divide and conquer with Zookeeper. After the problem has been divided, the conquering relies on employing distributed and parallel processing methods ACROSS the Hadoop cluster. Benefits of using a Zookeeper are:
|
|
| 20. |
Explain overfitting in big data? How to avoid the same. |
|
Answer» Overfitting is generally a modeling error referring to a model that is TIGHTLY fitted to the data, i.e. When a modeling function is closely fitted to a limited data set. Due to Overfitting, the predictivity of such models gets reduced. This effect leads to a decrease in generalization ability failing to generalize when applied OUTSIDE the sample data. There are several Methods to avoid Overfitting; some of them are:
|
|
| 21. |
Explain the distributed Cache in the MapReduce framework. |
|
Answer» Distributed Cache is a significant feature provided by the MapReduce Framework, practiced when you want to share the files across all nodes in a Hadoop CLUSTER. These files can be jar files or simple properties files. Hadoop's MapReduce framework allows the facility to cache small to moderate read-only files such as TEXT files, zip files, jar files, etc., and DISTRIBUTE them to all the Datanodes(worker-nodes) MapReduce jobs are running. All DATANODE gets a copy of the file(local-copy), which Distributed Cache sends. |
|
| 22. |
Mention the core methods of Reducer. |
|
Answer» The core methods of a REDUCER are:
|
|
| 23. |
When to use MapReduce with Big Data. |
|
Answer» MapReduce is a programming model created for distributed computation on big data sets in parallel. A MapReduce model has a map FUNCTION that performs filtering and sorting and a reduced function, which serves as a summary operation. MapReduce is an important PART of the Apache Hadoop open-source ecosystem, and it’s extensively used for querying and selecting data in the Hadoop Distributed File System (HDFS). A variety of queries may be done DEPENDING on the broad spectrum of MapReduce algorithms possible for creating data selections. In addition, MapReduce is fit for iterative computation involving LARGE quantities of data requiring parallel processing. This is because it represents a data flow rather than a procedure. The more enhanced data we produce and accumulate, the HIGHER the need to process all that data to make it usable. MapReduce’s iterative, parallel processing programming model is a good tool for creating a sense of big data. |
|
| 24. |
What is Map Reduce in Hadoop? |
|
Answer» Hadoop MAPREDUCE is a software framework for processing enormous data sets. It is the main component for data processing in the Hadoop framework. It divides the INPUT data into several parts and runs a program on every data component parallel. The word MapReduce refers to two separate and different tasks. The first is the map operation, which takes a set of data and transforms it into a DIVERSE COLLECTION of data, where INDIVIDUAL elements are divided into tuples. The reduce operation consolidates those data tuples based on the key and subsequently modifies the value of the key. |
|
| 25. |
What are the different big data processing techniques? |
|
Answer» Big Data processing methods analyze big data sets at a massive scale. Offline BATCH data processing is typically FULL power and full scale, tackling arbitrary BI SCENARIOS. In contrast, real-time stream processing is conducted on the most recent slice of data for data profiling to pick outliers, impostor transaction exposures, safety monitoring, etc. However, the most challenging task is to do fast or real-time ad-hoc analytics on a big comprehensive data set. It substantially means you need to scan tons of data within seconds. This is only PROBABLE when data is processed with HIGH parallelism. Different techniques of Big Data Processing are:
|
|