InterviewSolution
| 1. |
Explain Data preparation in Big Data? |
|
Answer» DATA PREPARATION involves collecting, combining, organizing and structuring data so that it can be analyzed for patterns, trends, and insights. The Big Data needs to be preprocessed, cleansed, validated and transformed. For this, the required data is pulled in from different sources INTERNAL or external. One of the major FOCUSES of data preparation is that the data under consideration for analysis is consistent and accurate. It so because accurate data will only produce valid results. When the data is collected, it is not complete. It may have some missing values, outliers, etc. Data preparation is the major and very important activity in any Big Data project. Only good data will produce good results. Most of the time, the data resides in silos, in different databases. It is also in different formats. So it needs to be reconciled. There are five D's associated with the process of data preparation. These are :
The process of data preparation is automated. Various MACHINE learning algorithms can be used in data preparation like filling missing values, fields renaming, ensuring consistency, removing redundancy, etc. There are various terminologies related to the process of data preparation such as data cleansing, transforming variables, removing outliers, data curation, data enrichment, data structuring and modeling, etc. These terminologies are actually the various processes or activities that are done under the process of data preparation. It is seen that the time spent on data preparation is generally more than the time required for data analysis. Though the methods used for data preparation are automated, it takes a lot of time to prepare the data as the volume of data is very large in quantity and it tends to grow continuously. |
|