InterviewSolution
| 1. |
What do you mean by Data Cleansing in Big Data? |
|
Answer» By data cleansing, you can identify which of your data records or entries are incomplete, inaccurate, incorrect or irrelevant. In other words, we can say that data cleansing is nothing but identifying the inaccuracies and redundancies in the dataset. There remain certain issues with the data we collect. These issues MUST be resolved or rectify before we can APPLY any kind of processing or analysis on the data. If the data remains unclean, it will give the wrong insights. To have good results, it is expected that the input data must also be good. For this data, cleansing is required and it is very IMPORTANT and a necessary step in any Big Data project. Without cleansing data, you should not PROCEED further. Otherwise, you may end up with all the incorrect information. The various issues that our input dataset may contain are outlined as follows:
There are various methods to identify these issues:
By visualization method, we mean we can take a random sample of the data and see whether it is correct or not. By the Outlier Analysis method, we mean to find out an extreme or odd value that is not expected in that particular feature. For example in the 'age' column, we can not expect a value like 200 or 350 etc. By the Validation Code method, we mean creating such a code that can identify whether the data or values under consideration are right or not. Once we can identify the issues, we can apply the corresponding methods to correct them. Cleansing a Big Data can become a time consuming and cumbersome process. So it is always suggested to start with a small random sample of the data. Developing rules on a small valid sample of the data will speed-up your time required to get the required insights. This is so because it reduces the latency that is associated with the iterative analysis/exploration of Big Data. |
|