How will you balance/correct imbalanced data?

1.	How will you balance/correct imbalanced data?
Answer» There are different techniques to correct/balance imbalanced data. It can be done by increasing the sample numbers for MINORITY CLASSES. The number of samples can be decreased for those classes with extremely high data points. Following are some approaches followed to balance data: Use the right evaluation metrics: In cases of imbalanced data, it is very important to use the right evaluation metrics that provide valuable information. Specificity/Precision: Indicates the number of selected instances that are relevant. Sensitivity: Indicates the number of relevant instances that are selected. F1 score: It represents the harmonic mean of precision and sensitivity. MCC (Matthews correlation coefficient): It represents the correlation coefficient between observed and predicted binary classifications. AUC (Area Under the Curve): This represents a relation between the true positive rates and false-positive rates. For EXAMPLE, consider the below graph that illustrates training data: Here, if we measure the accuracy of the model in terms of getting "0"s, then the accuracy of the model would be very high -> 99.9%, but the model does not guarantee any valuable information. In such cases, we can apply different evaluation metrics as stated above. Training Set Resampling: It is also possible to balance data by working on getting different datasets and this can be achieved by resampling. There are two approaches followed under-sampling that is used based on the use case and the requirements: Under-sampling This balances the data by reducing the size of the abundant class and is used when the data quantity is sufficient. By performing this, a new dataset that is balanced can be retrieved and this can be used for further modeling. Over-sampling This is used when data quantity is not sufficient. This method balances the dataset by trying to increase the samples size. Instead of getting rid of extra samples, new samples are generated and introduced by employing the methods of repetition, bootstrapping, etc. Perform K-fold cross-validation correctly: Cross-Validation needs to be applied properly while USING over-sampling. The cross-validation should be done before over-sampling because if it is done later, then it would be LIKE overfitting the model to get a specific result. To avoid this, resampling of data is done repeatedly with different ratios.

Answer»

There are different techniques to correct/balance imbalanced data. It can be done by increasing the sample numbers for MINORITY CLASSES. The number of samples can be decreased for those classes with extremely high data points. Following are some approaches followed to balance data:

Use the right evaluation metrics: In cases of imbalanced data, it is very important to use the right evaluation metrics that provide valuable information.
- Specificity/Precision: Indicates the number of selected instances that are relevant.
- Sensitivity: Indicates the number of relevant instances that are selected.
- F1 score: It represents the harmonic mean of precision and sensitivity.
- MCC (Matthews correlation coefficient): It represents the correlation coefficient between observed and predicted binary classifications.
- AUC (Area Under the Curve): This represents a relation between the true positive rates and false-positive rates.

For EXAMPLE, consider the below graph that illustrates training data:

Here, if we measure the accuracy of the model in terms of getting "0"s, then the accuracy of the model would be very high -> 99.9%, but the model does not guarantee any valuable information. In such cases, we can apply different evaluation metrics as stated above.

Training Set Resampling: It is also possible to balance data by working on getting different datasets and this can be achieved by resampling. There are two approaches followed under-sampling that is used based on the use case and the requirements:
- Under-sampling This balances the data by reducing the size of the abundant class and is used when the data quantity is sufficient. By performing this, a new dataset that is balanced can be retrieved and this can be used for further modeling.
- Over-sampling This is used when data quantity is not sufficient. This method balances the dataset by trying to increase the samples size. Instead of getting rid of extra samples, new samples are generated and introduced by employing the methods of repetition, bootstrapping, etc.
Perform K-fold cross-validation correctly: Cross-Validation needs to be applied properly while USING over-sampling. The cross-validation should be done before over-sampling because if it is done later, then it would be LIKE overfitting the model to get a specific result. To avoid this, resampling of data is done repeatedly with different ratios.

How will you balance/correct imbalanced data?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment