InterviewSolution
| 1. |
You have observed outliers in your dataset. What approaches will you consider to make your model more robust to outliers? |
|
Answer» Response: There are multiple ways to make a model more robust to outliers, from different aspects either from data PREPARATION perspective or from a model-building perspective. An outlier is assumed as being unwanted, unexpected, or a must-be-incorrect value to the human's knowledge so far (e.g. no one can live longer than 150 years of age) rather than a rare EVENT which is possible but rare. Outliers are usually defined as the sample distribution. Hence, outliers could be removed in the pre-processing step (before any learning phase happens), by using standard deviations(sd) such as (Mean +/- 2*sd), it can be used for normality. Otherwise, interquartile ranges from Q1 - Q3, where Q1 - is the "middle" value in the FIRST half of the rank-ordered data set, Q3 - is the "middle" value in the second half of the rank-ordered data set. It can be used for not normal/unknown as threshold levels. Below diagram shows typical outliers encircled with red circles for sample illustration purposes. Image Ref Additionally, data TRANSFORMATION (e.g. log transformation) may help if data have a noticeable tail. When outliers are related to the sensitivity of the collecting INSTRUMENT which may not precisely record small values, Winsorization may be useful. Winsorizing or winsorization is the transformation of statistics by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers. This type of transformation has the same effect as clipping signals (i.e. replaces extreme data values with less extreme values). Another option to reduce the influence of outliers is using mean absolute difference rather mean squared error. For model building purposes, some models are resistant to outliers (e.g. tree-based approaches) or non-parametric tests. Tree models typically divide each node into two parts in each split, which is similar to the median effect. Therefore, at each split, all data points in a bucket could be equally treated regardless of the extreme values they may have. |
|