1.

What are outliers?

Answer»

Outliers are data points/values that are very FAR from the group. These do not belong to any particular group/cluster.

The PRESENCE of outliers may affect the behavior of the MODEL. So proper care is to be taken to identify and properly treat the outliers.

The outliers may contain valuable and often useful information. So they should be handled very CAREFULLY. Most of the time, they are considered to be bad data points but their presence in the data set should also be investigated.

Outliers present in the input data may skew the result. They may mislead the process of training of machine learning algorithms. This results in:

  1. Longer Training Time
  2. Less Accurate Models
  3. Poor Results.

It is observed that many machine learning models are sensitive to:

  1. The range of attribute values
  2. Distribution of attribute values

The presence of outliers may create misleading representations. This will lead to misleading interpretations of the collected data.

As in descriptive statistics, the presence of outliers may skew the mean and standard deviation of the attribute values The effects can be observed in plots like scatterplots and histograms.

 For some problems, outliers can be more relevant. For example anomalies in:

  1. Fraud detection
  2. Computer security.

Some of the outlier detection methods are:

  1. EXTREME Value Analysis: Here we determine the statistical tails of the distribution of data. For example, Statistical methods like 'z-scores' on univariate data.
  2. Probabilistic and Statistical Models: Here we determine the 'unlikely instances' from a 'probabilistic model' of data. For example, the Optimization of' Gaussian mixture' models using 'expectation-maximization'.
  3. Linear Models: Using the linear correlations, the data is modeled into lower dimensions. For example, Data having large residual errors can be outliers.
  4. Proximity-based Models: Here, the data instances which are isolated from the group or mass of the data are determined by Cluster, Density or by the Nearest Neighbor Analysis.
  5. Information-Theoretic Models: Here the outliers can be detected as data instances that increase the complexity of the dataset (minimum code length).
  6. High-Dimensional Outlier Detection: In this method, we search subspaces for the outliers based on distance measures in higher dimensions.


Discussion

No Comment Found