1.

What are missing values in Big data? And how to deal with it?

Answer»

Missing values in Big Data generally refer to the values which aren’t present in a particular column, in the worst case they may lead to erroneous data and may provide incorrect results. There are several techniques used to deal with the missing values they are - 

  • Mean or Median Imputation:
    When data is missing at irregular intervals, we can use list-wise or pair-wise deletion of the missing observations. Still, there can be multiple reasons why this may not be the most workable option:
    • There may not be enough notes with non-missing data to produce a reliable analysis
    • In predictive analytics, missing data can prevent the forecasts for those observations which have missing data
    • External factors may REQUIRE SPECIFIC observations to be part of the analysis
      In such cases, we impute values for missing data. A simple technique is to use the mean or median of the non-missing observations. This can be useful in cases where the number of missing observations is low. However, for many missing values, using mean or median can result in loss of variation in data, and it is better to use imputations.
  • Multivariate Imputation by Chained Equations (MICE):
    MICE believes that the missing data are Missing at Random (MAR). It imputes data on a variable-by-variable basis by specifying an imputation model per variable. MICE uses predictive mean matching (PMM) for continuous variables, logistic regressions for binary variables, bayesian polytomous regressions for factor variables, and PROPORTIONAL odds model for ordered variables to impute missing data.
    To set up the data for MICE, it is essential to note that the algorithm uses all the variables in the data for predictions. In this case, variables that may not be useful for predictions, like the ID variable, should be removed before implementing this algorithm.

    Data$ID <- NULL
    Secondly, as MENTIONED above, the algorithm treats different variables differently. So, all categorical variables should be treated as factor variables before implementing MICE.

    Data$year <- as.factor(Data$year)
    Data$gender <- as.factor(Data$gender)
    Then you can implement the algorithm using the MICE library in R
    library(mice)
    init = mice(Data, maxit=0)
    method = init$method
    predMat = init$predictorMatrix
    set.seed(101)
    imputed = mice(Data, method=method, predictorMatrix=predMat, m=5)
    You can also ignore some variables as predictors or skip a variable from being imputed using the MICE library in R. Additionally, the library also allows you to set a method of imputation discussed above depending upon the nature of the variable.
  • Random Forest:
    Random forest is a non-parametric imputation method appropriate to multiple changeable types that work well with both data missing at random and not missing at irregular. Random forest uses several decision trees to discover missing values and outputs OOB imputation error estimates.
    One warning is that random forest works best with large datasets, and using random forest on small datasets encompasses the RISK of overfitting. The extent of overfitting leading to false imputations will depend upon how closely the distribution for predictor variables for non-missing data resembles the distribution of predictor variables for missing data. For example, suppose the distribution of race/ethnicity for non-missing data is similar to the distribution of race/ethnicity for missing data. In that case, overfitting is not likely to throw off results. However, if the two distributions differ, the accuracy of imputations will suffer.


Discussion

No Comment Found