13 + Interview Questions in Predictive Modeling Interview Questions in Predictive Modeling Tutorial Page 1 InterviewSolution

1.	Explain Collinearity Between Continuous And Categorical Variables. Is Vif A Correct Method To Compute Collinearity In This Case?
Answer» COLLINEARITY between categorical and continuous VARIABLES is very common. The choice of reference category for DUMMY variables affects multicollinearity. It means CHANGING the reference category of dummy variables can avoid collinearity. Pick a reference category with highest proportion of CASES. VIF is not a correct method in this case. VIFs should only be run for continuous variables. The t-test method can be used to check collinearity between continuous and dummy variable. Collinearity between categorical and continuous variables is very common. The choice of reference category for dummy variables affects multicollinearity. It means changing the reference category of dummy variables can avoid collinearity. Pick a reference category with highest proportion of cases. VIF is not a correct method in this case. VIFs should only be run for continuous variables. The t-test method can be used to check collinearity between continuous and dummy variable.

1.

Explain Collinearity Between Continuous And Categorical Variables. Is Vif A Correct Method To Compute Collinearity In This Case?

Answer»

COLLINEARITY between categorical and continuous VARIABLES is very common. The choice of reference category for DUMMY variables affects multicollinearity. It means CHANGING the reference category of dummy variables can avoid collinearity. Pick a reference category with highest proportion of CASES.

VIF is not a correct method in this case. VIFs should only be run for continuous variables. The t-test method can be used to check collinearity between continuous and dummy variable.

Collinearity between categorical and continuous variables is very common. The choice of reference category for dummy variables affects multicollinearity. It means changing the reference category of dummy variables can avoid collinearity. Pick a reference category with highest proportion of cases.

VIF is not a correct method in this case. VIFs should only be run for continuous variables. The t-test method can be used to check collinearity between continuous and dummy variable.

Discussion

2.	Explain Important Model Performance Statistics?
Answer» AUC > 0.7. No significant difference between AUC score of training vs validation. KS should be in TOP 3 deciles and it should be more than 30 Rank Ordering. No break in rank ordering. Same signs of PARAMETER ESTIMATES in both training and validation

2.

Explain Important Model Performance Statistics?

Answer»

AUC > 0.7. No significant difference between AUC score of training vs validation.
KS should be in TOP 3 deciles and it should be more than 30
Rank Ordering. No break in rank ordering.
Same signs of PARAMETER ESTIMATES in both training and validation

Discussion

3.	What Is P-value And How It Is Used For Variable Selection?
Answer» The p-value is lowest level of significance at which you can reject null hypothesis. In the case of independent VARIABLES, it implies whether coefficient of a VARIABLE is SIGNIFICANTLY DIFFERENT from zero. The p-value is lowest level of significance at which you can reject null hypothesis. In the case of independent variables, it implies whether coefficient of a variable is significantly different from zero.

3.

What Is P-value And How It Is Used For Variable Selection?

Answer»

The p-value is lowest level of significance at which you can reject null hypothesis. In the case of independent VARIABLES, it implies whether coefficient of a VARIABLE is SIGNIFICANTLY DIFFERENT from zero.

The p-value is lowest level of significance at which you can reject null hypothesis. In the case of independent variables, it implies whether coefficient of a variable is significantly different from zero.

Discussion

4.	Do We Remove Intercepts While Calculating Vif?
Answer» No. VIF depends on the intercept because there is an intercept in the regression used to determine VIF. If the intercept is removed, R-square is not meaningful because it may be negative in which case ONE can get VIF < 1, IMPLYING that the STANDARD error of a variable would go up if that independent variable were uncorrelated with the other PREDICTORS. No. VIF depends on the intercept because there is an intercept in the regression used to determine VIF. If the intercept is removed, R-square is not meaningful because it may be negative in which case one can get VIF < 1, implying that the standard error of a variable would go up if that independent variable were uncorrelated with the other predictors.

4.

Do We Remove Intercepts While Calculating Vif?

Answer»

No. VIF depends on the intercept because there is an intercept in the regression used to determine VIF. If the intercept is removed, R-square is not meaningful because it may be negative in which case ONE can get VIF < 1, IMPLYING that the STANDARD error of a variable would go up if that independent variable were uncorrelated with the other PREDICTORS.

No. VIF depends on the intercept because there is an intercept in the regression used to determine VIF. If the intercept is removed, R-square is not meaningful because it may be negative in which case one can get VIF < 1, implying that the standard error of a variable would go up if that independent variable were uncorrelated with the other predictors.

Discussion

5.	How Vif Is Calculated And Interpretation Of It?
Answer» VIF MEASURES how MUCH the variance (the square of the estimate's standard deviation) of an estimated regression coefficient is increased because of collinearity. If the VIF of a predictor variable were 9 (√9 = 3) this MEANS that the standard error for the coefficient of that predictor variable is 3 times as large as it would be if that predictor variable were uncorrelated with the other predictor variables.Steps of calculating VIF Run linear regression in which ONE of the independent variable is considered as target variable and all the other independent variables considered as independent variables CALCULATE VIF of the variable. VIF = 1/(1-RSquared) VIF measures how much the variance (the square of the estimate's standard deviation) of an estimated regression coefficient is increased because of collinearity. If the VIF of a predictor variable were 9 (√9 = 3) this means that the standard error for the coefficient of that predictor variable is 3 times as large as it would be if that predictor variable were uncorrelated with the other predictor variables.Steps of calculating VIF

5.

How Vif Is Calculated And Interpretation Of It?

Answer»

VIF MEASURES how MUCH the variance (the square of the estimate's standard deviation) of an estimated regression coefficient is increased because of collinearity. If the VIF of a predictor variable were 9 (√9 = 3) this MEANS that the standard error for the coefficient of that predictor variable is 3 times as large as it would be if that predictor variable were uncorrelated with the other predictor variables.Steps of calculating VIF

Run linear regression in which ONE of the independent variable is considered as target variable and all the other independent variables considered as independent variables
CALCULATE VIF of the variable. VIF = 1/(1-RSquared)

VIF measures how much the variance (the square of the estimate's standard deviation) of an estimated regression coefficient is increased because of collinearity. If the VIF of a predictor variable were 9 (√9 = 3) this means that the standard error for the coefficient of that predictor variable is 3 times as large as it would be if that predictor variable were uncorrelated with the other predictor variables.Steps of calculating VIF

Discussion

6.	What Is Multicollinearity And How To Deal It?
Answer» MULTICOLLINEARITY implies high correlation between independent variables. It is ONE of the assumptions in linear and LOGISTIC regression. It can be identified by looking at VIF score of variables. VIF > 2.5 implies moderate collinearity issue. VIF >5 is considered as high collinearity. It can be handled by iterative process : first step - remove VARIABLE having highest VIF and then CHECK VIF of remaining variables. If VIF of remaining variables > 2.5, then follow the same first step until VIF < =2.5 Multicollinearity implies high correlation between independent variables. It is one of the assumptions in linear and logistic regression. It can be identified by looking at VIF score of variables. VIF > 2.5 implies moderate collinearity issue. VIF >5 is considered as high collinearity. It can be handled by iterative process : first step - remove variable having highest VIF and then check VIF of remaining variables. If VIF of remaining variables > 2.5, then follow the same first step until VIF < =2.5

6.

What Is Multicollinearity And How To Deal It?

Answer»

MULTICOLLINEARITY implies high correlation between independent variables. It is ONE of the assumptions in linear and LOGISTIC regression. It can be identified by looking at VIF score of variables. VIF > 2.5 implies moderate collinearity issue. VIF >5 is considered as high collinearity.

It can be handled by iterative process : first step - remove VARIABLE having highest VIF and then CHECK VIF of remaining variables. If VIF of remaining variables > 2.5, then follow the same first step until VIF < =2.5

Multicollinearity implies high correlation between independent variables. It is one of the assumptions in linear and logistic regression. It can be identified by looking at VIF score of variables. VIF > 2.5 implies moderate collinearity issue. VIF >5 is considered as high collinearity.

It can be handled by iterative process : first step - remove variable having highest VIF and then check VIF of remaining variables. If VIF of remaining variables > 2.5, then follow the same first step until VIF < =2.5

Discussion

7.	Explain Dimensionality / Variable Reduction Techniques?
Answer» Unsupervised Method (No Dependent Variable) Principal Component Analysis (PCA) Hierarchical Variable Clustering (Proc Varclus in SAS) Variance INFLATION FACTOR (VIF) Remove zero and near-zero variance predictors MEAN absolute CORRELATION. Removes the variable with the largest mean absolute correlation. Supervised Method (In RESPECT to Dependent Variable): For Binary / Categorical Dependent Variable Information Value Wald Chi-Square Random Forest Variable Importance Gradient Boosting Variable Importance Forward/Backward/Stepwise - Variable Significance (p-value) AIC / BIC score For Continuous Dependent Variable Adjusted R-Square Mallows' Cp Statistic Random Forest Variable Importance AIC / BIC score Forward / Backward / Stepwise - Variable Significance Unsupervised Method (No Dependent Variable) Supervised Method (In respect to Dependent Variable): For Binary / Categorical Dependent Variable For Continuous Dependent Variable

7.

Explain Dimensionality / Variable Reduction Techniques?

Answer»

Unsupervised Method (No Dependent Variable)

Principal Component Analysis (PCA)
Hierarchical Variable Clustering (Proc Varclus in SAS)
Variance INFLATION FACTOR (VIF)
Remove zero and near-zero variance predictors
MEAN absolute CORRELATION. Removes the variable with the largest mean absolute correlation.

Supervised Method (In RESPECT to Dependent Variable):

For Binary / Categorical Dependent Variable

Information Value
Wald Chi-Square
Random Forest Variable Importance
Gradient Boosting Variable Importance
Forward/Backward/Stepwise - Variable Significance (p-value)
AIC / BIC score

For Continuous Dependent Variable

Adjusted R-Square
Mallows' Cp Statistic
Random Forest Variable Importance
AIC / BIC score
Forward / Backward / Stepwise - Variable Significance

Unsupervised Method (No Dependent Variable)

Supervised Method (In respect to Dependent Variable):

For Binary / Categorical Dependent Variable

For Continuous Dependent Variable

Discussion

8.	How To Treat Outliers?
Answer» There are several methods to TREAT OUTLIERS - Percentile Capping Box-Plot Method Mean plus minus 3 Standard Deviation WEIGHT of Evidence There are several methods to treat outliers -

8.

How To Treat Outliers?

Answer»

There are several methods to TREAT OUTLIERS -

Percentile Capping
Box-Plot Method
Mean plus minus 3 Standard Deviation
WEIGHT of Evidence

There are several methods to treat outliers -

Discussion

9.	How To Handle Missing Values?
Answer» We fill/impute missing values using the following METHODS. Or make missing values as a separate category. Mean Imputation for CONTINUOUS Variables (No Outlier) Median Imputation for Continuous Variables (If Outlier) Cluster Imputation for Continuous Variables Imputation with a random value that is drawn between the minimum and maximum of the variable [Random value = min(x) + (max(x) - min(x)) * ranuni(SEED)] Impute Continuous Variables with ZERO (Require business knowledge) Conditional Mean Imputation for Continuous Variables Other Imputation Methods for Continuous - Predictive mean matching, Bayesian linear regression, Linear regression ignoring model error etc. WOE for missing values in CATEGORICAL variables Decision Tree, Random Forest, Logistic Regression for Categorical Variables Decision Tree, Random Forest WORKS for both Continuous and Categorical Variables Multiple Imputation Method We fill/impute missing values using the following methods. Or make missing values as a separate category.

9.

How To Handle Missing Values?

Answer»

We fill/impute missing values using the following METHODS. Or make missing values as a separate category.

Mean Imputation for CONTINUOUS Variables (No Outlier)
Median Imputation for Continuous Variables (If Outlier)
Cluster Imputation for Continuous Variables
Imputation with a random value that is drawn between the minimum and maximum of the variable [Random value = min(x) + (max(x) - min(x)) * ranuni(SEED)]
Impute Continuous Variables with ZERO (Require business knowledge)
Conditional Mean Imputation for Continuous Variables
Other Imputation Methods for Continuous - Predictive mean matching, Bayesian linear regression, Linear regression ignoring model error etc.
WOE for missing values in CATEGORICAL variables
Decision Tree, Random Forest, Logistic Regression for Categorical Variables
Decision Tree, Random Forest WORKS for both Continuous and Categorical Variables
Multiple Imputation Method

We fill/impute missing values using the following methods. Or make missing values as a separate category.

Discussion

10.	Difference Between Linear And Logistic Regression?
Answer» Two main difference are as follows - LINEAR regression requires the dependent variable to be CONTINUOUS i.e. numeric values (no categories or groups). While BINARY logistic regression requires the dependent variable to be binary - two categories only (0/1). Multinomial or ordinary logistic regression can have dependent variable with more than two categories. Linear regression is based on least square estimation which SAYS regression coefficients should be chosen in such a way that it minimizes the sum of the squared DISTANCES of each observed response to its fitted value. While logistic regression is based on Maximum Likelihood Estimation which says coefficients should be chosen in such a way that it maximizes the Probability of Y given X (likelihood) Two main difference are as follows - Linear regression requires the dependent variable to be continuous i.e. numeric values (no categories or groups). While Binary logistic regression requires the dependent variable to be binary - two categories only (0/1). Multinomial or ordinary logistic regression can have dependent variable with more than two categories. Linear regression is based on least square estimation which says regression coefficients should be chosen in such a way that it minimizes the sum of the squared distances of each observed response to its fitted value. While logistic regression is based on Maximum Likelihood Estimation which says coefficients should be chosen in such a way that it maximizes the Probability of Y given X (likelihood)

10.

Difference Between Linear And Logistic Regression?

Answer»

Two main difference are as follows -

LINEAR regression requires the dependent variable to be CONTINUOUS i.e. numeric values (no categories or groups). While BINARY logistic regression requires the dependent variable to be binary - two categories only (0/1). Multinomial or ordinary logistic regression can have dependent variable with more than two categories.

Linear regression is based on least square estimation which SAYS regression coefficients should be chosen in such a way that it minimizes the sum of the squared DISTANCES of each observed response to its fitted value. While logistic regression is based on Maximum Likelihood Estimation which says coefficients should be chosen in such a way that it maximizes the Probability of Y given X (likelihood)

Two main difference are as follows -

Linear regression requires the dependent variable to be continuous i.e. numeric values (no categories or groups). While Binary logistic regression requires the dependent variable to be binary - two categories only (0/1). Multinomial or ordinary logistic regression can have dependent variable with more than two categories.

Linear regression is based on least square estimation which says regression coefficients should be chosen in such a way that it minimizes the sum of the squared distances of each observed response to its fitted value. While logistic regression is based on Maximum Likelihood Estimation which says coefficients should be chosen in such a way that it maximizes the Probability of Y given X (likelihood)

Discussion

11.	Explain The Problem Statement Of Your Project. What Are The Financial Impacts Of It?
Answer» Cover the objective or main goal of your PREDICTIVE MODEL. Compare MONETARY benefits of the predictive model vs. No-model. ALSO highlights the non-monetary benefits (if any). Cover the objective or main goal of your predictive model. Compare monetary benefits of the predictive model vs. No-model. Also highlights the non-monetary benefits (if any).

11.

Explain The Problem Statement Of Your Project. What Are The Financial Impacts Of It?

Answer»

Cover the objective or main goal of your PREDICTIVE MODEL. Compare MONETARY benefits of the predictive model vs. No-model. ALSO highlights the non-monetary benefits (if any).

Cover the objective or main goal of your predictive model. Compare monetary benefits of the predictive model vs. No-model. Also highlights the non-monetary benefits (if any).

Discussion

12.	What Are The Applications Of Predictive Modeling?
Answer» Predictive modeling is mostly USED in the following areas - Acquisition - CROSS Sell / Up Sell RETENTION - Predictive Attrition Model Customer Lifetime Value Model Next Best Offer Market Mix Model Pricing Model Campaign RESPONSE Model Probability of Customers defaulting on loan Segment customers based on their homogenous attributes Demand FORECASTING Usage Simulation Underwriting Optimization - Optimize Network Predictive modeling is mostly used in the following areas -

12.

What Are The Applications Of Predictive Modeling?

Answer»

Predictive modeling is mostly USED in the following areas -

Acquisition - CROSS Sell / Up Sell
RETENTION - Predictive Attrition Model
Customer Lifetime Value Model
Next Best Offer
Market Mix Model
Pricing Model
Campaign RESPONSE Model
Probability of Customers defaulting on loan
Segment customers based on their homogenous attributes
Demand FORECASTING
Usage Simulation
Underwriting
Optimization - Optimize Network

Predictive modeling is mostly used in the following areas -

Discussion

13.	What Are The Essential Steps In A Predictive Modeling Project?
Answer» It consists of the following steps: Establish business objective of a predictive model Pull Historical Data - Internal and External Select Observation and PERFORMANCE WINDOW Create newly derived VARIABLES Split Data into Training, Validation and Test Samples Clean Data - Treatment of MISSING Values and Outliers Variable Reduction / SELECTION Variable Transformation Develop Model Validate Model Check Model Performance Deploy Model Monitor Model It consists of the following steps:

13.

What Are The Essential Steps In A Predictive Modeling Project?

Answer»

It consists of the following steps:

Establish business objective of a predictive model
Pull Historical Data - Internal and External
Select Observation and PERFORMANCE WINDOW
Create newly derived VARIABLES
Split Data into Training, Validation and Test Samples
Clean Data - Treatment of MISSING Values and Outliers
Variable Reduction / SELECTION
Variable Transformation
Develop Model
Validate Model
Check Model Performance
Deploy Model
Monitor Model

It consists of the following steps:

Discussion

Explore topic-wise InterviewSolutions in .

Explain Collinearity Between Continuous And Categorical Variables. Is Vif A Correct Method To Compute Collinearity In This Case?

Explain Important Model Performance Statistics?

What Is P-value And How It Is Used For Variable Selection?

Do We Remove Intercepts While Calculating Vif?

How Vif Is Calculated And Interpretation Of It?

What Is Multicollinearity And How To Deal It?

Explain Dimensionality / Variable Reduction Techniques?

How To Treat Outliers?

How To Handle Missing Values?

Difference Between Linear And Logistic Regression?

Explain The Problem Statement Of Your Project. What Are The Financial Impacts Of It?

What Are The Applications Of Predictive Modeling?

What Are The Essential Steps In A Predictive Modeling Project?