107 + Interview Questions in Machine Learning in Data Science Page 1 InterviewSolution

1.

We have below data with 10 transactions. What is the “Lift Ratio” for “if white then blue”?

Answer»

MISSING data in the training data set can reduce the power/fit of a model or can lead to a biased model because we have not analyzed the behavior and relationship with other variables correctly. It can lead to incorrect prediction or classification. Below is a SIMPLE example to illustrate this.

Name	Weight	Gender	Play Golf or Not
AA	55	M	Yes
BB	62	F	Yes
CC	58	F	No
DD	54		No
EE	54	M	No
FF	66	F	Yes
GG	56		Yes
HH	56	M	Yes

Figure 1

Gender	# Count	# Play Golf	% Play Golf
F	3	2	66.67%
M	3	2	66.67%
Missing/Blank	2	1	50%

Figure 2

Please NOTE the missing values in the table shown above: in figure1, we have not treated missing values for our analysis in Figure 2. The INFERENCE from this data set is that the chances of playing golf by females and MALES are similar.

On the other hand, if you look at Figure. 4, which shows data after treatment of missing values (based on gender), we can see that females have higher chances of playing cricket compared to males.

Name	Weight	Gender	Play Golf or Not
AA	55	M	Yes
BB	62	F	Yes
CC	58	F	No
DD	54	M	No
EE	54	M	No
FF	66	F	Yes
GG	56	M	Yes
HH	56	M	Yes

Figure 3

Gender	# Count	# Play Golf	% Play Golf
F	3	2	66.67%
M	5	3	60%

Figure 4

15.	What is the difference between one hot encoding and label encoding? Explain.
Answer» USING one hot encoding, the dimensionality (i.e. FEATURES) in a dataset get increased because it creates a new variable for each level present in categorical variables. For example: let’s say we have a variable ‘color’. The variable has 3 levels NAMELY Red, Blue, and Green. One hot encoding ‘color’ variable will generate three new variables as Color.Red, Color.Blue and Color.Green containing 0 and 1 value. In label encoding, the levels of categorical variables get ENCODED as 0 and 1, so no new variable is created. Label encoding is majorly used for binary variables.

45.	Which of the following is used to assist the quantitative trader in the development?(a) quantmod(b) quantile(c) quantity(d) mboostI got this question in an interview for internship.This is a very interesting question from Model Based Prediction in chapter Machine Learning of Data Science
Answer» The correct choice is (a) quantmod The BEST EXPLANATION: Quandl PACKAGE is similar to quantmod.

46.	Which of the following is not a machine learning algorithm?(a) SVG(b) SVM(c) Random forest(d) None of the mentionedThis question was posed to me during a job interview.This intriguing question originated from Cross Validation topic in portion Machine Learning of Data Science
Answer» The correct OPTION is (a) SVG Easiest explanation - SVM stands for SCALABLE VECTOR MACHINE.

47.	Which of the following methods are present in caret for regularized regression?(a) ridge(b) lasso(c) relaxo(d) all of the mentionedThis question was addressed to me at a job interview.My question is based upon Model Based Prediction topic in portion Machine Learning of Data Science
Answer» Right answer is (d) all of the mentioned Easiest EXPLANATION - In caret ONE can TUNE over the no of predictors to retain instead of defined values for PENALTY.

48.	Which of the following function can be used to flag predictors for removal?(a) searchCorrelation(b) findCausation(c) findCorrelation(d) none of the mentionedThe question was asked by my college director while I was bunking the class.My doubt stems from caret topic in chapter Machine Learning of Data Science
Answer» Correct ANSWER is (C) findCorrelation The explanation is: Some models THRIVE on CORRELATED PREDICTORS.

49.	Which of the following can be used to create the most common graph types?(a) qplot(b) quickplot(c) plot(d) all of the mentionedI have been asked this question by my school teacher while I was bunking the class.My enquiry is from Cross Validation in section Machine Learning of Data Science
Answer» The correct CHOICE is (a) QPLOT Explanation: qplot() is SHORT for a QUICK PLOT.

50.	Which of the following is a common error measure?(a) Sensitivity(b) Median absolute deviation(c) Specificity(d) All of the mentionedI got this question in an online interview.I'd like to ask this question from Cross Validation topic in portion Machine Learning of Data Science
Answer» Correct OPTION is (d) All of the mentioned Explanation: SENSITIVITY and SPECIFICITY are statistical measures of the PERFORMANCE of a binary classification test, also known in statistics as classification FUNCTION.

Explore topic-wise InterviewSolutions in .

We have below data with 10 transactions. What is the “Lift Ratio” for “if white then blue”?

Transactions#FaceplateColorsPurchased1redwhitegreen2whiteorange3whiteblue4redwhiteorange5redblue6whiteblue7whiteorange8redwhitebluegreen9redwhiteblue10yellow

We have below data with 10 transactions. What is the performance measure “Confidence” for “if white then blue”?

Transaction#FaceplateColorsPurchased1redwhitegreen2whiteorange3whiteblue4redwhiteorange5redblue6whiteblue7whiteorange8redwhitebluegreen9redwhiteblue10yellow

We have below data with 10 transactions. What is the performance measure “Support” for “if white then blue”?

SATAverageSAT score of new freshmenTop10% new freshmen in top 10% of highschool classAccept% of applicants accepted SFRatioStudent to faculty ratioSExpensesEstimated annual expensesGradRateGraduation Rate(%)

Consider universities dataset below. Data for 25 undergraduate programs at business schools in US universities in 1995. The dataset excludes image variables (student satisfaction, employer satisfaction, dean’s opinions, etc.). Given this

We have trained/executed our model with the given dataset. We have noticed that we have used a regression model and it is suffering from multicollinearity. Is it possible to improvise on our model without losing any information?

When is Ridge regression used and when is Lasso regression (ideally)?

What are the key methods for variable selection? Explain briefly.

What is the difference between Random Forest and Gradient Boosting algorithms? Explain briefly.

We have got a dataset where a number of variables is greater than the number of observations or rows. Can we use classical Regression techniques here? How would you deal with this situation?

We have developed a Random Forest model with 10000 trees. We have got training error as 0. However, the validation error seems to be around 34~35. Any thoughts? Do you feel the model has not trained appropriately?

What is the difference between one hot encoding and label encoding? Explain.

We have time series data provided to us. What cross-validation techniques are to be followed?

We have a dataset comprising of variables having more than 30% missing values. Let’s say, for example, we have 100 variables and 16 variables have missing values of more than 30%. How will you deal with this scenario?

What are the parameters to evaluate Logistic Regression? Explain briefly.

What is the difference between OLS and Maximum Likelihood? Explain briefly.

What is Bias-Variance trade-off? Explain.

There are multiple algorithms available in machine learning – supervised, unsupervised and other learning. How do you determine which one to use?

How is the logistic regression model evaluated? Explain at least 3 points.

What is the difference between Type 1 and Type 2 Error? Explain briefly.

There is an ask to evaluate a regression model based on parameters such as R square, Adjusted R square, and Tolerance? Explain what will be the criteria.

What is the difference between kNN and k means clustering?

The model is suffering from low bias and high variance. What approach should be used to tackle this scenario and why?

What is the Jitter Plot? Explain with an example.

What is the Kolmogorov And Smirnov Test?

What is the Wilcoxon Signed Rank Test?

How do we test if a time series data stationary or not programmatically?

How will you detrend a time series?

What is auto-correlation and partial auto-correlation?

What is a stationary time series?

Provide five assumptions of Linear regression.

Provide at least three ways to detect outliers in a dataset?

What impact outliers have in a dataset? Explain with an example.

What is kNN imputation and what are its pros &amp; cons?

What is the difference between “listwise deletion” and “pairwise deletion”?

Missing values in data can cause issues and there are different strategies to handle missing values. What are the different types of missing values at the time of data collection? Explain.

Why missing values treatment is required?

What type of bivariate analysis will you perform if variables are categorical and continuous?

What is the chi-square test? When do we use this?

What is the difference between univariate and bivariate analysis? Explain briefly.

What is CRISP-DM? Explain various stages

Which of the following is used to assist the quantitative trader in the development?(a) quantmod(b) quantile(c) quantity(d) mboostI got this question in an interview for internship.This is a very interesting question from Model Based Prediction in chapter Machine Learning of Data Science

Which of the following is not a machine learning algorithm?(a) SVG(b) SVM(c) Random forest(d) None of the mentionedThis question was posed to me during a job interview.This intriguing question originated from Cross Validation topic in portion Machine Learning of Data Science

Which of the following methods are present in caret for regularized regression?(a) ridge(b) lasso(c) relaxo(d) all of the mentionedThis question was addressed to me at a job interview.My question is based upon Model Based Prediction topic in portion Machine Learning of Data Science

Which of the following can be used to create the most common graph types?(a) qplot(b) quickplot(c) plot(d) all of the mentionedI have been asked this question by my school teacher while I was bunking the class.My enquiry is from Cross Validation in section Machine Learning of Data Science

Which of the following is a common error measure?(a) Sensitivity(b) Median absolute deviation(c) Specificity(d) All of the mentionedI got this question in an online interview.I'd like to ask this question from Cross Validation topic in portion Machine Learning of Data Science

What is kNN imputation and what are its pros & cons?