Lesson

Basic

Learn Basic in SQLPad's Cracking the Machine Learning Fundamentals Interview course with practical examples and guided lessons.

Supervised learning is a set of problems where the ground truth (e.g, types of objects, historical prices) is known in the training data. Supervised learning can be either classification or regression models depending on whether the variable you are trying to predict is discrete (true or false, textbook labels) or continuous (prices, cost). Many business problems can be formulated as supervised learning problems.

BASIC

Top Basic Supervised Learning interview questions and answers.

15 questions

Q1: What is your favorite machine learning algorithm? and why?

This is a warmup question to kick off a conversation, there is no right or wrong as long as you pick an algorithm that you know ins and outs when the interviewer dives deep into details.

Sample answer:

Thank you, that's a very interesting question, my favorite machine learning algorithm is Random Forest because of its ease of use and high performance.

1. Based on my past experience in building machine learning models, Random Forest has consistently generated good results in both classification and regression problems;

2. Random Forest also trains very fast without a lot of feature engineering, which saves a lot of time and allowed me to build a quick prototype and communicate my results to the business;

3. Because Random Forest randomly samples the training data and variables, it is less prone to outliers and over-fitting.

4. Lastly, RF also has some bells and whistles such as giving us an idea of how important each variable is, which makes it easy for us to explore the data and intuitively validate our understanding of the data.

Q2: What is n-fold cross-validation? Walk me through the process, what is a common value of n?

N-fold cross-validation is a re-sampling technique, commonly used to evaluate a machine learning model's performance.

The raw data set is randomly partitioned into n disjoint subsets, and one of the subsets is used as the testing data, while the rest is used as the training data for modeling.

The process is repeated n times and the performance metrics such as accuracy are aggregated by taking the average of the n accuracies.

A common value of n is 5, so that 80% of the samples are used in training, and 20% for testing.

Q3: What is over-fitting, and how do we know if our model is over-fitting? How do you handle over-fitting? Name a few methods.

Over-fitting happens when a machine learning model "tries too hard" to make a perfect prediction on training data.

For example, a single decision tree developed many levels and try to predict a perfect label for every single data point, and every decision block (leave) has only one sample.

A typical over-fitting pattern can be detected by re-apply the model to a hold-out sample, and if the performance on the holdout sample is quite different than its performance on the training data.

It most likely is an over-fitted model, which means the model won't generalize well.

For example: in linear regression models, we can use regularization such as L1 (Lasso) or L2 (Ridge) methods to help shrink or reduce the impact of too many parameters.

For a decision tree, we can create a large tree (with many depths), and use cross-validation to prune the tree back to avoid over-fitting.

For a neural network, dropout is an effective way to help avoid over-fitting.

Q4: What is under-fitting, and how do we improve it?

Under-fitting usually happens when the model is not 'sophisticated' enough to fit the training data, usually, we can increase the model complexity to help improve its performance.

For example, we can introduce more variables/features, or if we are building a linear regression model, including higher order-independent variables could help.

Q5: How do you handle missing value?

Generally speaking, in practice, there are two ways to deal with missing values.

1. Skip those samples

If there is no evidence that the missing of certain fields is not random, and the number of rows/samples with missing data is only a small percentage of your training data, you can safely remove those samples.

If there is a strong correlation with samples that has a missing value to other variables, e.g., in a survey people who earn less than 50k per year may feel uncomfortable sharing their income bucket information.

Instead of removing all those samples (which introduces sample bias, and your training data is no longer a good representation of the overall population), we can keep those rows but fill the missing income bucket variables with a new value of "missing".

2. Imputation

If the missing values are random, no strong correlation between missing values and other variables, we can replace the missing values using the maximum likelihood philosophy.

For example, we can fill the median value for continuous variables, or the most common non-missing value for categorical variables.

Q6: How do you handle extremely unbalanced data?

When your samples are extremely unbalanced, for example: in a binary classification problem and the ratio of class 1 vs class 2 is 99% vs 1% (like click event prediction for display ads), usually, we can down-sampling the larger group to achieve a 50/50 (or other ratios such as like 70/30 split) ratio for the new training data.

After training the model on the new training data (50/50 of class 1 vs class 2), the key to making sure your upsampling works is to test your model on a hold-out sample without any resampling.

By doing this, you will be able to evaluate the model performance in the real production environment.

Since the hold-out sample has 99% of class 1 data, if you create a dummy predictor and simply predict everything to be class 1, you will still have 99% accuracy.

So, instead of using overall accuracy, we need to change the metric to f1-score, which is the harmonic mean of precision and recall, to make sure the model is a good model.

Q7: How do you split your data into training and testing if the data is time-dependent?

This is a typical question to test your understanding of the model evaluation process.

The key idea behind the time-sensitive training data is to think about what the real-world data will look like.

e.g., if you are building a time-series model to predict next month's web traffic, you can't use future months' data.

So instead of randomly splitting your raw data into training and testing subsets, you will need to separate them by time. e.g., using the month 1 to month 13 data to train your model, use month 14 to test its performance.

Q8: What does model regularization help to achieve?

The main goal of regularization is to help us avoid over-fitting. When the number of variables and the complexity of the model is high, the model can achieve near-perfect performance on the training data, but due to lack of generalization, it will perform poorly for unseen new data.

Regularization helps us reduce the complexity of the model, and avoid 'over-learning' on the training data, so it can perform better on unseen data points.

Q9: Explain the bias vs. variance tradeoff

Theoretical explanation:

The mean squared error can be decomposed into two parts:

$\operatorname {E} _{D,\varepsilon }{\Big [}{\big (}y-{\hat {f}}(x;D){\big )}^{2}{\Big ]}={\Big (}\operatorname {Bias} _{D}{\big [}{\hat {f}}(x;D){\big ]}{\Big )}^{2}+\operatorname {Var} _{D}{\big [}{\hat {f}}(x;D){\big ]}+\sigma ^{2}$

The bias part and the variance part.

Bias:

$\operatorname {Bias} _{D}{\big [}{\hat {f}}(x;D){\big ]}=\operatorname {E} _{D}{\big [}{\hat {f}}(x;D){\big ]}-f(x)$

Variance:

\operatorname {Var} _{D}{\big [}{\hat {f}}(x;D){\big ]}=\operatorname {E} _{D}[{\big (}\operatorname {E} _{D}[{\hat {f}}(x;D)]-{\hat {f}}(x;D){\big )}^{2}].

The bias represents how your model performs on the training sample, and the variance is how it will generalize for future unseen samples.

Intuitive explanation

Imagine if we are shooting a target, the best scenario is that we consistently hit the bullseye.

Chart 1: low bias, low variance (best scenario).

Chart 2: bias high, variance low

If we are consistent but every time we shoot far away from the bullseye, it's the high bias, low variance scenario.

Similarly, we also have low bias, high variance (chart 3) and high bias, high variance scenarios:

Chart 3: bias low, variance high.

Chart 4: bias high, variance high

Q10: How to deal with categorical features?

The most common way to represent categorical features is to use the one-hot encoding method.

Q11: Explain SVM algorithm, how does it work?

SVM means support vector machine, it is a linear algorithm that tries to find the best separation of two classes of samples for a binary classification problem.

It is formalized as an optimization problem and the goal is to find a line (in 2d) or hyperplane (higher dimensional space) that represents the largest separation, or margin, between the two classes o samples.

The hyperplane that maximizes the margin is called the support vector.

Q12: What are some of the most common kernels in SVM?

Linear;
Polynomial;
RBF: Radial basis function;
Sigmoid

Q13: When should we prioritize False Positive over False Negative, give me an example?

False Positive represents the probability when we mistakenly predict a negative case as a positive case.

False Positive increases the Recall and is usually preferred if the potential loss/penalty is very high.

For example: in extreme weather prediction, we'd rather mistakenly predict the weather as a potential hurricane, than not treat it as a potential risk.

Q14: When should we prioritize False Negative over False Positive, give me an example?

False-negative means we mistakenly predict a positive case as the negative case.

It happens commonly in recruiting, often a company tends to be very strict in making an offer to a candidate so that they won't hire someone who turns out to be a poor employee.

They'd rather reject a qualified candidate rather than accept an unqualified candidate.

Q15: What does the 'Naive' in Naive Bayes come from?

The "Naiveness" or "Naivety" in a Naive Bayes algorithm stems from the assumption of conditional independence of every pair of features X given the class variable y, which significantly simplifies the computation, but in a real world, it's rarely true.