Generative Model
Companies: Amazon
Difficulty: Medium
Frequency: Low
Question
Can you explain what is a generative model?
Answer
Generative models model the joint probability P(X, Y), where X is the data points and Y is the labels. While discriminative models try to model the conditional probability P(Y  X). An example of this would be if we have a dataset with cars, a generative model would be able to generate new cars while a discriminative model would be able to differentiate between a car and a bicycle.
Inconsistent Metrics
Companies: Amazon
Difficulty: Medium
Frequency: Low
Question
How would you solve the inconsistency between an offline metric and online AB testing?
Answer
The problem would most likely be that either the offline metrics are not representative of the online metric or vice versa. If the online metric is correct, we should look at selecting an offline metric that is more aligned with the online metric. On the other hand, if the online metric is not representative of the problem the model is trying to solve, we should update this online metric.
BERT vs Transformer
Companies: Amazon
Difficulty: High
Frequency: Medium
Question
What is the difference between BERT and transfomers?
Answer
Transformers are an encoderdecoderbased neural network that uses attention (determines which words in the sentence is important) and it does not contain recurrent connections. On the other hand, BERT uses the encoder part of the transformers as the main part of its architecture.
SGD vs Adam
Companies: Amazon
Difficulty: Medium
Frequency: Low
Question
What is the difference between stochastic gradient descent and adam?
Answer
In SGD, we (in the strictest definition) use only one data point and then updates the weights, however, this is very noisy so we generally update the weights using a subset of the data. While Adam uses Momentum and Adaptive Learning Rates to converge faster than vanilla gradient descent.
Word Embeddings Comparison
Companies: Amazon
Difficulty: Medium
Frequency: Low
Question
What is the difference between word2vec, glove and fasttext?
Answer
Large Number of Features
Companies: Netflix
Difficulty: Easy
Frequency: Medium
Question
What would happen when the number of features is far greater than the number of data points?
Answer
This can lead to overfitting, and we can reduce this by regularization techniques (L1/2 regularization) or dimensionality reduction or feature selection.
Small Number of Features
Companies: Netflix
Difficulty: Easy
Frequency: Medium
Question
What would happen when the number of data points is far greater than the number of features?
Answer
Model Explanations
Companies: Netflix
Difficulty: Easy
Frequency: High
Question
Choose a few of the ml models you are most familiar with, introduce the advantages and disadvantages.
Answer
Take a look at the many algorithms on our ML Questions pages.
Ad Titiles Similarity
Companies: Microsoft
Difficulty: Easy
Frequency: Low
Question
Given two different titles for two separate ads, how would you measure the similarity of the two titles?
Answer
There are a few ways we can approach this problem:

The simplest would be to count the frequency of each character in each title and then use distance metrics like cosine similarity to determine how close they are to each other.

Another would be to convert each word to a word vector (TFIDF, Word Vectors etc.) and use cosine similarity to determine their similarity.
Data Imbalance
Companies: Microsoft
Difficulty: Medium
Frequency: Medium
Question
Assuming that you have to train a classification model, you have 50k labelled data points and 200k unlabeled data points. What would be your approach to train the classifier
Answer
There are a few approaches we can take:

Use active learning to determine the most informative of the data points from the unlabelled data set. We can then manually label this small set and add it to our labelled data points.

Use clustering to generate clusters of unlabelled data. Then we can determine how close the unlabelled data points are from the labelled data points and use the closest cluster as the label for our unlabelled data points.
Negative Sampling
Companies: Microsoft
Difficulty: Easy
Frequency: Low
Question
Can you explain negative sampling and why it is useful?
Answer
Generative Adversarial Networks
Companies: Microsoft, Snapchat
Difficulty: Medium
Frequency: Low
Question
What are generative Adversarial Networks?
Answer
Gaussian Mixture Models
Companies: Microsoft
Difficulty: Medium
Frequency: Low
Question
Explain gaussian mixture models
Answer
Gaussian mixture model (GMM) is a probabilistic model which contains multiple normally distributed regions within a dataset. GMMs are unsupervised and learn these regions automatically (similar to kmeans).
Calculate MLE
Companies: Microsoft
Difficulty: Easy
Frequency: Low
Question
How to calculate MLE?
Answer
RMSE
Companies: Microsoft
Difficulty: Easy
Frequency: Low
Question
Can you explain and write out the formula of RMSE. When will RMSE be used?
Answer
Output CNN
Companies: Microsoft
Difficulty: Medium
Frequency: Low
Question
How to calculate the output size of the convolution layer? Write the complete formula.
Answer
The formula is:
Output Size = [(W−K+2P)/S]+1.
where:
W is the input volume
K is the Kernel
P is the padding
S is the stride
Transformer vs LSTM
Companies: Microsoft
Difficulty: Medium
Frequency: Medium
Question
What are the differences between a transformer and an LSTM?
Answer
Compare Models
Companies: Microsoft
Difficulty: Easy
Frequency: Low
Question
Give a set of ground truths and 2 models, how can you be confident that one model is better than another?
Answer
We should first look at offline metrics that are relevant to the problem. Second, we should perform A/B testing to look at our online metrics, to see which performed better. In addition, we can take other factors into account such as latency and size of the model.
EM Algorithm
Companies: Microsoft
Difficulty: Medium
Frequency: Low
Question
Can you explain at a high level what is the EM Algorithm?
Answer
The EM algorithm is an iterative approach that cycles between two steps:

EStep. Estimate the missing variables in the dataset.

MStep. Maximize the parameters of the model in the presence of the data.
LDA & QDA
Companies: Microsoft
Difficulty: Medium
Frequency: Low
Question
Answer
LDA (Linear Discriminant Analysis) and QDA (Quadratic Discriminant Analysis) algorithm is based on Bayes theorem, to classify an new datapoint we take the following steps:

Determine the distribution for the training data for each class

Use Bayes theorem to calculate the probability P(Y=classX=datapoint)
When we need a linear decision boundary we will use LDFA and this means when we have a nonlinear decision boundary we will use QDA. The more the classes are separable and the distribution of the input is normal the better these algorithms will perform.
Random Forest
Companies: Amazon, Microsoft, Google
Difficulty: Easy
Frequency: High
Question
Can you explain the Random Forest algorithm?
Answer
Severe Data Imbalance
Companies: Microsoft
Difficulty: Easy
Frequency: Medium
Question
Suppose we have a 1 millionsize binary class test dataset. There are 1000 negative examples and rest are positive, the test accuracy of a binary classification model is 99.9%. Is this model well trained?
Answer
No, accuracy is not a very good metric for imbalanced data. A better metric would be F1 score which banalnces precision and recall. If you need to focus on recall or precision you should use those instead.
Clustering Methods
Companies: Amazon
Difficulty: Easy
Frequency: Medium
Question
What are some common methods of clustering, how to measure the results of clustering
Answer
Cross Entropy and MLE
Companies: Google
Difficulty: Medium
Frequency: Low
Question
What is the relationship between cross entropy and MLE?
Answer
For maximum likelihood estimation (MLE), it estimates the parameters of a probability distribution by maximizing the likelihood function. This is so that we get the parameters that are the most probable based on the data. On the other hand, crossentropy gives a score (of information) that is needed to determine if an example is drawn from the estimated probability distribution (our model) and the actual distribution of the data. Thus, we want to minimize this crossentropy.
It is clear that we can use MLE for our model training (i.e. estimating the parameters) while crossentropy can be used as a loss function to determine how close we are to the ground truth.
No Gradient
Companies: Google
Difficulty: Medium
Frequency: Low
Question
If you don’t use gradient descent or its derivaties, is there any way to optimize the model?
Answer
We can use simulated annealing which approximates the global optimum of a given function but it is usually longer to converge. We can use other gradientbased optimization techniques such as:

LevenbergMarquardt Algorithm (LMA)

Nonlinear Conjugate Gradient

Limitedmemory BFGS
Batch Norm
Companies: Amazon, Google
Difficulty: Easy
Frequency: Medium
Question
What is batch normalization and why is it useful?