Generative Model

Companies: Amazon

Difficulty: Medium

Frequency: Low

Question

Can you explain what is a generative model?

Answer

Generative models model the joint probability P(X, Y), where X is the data points and Y is the labels. While discriminative models try to model the conditional probability P(Y | X). An example of this would be if we have a dataset with cars, a generative model would be able to generate new cars while a discriminative model would be able to differentiate between a car and a bicycle. 

 
Inconsistent Metrics

Companies: Amazon

Difficulty: Medium

Frequency: Low

Question

How would you solve the inconsistency between an offline metric and online AB testing?

Answer

The problem would most likely be that either the offline metrics are not representative of the online metric or vice versa. If the online metric is correct, we should look at selecting an offline metric that is more aligned with the online metric. On the other hand, if the online metric is not representative of the problem the model is trying to solve, we should update this online metric.

 
BERT vs Transformer

Companies: Amazon

Difficulty: High

Frequency: Medium

Question

What is the difference between BERT and transfomers?

Answer

Transformers are an encoder-decoder-based neural network that uses attention (determines which words in the sentence is important) and it does not contain recurrent connections. On the other hand, BERT uses the encoder part of the transformers as the main part of its architecture.

 
SGD vs Adam

Companies: Amazon

Difficulty: Medium

Frequency: Low

Question

What is the difference between stochastic gradient descent and adam?

Answer

In SGD, we (in the strictest definition) use only one data point and then updates the weights, however, this is very noisy so we generally update the weights using a subset of the data. While Adam uses Momentum and Adaptive Learning Rates to converge faster than vanilla gradient descent.

 
 
Large Number of Features

Companies: Netflix

Difficulty: Easy

Frequency: Medium

Question

What would happen when the number of features is far greater than the number of data points?

Answer

This can lead to overfitting, and we can reduce this by regularization techniques (L1/2 regularization) or dimensionality reduction or feature selection.

 
 
Model Explanations

Companies: Netflix

Difficulty: Easy

Frequency: High

Question

Choose a few of the ml models you are most familiar with, introduce the advantages and disadvantages.

Answer

Take a look at the many algorithms on our ML Questions pages.

 
Ad Titiles Similarity

Companies: Microsoft

Difficulty: Easy

Frequency: Low

Question

Given two different titles for two separate ads, how would you measure the similarity of the two titles?

Answer

There are a few ways we can approach this problem:

  • The simplest would be to count the frequency of each character in each title and then use distance metrics like cosine similarity to determine how close they are to each other.

  • Another would be to convert each word to a word vector (TF-IDF, Word Vectors etc.) and use cosine similarity to determine their similarity.

 
Data Imbalance

Companies: Microsoft

Difficulty: Medium

Frequency: Medium

Question

Assuming that you have to train a classification model, you have 50k labelled data points and 200k unlabeled data points. What would be your approach to train the classifier

Answer

There are a few approaches we can take:

  • Use active learning to determine the most informative of the data points from the unlabelled data set. We can then manually label this small set and add it to our labelled data points.

  • Use clustering to generate clusters of unlabelled data. Then we can determine how close the unlabelled data points are from the labelled data points and use the closest cluster as the label for our unlabelled data points.

 
Negative Sampling

Companies: Microsoft

Difficulty: Easy

Frequency: Low

Question

Can you explain negative sampling and why it is useful?

Answer

 
Generative Adversarial Networks

Companies: Microsoft, Snapchat

Difficulty: Medium

Frequency: Low

Question

What are generative Adversarial Networks?

Answer

 
Gaussian Mixture Models

Companies: Microsoft

Difficulty: Medium

Frequency: Low

Question

Explain gaussian mixture models

Answer

Gaussian mixture model (GMM) is a probabilistic model which contains multiple normally distributed regions within a dataset. GMMs are unsupervised and learn these regions automatically (similar to k-means). 

 
Calculate MLE

Companies: Microsoft

Difficulty: Easy

Frequency: Low

Question

How to calculate MLE?

Answer

 
RMSE

Companies: Microsoft

Difficulty: Easy

Frequency: Low

Question

Can you explain and write out the formula of RMSE. When will RMSE be used?

Answer

 
Output CNN

Companies: Microsoft

Difficulty: Medium

Frequency: Low

Question

How to calculate the output size of the convolution layer? Write the complete formula.

Answer

The formula is:

 

Output Size = [(W−K+2P)/S]+1.

 

where:

W is the input volume

K is the Kernel

P is the padding

S is the stride

 
 
Compare Models

Companies: Microsoft

Difficulty: Easy

Frequency: Low

Question

Give a set of ground truths and 2 models, how can you be confident that one model is better than another?

Answer

We should first look at offline metrics that are relevant to the problem. Second, we should perform A/B testing to look at our online metrics, to see which performed better. In addition, we can take other factors into account such as latency and size of the model.

 
EM Algorithm

Companies: Microsoft

Difficulty: Medium

Frequency: Low

Question

Can you explain at a high level what is the EM Algorithm?

Answer

 
LDA & QDA

Companies: Microsoft

Difficulty: Medium

Frequency: Low

Question

Answer

LDA (Linear Discriminant Analysis) and QDA (Quadratic Discriminant Analysis) algorithm is based on Bayes theorem, to classify an new datapoint we take the following steps:

  1. Determine the distribution for the training data for each class

  2. Use Bayes theorem to calculate the probability P(Y=class|X=datapoint)

When we need a linear decision boundary we will use LDFA and this means when we have a non-linear decision boundary we will use QDA. The more the classes are separable and the distribution of the input is normal the better these algorithms will perform.

 
Random Forest

Companies: Amazon, Microsoft, Google

Difficulty: Easy

Frequency: High

Question

Can you explain the Random Forest algorithm?

Answer

 
Severe Data Imbalance

Companies: Microsoft

Difficulty: Easy

Frequency: Medium

Question

Suppose we have a 1 million-size binary class test dataset. There are 1000 negative examples and rest are positive, the test accuracy of a binary classification model is 99.9%. Is this model well trained?

Answer

No, accuracy is not a very good metric for imbalanced data. A better metric would be F1 score which banalnces precision and recall. If you need to focus on recall or precision you should use those instead.

 
Clustering Methods

Companies: Amazon

Difficulty: Easy

Frequency: Medium

Question

What are some common methods of clustering, how to measure the results of clustering

Answer

 
Cross Entropy and MLE

Companies: Google

Difficulty: Medium

Frequency: Low

Question

What is the relationship between cross entropy and MLE?

Answer

For maximum likelihood estimation (MLE), it estimates the parameters of a probability distribution by maximizing the likelihood function. This is so that we get the parameters that are the most probable based on the data. On the other hand, cross-entropy gives a score (of information) that is needed to determine if an example is drawn from the estimated probability distribution (our model) and the actual distribution of the data. Thus, we want to minimize this cross-entropy.

It is clear that we can use MLE for our model training (i.e. estimating the parameters) while cross-entropy can be used as a loss function to determine how close we are to the ground truth. 

 
No Gradient

Companies: Google

Difficulty: Medium

Frequency: Low

Question

If you don’t use gradient descent or its derivaties, is there any way to optimize the model?

Answer

We can use simulated annealing which approximates the global optimum of a given function but it is usually longer to converge. We can use other gradient-based optimization techniques such as:

  • Levenberg-Marquardt Algorithm (LMA)

  • Nonlinear Conjugate Gradient

  • Limited-memory BFGS 

 
 
Batch Norm

Companies: Amazon, Google

Difficulty: Easy

Frequency: Medium

Question

What is batch normalization and why is it useful?

Answer