Classification Evaluation Metrics

Companies: Amazon, Facebook, Microsoft

Difficulty: Easy

Frequency: High

Question

What would be the metrics you use for classification?

Answer

You should speak about accuracy first since it is the most obvious. Then you can mention that accuracy only works with fairly balanced data. For unbalanced datasets you should mention precision, recall, F1 score and ROC curves.

 

To impress your interviewer, you should mention when you would use each of these methods and their formula:

  • Accuracy: when dataset is not balanced or skewed

  • Precision: when we want to be sure to be confident in our prediction

  • Recall: when we want to get as many positive samples as possible

  • F1: want to balance precision and recall

Word2Vec Types

Companies: Facebook, Microsoft

Difficulty: Easy

Frequency: Low

Question

What is n-gram, skip-gram and continuous bag of words?

Answer

N-gram is a sequence of consecutive words (n) taken out of a given sequence. e.g. Give the sentence: The boy jumped over the chair. We have the following 3-grams: The boy jumped, boy jumped over, jumped over the, over the chair.

Word2vec uses skip-gram to try predicting the neighbours given a word e.g. if the sentence is The boy jumped over the chair. Then given the word "over" and a window size of 2, we would like to get (boy, jumped, the chair).

Lastly, CBOW is also used in word2vec where we do the opposite of skip-gram, where given a sentence with a missing word, we try to predict the word, e.g. if the sentence is The boy jumped [token] the chair. We hope the we will predict [token] to be over.

 
Distributed Deep Learning

Companies: Facebook

Difficulty: Hard

Frequency: Low

Question

How would you use distributed training for deep learning models?

Answer

The main method we can use distributed training is to partition large datasets, this is called data parallelism. Each of these partitions would then run on a machine (server) so we should partition based on the number of worker machines we have. Each worker would perform training on its own partitioned data and then generate a set of parameter updates. There would be synchronisation between workers.

 
Logistic Regression

Companies: Amazon, Apple, Google

Difficulty: Easy

Frequency: High

Question

Can you explain logistic regression in as much detail as you can?

Answer

 
K-means

Companies: Apple, Google

Difficulty: Easy

Frequency: High

Question

Can you explain the k-means algorithm?

Answer

 
 
Vanishing Gradient

Companies: Apple

Difficulty: Easy

Frequency: Medium

Question

What is the vanishing gradient problem? How would you solve this?

Answer

In deep neural networks, the addition of more and more layers, using certain activation functions tend to make the gradient of the loss function close to 0 as we backpropagate this gradient. A gradient of 0 would mean that our network does not learn anything, so vanishing gradients lead to models that become difficult to train.

In order to solve this we can:

  • We can use activation functions with are not as prone to this problem, for e.g. ReLU.

  • Use residual networks which provide residual connections to earlier layers so gradients can skip many activation functions.

 
Imbalanced Data

Companies: Amazon, Apple

Difficulty: Easy

Frequency: Medium

Question

How would you handle imbalanced data in classification?

Answer

 
Gradient Descent Rescale

Companies: Amazon, Apple

Difficulty: Medium

Frequency: Low

Question

Should we rescale features before gradient descent? Why?

Answer

Yes, we should rescale our feature because it converts values wihtin the same range so that certain features are not more dominat as computing the graident. If our features are on a small range then the graident will descend quickly, and vice versa for feature within a large ranges. So, if the features are very different we will oscillate inefficently towards the optimum point. 

 
 
 
Bias And Variance

Companies: Amazon, Apple, Facebook, Google

Difficulty: Easy

Frequency: High

Question

What is the bias-varaince tradeoff?

Answer

 
 
Cold Start

Companies: Apple

Difficulty: Easy

Frequency: Medium

Question

What is one issue that you need to be aware of in collaborative filtering problems?

Answer

One issue to be aware of is the problem of cold start. This comes up in collaborative filtering problems when there is a new product, how do we know whether to recommend to a user since there would not be any users who also like this product to compare.

 

In order to solve this we can 

  • Switch to content-based filtering

  • Can approximate the features of the new product based on previous similar products

  • Use Weighted Alternating Least Squares as an objective function

 
XGBoost

Companies: Apple

Difficulty: Medium

Frequency: High

Question

Can you explain the xgboost algorithm?

Answer

 
Word2Vec

Companies: Amazon, Apple

Difficulty: Medium

Frequency: Medium

Question

What’s the word2vec model? 

Answer

The idea behind word2vec is to convert a word into a vector representation where similar words have a similar representation. There are two methods to train a word2vec model: Continuous Bag of Words and Skip-Gram.

Word2vec uses skip-gram to try predicting the neighbours given a word e.g. if the sentence is The boy jumped over the chair. Then given the word "over" and a window size of 2, we would like to get (boy, jumped, the chair).

 

CBOW is also used in word2vec where we do the opposite of skip-gram, where given a sentence with a missing word, we try to predict the word, e.g. if the sentence is The boy jumped [token] the chair. We hope the we will predict [token] to be over.

During training vectors of words that are similar by their context are brought closer together when performing the backpropagation step.

 
 
CNNs

Companies: Apple

Difficulty: Easy

Frequency: Medium

Question

Can you explain how a CNN works?

Answer

 
Classifying Kitchen Utensils

Companies: Apple

Difficulty: Easy

Frequency: Low

Question

You've started on a new project and you've trained a new network to classify different kitchen utensils. You've achieved certain accuracy, but now want to improve its performance. Would you begin by adding more layers to the network to achieve higher accuracy? If so, why? If not, why not?

Answer

 
 
Validation Accuracy

Companies: Apple

Difficulty: Medium

Frequency: Low

Question

You submitted a job for training. After running for 100 Epochs, you see that the loss has stopped decreasing soon after training started, but validation accuracy is close to 0. Why is that? What can you do to fix that?

Answer

 
Cross-Entropy Loss

Companies: Amazon, Apple

Difficulty: Medium

Frequency: Low

Question

Can you explain what is Cross-Entropy Loss?

Answer

Cross-entropy loss, also called log loss, measures the performance of a classification model by returning a probability score between 0 and 1. A value of 1 (a high loss value) would mean that the model is performing poorly, i.e. there is a big difference between the predicted value and actual value. A value of 0 would be a perfect model.

 
Softmax

Companies: Apple

Difficulty: Medium

Frequency: Low

Question

What is softmax?  Why aren't we using max to simply pick the most likely output instead?

Answer

Applying the softmax or the max function to a vector would both give the same result. However, the difference is that the softmax produces a probability distribution from the vector input and thus is good for classification methods in machine learning because we can have a strength associated with the prediction. In addition, this "normalization" via the softmax function required for machine learning techniques to work.

 
Replace Softmax

Companies: Apple

Difficulty: Hard

Frequency: Low

Question

Why not replace softmax with a 0-1 standardization?

Answer

When you compare softmax to 0-1 standardization, the main difference is that standardization does not care if one value in a vector is much larger than others once the proportion is the same, while the softmax function is sensitive to these large differences.

Let us take a look at an example:

Supposse we have a vector [1, 2] then the softmax function would return [0.27, 0.73] and 0-1 standardization would return [0, 1]. However, if the vector is [10, 20] then softmax would return [0.00005, 0.99995] which indicates that we are very sure that the second item in the vector is correct. On the other hand, standardization would return [0, 1].

 
Logits Formula

Companies: Apple

Difficulty: Medium

Frequency: Low

Question

Write the formula of logits. What is the range of input and output of logits function.

Answer

Logit function is p/(1-p) and it creates a map of probability values from (0, 1) to (-inf, inf).

 
Cross-Entropy vs Softmax

Companies: Apple

Difficulty: Medium

Frequency: Low

Question

Can you explain the difference between cross entropy loss and softmax?

Answer

The softmax function outputs a 0-1 distribution over its inputs.
 

Cross-entropy loss measures two distributions (predicted and actual) to determine how close they are.

You would use a softmax function as an activation layer in a neural network and then later on, depending on the use case, we can use the cross-entorpy loss at the end to get our loss score.