RNN, CNN, LSTM

Companies: Amazon, Apple

Difficulty: Easy

Frequency: Medium

Question

What are the differences between RNN, CNN, LSTM. What are the advantages of LSTM compared to RNN?

Answer

LSTMs are a modification of RNNs so we should compare CNNs to RNNs. CNNs use many sequential kernels to transform data and this is applied spatially, this means that each weight in the kernel is not connected to each input. Whereas RNNs applies it's weights through time, in other words, each input is connected to every weight in the first layer and when an input comes in it is multiplied by the internal weights and then the previous output also comes in and is multiplied by different weights and both are combined to produce an output and this repeats. 

Because of these differences, CNNs are better able to recognise patterns across space (images) while RNNs are used more for temporal problems (text prediction, forecasting). 

LSTMs are modified versions of vanillas RNNs. They have gates that control whether data should be added or "forgotten" from the network. Additionally, LSTMs prevent vanishing gradient problems that tend to occur in vanillas RNNs.

Fully Connected vs Convolutional Layer

Companies: Apple

Difficulty: Easy

Frequency: Medium

Question

What is the difference between fully connected layer and fully convolutional layer

Answer

 
SVMs

Companies: Amazon, Apple, Google

Difficulty: Easy

Frequency: Medium

Question

What are SVMs? What is Kernel in SVMs

Answer

 
Regularization

Companies: Amazon, Apple, Facebook, Google, Netflix

Difficulty: Easy

Frequency: High

Question

What are some regulariazation methods you are aware of?

Answer

 
Cross Validation

Companies: Apple

Difficulty: Easy

Frequency: Medium

Question

Can you explain k-fold cross validation

Answer

Cross-validation tends to be used as it can give you a less biased model accuracy.

The algorithm is as follows:

  1. Shuffle the dataset

  2. Determine how many groups you want to split your dataset into, this is the k

  3. Split the dataset into k groups

  4. Choose one group and this will be the test data set

  5. The remaining groups will be used to train the model
  6. Evaluate the model on the test dataset
  7. Repeat steps 4-6 and you can average the score of each iteration as your final model score
 
Multi-class SVM

Companies: Apple, Microsoft

Difficulty: Easy

Frequency: Low

Question

How would you perform multi-class classification with an SVM

Answer

 
 
Regularization In Decision Trees

Companies: Amazon, Apple

Difficulty: Medium

Frequency: Low

Question

What method os regularization are avaible to tree-based models?

Answer

  1. Shrinkage: use a learning rate in the update step to slow down the learning

  2. Stochastic gradient boosting: base learner should be fit on a subsample of the training set drawn at random without replacement

  3. We can limit the number of observations in the leaves

  4. We can penalize trees that are complex

 
Loss - KL Divergence

Companies: Amazon, Google

Difficulty: Hard

Frequency: Low

Question

What is KL divergence and how does it work as a loss function?

Answer

KL divergence is a distance measure between two probability distributions. When we use it as a loss function, we compare the probability scores that are returned by our model, P(x), and the ground truth probability distribution Q(x).

Screenshot 2021-06-28 at 5.24.58 pm.png
 
Loss - CBOW and Skip-Gram

Companies: Apple, Google

Difficulty: Amazon

Frequency: Low

Question

What is the loss function for the skip-gram model and the CBOW model in word2vec?

Answer

Some loss functions we can use in word2vec is:

  • Softmax: use the softmax function to compute the probability of predicting the output word given the input word (or surrounding words).

  • Hierarchical Softmax: Faster to compute that than softmax. We create a binary tree where the leaves are the words and the nodes are the probabilities of the children nodes. The probability of choosing a word is the probability of taking a path from the root to the leaf (word).

  • Negative Sampling: tries to maximize the probability of the correct output word and minimize the probability of randomly (and incorrectly) selected words. In other words, we push the encodings for similar words together and dissimilar words further away.

 
CNN vs NN

Companies: Amazon

Difficulty: Easy

Frequency: Medium

Question

What is the difference between a CNN and NN?

Answer

 
Logistic Regression vs Linear Regression

Companies: Amazon

Difficulty: Easy

Frequency: Medium

Question

What are the differences between logistic and linear regression?

Answer

 
Bagging vs Boosting

Companies: Amazon

Difficulty: Easy

Frequency: High

Question

Can you explain bagging and boosting. Does boosting require the same training set for each iteration?

Answer

 
 
Dimensionality Reduction

Companies: Amazon, Microsoft

Difficulty: Medium

Frequency: High

Question

Talk about some methods of dimensions reduction and explain PCA.

Answer

GD vs SGD vs Momentum

Companies: Amazon

Difficulty: Medium

Frequency: High

Question

Explain the difference between gradient descent, sochatic gradient descent and momentum.

Answer

In GD we pass through the entire dataset before we update the weights. In SGD, we (in the strictest definition) use only one data point and then updates the weights, however, this is very noisy so we generally update the weights using a subset of the data, this is sometimes called Minibatch Gradient Descent (MGD). Momentum is an extension of GD and controls the amount of historical context to include when we update the weights so that we can accelerate the optimization process.

 
Min Loss In Linear Regression

Companies: Amazon

Difficulty: Medium

Frequency: Medium

Question

How to minimize the loss function in linear regression training?

Answer

 
 
Random Forest vs GBT

Companies: Amazon

Difficulty: Easy

Frequency: Medium

Question

What are random forest and gradient boosting trees?

Answer

 
 
 
k-fold Bias/Variance

Companies: Amazon

Difficulty: Easy

Frequency: Low

Question

If we divide the data into 10 folds and 2 folds for k-fold cross-validation. Which will have higher variance? Which one will have a higher bias?

Answer

The first (10) would have a higher variance and the second (2) would have a higher bias. To better understand why, let us take the example when k is 1, in this situation we only have a single dataset to train on and so our model is biased to this data. For larger values of k, the training datasets have less overlap with each other and so the training data is diverse which reduces our bias but in so doing increases variance.  

 
LSTM

Companies: Amazon

Difficulty: Easy

Frequency: Medium

Question

Describe LSTM

Answer

 
Pooling

Companies: Amazon

Difficulty: Easy

Frequency: Medium

Question

Tell me about pooling and what is the role of pooling from a practical perspective? Answer The number of parameters can be reduced. The answer the interviewer wants is to help transitional invariance

Answer

Pooling (commonly max or average) is a technique in CNNs that downsamples (reduce the size) of the feature maps by summarizing the values within a region of the feature map.

The benefit of pooling are:

  • Reduce the dimensionality of the neural network

  • Maintain rotational/position invariance, i.e. the image can be detected if its position is changed or it is rotated.

 
Attention Networks

Companies: Amazon, Microsoft

Difficulty: Hard

Frequency: Medium

Question

Can you explain attention networks? What is self-attention?

Answer

In neural networks, inputs pass through a hidden layer and then an activation function is applied to return the output of this layer. For deep networks, this output now becomes the input in the next hidden layer.  This linear flow is how many vanilla neural network architectures work. However, for attention models, if attention is used at a layer then this layer looks at previous layers to possibly use their activations. Self-attention looks at the inputs from the same layer rather than previous layers. 

To better understand this difference, you can think of translating a sentence, you can apply self-attention to better understand the lexical patterns between words in a single language's sentence. If you use attention, then it will try to model the patterns between the two different sentences. As a result, attention networks tend to be used in encoder-decoder models while self-attention is useful in transformer architectures.

 
BERT

Companies: Amzon

Difficulty: Hard

Frequency: Low

Question

Can you explain the basis of BERT?

Answer

BERT is an architecture that is made up of multiple stacked encoders. In order to train this model, the data is processed by combing three embeddings: position, segment and token/word. The input is created by summing these three embeddings. 

BERT is then pre-trained on two unsupervised tasks, this means that we can use a lot of data since it does not have to be labelled:

  1. Masked Language Modeling: the model is trained in a bi-directional matter, i.e., the model captures patterns from the sentence using the right context of a masked word (we want to predict this masked word) and the left context of the masked word.

  2. Next Sentence Prediction: we train the model to be able to determine when given two sentences, A and B, is B the actual next sentence. 

 
Loss Functions

Companies: Amazon

Difficulty: Easy

Frequency: Medium

Question

What is a loss function? What is a cost function?

Answer

A loss function is calculated for a single training example. A cost function, on the other hand, is the average loss over the entire training dataset. Some examples of loss functions are Mean Square Error, Mean Absolute Error, Hinge Loss, Cross-Entropy Loss.