RNN, CNN, LSTM
Companies: Amazon, Apple
Difficulty: Easy
Frequency: Medium
Question
What are the differences between RNN, CNN, LSTM. What are the advantages of LSTM compared to RNN?
Answer
LSTMs are a modification of RNNs so we should compare CNNs to RNNs. CNNs use many sequential kernels to transform data and this is applied spatially, this means that each weight in the kernel is not connected to each input. Whereas RNNs applies it's weights through time, in other words, each input is connected to every weight in the first layer and when an input comes in it is multiplied by the internal weights and then the previous output also comes in and is multiplied by different weights and both are combined to produce an output and this repeats.
Because of these differences, CNNs are better able to recognise patterns across space (images) while RNNs are used more for temporal problems (text prediction, forecasting).
LSTMs are modified versions of vanillas RNNs. They have gates that control whether data should be added or "forgotten" from the network. Additionally, LSTMs prevent vanishing gradient problems that tend to occur in vanillas RNNs.
Fully Connected vs Convolutional Layer
Companies: Apple
Difficulty: Easy
Frequency: Medium
Question
What is the difference between fully connected layer and fully convolutional layer
Answer
SVMs
Companies: Amazon, Apple, Google
Difficulty: Easy
Frequency: Medium
Question
What are SVMs? What is Kernel in SVMs
Answer
Regularization
Companies: Amazon, Apple, Facebook, Google, Netflix
Difficulty: Easy
Frequency: High
Question
What are some regulariazation methods you are aware of?
Answer
Cross Validation
Companies: Apple
Difficulty: Easy
Frequency: Medium
Question
Can you explain kfold cross validation
Answer
Crossvalidation tends to be used as it can give you a less biased model accuracy.
The algorithm is as follows:

Shuffle the dataset

Determine how many groups you want to split your dataset into, this is the k

Split the dataset into k groups

Choose one group and this will be the test data set
 The remaining groups will be used to train the model
 Evaluate the model on the test dataset
 Repeat steps 46 and you can average the score of each iteration as your final model score
Multiclass SVM
Companies: Apple, Microsoft
Difficulty: Easy
Frequency: Low
Question
How would you perform multiclass classification with an SVM
Answer
Rare Feature
Companies: Apple
Difficulty: Easy
Frequency: Medium
Question
How do you deal with rare a feature value.
Answer
Regularization In Decision Trees
Companies: Amazon, Apple
Difficulty: Medium
Frequency: Low
Question
What method os regularization are avaible to treebased models?
Answer

Shrinkage: use a learning rate in the update step to slow down the learning

Stochastic gradient boosting: base learner should be fit on a subsample of the training set drawn at random without replacement

We can limit the number of observations in the leaves

We can penalize trees that are complex
Loss  KL Divergence
Companies: Amazon, Google
Difficulty: Hard
Frequency: Low
Question
What is KL divergence and how does it work as a loss function?
Answer
KL divergence is a distance measure between two probability distributions. When we use it as a loss function, we compare the probability scores that are returned by our model, P(x), and the ground truth probability distribution Q(x).
Loss  CBOW and SkipGram
Companies: Apple, Google
Difficulty: Amazon
Frequency: Low
Question
What is the loss function for the skipgram model and the CBOW model in word2vec?
Answer
Some loss functions we can use in word2vec is:

Softmax: use the softmax function to compute the probability of predicting the output word given the input word (or surrounding words).

Hierarchical Softmax: Faster to compute that than softmax. We create a binary tree where the leaves are the words and the nodes are the probabilities of the children nodes. The probability of choosing a word is the probability of taking a path from the root to the leaf (word).

Negative Sampling: tries to maximize the probability of the correct output word and minimize the probability of randomly (and incorrectly) selected words. In other words, we push the encodings for similar words together and dissimilar words further away.
CNN vs NN
Companies: Amazon
Difficulty: Easy
Frequency: Medium
Question
What is the difference between a CNN and NN?
Answer
Logistic Regression vs Linear Regression
Companies: Amazon
Difficulty: Easy
Frequency: Medium
Question
What are the differences between logistic and linear regression?
Answer
Bagging vs Boosting
Companies: Amazon
Difficulty: Easy
Frequency: High
Question
Can you explain bagging and boosting. Does boosting require the same training set for each iteration?
Answer
Dimensionality Reduction
Companies: Amazon, Microsoft
Difficulty: Medium
Frequency: High
Question
Talk about some methods of dimensions reduction and explain PCA.
Answer
GD vs SGD vs Momentum
Companies: Amazon
Difficulty: Medium
Frequency: High
Question
Explain the difference between gradient descent, sochatic gradient descent and momentum.
Answer
In GD we pass through the entire dataset before we update the weights. In SGD, we (in the strictest definition) use only one data point and then updates the weights, however, this is very noisy so we generally update the weights using a subset of the data, this is sometimes called Minibatch Gradient Descent (MGD). Momentum is an extension of GD and controls the amount of historical context to include when we update the weights so that we can accelerate the optimization process.
Min Loss In Linear Regression
Companies: Amazon
Difficulty: Medium
Frequency: Medium
Question
How to minimize the loss function in linear regression training?
Answer
Feature Scaling Decision Trees
Companies: Amazon
Difficulty: Easy
Frequency: High
Question
Is feature scaling necessary for training decision trees?
Answer
Random Forest vs GBT
Companies: Amazon
Difficulty: Easy
Frequency: Medium
Question
What are random forest and gradient boosting trees?
Answer
Base Tree GBT
Companies: Apple
Difficulty: Easy
Frequency: Low
Question
Is the base tree in gradient boosting trees deep or shallow?
Answer
Gradient Fixing
Companies: Amazon
Difficulty: Easy
Frequency: High
Question
Answer
kfold Bias/Variance
Companies: Amazon
Difficulty: Easy
Frequency: Low
Question
If we divide the data into 10 folds and 2 folds for kfold crossvalidation. Which will have higher variance? Which one will have a higher bias?
Answer
The first (10) would have a higher variance and the second (2) would have a higher bias. To better understand why, let us take the example when k is 1, in this situation we only have a single dataset to train on and so our model is biased to this data. For larger values of k, the training datasets have less overlap with each other and so the training data is diverse which reduces our bias but in so doing increases variance.
LSTM
Companies: Amazon
Difficulty: Easy
Frequency: Medium
Question
Describe LSTM
Answer
Pooling
Companies: Amazon
Difficulty: Easy
Frequency: Medium
Question
Tell me about pooling and what is the role of pooling from a practical perspective? Answer The number of parameters can be reduced. The answer the interviewer wants is to help transitional invariance
Answer
Pooling (commonly max or average) is a technique in CNNs that downsamples (reduce the size) of the feature maps by summarizing the values within a region of the feature map.
The benefit of pooling are:

Reduce the dimensionality of the neural network

Maintain rotational/position invariance, i.e. the image can be detected if its position is changed or it is rotated.
Attention Networks
Companies: Amazon, Microsoft
Difficulty: Hard
Frequency: Medium
Question
Can you explain attention networks? What is selfattention?
Answer
In neural networks, inputs pass through a hidden layer and then an activation function is applied to return the output of this layer. For deep networks, this output now becomes the input in the next hidden layer. This linear flow is how many vanilla neural network architectures work. However, for attention models, if attention is used at a layer then this layer looks at previous layers to possibly use their activations. Selfattention looks at the inputs from the same layer rather than previous layers.
To better understand this difference, you can think of translating a sentence, you can apply selfattention to better understand the lexical patterns between words in a single language's sentence. If you use attention, then it will try to model the patterns between the two different sentences. As a result, attention networks tend to be used in encoderdecoder models while selfattention is useful in transformer architectures.
BERT
Companies: Amzon
Difficulty: Hard
Frequency: Low
Question
Can you explain the basis of BERT?
Answer
BERT is an architecture that is made up of multiple stacked encoders. In order to train this model, the data is processed by combing three embeddings: position, segment and token/word. The input is created by summing these three embeddings.
BERT is then pretrained on two unsupervised tasks, this means that we can use a lot of data since it does not have to be labelled:

Masked Language Modeling: the model is trained in a bidirectional matter, i.e., the model captures patterns from the sentence using the right context of a masked word (we want to predict this masked word) and the left context of the masked word.

Next Sentence Prediction: we train the model to be able to determine when given two sentences, A and B, is B the actual next sentence.
Loss Functions
Companies: Amazon
Difficulty: Easy
Frequency: Medium
Question
What is a loss function? What is a cost function?
Answer
A loss function is calculated for a single training example. A cost function, on the other hand, is the average loss over the entire training dataset. Some examples of loss functions are Mean Square Error, Mean Absolute Error, Hinge Loss, CrossEntropy Loss.