Machine Learning Final
202F- Consider the following neural network. Assume that outputs O1, O2 are outputs of sigmoid activation functions. Target1 and Target2 are the ground truth for the outputs O1 and O2. Mean Squared Errors is our loss function. Calculate the derivative of the loss with respect to w5.
...
194-What are the two major characteristics of an activation function?
1- Non-Linear 2- Differentiable
182A- What is the role of kernels in SVM classifiers?
Answer: A kernel function transforms the data of the training set from a non-linear decision surface into a linear equation in a higher dimension space.
174- When do we expect to have low generalization error?
Answer: A larger margin would result in better test results and lower generalization error. Hence, regularization is performed to avoid overfitting and keep a larger margin.
178- What is the role of parameter C in the cost function of SVM?
Answer: A larger value of C forces the learning process to reduce the total cost as much as possible. A large C would result in a smaller margin and a higher chance of overfitting. Hence, C controls the tradeoff between the simplicity of the model and the misclassification of the data.
207B- What is the main difference between the basic gradient descent and an adaptive optimizer?
Answer: Adaptive optimizers use "decay rate" to gradually reduce the "learning rate" as the number of epochs increases. For example, AdaGrad optimizer divides the learning rate by the sum of all previously calculated gradients. Hence, as the epochs increase, this ratio gets smaller.
230- What is an autoencoder?
Answer: An autoencoder is a neural network model that seeks to learn a compressed representation of the input. An autoencoder is a network that is trained to copy the input to the output.
206- In the context of deep learning, what is an optimizer?
Answer: An optimizer finds optimum values of coefficients during the backpropagation operation.
193- How do we optimize the weights in a neural network?
Answer: By backpropagation. Inputs are sent through the network in a feedforward manner. Loss is calculated as the difference between the calculated output and the ground truth label. Then, the gradient of the loss is back-propagated. Weights are tuned based on the gradient of the loss.
163- How do we calculate the loss in logistic regression?
Answer: Cross entropy is used.
186- What are dendrites and axons as parts of neuron cells?
Answer: Dendrites bring information to neurons, and axons send out information.
195- What is a fully connected neural network?
Answer: Each neuron's output in the network is connected to the inputs of all neurons in the next layer.
157- Logistic regression (LR) is a multi-class classifier.
Answer: False. This algorithm has extensions that are used for multi-class, but the logistic regression is a two-class classifier.
207A- What are two examples of optimizers?
Answer: Gradient descent algorithm and "adam" are examples of optimizers.
181- In the following figure, what are H0, H1, H2, d-, and d+? What is the distance between H1 and H2?
Answer: H0 is the hyperplane, H1 and H2 are the planes that touch the support vectors. The distance between H1 and H2 is the margin which is the sum of d- and d+.
179- What happens if C in SVM is very small?
Answer: High misclassification would occur.
200- What is the Perceptron convergence theorem?
Answer: If the data labels are linearly separable, the Perceptron learning algorithm will converge and halt after a finite number of iterations.
171- What is a "margin" in SVM?
Answer: If the training data is linearly separable, we can select two parallel hyperplanes that separate the two classes of data so that the distance between them is as large as possible. The region bounded by these two hyperplanes is called the "margin," and the maximum margin hyperplane is the hyperplane that lies halfway between them.
192- What happens if we initialize the weights by all zeros?
Answer: If the weights are initialized to zero, then all outputs will be zero regardless of the value of the inputs. Random initialization prevents getting stuck in a local minimum if we do cross-validation.
197B- What is the main difference between classic machine learning and deep learning?
Answer: In the classic machine learning methods, features are manually extracted. In deep learning, the network learns to extract appropriate features.
172- What is the relationship between the magnitude of noise in the data and the size of margin in SVM?
Answer: Increasing the noise would result in the reduction of the margin. More noise, less margin
189- What types of layers are in an MLP?
Answer: Input, hidden, and output.
182B- What does the RBF kernel do?
Answer: It calculates the Gaussian distance between support vectors and data points. A hyperplane is then calculated for the calculated distances.
188- What is a binary step activation function?
Answer: It compares ∑(𝑥𝑖𝑤𝑖) + 𝑏 with a threshold. If the input is above the threshold, the output will be 1; otherwise, the outcome will be zero.
209- What is the Hessian matrix?
Answer: It is a matrix of all possible second-order derivatives of the loss function with respect to the network's coefficients.
183A- What type of SVM kernel is the following expression? What is the purpose of γ? How does the size of γaffect the behavior of SVM?
Answer: It is an RBF kernel. Gamma is a parameter that controls the effect of the data points on the hyperplane. If Gamma is small, then data points far from the support vectors also influence the shape of the hyperplane.
166- Based on what principle, the following expression is expanded in the second line?
Answer: It is assumed that there are only 0 and 1 classes. So, we have a Bernoulli distribution:
180- What does the following expression show?
Answer: It says that the hyperplane 𝜃𝑇𝑥 is the dot product of two vectors 𝜃, and 𝑥. The magnitudes of the two vectors are multiplied by the cosine of the angle between them. The magnitude of 𝜃 is multiplied by the projection of vector X on vector 𝜃.
176- What is the purpose of the following expression in the SVM classification?
Answer: It shows the cost for the total of m data points. Each data point has either a true label of 𝑦(𝑖) = 1 or 0? The cost for each class 1 data point is max(0, 1 − 𝜃𝑇𝑥)
214- What are the main layers in a CNN?
- Convolutional (CONV) - Pooling (POOL) - Fully connected (FC) or? -input -convolution -pooling -FC -Softmax -output
204- What is the "model" formed by the following code? What is the task of the "Flatten()" layer? What does the Dropout layer do? What does "Dense" mean?
Answer: It uses TensorFlow to form an MLP classifier with one hidden layer of 512 neurons and an output layer of 10 neurons. The Flatten layer converts the input matrix into a one-dimensional vector. The Dropout layer randomly deactivates outputs of 20 percent of the neurons. A Dense layer is a fully connected layer.
231- What is an LSTM network?
Answer: LSTMs are intended to extract information from sequential data. For example, it can be used to predict future stock prices based on previous values. LSTMs use a series of 'gates' that control what part of the sequence is related to which part. LSTM is used for time series prediction, speech recognition, music composition, and handwriting recognition.
158- What is the role of logit (i.e., (𝐱) = w₀ + w₁𝑥₁ + ⋯ + wᵣ𝑥ᵣ) in logistic regression?
Answer: Logistic regression is a linear classifier, and logit is a linear function that is applied to the attributes of a data point. Variables 𝑏₀, 𝑏₁, ..., 𝑏ᵣ are learned coefficients of the model.
162- Why is MSE not used in logistic regression?
Answer: MSE for the logistic function is not convex, and the gradient descent may get trapped in a local minimum.
225- What will be the outcome of a max-pooling process with a kernel of T×T applied to an N×N feature map?
Answer: Max-pooling chooses the maximum element in a window as its output. If we are using stride = 1, then the output of the max-pooling will be (𝑁 − 𝑇 + 1) × (𝑁 − 𝑇 + 1). The first time that we place the kernel on the matrix, there will be (𝑁 − 𝑇) more positions left on each side of the kernel.
174- What is the consequence of regularization in SVM?
Answer: More considerable margin and lower chance of overfitting.
162- What is the main difference between Naïve Bayes and logistic regression?
Answer: Naïve Bayes is a generative model, and logistic regression is a discriminative model.
198- What is an epoch?
Answer: One complete presentation of the training set to the network during training.
185- What are unstructured data?
Answer: Qualitative data that cannot easily be classified and queried is unstructured. Examples of unstructured data include images, videos, webpages, PDF files, PowerPoint presentations, emails, blog entries, and word processing documents.
192- How do we initialize weights in a neural network?
Answer: Randomly.
201- Which of the following two sets of data are suitable for classification by MLP?
Answer: Set A is linearly separable and feasible for classification by MLP. Linearly separable data is data that if graphed in two dimensions, can be separated by a straight line.
184- What is structured data?
Answer: Structured data elements can readily be labeled, queried, and analyzed. Examples of structured data elements include age, height, price, numeric data, currency, alphabetic, names, dates, addresses.
168- What does SVM stand for?
Answer: Support vector machine.
203- What does the following code do?
Answer: The above code scales the train and test data. It uses an MLP with two hidden layers. Ten neurons in each hidden layer. It uses the alpha regularization factor. Increasing alpha could reduce overfitting/
205- If the input of the above MLP is a 28 by 28 matrix, how many parameters each layer has?
Answer: The first hidden layer has 28*28*512 +512 parameters. The output layer has 10*512 +10 parameters.
197A- What is the input layer in the following artificial neural network?
Answer: The input layer is not an actual layer. The input layer consists of the outside inputs (x1, x2, and x3) fed into the network. The first actual layer is the first hidden layer. In this diagram, there is one hidden layer and one output layer.
175- Explain the following curves that are for the SVM cost function.
Answer: The left curve is for label 1. If 𝜃𝑇𝑥 becomes negative; then the cost increases proportionally to 𝜃𝑇𝑥. Hence, we want 𝜃T𝑥 > 0 when the datapoint is of 1 class, and 𝜃T𝑥 < 0 for the 0 class.
219D- Why do we use the log of the softmax output as the loss?
Answer: The log is a good representation of the loss. If the probability of the classification of data with label 𝑘 is low (𝑝𝑘 is low) then its loss −log(𝑝𝑘) is a large negative number. Also, the derivative of the loss with respect to the input is easy to compute
219C- How do we calculate the loss for the output of a softmax layer?
Answer: The loss is the natural log of the output of the softmax
202E- What happens if we replace ReLu with an identity function (𝑓(𝑥) = 𝑥)?
Answer: The network won't be able to approximate non-linear functions.
173- What happens in SVM if we don't perform regularization?
Answer: The number of coefficients would increase, and overfitting would occur. more coefficients, more overfitting
190- What is an "input function" in an MLP layer?
Answer: The operation performed on inputs and the weights and the bias of that layer. Usually, a dot product of the input vector of the layer by weight vector is performed. Then the bias is added to the result of the dot product.
218B- Suppose an input of size m×m=100×100 without zero-padding is convolved with a k×k= 3×3 kernel. Then a sliding pooling layer with a window size of p×p = 8×8 is applied. What is the size of the output matrix?
Answer: The output of the convolution is 98×98. The output of the sliding window is 91×91. The output of convolution= (m-k+1) ×(m-k+1)=c×c. The output of the pooling layer= (c-p+1) ×(c-p+1)
202A- What are the following functions used for and called?
Answer: They are all activation functions. A is sigmoid, B is tanh, C is ReLu, and D is leaky ReLu.
199- Consider the following code: What will the three graphs represent? Which graph has the steepest slope?
Answer: They show sigmoid functions, which can be used as activation functions. The blue graph has the steepest slope, and the green curve has the shallowest slope.
177- What is the purpose of the following expression in the SVM classification?
Answer: This expression is used for the training of the SVM model. It uses a regularizer to control the size of the coefficients. Also, it uses a C parameter that controls the effect of the sum of all losses on the total cost function
167- What is the following expression used for in logistic regression?
Answer: This is cross-entropy which is used for the computation of the loss. Here H theta (x^(i)) is the probability of the data point having label 1.
218A- What is the purpose of pooling layers?
Answer: To generate features with a broader view.
229-What is "invariance to translation?
Invariance to translation means that if we translate the inputs the CNN will still be able to detect the class to which the input belongs. Translational Invariance is a result of the pooling operation.
164- What is the following expression used for in logistic regression?
It shows the likelihood of data with attributes x belonging to the class y based on trained coefficients θ. Here we are assuming we have n data points. We want to train the model coefficients θ to maximize this likelihood.
232- What is the difference between autoencoders and variational autoencoders?
Key differences: Autoencoder (AE) -Used to generate a compressed transformation of input in a latent space. -The latent variable is not regularized. Picking a random latent variable will generate garbage output. -The latent variable has a discontinuity. -Latent variable is deterministic values. -The latent space lacks the generative capability. Variational Autoencoder (VAE) -Enforces conditions on the latent variable to be the unit norm. -The latent variable in the compressed form is mean and variance. -The latent variable is smooth and continuous. -A random value of the latent variable generates meaningful output at the decoder. -The input of the decoder is stochastic and is sampled from a gaussian with mean and variance of the output of the encoder. -Regularized latent space -The latent space has generative capabilities.
160- Consider the following model for logistic regression: P (y =1|x, w)= g(w0 + w1x), where g(z) is the logistic function. In this equation, what would be the range of P?
Logistics function gives the output in the range of 0 to 1. Hence, Range of P is (0,1)
161- Assume that (𝑥) = 𝑤0 + 𝑤1𝑥1 + 𝑤2 𝑥2 , 𝑤0 = 5, 𝑤1 = 0, and 𝑤2 = −1 Draw 𝑓(𝑥) line on a two-dimensional graph with 𝑥1 and 𝑥2 axis. Also, show on the graph which side of 𝑓(𝑥) will be classified as 0 and which side will be labeled
Notice that this graph is only showing x1 and x2, and the value of f(x) is not shown. We want to analyze 𝑓(𝑥) = 5 − 𝑥2. If x2 is more than 5 then f(x) becomes negative, and a 0 label is chosen for the data point.
187- What is the output of the following artificial neuron?
Output = 𝑓(𝑥1𝑤1 + 𝑥2𝑤2 + 𝑥3𝑤3 + 𝑏) where 𝑓() is the activation function.
224- Suppose an M×M matrix is convolved with a K×K filter. What is the output size?
Output is 𝐾−1/2 smaller on every side compared to the input matrix.
169- What is a hyperplane in SVM?
Answer: A hyperplane is a decision plane that separates objects of different classes.
156- What is imbalanced classification?
Answer: A balanced data means that the training and test data have equal samples from different classes.
170- What is a support vector in SVM?
Answer: A data point in a class that is closest to the hyperplane is called a support vector.
196- Consider the following MLP network. What are the elements pointed by letters A, B, C, D, and E?
A: activation function of the output layer. B: bias of the output layer C: it is equal to 1 D: The output of the second neuron of the hidden layer. 64 E: it is the coefficient of the third input of the output neuron.
211- What is the "early stopping" algorithm, and what is its purpose?
Ans. 211 The "Early Stopping" is algorithmic regularization, which is used to avoid overfitting when training the learner with gradient descent by the iterative method. The main purpose of "early stop" is to monitor the generalization error of the model and to stop the training by minimizing the error.
212- What is the difference between dropout and drop-connect methods?
Ans. 212 To apply DropOut, we randomly select a subset of the units and clamp their output to zero, regardless of the input; this effectively removes those units from the model. A different subset of units is randomly selected every time we present a training example. DropConnect works similarly, except that we disable individual weights (i.e., set them to zero), instead of nodes, so a node can remain partially active.
210- What is data augmentation, and why is it performed?
Ans.210 Data augmentation is a method used to increase the volume of data by adding modified copies of data from existing data to create new synthetic data for data analysis. The main objective of data augmentation is to improve performance and results by creating new machine learning models by training datasets. Synthetic data is a method of data generation for machine learning that creates an entirely artificial dataset from the original data
208- What is the difference between stochastic, batch, and mini-batch gradient descent methods.
Answer: Gradient Descent is nothing but one of the key algorithms used in Machine Learning. While training the machine learning model, we require an algorithm to minimize the value of the loss function. Gradient Descent is one of the optimization algorithms, that is used to minimize the loss. There are mainly three types of Gradient Descent algorithms. 1. Batch Gradient Descent It uses the entire dataset together to update the model weight (take average). It would calculate the loss for each data point in the training dataset but update the model weight (parameters) after all training data points have been evaluated. For example, if we have one thousand data points, then the model's weight update will happen after all the one thousand data points are evaluated. 2. Mini batch Gradient Descent Mini batch Gradient Descent splits the training dataset into small batches. These batches are used to calculate model loss and update model coefficients. For example, if we have one thousand data points, then we create 10 batches of 100 data points each. 3. Stochastic Gradient Descent Stochastic Gradient Descent calculates the loss and updates the model for each data point in the training dataset. This uses a single data point at a time, it does not matter how many data points we have.
202B- What are the main advantages of the ReLu activation function?
Answer: 1- ReLu causes representational sparsity. It means that any neuron which produces negative output will be ignored. This accelerates the learning process since the weights of links that are connected to such neurons are not calculated. 2- ReLu prevents vanishing gradient. The gradients will be proportional to the activation output.
202D- What is the vanishing gradient problem?
Answer: To update any network parameter, we need to calculate the gradient of the loss with respect to that parameter. If the gradient is zero or close to that, the update process takes a long time or may even stop. Furthermore, as the number of layers grows, the possibility of vanishing gradient increases. Sigmoid and tanh activation functions could cause a vanishing gradient. For example, the gradient of sigmoid as a function of its input is: 𝜕(𝑥)/𝜕x= 𝑓(𝑥)(1 − 𝑓(𝑥)). This is a function of the output of the sigmoid. If the output of the sigmoid is small, the gradient becomes small. Contrary to this situation, the gradient of the ReLu function for positive inputs is one irrespective of the value of the cell's output.
183C- What is the gradient of the loss function in SVM?
Answer: We know that the gradient is the slope of the function. The hinge loss function has a gradient of zero for any data point with 1 label that produces θᵀx ≥ 1. The gradient is -x for these data points if they produce θᵀx < 1.
165- What does the following expression describe?
Answer: We want to maximize the likelihood of the n data points belonging to correct classes by training the model coefficients θ. Log of the likelihood is used to change the product of the probabilities to a summation.
219B- What is the purpose of the softmax layer?
Answer: it is usually used at the output of a multi-class classifier. The softmax layer uses the following formula: 71 where 𝑝𝑘 is the 𝑘th output and 𝑓𝑘 is the 𝑘th input of the softmax layer. The sum of all output is one. Hence, each the 𝑘th output of the classifier represents the probability of the network's input belonging to this class.
159- What is the logistic function applied to (𝐱)?
Answer: 𝑔(𝑓(𝐗)) = 1 /(1+𝑒^−𝑓(𝐗))
191- Why do we perform backpropagation during the test phase?
Back-propagation is only performed during training.
228- What is parameter sharing?
Parameter sharing is sharing of weights by all neurons in a particular feature map A convolutional neural network learns certain features in images that are useful for classifying the image. Sharing parameters gives the network the ability to look for a given feature everywhere in the image, rather than in just a certain area. This is extremely useful when the object of interest could be anywhere in the image. Relaxing the parameter sharing allows the network to look for a given feature only in a specific area. For example, if your training data is of faces that are centered, you could end up with a network that looks for eyes, nose, and mouth in the center of the image, a curve towards the top, and shoulders towards the bottom. It is uncommon to have training data where useful features will usually always be in the same area, so this is not seen often.
218C-Suppose you have five convolutional kernels of size 7×7 and stride 1 in the first layer of a convolutional neural network. You pass an input of dimension 224 × 224 × 3 through this layer. What are the dimensions of the data which the next layer will receive?
STILL, WAITING ON THE 5? Answer: The kernel is placed inside, so 3 points on each side are lost. The output will be 218 × 218 × 5.
222- Backpropagation works by first calculating the gradient of ___ and then propagating it backward.
Sum of squared error with respect to weights
202C- Draw the shape of the ReLu function and its derivative.
The derivative is not defined at point 0.
183B- How do we find the hyperplane of an SVM classifier?
The training data has x features and y labels. We calculate the loss function by initializing θ values. Then we use the gradient descent algorithm to find the gradient of the loss function. Finally, we update the parameters using gradients until the process converges.
213A- What are the five main differences between CNN and MLP?
a) Local receptive field b) Sharing parameters c) Pooling layers d) Sparse connectivity e) Equivariant to translation
213B- Which of the following options can be used to reduce overfitting in deep learning models?
• Add more data • Use data augmentation • Use architecture that generalizes well • Add regularization • Reduce architectural complexity
213C- Which of the following is a data augmentation technique used in image recognition tasks?
• Horizontal flipping • Random cropping • Random scaling • Color jittering (randomly change the brightness, contrast, and saturation) • Random translation