COMP 6630 Quiz Questions
Logistic regression is used for ___?
Classification
In a complex financial modeling project, you are tasked with predicting the likelihood of a customer defaulting on a loan based on various financial attributes such as income, debt, and credit history. What type of machine learning problem is this, and what are the key considerations in choosing an appropriate approach?
Classification, considering the binary nature of default prediction. Key considerations include model interpretability and handling imbalanced datasets.
For Ridge Regression(L2), if the regularization parameter = 0, what does it mean?
Large coefficients are not penalized. Overfitting problems are not accounted for. The loss function is the same as the ordinary least square loss function.
For Ridge Regression(L2), if the regularization parameter is very high, which options are true?
Large coefficients are significantly penalized Can lead to a model that is too simple and ends up underfitting the data
Which parameter determines the size of the improvement step to take on each iteration of Gradient Descent?
Learning Rate
Which of the following statements accurately describes the relationship between generalization error and learning error?
Learning error is the error on the training data, and generalization error is the expected error on new, unseen data.
Why does using linear activation in all layers of a Multi-Layer Perceptron (MLP) effectively make it behave like a linear model?
Linear activation functions produce a linear combination of the inputs and weights, resulting in a linear transformation at each layer.
In a logistic regression model, the decision boundary can be ___.
Linear and Non-Linear
When dealing with a binary classification problem (predicting classes 0 or 1), why is it generally not advisable to use linear regression for classification tasks?
Linear regression outputs unbounded real numbers, making it difficult to interpret as class probabilities.
Why cost function which has been used for linear regression can't be used for logistic regression?
Linear regression uses mean squared error as its cost function. If this is used for logistic regression, then it will be a non-convex function of its parameters. Gradient descent will converge into the global minimum only if the function is convex.
When deciding between using gradient descent and Newton's method for optimizing logistic regression, what consideration is crucial?
Newton's method is generally more robust but can be computationally expensive for large datasets.
The residuals of observations over time must be random, with no trends or patterns suggesting that there is no correlation between current observations and previous observations. This assumption of an Ordinary Least Square (OLS) model is called ______?
No Auto-correlation
Does the size of the feature map always reduce upon applying the filters? Explain why or why not.
No, the convolution operation shrinks the matrix of pixels(input image) only if the size of the filter is greater than 1 i.e, f > 1. When we apply a filter of 1×1, then there is no reduction in the size of the image and hence there is no loss of information.
Is it possible to completely eliminate all errors in a machine learning model?
No, unavoidable error (irreducible error) exists in all machine learning problems.
Why are activation functions in the hidden layers of a neural network typically chosen to be both differentiable and non-linear? Select all that apply.
Non-linearity introduces complexity into the model, enabling it to learn and represent non-linear relationships in the data, which is crucial for the success of deep learning architectures. Non-linearity prevents the network from learning simple features, ensuring that it can learn complex representations of the data. Differentiability ensures that the gradients can be computed during backpropagation, allowing for efficient optimization of the network's parameters.
In a linear regression model, what techniques can find the coefficients?
Ordinary Least Squares (Non-Iterative. Using Normal Eqns) Gradient Descent (Iterative) Regularization
Disadvantages of OLS
Outliers: An outlier point will also be accommodated in the OLS equation, which will drag the line towards that outlier, thus rendering the contribution of many valid points useless. Normality Assumption: With the OLS method, the test statistics can give unreliable results if the data is not normally distributed. OLS is usually computationally heavy with an order of complexity of O(n³) for an n*n matrix, where 'n' is the number of features.
In a multiclass classification problem with four classes (A, B, C, D), you are considering using binary classifiers for the task. How many classifiers do you need when you use one-vs-all (OvA) strategy and a one-vs-one (OvO) strategy?
OvA:4 OvO:6
What are the limitations of Lasso Regression?
If the number of features (p) is greater than the number of observations (n), Lasso will pick at most n features as non-zero, even if all features are relevant If there are two or more highly collinear feature variables, then LASSO regression selects one of them randomly which is not good for the interpretation of the data.
Suppose you are using gradient descent to minimize the cost function in linear regression. During the gradient descent iterations, you observe that the cost function decreases, but the algorithm converges very slowly. What could be a possible solution to improve convergence?
Increase the learning rate.
Consider a linear regression model with a single input feature x. Instead of using a linear basis function, you decide to use a polynomial basis function of degree 3. Which of the following statements is true regarding the impact of increasing the degree of the polynomial basis function?
Increasing the degree of the polynomial basis function can lead to overfitting.
What is a key property of the logistic sigmoid function that makes it suitable for logistic regression?
It is differentiable, allowing the use of gradient-based optimization algorithms.
What is the primary benefit of parameter sharing in convolutional neural networks (CNNs)?
Parameter sharing reduces the computational complexity of the network by reusing the same set of weights across different parts of the input. (Missed)
Consider the mathematical entities: scalars, vectors, matrices, and tensors. What accurately describes the distinction between them?
Scalars are single values, vectors are one-dimensional arrays, matrices are two-dimensional arrays, and tensors are multi-dimensional arrays.
What's the penalty term for the Ridge regression(L2)?
The square of the magnitude of the coefficients
What does the Selectivity-Invariance dilemma in neural networks refer to?
The trade-off between the network's ability to recognize specific features and its ability to generalize across variations of those features.
What is the difference between the validation set and the test set in machine learning?
The validation set is used for model evaluation during training, while the test set is reserved for the final assessment of the model's generalization performance.
In linear regression, when might it be appropriate to use basis functions to transform input features?
When the input features exhibit a complex, nonlinear relationship with the output.
Can we use CNN to perform Dimensionality Reduction? If Yes then which layer is responsible for dimensionality reduction particularly in CNN?
Yes, CNN does perform dimensionality reduction. A pooling layer is used for this. The main objective of Pooling is to reduce the spatial dimensions of a CNN. To reduce the spatial dimensionality, it will perform the down-sampling operations and creates a pooled feature map by sliding a filter matrix over the input matrix.
You are predicting whether an email is spam or not. Based on the features, you obtained an estimated probability to be 0.75. What's the meaning of this estimated probability?
there is a 75% chance that the email will be spam there is a 25% chance that the email will not be spam
For a linear regression model, start with random values for each coefficient. The sum of the squared errors is calculated for each pair of input and output values. A learning rate is used as a scale factor and the coefficients are updated in the direction of minimizing the error. The process is repeated until a minimum sum squared error is achieved or no further improvement is possible. This technique is called ______?
Gradient Descent
One of the major assumptions of linear regression: when the variance around the regression line is the same for all values of the predictor variable is called _____?
Homoscedasticity
Why do we prefer Convolutional Neural networks (CNN) over Artificial Neural networks (ANN) for image data as input?
1. Feedforward neural networks can learn a single feature representation of the image but in the case of complex images, ANN will fail to give better predictions, this is because it cannot learn pixel dependencies present in the images. 2. CNN can learn multiple layers of feature representations of an image by applying filters, or transformations. 3. In CNN, the number of parameters for the network to learn is significantly lower than the multilayer neural networks since the number of units in the network decreases, therefore reducing the chance of overfitting. 4. Also, CNN considers the context information in the small neighborhood and due to this feature, these are very important to achieve a better prediction in data like images. Since digital images are a bunch of pixels with high values, it makes sense to use CNN to analyze them. CNN decreases their values, which is better for the training phase with less computational power and less information loss.
Explain the different layers in CNN.
1. Input Layer: The input layer in CNN should contain image data. Image data is represented by a three-dimensional matrix. We have to reshape the image into a single column. For Example, Suppose we have an MNIST dataset and you have an image of dimension 28 x 28 =784, you need to convert it into 784 x 1 before feeding it into the input. If we have "k" training examples in the dataset, then the dimension of input will be (784, k). 2. Convolutional Layer: To perform the convolution operation, this layer is used which creates several smaller picture windows to go over the data. 3. ReLU Layer: This layer introduces the non-linearity to the network and converts all the negative pixels to zero. The final output is a rectified feature map. 4. Pooling Layer: Pooling is a down-sampling operation that reduces the dimensionality of the feature map. 5. Fully Connected Layer: This layer identifies and classifies the objects in the image. 6. Softmax / Logistic Layer: The softmax or Logistic layer is the last layer of CNN. It resides at the end of the FC layer. Logistic is used for binary classification problem statement and softmax is for multi-classification problem statement. 7. Output Layer: This layer contains the label in the form of a one-hot encoded vector.
Consider a convolutional layer in a neural network with the following characteristics: Input image size: 32x32x3 (32 pixels height, 32 pixels width, 3 color channels) Filter/kernel size: 5x5 Number of filters: 10 Stride: 1 Padding: 2 What are the dimensions of the output activation map?
32x32x10
Consider a convolutional layer in a neural network with the following characteristics:
4864
Consider a binary classification problem where a model is trained to identify fraudulent transactions. The dataset consists of 505 transactions, with 10 of them being fraudulent. The model's performance is evaluated with the following confusion matrix: True Positives (Correctly identified fraudulent transactions): 6True Negatives (Correctly identified non-fraudulent transactions): 480False Positives (Incorrectly identified as fraudulent): 15False Negatives (Incorrectly identified as non-fraudulent): 4 What is the model's accuracy?
96.2% (Missed this one)
Which of the following best describes a "one-hot" vector?
A vector with a single element set to 1 and the rest set to 0, representing the presence or absence of a specific category.
Explain the role of the flattening layer in CNN
After a series of convolution and pooling operations on the feature representation of the image, we then flatten the output of the final pooling layers into a single long continuous linear array or a vector. The process of converting all the resultant 2-d arrays into a vector is called Flattening. Flatten output is fed as input to the fully connected neural network having varying numbers of hidden layers to learn the non-linear complexities present with the feature representation.
Which one is the correct Linear regression assumption?
All below are correct. Linear regression assumes the input and output variables are not noisy. Linear regression will over-fit your data when you have highly correlated input variables. The residuals (true target value − predicted target value) of the data are normally distributed and independent from each other
True or False? If we take the weighted sum of inputs as the output as we do in Linear Regression, the value can be more than 1 but we want a value between 0 and 1. That's why Linear Regression can't be used for classification tasks. Logistic Regression is a generalized Linear Regression in the sense that we don't output the weighted sum of inputs directly, but we pass it through a function that can map any real value between 0 and 1. The value of the sigmoid function always lies between 0 and 1
All true.
True or False? Ridge regression decreases the complexity of a model but does not reduce the number of variables since it never leads to a coefficient been zero rather only minimizes it Ridge regression is not good for feature reduction As the regularization parameter increases, the value of the coefficient tends towards zero. This leads to both low variance (as some coefficient leads to negligible effect on prediction) and low bias (minimization of coefficient reduces the dependency of prediction on a particular variable)
All true.
True or False? Lasso regression stands for Least Absolute Shrinkage and Selection Operator. The difference between ridge and lasso regression is that lasso tends to make coefficients to absolute zero as compared to Ridge which never sets the value of the coefficient to absolute zero. Lasso can be used to select important features of a dataset?
All true.
When the violation of homoscedasticity affects the fit of the linear model, how to reduce the influence of this violation?
Apply a transformation on the target variable. Use weighted least squares.
How does the violation of homoscedasticity affect the fit of the linear model?
As the variance of the residuals increases, the model becomes more sensitive to such data points than the areas with less variance. The data points in the high variance region act as influential outliers.
Why do we use a Pooling layer in a CNN?
CNN uses pooling layers to reduce the size of the input image so that it speeds up the computation of the network. Pooling or spatial pooling layers: Also called subsampling or downsampling. It is applied after convolution and RELU operations. It reduces the dimensionality of each feature map by retaining the most important information. Since the number of hidden layers required to learn the complex relations present in the image would be large. As a result of pooling, even if the picture were a little tilted, the largest number in a certain region of the feature map would have been recorded and hence, the feature would have been preserved. Also as another benefit, reducing the size by a very significant amount will use less computational power. So, it is also useful for extracting dominant features.
What do you understand by shared weights in CNN?
CNNs work by passing a filter over the image input. For the trivial example of a 4×4 image and a 2×2 filter with a stride size of 2, this would mean that the filter (which has four weights, one per pixel) is applied four times, making for 16 weights total.
For Lasso Regression(L1), if the regularization parameter is very high, which options are true?
Can be used to select important features of a dataset. Shrinks the coefficients of less important features to exactly 0.
Logistic Regression is a Machine Learning algorithm that is used to predict the probability of a ___?
Categorical dependent variable.
True or False: Logistic regression and linear regression have the same update rule when using gradient descent because the derivative of the logistic sigmoid function is equivalent to the identity function.
False
True or False: Logistic regression inherently learns non-linear decision boundaries because the logistic sigmoid function, being non-linear, introduces non-linearity in the model.
False
True or False: While convolutional neural networks (CNNs) excel at learning hierarchical representations from raw input data, they require manual feature engineering to achieve optimal performance in complex tasks.
False
Briefly explain the two major steps of CNN
Feature Learning deals with the algorithm by learning about the dataset. Components like Convolution, ReLU, and Pooling work for that, with numerous iterations between them. Once the features are known, then classification happens using the Flattening and Full Connection components.
In logistic regression, the least squares (sum of squared errors) objective function is not commonly used. Why is least squares not an appropriate choice for logistic regression compared to linear regression?
Logistic regression involves a non-linear transformation of the output (into a probability), making the squared error less meaningful.
For a Linear Regression model, we choose the coefficients and the bias term by minimizing the _____.
Loss Function Error Function Cost Function
For a regression line through the data, the vertical distance from each data point to the regression line is called residual. (i) Square the residual, and (ii) sum all of the squared errors together. This is the quantity that ordinary least squares seek to _____?
Minimize
Explain the significance of "Parameter Sharing" and "Sparsity of connections" in CNN.
Parameter sharing: In convolutions, we share the parameters while convolving through the input. The intuition behind this is that a feature detector, which is useful in one part of the image may also be useful in another part of the image. So, by using a single filter we convolved all the entire input and hence the parameters are shared. Let's understand this with an example, If we would have used just the fully connected layer, the number of parameters would be = 32*32*3*28*28*6, which is nearly equal to 14 million which makes no sense. But in the case of a convolutional layer, the number of parameters will be = (5*5 + 1) * 6 (if there are 6 filters), which is equal to 156. Convolutional layers, therefore, reduce the number of parameters and speed up the training of the model significantly. The sparsity of Connections: This implies that for each layer, each output value depends on a small number of inputs, instead of taking into account all the inputs.
Imagine a medical test to detect a rare disease where only 1 in 1000 individuals are actually affected. A classifier is trained to identify the disease based on certain features. The classifier is evaluated on a dataset of 10,000 individuals, with the following results: True Positives (Correctly identified cases): 5True Negatives (Correctly identified non-cases): 9,900False Positives (Incorrectly identified as cases): 20False Negatives (Incorrectly identified as non-cases): 75 Which of the following metric is appropriate for evaluating this classifier's performance? Choose all that apply.
Precision, Recall
Explain the significance of the RELU Activation function in Convolution Neural Network.
RELU Layer - After each convolution operation, the RELU operation is used. Moreover, RELU is a non-linear activation function. This operation is applied to each pixel and replaces all the negative pixel values in the feature map with zero. Usually, the image is highly non-linear, which means varied pixel values. This is a scenario that is very difficult for an algorithm to make correct predictions. RELU activation function is applied in these cases to decrease the non-linearity and make the job easier. Therefore this layer helps in the detection of features, decreasing the non-linearity of the image, converting negative pixels to zero which also allows detecting the variations of features. Therefore non-linearity in convolution(a linear operation) is introduced by using a non-linear activation function like RELU.
What is the difference between shallow (neural network) and deep (neural network) classifiers?
Shallow classifiers are shallow in terms of the number of layers but may have high-dimensional feature representations, whereas deep classifiers are deep in terms of both the number of layers and the complexity of features learned.
What's the cost function of the logistic regression?
Sigmoid Function / Logistic Function
Which of the following statements about activation functions in the output layer of neural networks is TRUE? Check all that apply.
The Sigmoid activation function is suitable for binary classification tasks and is often used in the output layer to produce probabilities between 0 and 1. The Softmax activation function is frequently employed in the output layer for multi-class classification tasks, as it normalizes the outputs into a probability distribution. (Missed)
What's the penalty term for the Lasso regression?
The absolute sum of the coefficients
What is the role of the Fully Connected Layer in CNN?
The aim of the Fully connected layer is to use the high-level feature of the input image produced by convolutional and pooling layers for classifying the input image into various classes based on the training dataset. Fully connected means that every neuron in the previous layer is connected to each and every neuron in the next layer. The Sum of output probabilities from the Fully connected layer is 1, fully connected using a softmax activation function in the output layer. The softmax function takes a vector of arbitrary real-valued scores and transforms it into a vector of values between 0 and 1 that sums to 1.
Disadvantages of Linear Regression
The assumption of linearity between the dependent variable and the independent variables. In the real world, the data is not always linearly separable Linear regression is very sensitive to outliers Before applying Linear regression, multicollinearity should be removed because it assumes that there is no relationship among independent variables.
During the backpropagation algorithm, what role does the chain rule play?
The chain rule is used to propagate errors backward through the network by computing the gradient of the loss function with respect to each layer's inputs.
For Lasso Regression(L1), if the regularization parameter = 0, what does it mean?
The loss function is as same as the ordinary least square loss function
Suppose you are training a machine learning model to predict house prices based on various features such as square footage, number of bedrooms, and location. After training the model, you notice that it performs poorly on both the training and validation sets. The predictions are consistently off, and the model doesn't capture the underlying patterns in the data effectively. What might be the issue, and how could you address it?
The model is underfitting. To address this, increase the complexity of the model by adding more features.
A linear regression model assumes "a linear relationship between the input variables and the single output variable." What's the meaning of this assumption?
The output variable can be calculated from a linear combination of the input variables
What's the hypothesis of logistic regression?
To limit the cost function between 0 and 1
In the context of linear regression, when might you prefer using the normal equations over gradient descent, or vice versa, to find its optimal parameters?
Use gradient descent when computational efficiency is a priority and the dataset is too large to fit into memory; use the normal equations when the dataset is small and can be stored entirely
In a simple linear regression problem, a single input variable (x) and a single output variable (y), the linear equation would be y = ax + b; where a and b are _______ and ________ respectively
feature coefficient, bias Coefficient slope, y-intercept
