DL Mid
In gradient descent with momentum, how is the velocity update typically calculated? A) By multiplying the gradient by a constant factor. B) By using a weighted sum of the previous velocity and the current gradient. C) By adding the current gradient directly to the weights. D)By subtracting the learning rate from the current gradient.
B) By using a weighted sum of the previous velocity and the current gradient.
What is the primary goal of fitting a model in the context of machine learning? A) Achieving perfect performance on the training data B) Finding model parameters that produce the best possible mapping from input to output for the given task C) Minimizing the training loss D) Finding the global minimum of the loss function
B) Finding model parameters that produce the best possible mapping from input to output for the given task
Which type of layer is used to transform the multi-dimensional output of a convolutional layer into a format suitable for input to a fully connected layer? A) Pooling Layer B) Flatten layer C) Batch normalization layer D) Convolutional Layer
B) Flatten layer
Which of the following is the most commonly used activation function? A) Sigmoid B) ReLU C) Linear D) Tanh
B) ReLU
Which of the following statements is FALSE regarding the comparison between shallow and deep networks? A) Both can approximate any continuous function given enough capacity B) Deep networks are often easier to train in practice C) Deep networks are always more efficient at representing any function D) Deep networks produce more linear regions per parameter
C) Deep networks are always more efficient at representing any function
The core concept behind supervised learning is: A) Clustering data into groups. B) Maximizing rewards in an environment. C) Defining a mapping from input data to an output prediction. D) Learning from unlabeled data
C) Defining a mapping from input data to an output prediction.
The book categorizes machine learning methods into three broad areas. Which of the following is NOT one of them? A)Supervised Learning B) Reinforcement Learning C) Evolutionary Learning D) Unsupervised Learning
C) Evolutionary Learning
Which type of pooling operation is more robust to small translations in the input? A) Sum pooling B) No pooling C) Max pooling D) Average pooling
C) Max pooling
In reinforcement learning, the entity that performs actions within an environment is called: A) The environment B) The actor C) The agent D) The observer
C) The agent
Which of the following best describes a computation graph in the context of neural networks? A) A decision tree used to choose the optimal learning rate for the network. B) A network visualization tool to analyze the performance of the model. C) A graph representing the architecture of the neural network layers. D) A directed acyclic graph (DAG) where nodes represent operations, and edges represent data flow, used to compute the loss and gradients.
D) A directed acyclic graph (DAG) where nodes represent operations, and edges represent data flow, used to compute the loss and gradients.
In the context of the 1D linear regression example, what does the model represent? A) A family of parabolas. B) A single fixed line. C) A family of exponential curves. D) A family of straight lines.
D) A family of straight lines.
What is a convolutional kernel in a CNN? A) A fully connected network applied to an image. B) A pooling operation applied to reduce dimensionality. C) A set of filters used for averaging pixel values. D) A matrix of weights used to process specific regions of the input.
D) A matrix of weights used to process specific regions of the input.
In backpropagation through a CNN, what is updated during training? A) The strides B) The input image C) the number of channels D) the kernels
D) the kernels
In the context of supervised learning, what does the term 'inference' refer to? A) The process of evaluating a model's performance on a test dataset. B) The process of adjusting model parameters to minimize the loss. C) The process of using a trained model to make predictions on new input data. D) The process of collecting and labeling training data.
C) The process of using a trained model to make predictions on new input data.
What is an epoch in the context of model training? A) A complete pass through the entire training dataset B) A single iteration of the optimization algorithm C) A fixed number of training examples D) The process of evaluating the model on the test data
A) A complete pass through the entire training dataset
How does data augmentation help improve the performance of a CNN? A) By increasing the effective size of the training set B) By increasing the network's depth C) By adjusting the learning rate during training D) By reducing the model complexity
A) By increasing the effective size of the training set
What does the ReLU activation function do in neural networks? A) It clips the negative inputs to zero B) It creates exponential growth in outputs C) It multiplies input by a constant factor D) It maps input to the range 0..1.
A) It clips the negative inputs to zero
How does automatic differentiation leverage the structure of a computation graph to compute gradients? A) It first performs a forward pass to compute the output, then applies the chain rule in the reverse direction (backpropagation) to compute gradients at each node. B) It uses forward propagation only to compute the gradients at each node. C) It computes gradients by traversing the graph from input to output, using finite differences to approximate derivatives. D) It computes gradients by analyzing the symbolic form of the function without considering data flow.
A) It first performs a forward pass to compute the output, then applies the chain rule in the reverse direction (backpropagation) to compute gradients at each node.
What is the universal approximation theorem? A) It states that a neural network with enough capacity can approximate any continuous function B) It states that shallow networks cannot approximate continuous functions C) It applies only to networks with multiple layers D) It states that deep networks can approximate any function
A) It states that a neural network with enough capacity can approximate any continuous function
The fundamental characteristic of unsupervised learning is: A) The absence of labeled output data. B) The focus on sequential decision-making. C) The use of rewards to guide learning. D) The presence of labeled output data.
A) The absence of labeled output data.
What happens when a stride of 2 is used in a convolutional layer? A) The output size becomes half of the input size. B) The number of channels is reduced. C) The output size is doubled. D) The output size remains unchanged.
A) The output size becomes half of the input size.
What is the purpose of the learning rate in gradient descent? A) To control the step size of parameter updates during training B) To determine the number of epochs C) To control the size of the training batch D) To measure the model's performance
A) To control the step size of parameter updates during training
What is the main advantage of using pooling layers in CNNs? A) To decrease the spatial dimensions B) To decrease the number of channels C) To increase the learnable parameters D) To introduce non-linearity
A) To decrease the spatial dimensions
What is the main purpose of using transfer learning in CNNs? A) To improve training efficiency B) To reduce the number of model parameters C) To add more model layers D) To train a model from scratch
A) To improve training efficiency
What type of padding should be used if we want to reduce the output size? A) Valid padding B) Reflective Padding C) zero padding D) same padding
A) Valid padding
In the context of model evaluation, what does 'underfitting' mean? A)The model is too simple and cannot capture the true underlying relationship between inputs and outputs. B)The model has been trained for too many epochs. C) The model has too few parameters. D) The model is too complex and captures noise in the training data, leading to poor generalization.
A)The model is too simple and cannot capture the true underlying relationship between inputs and outputs.
What is the number of weights in a convolutional layer with a kernel size of 3, a stride of 2, an input of 5 channels, an output of 10 channels, and an image size of 32x32? A) 1350 B) 450 C) 150 D) 3200
B) 450
What does the loss function represent in the model fitting process? A) The learning rate of the optimization algorithm B) A measure of the mismatch between the network predictions and the ground truth for a training set C)The number of parameters in the model D) The complexity of the model
B) A measure of the mismatch between the network predictions and the ground truth for a training set
In a deep network, how does each layer contribute to the complexity of the function? A) By changing the activation function B) By introducing new joints/folds through clipping C) By adding more parameters D) By increasing the number of hidden units
B) By introducing new joints/folds through clipping
Which of the following is a commonly used optimization algorithm for fitting neural networks? A) Genetic algorithms B) Gradient descent C) Exhaustive search D) Simulated annealing
B) Gradient descent
Which of the following is a reason to use Nesterov Accelerated Momentum over traditional momentum? A) It eliminates the oscillation problem completely. B) It anticipates future changes in velocity by using a lookahead step, leading to more accurate updates. C) It uses a variable learning rate, unlike momentum. D) It only adjusts parameters at the end of each epoch, unlike momentum.
B) It anticipates future changes in velocity by using a lookahead step, leading to more accurate updates.
Which of the following is true about the receptive field as we go deeper in a CNN? A) It becomes smaller B) It becomes larger C) It depends on the activation function D) It remains the same
B) It becomes larger
What effect does increasing the number of filters in a convolutional layer have? A) It increases the parameter reuse B) It increases the depth of the output C) It increases the receptive field D) It increases the spatial dimensions of the output
B) It increases the depth of the output
What are hyperparameters in the context of deep networks? A) The learned weights and biases B) Quantities like the number of layers and hidden units, chosen before training C) The type of activation function D) The input and output data
B) Quantities like the number of layers and hidden units, chosen before training
In the matrix notation, what do the vectors βk represent? A)The weights connecting layers B) The biases at each layer C) The output data D) The activations at each layer
B) The biases at each layer
What is the "width" of a neural network? A) The total number of parameters B) The number of hidden units in each layer C) The number of hidden layers D) The size of the input data
B) The number of hidden units in each layer
In the matrix notation for deep networks, what do the matrices Ωk represent? A) The activations at each layer B) The weights connecting layers C) The biases at each layer D) The input data
B) The weights connecting layers
In a deep network, how are the outputs of one layer used in subsequent layers? A) They are multiplied by a constant B) They serve as inputs to the next layer C) They are discarded D) They are averaged with previous outputs
B) They serve as inputs to the next layer
What is the role of momentum in optimization algorithms like SGD? A) To prevent the algorithm from getting stuck in saddle points B) To accelerate the convergence of the optimization process, especially in the presence of high curvature or noisy gradients C) To regularize the model and prevent overfitting D) To determine the optimal learning rate
B) To accelerate the convergence of the optimization process, especially in the presence of high curvature or noisy gradients
What is the purpose of the bias parameters in a neural network? A) To introduce nonlinearity B) To add an offset to the linear functions C) To control the slope of the linear functions D) To initialize the weights
B) To add an offset to the linear functions
What is the purpose of a 1x1 convolution in CNNs? A) To perform pooling B) To reduce the number of channels C) To apply non-linearity D) To increase the spatial dimensions of the feature map
B) To reduce the number of channels
Which of the following is NOT a hyperparameter of a shallow neural network? A) Number of hidden units B) Weights connecting the input to the hidden layer C) Number of layers D) Type of activation function
B) Weights connecting the input to the hidden layer
In supervised learning, the distinction between regression and classification problems lies in: A) The amount of training data required. B) Whether the output is a continuous value or a categorical assignment. C) The type of input data used. D) The complexity of the model.
B) Whether the output is a continuous value or a categorical assignment.
Which parameter determines how much a convolutional kernel moves during the convolution operation? A) pooling B) stride C) dilation D) padding
B) stride
The primary focus of the field of AI is to: A) develop faster computing hardware B)Solve complex mathematical equations. C)Create systems that mimic intelligent behavior. D) Analyze large datasets.
C) Create systems that mimic intelligent behavior.
What is one of the effects of using a dilated convolution in a convolutional neural network? A) increases the stride B) increases the receptive field C) decreases the receptive field D) reduces the number of parameters
C) Decreases the receptive field
Which of the following are key components of the backpropagation algorithm? A)Learning rate annealing and gradient noise to improve optimization. B) Gradient clipping and L2 regularization to prevent overfitting. C) Forward pass and reverse pass to propagate the gradients. D) Activation functions and bias updates during the forward pass only.
C) Forward pass and reverse pass to propagate the gradients.
Machine Learning (ML) distinguishes itself from other AI approaches by: A) Relying solely on logical reasoning. B) Focusing exclusively on structured data. C) Its ability to learn from experience and improve its decision-making capabilities. D) Being entirely dependent on human programming.
C) Its ability to learn from experience and improve its decision-making capabilities.
The loss function in supervised learning is used to: A) Control the learning rate during training. B) Determine the optimal number of hidden layers in a neural network. C) Quantify the mismatch between the model's predictions and the true outputs in the training data. D) Generate new data samples.
C) Quantify the mismatch between the model's predictions and the true outputs in the training data.
How does stochastic gradient descent (SGD) differ from standard gradient descent? A)SGD updates parameters only once per epoch B) SGD uses a fixed learning rate C) SGD computes the gradient on a small random subset of the training data at each iteration D) SGD is only suitable for shallow neural networks
C) SGD computes the gradient on a small random subset of the training data at each iteration
The primary goal of generative unsupervised models is to A) Identify outliers in a dataset. B) Classify data into predefined categories. C) Synthesize new data examples that resemble the training data. D) Predict future values in a time series
C) Synthesize new data examples that resemble the training data.
Which of the following describes a feature map in a CNN? A) The output of a pooling layer B) The collection of learned filters C) The spatial representation of the learned features after applying a convolution D) The number of output channels
C) The spatial representation of the learned features after applying a convolution
What is the 'bias-variance trade-off' in supervised learning? A) The trade-off between the learning rate and the batch size during training. B) The trade-off between the accuracy of the model on the training data and its complexity. C) The trade-off between the model's ability to fit the training data perfectly and its ability to generalize to new data. D) The trade-off between the number of layers and the number of hidden units in a neural network.
C) The trade-off between the model's ability to fit the training data perfectly and its ability to generalize to new data.
What is the purpose of backpropagation in neural network training? A) To adjust the learning rate during training B) To evaluate the model's performance C) To efficiently compute the gradients of the loss function with respect to all the model parameters D) To initialize the model parameters
C) To efficiently compute the gradients of the loss function with respect to all the model parameters
What is the purpose of a test dataset in supervised learning? A) To fine-tune the hyperparameters of the model. B) To train the model's parameters. C) To evaluate how well the trained model generalizes to new, unseen data. D) To generate new data samples.
C) To evaluate how well the trained model generalizes to new, unseen data.
What is the purpose of the softmax function in the output layer of a CNN? A) To increase the number of output channels B) To apply non-linearity C) To normalize the output into a probability distribution D) To reduce overfitting
C) To normalize the output into a probability distribution
What is the primary modification made by the Adam optimization algorithm compared to basic gradient descent with momentum? A)Adam performs updates using only the second moment (squared gradients) to adjust the learning rate. B) Adam uses a constant learning rate across all parameters. C) Adam uses a constant learning rate across all parameters. D) Adam combines momentum with adaptive learning rates by maintaining moving averages of both the gradients and the squared gradients.
D) Adam combines momentum with adaptive learning rates by maintaining moving averages of both the gradients and the squared gradients.
How do deep networks help with hierarchical feature learning? A) By reducing the number of neurons in each layer B) By forcing inputs to be linearly separable C) By using complex matrix operations in each layer D) By progressively learning more abstract representations at deeper layers
D) By progressively learning more abstract representations at deeper layers
How do shallow neural networks handle multivariate inputs? A) By scaling the input features B) By combining inputs into a single scalar C) By creating separate networks for each input D) By using multiple input units to handle each dimension of the input
D) By using multiple input units to handle each dimension of the input
The specific term used when a deep neural network is fitted to data is: A) Machine Learning. B) Supervised Learning. C) Artificial Intelligence. D) Deep Learning.
D) Deep Learning.
Which of the following statements is true about a fully connected network? A) It cannot approximate nonlinear functions B) It is only suitable for image data C) None of these D) Every element in one layer connects to every element in the next layer
D) Every element in one layer connects to every element in the next layer
The primary goal of a learning algorithm in supervised learning is to: A) Increase the variance of the model's predictions. B) Minimize the number of parameters in the model. C) Maximize the complexity of the model. D) Find the parameters that minimize the loss function.
D) Find the parameters that minimize the loss function.
What is the primary method used for training (or fitting) the linear regression model in the example? A) Randomly assigning parameter values. B) Using a closed-form solution to directly compute the optimal parameters. C) Exhaustive search over all possible parameter combinations. D) Gradient descent, starting from an initial guess and iteratively improving the parameters.
D) Gradient descent, starting from an initial guess and iteratively improving the parameters.
During backpropagation, how are the gradients calculated and updated in each layer of a neural network? A)Gradients are calculated and updated layer by layer in the forward direction of the network. B) Gradients are calculated and updated only for the output layer. C) Gradients are calculated using only the loss function without considering the activation functions in each layer. D) Gradients are calculated from the output layer back to the input layer, using the chain rule to update parameters for all layers.
D) Gradients are calculated from the output layer back to the input layer, using the chain rule to update parameters for all layers.
How does He Initialization relate to ReLU activations? A) He Initialization sets all weights to a fixed value to optimize convergence with ReLU activations. B) He Initialization is specifically designed to prevent vanishing gradients when using sigmoid activation functions. C) He Initialization is used only for output layers with softmax activations. D) He Initialization adjusts the variance of the weights based on the number of input units, making it particularly effective for ReLU activations to avoid vanishing or exploding gradients.
D) He Initialization adjusts the variance of the weights based on the number of input units, making it particularly effective for ReLU activations to avoid vanishing or exploding gradients.
What impact does adding more hidden layers to a neural network have? A) It decreases the risk of overfitting B) It simplifies the model's architecture C) It adds more linear regions and increases expressiveness D) It always improves performance
D) It always improves performance
How does the number of output channels relate to the number of kernels used in a convolutional layer? A) It has no relation to the number of kernels. B) It is double the number of kernels. C) It is always one less than the number of kernels. D) It is equal to the number of kernels.
D) It is equal to the number of kernels.
What is the primary contribution of the backpropagation algorithm to neural network training? A) It eliminates the need for weight initialization in neural networks. B)It ensures that the learning rate decreases with each epoch. C) It allows for the computation of gradients for a single layer only. D) It provides an efficient method to calculate the gradients of the loss function with respect to all model parameters, enabling multilayer neural network training.
D) It provides an efficient method to calculate the gradients of the loss function with respect to all model parameters, enabling multilayer neural network training.
What advantage does reverse-mode automatic differentiation (used in backpropagation) offer compared to forward-mode automatic differentiation in deep learning? A) Forward-mode is the default choice for all deep learning models due to its computational efficiency. B) Reverse-mode is used to approximate gradients, while forward-mode computes exact gradients. C) Forward-mode is generally faster for deep learning models with many layers. D) Reverse-mode is more efficient for computing gradients in neural networks with many input variables and few outputs.
D) Reverse-mode is more efficient for computing gradients in neural networks with many input variables and few outputs.
What is the "depth" of a neural network? A) The number of hidden units in each layer B) The total number of parameters C) The size of the output data D) The number of hidden layers
D) The number of hidden layers
The chapter mentions several potential ethical concerns associated with AI. Which of the following is NOT one of them? A)Bias and fairness in algorithmic decision-making. B) The risk of AI being weaponized. C) The explainability of AI systems and their decision-making processes. D) The potential for AI to surpass human intelligence.
D) The potential for AI to surpass human intelligence.
In the 1D linear regression example, what is the least squares loss function designed to minimize? A) The sum of the absolute deviations between the model's predictions and the true values. B) The maximum deviation between the model's predictions and the true values. C) The product of the deviations between the model's predictions and the true values. D) The sum of the squares of the deviations between the model's predictions and the true values.
D) The sum of the squares of the deviations between the model's predictions and the true values.
What advantage do deep networks provide when processing structured data like images? A) They reduce the size of the input B) They require fewer hidden units C) They eliminate the need for feature engineering D) They can process local features in a hierarchical fashion
D) They can process local features in a hierarchical fashion
What role do parameters play in a machine learning model? A) They are used to measure the mismatch between model predictions and true outputs. B) They are hyperparameters chosen before training. C) They represent the fixed structure of the model equation. D) They determine the specific relationship between inputs and outputs within the model's family of possible relationships.
D) They determine the specific relationship between inputs and outputs within the model's family of possible relationships.
What is the primary purpose of automatic differentiation in deep learning frameworks? A) To perform symbolic differentiation for simple functions. B) To replace backpropagation with a more complex method of gradient computation. C) To approximate gradients using finite differences. D) To compute exact gradients efficiently during the training of neural networks by automatically applying the chain rule.
D) To compute exact gradients efficiently during the training of neural networks by automatically applying the chain rule.
Why is proper parameter initialization important in training deep neural networks? A) To avoid overfitting during training. B) To ensure that the gradients vanish during backpropagation. C) To ensure that the learning rate is kept constant throughout training. D) To help the optimization algorithm converge faster and avoid vanishing or exploding gradients.
D) To help the optimization algorithm converge faster and avoid vanishing or exploding gradients.
What is the purpose of using multiple convolutional layers in sequence? A) To increase the receptive field B) To increase the number of parameters C) To reduce the spatial dimensions of the input D) To increase feature complexity
D) To increase feature complexity
What is the purpose of an activation function in a neural network? A) To compute the loss B) To perform inference C) To initialize the parameters D) To introduce nonlinearity
D) To introduce nonlinearity
What is a common problem with very deep CNNs, and how can it be addressed? A) Overfitting; solved by adding more layers B)Underfitting; solved by reducing the number of parameters C)Overfitting; solved by removing dropout D)Vanishing gradients; solved by using skip connections (residual connections)
D)Vanishing gradients; solved by using skip connections (residual connections)