Stats 315 Final Exam

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Consider convolutional layer that transforms an image X of size hsub1 x wsub1 using a kernel K of size hsub2 x wsub2. Asume no bias and no padding is used. What's the size of the output of such a layer.

(hsub1-hsub2+1) x (wsub1 - wsub2 + 1)

How many trainable parameters does a max pooling layer with a pool size of (5,5) have

0, a max pooling layer doesn't have any parameters at all

What is the process of applying filters to extract features (convolutions)

1. Apply set of weights to extract local features 2. Use multiple filters to extract different features 3. Spatially share parameters of each filter

What is the architecture of LeNet

1. a convolutional encoder consisting of two convolutional layers and a dense block consisting of three fully connected layers

How to implement self attention with neural networks

1. encode position information 2. extract query, key, value for search 3. compute attention weighting 4. extract features with high attention

If I have an 100 x 100 image and a kernel of shape 5 x 5, and if I use padding of 4 along both height and width dimensions, what will the shape of the output be

100 x 100, the original image size + padding - kernel size (5) + 1 100+4-5+1 = 100 x 100

if a 100 x 100 pixel grayscale image is to be converted into a hidden representation of dimension 100 using a fully connected layer, roughly how many parameters will be needed in that layer

10^6, input * output , 100 x 100 (10^4) * 100 = (10^6)

Suppose a Conv2D layer uses a kernel of spatial dimensions 5 x 5 and no biases. Suppose the number of input channels i 6 and the number of output channels is 16. What is the total number of trainable parameters in such a layer?

2400 (25*16*6), size of patch(5x5) * input channel * output channel is how you calculate number of parameters in CNN

If I have an 100 x 100 image and a kernel of shape 5 x 5, and if I use padding of 4 and stride of 2 along both height and width dimensions, what will the shape of the output be

50 x 50, the original image size + padding = kernel size (5) + 1 = 100 / 2 (stride) = 50

What is a manifold

A lower dimensional subspace of some parent space that is locally similar to a linear (Euclidean) space

What is regularization

Actively impeding the model's ability to fit perfectly to training data with the goal of making the model perform better during validation

If you notice your neural network model has high variance, how might you try to fix this.

Add regularization, get more training data

what is the manifold hypothesis

All natural data lies on a low dimensional manifold within the high dimensional space where it is encoded (this is the reason machine learning works)

What is the advantage of pooling

Alleviated the excessive sensitivity of the convolutional layer to location

What is LeNet

Among the first published CNNs to capture wide attention for its performance on computer vision tasks. Achieved outstanding results in matching performance of support vector machines, later adapted to recognize digits for processing deposits in ATM machines

When working with timeseries data, how should you split the data for validation, training, testing, etc.

An example of a data split for a weather data set: first 50% for training, following 25% for validation, last 25% for testing. For timeseries data, validation and test data should be more recent than the training data (you want to predict the future given the past)

What does setting padding = 'same' do when creating a conv2D layer in keras

Applies the same spatial dimensions as the input image

When do you apply non-linearity during convolutional neural network

Apply after every convolution operation

convolution definition

Apply filters with learned weights to generate feature maps

what is classification (timeseries)

Assign one or more categorical labels to a timeseries: given the timeseries of the activity of a visitor on a websit, classify whether the visitor is a human or a bot

What is recurrent layer stacking

Classic way to build more powerful recurrent networks

What is iterated k-fold validation with shuffling

Consists of applying k-fold validation multiple times and shuffling the data every time before splitting it k ways. The final score is the average of the scores obtained at each run of k-fold validation. Ends up training and evaluating k * p models where p is the number of iterations (so its v expensive) It is for situations with relatively little data and a need to evaluate the model very precisely

The LeNet architecture we saw consisted of 4 types of layers: Convolution, pooling, flatten, and dense. The majority of the trainable parameters resided in which of these types?

Dense, dense neural networks have way more parameters than convolution. For pooling and flatten there are no parameters at all.

What is the convolution operation

During convolution we slide the filter (some portion of the input image) over a different part of the input image, element wise multiply and add the outputs to create feature map

Why do we use three sets of data: training, validation, and test as opposed to two sets that just train and then test?

Every ML model requires reconfiguration: choosing the number of layers, adjusting size of layers, etc. The feedback from the validation set allows for tuning that can be adjusted before going to final testing

How do you do feature-wise normalization of a dataset

For each feature in the input data, subtract the mean of the feature and divide it by the standard deviation

What are the types of timeseries tasks

Forecasting, classification, event detection, anomaly detection

Class imbalance

Given a binary classification example (i.e.) where 90% of samples belong to class A and 10% to class B. A baseline classifier that just predicts A every time will achieve a validation accuracy of 90%. In this particular case one cannot aim for a validation accuracy that is above some simple baseline (unless that baseline is 90%)

what is the price we pay for massively reducing the number of parameters in a convolutional layer (compared to fully connected layer)

If data does not follow the assumption of translation invariance and locality our models might struggle even to fit our training data

For feature learning during CNNs for classification, what order should the operations be done in

Input --> convolution + relu --> pooling --> convolution + relu --> pooling --> flatten --> fully connnected -- . softmax

What does the "le" in "LeNet" stand for

It's taken from the name of LeNet's inventor Yann LeCun

What is the main lesson learned from seeing that the Dense Neural Network does worse than the non-ML baseline on the Jena temperature forecasting task

Just because a good solution technically exists in our hypothesis space doesn't mean we'll be able to find it via gradient descent.

what is the difference between max pooling and average pooling

Maximum pooling calculates the maximum value of the elements in the pooling window, whereas average pooling uses the average value of elements in the pooling window

what happens when the lambda coefficient of the regularization term is increased

NN weights are driven close to 0

What does locality mean in the context of designing neural network architectures for processing images?

Network should focus on local regions, without regard for the contents of the image in distant regions

what does translation invariance mean in the context of designing neural network architectures for processing images

Network should respond similarly to the same image patch regardless of where the patch appears in the image

Suppose we design a convolutional kernel K with P parameters/weights for use with a 1024 x 1024 grayscale image. Then the image shape is changed to 1024 x 1024 x 3 to accommodate 3 color channels. Keeping the spatial dimensions of K the same, we design a new kernel K' to operate on color images and to output 10 channels. Let P' be the number of parameters/weights in K'. What is the relationship between P' and P.

P' = 30P, input channels * output channels

What are the types of regularization

Reducing the size of the network, weight regularization (more commonly used in small networks), dropout (more commonly used on large, complex models)

Suppose f and g are functions over the integers. Which of the following expression correctly computes their convolution evaluated at an integer i

Sum over a of of f(i-a)g(a), definition of convolution

What is k-fold validation?

Technique for evaluating predictive models. Dataset is divided into k subsets (manifolds). The model is then trained and evaluated k times using a different fold as the validation set each time. Performance metrics from each fold are averaged for the final score/estimation of the models generalization performance It is for situations with relatively little data: i.e. Boston housing dataset

consider the manifold of digit images that lies within the space of all 28x28 pixel images with pixel values in the integer 0 through 255. What does the manifold hypothesis say in this particular case.

The dimension of the manifold of digit images will be much smaller compares to 784

in typical CNN architectures what tends to happen to spatial dimensions as we move from earlier layers of processing to later ones.

They tend to decrease, as we move from input to output spatial dimensions decrease, number of channels increase.

what is the sequence modeling design criteria

To model sequences we need to 1. handle variable length sequences 2. track long-term dependencies 3. maintain information about order 4. share parameters across the sequence

evaluating an ML model always boils down to splitting the data into what three sets

Training, validation, and testing

If you're able to get training started, but training loss doesn't go down. What do you do

Try changing gradient descent parameters

If LeNet was to be redesigned using modern knowledge not available in the 1980s, what is one change you would expect to see.

Use ReLU instead of sigmoid activation

Important considerations in manual feature extraction

Viewpoint variation, scale variation, illumination conditions, deformation, background clutter, occlusion, intra-class variation

what is the difference in the loss metric when regularization is used vs. when it's not

When regularization is used, loss consists of normal prediction losses as well as losses from the regularization of layers, whereas when regularization isn't used loss is simply the prediction loss

When is it a good idea to use k-fold cross validation

When we don't have enough training examples to create a single large validation set

on a loss value graph of the validation curve and training curve over training time, where is there a robust fit/where does under fitting stop and overfitting begin

Where the validation curve just beings to separate from the training curve

What kernel can detect vertical edges in a grayscale image

[2,-2], want to detect changes from white to black. We need to detect white slice and then black slice. So one positive number and one negative number to maximize contrast between black and white.

what is a vanilla neural network

a recurrent neural network with one to one sequence modeling

timeseries definition

any data obtained via measurements at regular intervals: daily price of stock, hourly electricity consumption of a city, weekly sales of a store

How to apply recurrent dropout to fight overfitting in recurrent neural networks

applying normal dropout before a recurrent layer hinders learning rather than helping with regularization. SO, instead, using the same dropout mask should be applied at every timestep instead of a mask that varies from timestep to timestep. should be applied to the inner recurrent activations of the layer

what is the vanishing gradient problem

as you keep adding layers to a network, the network eventually becomes untrainable, LTSM and GRU are designed to address these issues

What do regression problems predict

continuous values (temperature tomorrow given meteorological data)

complete the analogy, Dense layer : multilayer perceptrons :: convolutional layer :

convolutional neural networks

What is anomaly detection (timeseries)

detect anything unusual happening in a continuous stream: unusual activity on your corporate email network --> attacker

What is dropout regularization

dropout, applied to a layer, consists of randomly setting a number of output features of the layer to 0 during training. The layer output will then have a few 0s distributed at random. The dropout rate (fraction of the features set to 0) is usually between 0.2 and 0.5.

true or false: best practice for tuning hyperparameters is to do it once at the beginning of the project and leave them fixed, since changes to the model do not affect the optimal choices.

false

What is FCN

fully convolutional network

What is the main goal of machine learning

generalization

what is the difference between hidden layers and hidden states

hidden layers are layers that are hidden from view on the path from input to output. hidden states are inputs to whatever a given step is doing and can only be computed by looking at data at previous time steps.

generalization

how well the the model performs on unseen data

what is event detection (timeseries)

identify the occurrence of a specific expected event within a continuous data stream: useful in hotword detection where a model monitors an audio stream and detects words like "ok google", "alexa", and "hey siri"

translation invariance

in the earliest layers out network should respond similarly to the same patch regardless of where it appears in the image

if the model is able to overfit but has high validation loss

increase model capacity

overfitting is particularly likely to occur when your data...

is noisy, involves uncertainty, includes rare features

Cross correlation of a single channel image X with a kernel K is equivalent to the convolution of X with which of the following kernels

kernel K flipped both horizontally and vertically, cross correlation is equivalent to convolutional when you flip vertically and horizontally

What are the types of weight regularization and when do you use them

kernel_regularizer is used for regularizing model weights/parameters, bias_regularizer is used for regularizing biases, activity_regularizer is used for regularizing the output of the layer

If the model meaningfully generalizes, but you can't get a better accuracy than the set common baseline

leverage better architecture priors

Alice generates output Y by computing corr2D(X,K) for an image X and kernel K. She gives the image X and output Y to Bob but doesn't tell him what K she used. What function should Bob run gradient descent on to approximately recover K? With respect to which variable should gradients be taken during gradient descent

loss(corr2d((X,K),Y) with respect to K

If training gets stuck what should you try

lowering or increasing the learning rate, increasing batch size

What is an LSTM (long short termmemory) network

ltsm cells are gated, they control information flow: forget, store update, output. LSTM cells are able to track information throughout many steps

what does batch normalization do

makes all features zero mean and sd one, can be used before or after activation

what does it mean for data to be noisy

noisy data is meaningless data. Any data that is corrupt/can't be interpreted by machines

what is pooling in the context of CNNs

pooling downsamples to input image to preserve spatial invariance

What is forcasting (timeseries)

predicting what will happen next in a series: forecast revenue a few months in advance in order to plan you budget

optimization

process of adjusting a model to get the best performance possible on the training data

what does a validation set do in training a model

provides an unbiased evaluation of the model

What is the process of fighting overfitting when getting more data isn't possible

regularization: control the models complexity by adding constraints on the smoothness of the model curve

What doe setting stride = (2,2) do when creatnig Conv2D layer in Keras?

spatial dimensions of output will be smaller than those of input, stride= (2,2) means we jump two steps at a time. This will have nothing to do with channels since this is stride.

What is one way to reduce overffitting

stopping the training process early

What are the two essential RNN layers available in Keras

the LSTM layer and the GRU

What is concatenation in the context of machine learning

the appending of vectors or matrices to form a new vector or matrix

When applying k-fold validation what is the final score

the average of the final validation scores for each layer

locality principle

the earliest layers of the network should focus on local regions without regard for the contents of the image in distant regions

which piece of an RNN captures historical information

the hidden state

What effect does a pooling layer have on the number of channels

the number of output channels is the same as the number of input channels

how does pooling work

the pooling window starts from the upper left of the input and slides across it from left to right and top to bottom. Then, at each location that the window hits, it computes either the maximum or average value of the subtensor

what is the fundamental issue w machine learning

the tension between optimization and generalization (essentially the issue of overfitting)

What are the parameters of an RNN

the weights and the bias of the hidden layer and the weights and bias of the output layer. Even at different time steps, RNNs always use these model parameters, so the cost of an RNN doesn't grow as the number of timesteps increases

What is perplexity used for

to evaluate the quality of language models.

How do you use training set, validation set, and testing set to evaluate a model

train the model on the training data, evaluate the loss of the model on the validation data, adjust based on the evaluation, and then test the model once more on the testing data which aims to be as close to the production data as possible

What happens as training epochs proceed, if you choose a learning rate of zero

training loss will not go down

true or false: maxpooling, combined with a stride larger than 1 can be used to reduce the spatial dimensions like width and height

true

True or false: a model should always be able to overfit

true, if you're not able to overfit it's an issue with the representational power of the model: need bigger model w/ more capacity by increasing number of layers, using bigger layers, using the appropriate kind of layer for the objective

what is a recurrent neural network

type of network which uses sequential data or time series data. Neural networks with hidden states

When should you use RNN as opposed to dense or CNN

when the task involves data that comes in sequences such as a sentence (temporal information). CNNs and dense networks have no memory. RNNs do this by essentially internally looping in order to process information relative to what it's seen so far

Why is interpolation a good source of generalization in ML models

with interpolated data points you can start making sense of points you've never seen before. You can relate new points to other points that lie close on the manifold


Ensembles d'études connexes

Business Law - Chapter 9: Renting or Owning a Home

View Set

Phathopharmacology Practice Questions

View Set

Everythings an Argument ch 5, 13, 17, 18/19

View Set

Questions throughout exams/quizes PCC Final

View Set

Chapter 8: Advanced Release Planning

View Set