Deep Learning Exam 1
Federated coefficient=sum(coef*n/ntotal) Hosp 1: 23.1 - 37.9age + 165.7male + 60.3dose Hosp 2: 1698.4 -22.8age - 37.3male + 24.7dose Federated coef: intercept= 23.1*(122/358) + 1698.4*(236/358) = 1127.5 age= -37.9*(122/358) - 22.8*(236/358) = -27.9 male= 165.7*(122/358) - 37.3*(236/358)= 31.9 dose= 60.3*(122/358) + 24.7*(236/358)= 36.8 Federated model= 1127.5 - 27.9age + 31.9male + 36.8dose
Build a federated model and make a prediction on length of survival for the following patient: Female patient with age of 60 and dose_gy of 75.
p=last layer neuron/sum of last layer neurons cat: p1=.07/(.07 + 1.28 + 5.25)= .011 dog: p2=1.28/(.07 + 1.28 + 5.25)= .194 deer: p3=5.25/(.07 + 1.28 + 5.25)= .795
Calculate final probability values (p1, p2, and p3) for the following network.
Cons: -oftentimes, low complications on the user side limits the size of the model -sensitive info can still be revealed to third party or central server during federation -many algorithmic and technical challenges to address the heterogeneity in incoming models and end nodes -different number of updates from the nodes as well as time differences
Cons of federated learning
MSE evaluates how well the predictions for the continuous target variable matches the true data (ex: if the target variable is quality) Accuracy evaluates how well the model is able to classify the data (ex: if the classification is cat/dog)
Deep learning model validation- MSE vs Accuracy
NOTE: This is a ACCURACY vs epoch graph Line 1: Training accuracy (increases at high epoch) Line 2: Validation accuracy (decreases at high epoch) The overfitting line is past the intersection of the two lines. Once training accuracy is higher than validation accuracy, overfitting begins.
Draw a vertical line to mark the initial epoch where the overfitting starts to occur.
NOTE: This is a LOSS vs epoch graph Line 1: Validation loss (increases at high epoch) Line 2: Training Loss (decreases at high epoch) The overfitting line is past the intersection of the two lines. Once validation loss is higher than training loss, overfitting begins.
Draw a vertical line to mark the initial epoch where the overfitting starts to occur.
Piecewise regression does not have an on/off switch like a neural activation function does. The activation function can be fitted to non-linear patterns, like a spiral, which is why deep learning is optimal for complex data. With piecewise functions, activation cannot harmonize, and only one function can be activated at a time.
Explain why piecewise regression is not able to perform classification task shown below:
total loss= (1-correct) + total of incorrect probabilities total loss= sum of losses Loss is different between true value and probability. Correct label has true value of 1 and the other labels have true value of 0 cat: loss 1= .011 - 0 = .011 dog: loss 2= .194 - 0= .194 deer: loss 3= 1 - .795= .205 total loss= .011 + .194 + .205 = .410
Given that the correct label is deer, calculate loss values for the following network.
16 weights + 1 bias = 17 betas
How many beta(s) does the boxed in neuron have?
4 neurons in previous layer plus intercept (beta0) = 5 betas
How many betas does the boxed in neuron have?
Global features: more layers are needed, high polynomial -Example: face shape
Layers and global features
Local features: few layers are needed, low polynomial -Example: eyebrows
Layers and local features
MSE: Truth in the training data set on which your model is built PMSE: Truth in real life (truth in new incoming data) MSE= training accuracy/loss PMSE= validation accuracy/loss Best model has smallest PMSE
PMSE vs MSE
Pros: -improvement in model performance when training data quantity is limited -Parallelization of computing power -Complete decentralization -Centralized server can continuously improve various models (like voice or face recognition) without transferring data
Pros of federated learning
4-9 affect the flexibility of the model. Increasing the number of layers and the number of neurons increases the number of parameters. More parameters increases the flexibility of the model.
The following table has 10 hyperparameters you can adjust. List out the hyperparameters affecting the flexibility of your model and explain how they affect the flexibility.
N=2000 K=5 2000/5=400 in each group 1 group is testing and 4 groups are training 400 testing images 1600 training images TRUE
There are 1000 dog images and 1000 cat images in your data set. TRUE/FALSE: Model training using the K-fold validation method with K=5 should result in your model trained on 1600 images
N=2000 K=1000 Leave-one-out validation has N folds, so 2000 folds. K=1000 is not the same as K=2000. FALSE
There are 1000 dog images and 1000 cat images in your data set. TRUE/FALSE: Results from the K fold validation with K=1000 should be equivalent to ones from the leave-one-out validation
60/20/20 is train/test/validate 60% of 1000 cat images= 1000 x 0.6= 600 training cat images
There are 1000 images of dog and 1000 images of cat in your dataset. Given that you are using 60/20/20 for the splitting, on how many cat images will your model be trained?
N=2400 K=5 2400/5=480 in each group 1 group is testing and 4 groups are training 480 testing images 1920 training images FALSE
There are 1200 dog images and 1200 cat images in your data set. TRUE/FALSE: Model training using the K-fold validation method with K=5 should result in your model trained on 1800 images
N=2400 K=1200 Leave-one-out validation has N folds, so 2400 folds. K=1200 is not the same as K=2400. FALSE
There are 1200 dog images and 1200 cat images in your data set. TRUE/FALSE: Results from the K fold validation with K=1200 should be equivalent to ones from the leave-one-out validation
They are more flexible, easier to calculate, easier to interpret
What are the advantages of piecewise regression as compared to polynomial regression?
Pros: -Learns faster Cons: -May not get best parameters -Does not properly converge, or may not converge at all because drastic updates lead to divergent behaviors
What are the pros and cons of using a large learning rate?
Pros: -Will get good, optimal parameters Cons: -Will take longer to train model -It may not learn
What are the pros and cons of using a small learning rate?
You split the data into K groups, and keep 1 for testing and the others for training. Number of images in each group is total number of images/number of folds (N/K) Example: For K=5, 1 set is test and 4 sets are training. If there were 1000 total images (N=1000), then there would be 200 images in each set (1000/5). 1 set (200 images) would be for testing, and the other 4 sets (800 images) would be for training.
What is K-fold validation?
It is difficult to determine the number of splines
What is a con of piecewise regression?
The number of sets equals the total number of images in the dataset (K=N). One set (single obs) is kept for testing and the others are used for training. Example: For a data set with 1000 total images (N=1000), the number of sets is K=1000. 1 obs is for testing and 999 for training.
What is leave-one-out validation?
Masking allows you to see which variable is the most impactful by hiding one variable at a time, "masking it." You can see how the impact of the other variables changes when a specific variable is masked.
What is the key component of TabNet which improves its explainability and how is the improvement achieved?
-Using what was learned from a particular task to solve a different task -Example: using a model classifying cats and dogs and retraining it for classification of racoons and deer Multiple ways: -Use model as-is and train it further with more data -Use model as part of a new network and then train it further with more data
What is transfer learning?
1st layer: 64 beta0 2nd layer: 32 beta0 3rd layer: 100 beta0 Total: 196 biases
You are building an ANN model for wine quality prediction. How many biases does your model have?
1st layer: 7 var input*64 neurons = 448 2nd layer: 64 neurons*32 neurons = 2048 3rd layer: 32 neurons*100 neurons = 3200 Bias: 196 Total: 448 + 2048 + 3200 + 196 = 5892 betas
You are building an ANN model for wine quality prediction. How many parameters does your model have?
Each neuron has a single bias (Beta0) 1st layer: 16 neurons=16 bias 2nd layer: 8 neurons=8 bias 3rd later: 2 neurons=2 bias Total biases: 26
You are building an ANN model which takes 200-by-200 RGB images. The following is the overview of your ANN model structure. How many biases does your model have?
Parameters= Bias + (total images)*(1st layer neurons) + (1st layer neurons)*(2nd layer neurons) + ... + (n-1 layer neurons)*(n layer neurons) Total images=200*200*RGB=200*200*3= 120000 Bias: 26 1st layer: 120000*16= 1920000 2nd layer: 16*8= 128 3rd layer: 8*2= 16 TOTAL: 26 + 1920000 + 128 + 16= 1920170 betas
You are building an ANN model which takes 200-by-200 RGB images. The following is the overview of your ANN model structure. How many parameters does your model have?
Total images=300*300*3 (RGB) = 270000 Bias= 20 + 12 + 3 = 35 bias 1st layer= 270000*20 = 5400000 2nd layer= 20*12= 240 3rd layer= 12*3= 36 Total parameters: 35 + 5400000 + 240 + 36 = 5400311 betas
You are building an ANN model which takes 300-by-300 RGB images. The following is the overview of your ANN model structure. How many parameters does your model have?