Model Evaluation , Metrics (Week 8)
sum of the weights
Sum of weights will penalize small values more
What if classifier has "non‐tunable" parameters?
a parameter is "non‐tunable" if tuning (or training) it on the training data leads to overfitting
Recall u
Effectiveness of a classifier to identify class labels if calculated from sums of per-text decisions
sum of the squared weights
Squared weights penalizes large values more
Basically there are two methods to overcome overfitting ____
1.Reduce the model complexity (e.g PCA) 2.Regularization
Regularizer
- A regularizer is an additional criterion to the loss function to avoid overfitting
Root Mean Squared Error (RMSE)
- RMSE is the most popular evaluation metric used in regression problems. It follows an assumption that error are unbiased and follow a normal distribution. - RMSE is highly affected by outlier values. Hence, make sure you've removed outliers from your data set prior to using this metric. - As compared to mean absolute error, RMSE gives higher weightage and punishes large errors.
Bias and Variance Tradeoff
- Complex models (many parameters) usually have lower bias, but higher variance. - Simple models (few parameters) have higher bias, but lower variance.
When we use regularization?
- Find out if there is Multicollinearity
Gain and Lift charts
- Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model. - The greater the area between the lift curve and the baseline, the better the model
A binary classification problem is really a trade-off between sensitivity and specificity.
- Sensitivity is the true positive rate also called the recall. It is the number instances from the positive (first) class that actually predicted correctly. - Specificity is also called the true negative rate. Is the number of instances from the negative class (second) class that were actually predicted correctly.
Shrinkage
- This shrinkage, aka regularization has the effect of reducing variance. - Depending on what type of shrinkage is performed, some of the coefficients may be estimated to be exactly zero. - This approach fits a model involving all p predictors, however, the estimated coefficients are shrunken towards zero relative to the least squares estimates.
K-FOLD CROSS VALIDATION
1. Split the sample into k subsets of equal size 2. For each fold, estimate a model on all the subsets except one 3. Use the left out subset to test the model, by calculating a CV metric of choice 4. Average the CV metric across subsets to get the CV error
Regression Metrics
1.Mean Absolute Error. 2.Mean Squared Error. 3.R^2
Classification Metrics
1.Classification Accuracy. 2.Logarithmic Loss. 3.Area Under ROC Curve. 4.Confusion Matrix. 5.Classification Report.
Confusion Matrix
A confusion matrix is an N X N matrix, where N is the number of classes being predicted. - Accuracy : the proportion of the total number of predictions that were correct. - Positive Predictive Value or Precision : the proportion of positive cases that were correctly identified. - Negative Predictive Value : the proportion of negative cases that were correctly identified. - Sensitivity or Recall : the proportion of actual positive cases which are correctly identified. - Specificity : the proportion of actual negative cases which are correctly identified.
What if adding predictors is not actually improving the model's fit
Adjusted R-squared (proportion of total variance explained by the model)
Precision u
Agreement of the data class labels with those of a classifiers if calculated from sums of per-text decisions
Precision m
An average per-class agreement of the data class labels with those of a classifiers
Recall m
An average per-class effectiveness of a classifier to identify class labels
Area Under ROC Curve
Area under ROC Curve (or AUC for short) is a performance metric for binary classification problems.
How does metrics influence the performance of the machine learning Algorithm?
Choice of metrics influences how the performance of machine learning algorithms is measured and compared
Classification Accuracy
Classification accuracy is the number of correct predictions made as a ratio of all predictions made.
Leave one out
Downside: expensive Upside: doesn't waste data
Test set
Downside: may give unreliable estimate of future performance Upside: Cheap
What is a good model percentage for Gini coefficient?
Gini above 60% is a good model
Gini Coefficient
Gini is nothing but ratio between area between the ROC curve and the diagonal line & the area of the above triangle
Why do models lose stability?
High error from training data (Under-fitting) , Low training error and generalization of the relationship, Zero training error (Overfitting)
What is cross validation?
In cross-validation the original sample is split into two parts. One part is called the training (or derivation) sample, and the other part is called the validation (or validation + testing) sample.
Model Interpretability
Irrelevant variables leads to unnecessary complexity in the resulting model. By removing them (setting coefficient = 0) we obtain a more easily interpretable model. However, using OLS makes it very unlikely that the coefficients will be exactly zero.
What is R^2?
It is the ratio of error in a model over the total variance in the dependent variable (Lower the error, the higher the R^2)
What does a large coefficient signify?
It means putting a lot of emphasis on that feature, i.e. the particular feature is a good predictor for the outcome.
What portion of the sample should be in each part? (Cross validation)
Large sample (Split it 50/50), Small sample (2/3 training, 1/3 testing & validation)
Logarithmic Loss
Logarithmic loss (or logloss) is a performance metric for evaluating the predictions of probabilities of membership to a given class.
What are the type of machine learning problem metrics demonstrated?
Metrics are demonstrated for both classification and regression type machine learning problems.
What are the consequences of overfitting?
Overfitted models will have high R2 values, but will perform poorly in predicting out-of-sample cases or test cases
Variance has what kind of fitting ____
Overfitting
What is RMSE good for?
RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion for fit if the main purpose of the model is prediction.
Advantage Of ROC
ROC curve is almost independent of the response rate
Fscore m
Relations between data's positive labels and those given by a classifier based on a per-class average
Fscore u
Relations between data's positive labels and those given by a classifier based on sums of per-text decisions
RIDGE REGRESSION
Ridge Regression is a technique used when the data suffers from multicollinearity ( independent variables are highly correlated).
How do we determine if one model is predicting better than another model?
Take the difference between observed (y) and predicted values (f), when applying the model to unseen data
Which kind of Cross Validation?
Test set, leave one out
Mean Absolute Error
The Mean Absolute Error (or MAE) is the sum of the absolute differences between predictions and actual values. It gives an idea of how wrong the predictions were.
How to check if a model fit is good?
The R2 statistic has become the almost universally standard measure for model fit in linear models.
R^2 Metric
The R^2 (or R Squared) metric provides an indication of the goodness of fit of a set of predictions to the actual values(coefficient of determination)
Why is ridge regression better than least squares?
The advantage is apparent in the bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases. This leads to decrease variance, with a smaller increase in bias - In Ridge, by properly tuning λ and acquiring less variance at the cost of a small amount of bias find a lower potential MSE.
Mean error (ME)
The average dollar amount or percentage points by which forecasts differ from outcomes
Mean Absolute Error (MAE)
The average of absolute dollars amount or percentage points by which a forecast differs from an outcome
Mean Absolute Percentage Error (MAPE)
The average of absolute percentage amount by which forecasts differ from outcomes
Mean Percentage Error (MPE)
The average of percentage errors by which forecasts differ from outcomes
Mean Squared Error (MSE)
The average of squared errors over the sample period
Error Rate
The average per class classification error
Average Accuracy
The average per class effectiveness of a classifier
What is the impact of model complexity on the magnitude of coefficients?
The size of coefficients increase exponentially with increase in model complexity
Shrinkage also performs variable selection
The two best-known techniques for shrinking the coefficient estimates towards zero are the ridge regression and the lasso.
Subset Selection
This approach identifies a subset of the p predictors that we believe to be related to the response
Bias has what kind of fitting ____
Underfitting
What is Maximum likelihood estimation (MLE)?
is a method of estimating the parameters by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable
Ridge regression
is similar to least squares except that the coefficients are estimated by minimizing a slightly different quantity.
Disadvantage of RIDGE REGRESSION
it includes all p predictors in the final model. The penalty term will set many of them close to zero, but never exactly to zero. This isn't generally a problem for prediction accuracy, but it can make the model more difficult to interpret the results.
What does F-Test determine?
the F-test determines whether the proposed relationship between the response variable and the set of predictors is statistically reliable
