ML- Statistical Learning Intro
Non Parametric Model
- do not assume functional form for the relationship between X and Y and seek to estimate the relationship based on the data - Best for prediction because they are more flexible and can capture more complicated relationships. Ex: KNN, Decision Trees, Random Forests, Kernel Density Estimation, SVM, K-Means Clustering, Hierarchical Clustering
Classification or Regression? What are n and p? We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product, we have recorded whether it was a success or failure, the price charged for the product, the marketing budget, the competition price, and ten other variables.
Classification (Success or Failure) samples (n) = 20 predictors (p) = 13
Types of Cross Validation
K-Fold Cross Validation Leave one out cross validation
k-fold cross validation
Data is split into K equal size folds for each fold k in K - Train model on all data except for kth fold - eval model on kth fold compute average performance over the k iterations good for small datasets or when the model has a large number of parameters that need to be tuned
Training and Test MSE
See the plot for Flexibility vs MSE Train MSE decreases with flexibility Test MSE is U-Shaped over flexibility Test MSE almost always larger than Training MSE
Bias Variance Tradeoff
relationship between model complexity and the ability for it to generalize new unseen data. In general, more complex models tend to have lower bias but higher variance, while simpler models tend to have higher bias but lower variance. Goal is to find balance between bias and variance for a problem. Tuning hyperparameters and cross-validation for performance eval are methods to do this.
Prediction Accuracy vs Model Interpretability
High Interpretability implies a simpler more inflexible model. Similar to a linear model. But also lower prediction accuracy because the model is likely underfitting due to the high bias. Low interpretability implies a more flexible model. This could be something like a non-linear model. And this means that there will be overfitting because its very flexible implying higher predictive accuracy but not necessarily on the test set...So the variance will be high.
Unsupervised Clustering (K-Means)
- Assign observations into k "to be determined" clusters... - Assumes hyper spherical distributions - Only based on distance from cluster centers and does not take distribution of points into account..
Leave-One-Out Cross Validation (LOOCV)
- Each observation is used once as the testing set - model is trained on remaining observations and performance estimates are averaged over all observations
Advantages of Parametric Model
- Easier to estimate set of parameters - Less Training Data
Disadvantages of Parametric Model
- If chosen functional form is too far from the truth, then prediction and inference results can be poor. - Underfitting
Advantages of Non Parametric
- Potential good fit, even if input-output relationship is complex
Regression vs Classification
- Regression: Response is Quantitative - Classification: Response is Qualitative
Disadvantage of Non Parametric
- Requires lots of training data - Risk of overfitting due to complexity
Parametric Model
- better for inference (since they make strong assumptions on functional form) - assume there is a known specific functional form or distribution for the relationship between X and Y Ex: Linear Regression - it assumes the relationship between X and Y is linear Others: - Linear Regression - Logistic - LDA, QDA
Should we use parametric or non-parametric for the following? - Functional relationship between X and Y is known and we have many training samples - The functional relationships between X and Y is known and we have few training samples - The functional relationships between X and Y is unknown and we have many training samples - The functional relationship between X and Y is unknown and we have few training samples
1. Parametric/Non Parametric 2. Parametric 3. Non-Parametric 4. Parametric -> because only few training samples... (simple form)
Which MSE to select?
Need to select the lowest Test MSE because this is the data that has not been seen yet and we want to make sure that the MSE has been minimized.
What is reducible error?
Can be reduced by applying more appropriate learning techniques and models or adding more or better training data Needs to be minimized!
What is irreducible error?
Cannot be reduced because value of epsilon may contain - unmeasured but relevant inputs - unmeasurable variation (noise)
What is Cross Validation
Dividing the dataset into training, validation, and testing sets. The training set is used to train the model, the validation set is used to select the best hyperparameters, and the testing set is used to evaluate the final performance of the model.
Performance Metric for Classification
Error Rate
Error Rate
Error Rate = (number of misclassified instances) / (total number of instances) calculates the percentage of instances that were predicted incorrectly by the model. Test Error Rate is what is used as metric
Prediction
Estimating the relationship between X and Y (f_hat) to get a good PREDICTION of (y_hat, the estimate for y)
Inference
Estimating the relationship between X and Y (f_hat) to get an UNDERSTANDING of the relationship between X_1, X_2...X_p and Y. Hypothesis Testing....
Mean Squared Error (MSE)
MSE = (1/n) * Σ(yi - ŷi)^2 MSE is a good metric for model performance because it provides a measure of how well the model is fitting the data. The lower the MSE, the better the model's performance. - This is because the MSE penalizes large errors more than small ones. In other words, it gives more weight to observations that are far from the predicted value, which is important because these are the points where the model is performing the worst. For Regression
How can model error be estimated?
Mean Squared Error
Classification or Regression? What are n and p? We collect a set of data on the top 500 firms in the U.S. For each firm, we record profit, number of employees, industry, and the CEO Salary. We are interested in understanding which factors affect CEO salary.
Regression (CEO Salary is Quantitative) samples (n) = 500 predictors (p) = 3
Classification or Regression? What are n and p? We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.
Regression (Prediction) samples (n) = 52 weeks predictors (p) = 3
Variance
The amount by which the function would change if we estimated it using a different training sample. High variance indicates small changes in the training data result in large changes in the estimate relationship between X and Y high flexibility = high variance, captures noise and random fluctuations in data instead of the patterns (overfitting)
For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer. (a) The sample size n is extremely large, and the number of predictors p is small. (b) The number of predictors p is extremely large, and the number of observations n is small. (c) The relationship between the predictors and response is highly non-linear. (d) The variance of the error terms, i.e. σ2 = Var(ϵ), is extremely high.
a.) FLEXIBLE, Large sample size, and number of predictors, so flexible model in large part due to large sample size. If data is more linear then there are enough data points to train parameters o model turns out more linear as well. However, overfitting could still occur b.) INFLEXIBLE, this is because of the fact that there is not a lot of observations, and with few data points there is a high signal to noise ratio. So better to use a simpler model c.) FLEXIBLE, this is because we know the relationship is more complex so itll be able to follow the trend in the data as opposed to an inflexible model d.) High variance is usually the outcome of the flexible model. So, we might want to counteract this with an inflexible model to reduce this high variance and overfitting. INFLEXIBLE
Bias
error introduced by approx a real life problem by too simple of a model - inflexible models have high bias (underfit), fail to capture patterns in data - more flexible models low bias
