Machine Learning (CH 1)
Validation Set
A data set in which the input is provided and the desired output is known, so that it can be determined how well a ML algorithm is working. Why: if you measure the generalization error multiple times on a test set to reduce, the model will overfit.
What is the difference between a model parameteter and learning algorithm's hyper parameter?
A model has one or more model parameters that determine what it will predict given a new instance (e.g., the slope of a linear model). A learning algorithm tries to find optimal values for these parameters such that the model generalizes well to new instances. A hyperparameter is a parameter of the learning algorithm itself, not of the model (e.g., the amount of regularization to apply).
What is a test set and why would you want to use it?
A test set is used to estimate the generalization error that a model will make on new instances, before the model is launched in production.
What is the purpose of a validation set?
A validation set is used to compare models. It makes it possible to select the best model and tune the hyperparameters.
Feature Engineering
As saying goes: garbage in, garbage out. come up with a good set of features to train on. Drop irrelevant predictors - feature selection: select useful features - feature extraction: combining existing features aka dimensionality reduction - create/gather new data
What is cross-validation and why would you prefer it to a validation set?
Cross-validation is a technique that makes it possible to compare models (for model selection and hyperparameter tuning) without the need for a separate validation set. This saves precious training data. Advantages: Generally, for building the machine learning models, data is the fuel. It is very unlikely that, we can find a very big dataset to build an efficient model. In case of data scarcity (which is the normal situation) , if we are proceeding with normal validation, almost we are reducing the dataset size by almost around 20-30% . In case of k-fold cross validation , there won't be any such kind of reduction in dataset size. And more over by doing k-fold cv , one can prevent over fitting.
Feature Extraction
Unsupervised Learning; dimensionality reduction, merging into one feature. why: reducing dimension of training data before feeding into another ML algorithm will run faster, take up less memory, and some cases perform better
Association Rule Learning
Unsupervised Learning; examining data to discover new and interesting relationships among attributes that can be stated as business rules.
Anomaly Detection
Unsupervised Learning; the process of identifying rare or unexpected items or events in a data set
Non-representative data a. Sampling Noise b. Sampling Bias
a. if the sample is too small b. non representative if the sample is too big (i.e., nonresponse bias, less than 25% people answered, sampled only few criteria)
Regularization
constraining a model to make it simpler and reduce the risk of overfitting this can be controlled by a hyperparameter
Testing and Validating
Generalization Error: error rate of trained model with training set vs test set data (aka. out-of-sample error) If low training error and high generalization error then model is overfitting
What are challenges in ML?
Some of the main challenges in Machine Learning are the 1. lack of data, 2. poor data quality, 3. nonrepresentative data, 4. uninformative features, 5. excessively simple models that underfit the training data 6. complex models that overfit the data.
If your model performs great on the training data but generalizes poorly to new instances, what is happening? What are the solutions?
If a model performs great on the training data but generalizes poorly to new instances, the model is likely overfitting the training data (or we got extremely lucky on the training data). Possible solutions to overfitting are 1. getting more data, 2. simplifying the model (selecting a simpler algorithm, 3. reducing the number of parameters or features used, 4. regularizing the model), 5. reducing the noise in the training data.
What can go wrong if you tune hyperparameters using the test set?
If you tune hyperparameters using the test set, you risk overfitting the test set, and the generalization error you measure will be optimistic (you may launch a model that performs worse than you expect).
Supervised Learning
Tasks as 1. 'classification'. 2. Regression: predict a 'target' numeric value given a set of 'features' called 'predictors' Algorithms: - k-Nearest Neighbors - Linear Regression - Logistic Regression - SVM - Decision Trees & Random Forests - Neural Networks
Hyperparameter
is a parameter of a learning algorithm (not of the model). As such, it is not affected by the learning algorithm itself; it must be set prior to training and remains constant during training.
Cross-validation
the training set is split into complementary subsets, and each model is trained against a different combinations of these subsets and validated against the remaining parts.
Overfitting
overgeneralizing in which model only performs well on the training data. Overfitting happens when model is too complex relative to the amount and noisiness of the training data. Solutions: - simplify the model with fewer parameters by reducing the number of attributes - gather more training data - reduce noise in the training data (e.g., fix data errors and remove outliers)
Machine Learning
the science of programming computers so they can learn from data.
Unsupervised Learning
training data which is unlabeled Algorithms: a. Clustering - K-means - Hierarchical Cluster Analysis - Expectation Maximization b. Visual and dimensionality reduction - Principal Component Analysis (PCA) - Kernel PCA - Locally-Linear Embedding (LLE) - t-distributed Stochastic Neighbor Embedding (t-SNE) c. Association rule learning - Apriori - Eclat
Fitness function Cost Function
utility function that measures how good your model is. cost function measures how bad it is
Underfitting
when a model is too simple to learn the underlying structure of the data. As an example, a linear model is prone to underfit for a complex analysis Solutions: - select a more powerful model, with more parameters - feeding better features to the learning algorithm (Feature Engineering) - Reducing the constraints on the model (e.g., reducing the regularization hyperparameter)