Machine Learning (CH 1)

Ace your homework & exams now with Quizwiz!

Non-representative data a. Sampling Noise b. Sampling Bias

a. if the sample is too small b. non representative if the sample is too big (i.e., nonresponse bias, less than 25% people answered, sampled only few criteria)

Supervised Learning

Tasks as 1. 'classification'. 2. Regression: predict a 'target' numeric value given a set of 'features' called 'predictors' Algorithms: - k-Nearest Neighbors - Linear Regression - Logistic Regression - SVM - Decision Trees & Random Forests - Neural Networks

Underfitting

when a model is too simple to learn the underlying structure of the data. As an example, a linear model is prone to underfit for a complex analysis Solutions: - select a more powerful model, with more parameters - feeding better features to the learning algorithm (Feature Engineering) - Reducing the constraints on the model (e.g., reducing the regularization hyperparameter)

If your model performs great on the training data but generalizes poorly to new instances, what is happening? What are the solutions?

If a model performs great on the training data but generalizes poorly to new instances, the model is likely overfitting the training data (or we got extremely lucky on the training data). Possible solutions to overfitting are 1. getting more data, 2. simplifying the model (selecting a simpler algorithm, 3. reducing the number of parameters or features used, 4. regularizing the model), 5. reducing the noise in the training data.

What can go wrong if you tune hyperparameters using the test set?

If you tune hyperparameters using the test set, you risk overfitting the test set, and the generalization error you measure will be optimistic (you may launch a model that performs worse than you expect).

What are challenges in ML?

Some of the main challenges in Machine Learning are the 1. lack of data, 2. poor data quality, 3. nonrepresentative data, 4. uninformative features, 5. excessively simple models that underfit the training data 6. complex models that overfit the data.

Overfitting

overgeneralizing in which model only performs well on the training data. Overfitting happens when model is too complex relative to the amount and noisiness of the training data. Solutions: - simplify the model with fewer parameters by reducing the number of attributes - gather more training data - reduce noise in the training data (e.g., fix data errors and remove outliers)

Unsupervised Learning

training data which is unlabeled Algorithms: a. Clustering - K-means - Hierarchical Cluster Analysis - Expectation Maximization b. Visual and dimensionality reduction - Principal Component Analysis (PCA) - Kernel PCA - Locally-Linear Embedding (LLE) - t-distributed Stochastic Neighbor Embedding (t-SNE) c. Association rule learning - Apriori - Eclat

Feature Engineering

As saying goes: garbage in, garbage out. come up with a good set of features to train on. Drop irrelevant predictors - feature selection: select useful features - feature extraction: combining existing features aka dimensionality reduction - create/gather new data

Testing and Validating

Generalization Error: error rate of trained model with training set vs test set data (aka. out-of-sample error) If low training error and high generalization error then model is overfitting

Machine Learning

the science of programming computers so they can learn from data.

Cross-validation

the training set is split into complementary subsets, and each model is trained against a different combinations of these subsets and validated against the remaining parts.

Fitness function Cost Function

utility function that measures how good your model is. cost function measures how bad it is

What is the difference between a model parameteter and learning algorithm's hyper parameter?

A model has one or more model parameters that determine what it will predict given a new instance (e.g., the slope of a linear model). A learning algorithm tries to find optimal values for these parameters such that the model generalizes well to new instances. A hyperparameter is a parameter of the learning algorithm itself, not of the model (e.g., the amount of regularization to apply).

Validation Set

A data set in which the input is provided and the desired output is known, so that it can be determined how well a ML algorithm is working. Why: if you measure the generalization error multiple times on a test set to reduce, the model will overfit.

What is a test set and why would you want to use it?

A test set is used to estimate the generalization error that a model will make on new instances, before the model is launched in production.

What is the purpose of a validation set?

A validation set is used to compare models. It makes it possible to select the best model and tune the hyperparameters.

What is cross-validation and why would you prefer it to a validation set?

Cross-validation is a technique that makes it possible to compare models (for model selection and hyperparameter tuning) without the need for a separate validation set. This saves precious training data. Advantages: Generally, for building the machine learning models, data is the fuel. It is very unlikely that, we can find a very big dataset to build an efficient model. In case of data scarcity (which is the normal situation) , if we are proceeding with normal validation, almost we are reducing the dataset size by almost around 20-30% . In case of k-fold cross validation , there won't be any such kind of reduction in dataset size. And more over by doing k-fold cv , one can prevent over fitting.

Feature Extraction

Unsupervised Learning; dimensionality reduction, merging into one feature. why: reducing dimension of training data before feeding into another ML algorithm will run faster, take up less memory, and some cases perform better

Association Rule Learning

Unsupervised Learning; examining data to discover new and interesting relationships among attributes that can be stated as business rules.

Anomaly Detection

Unsupervised Learning; the process of identifying rare or unexpected items or events in a data set

Regularization

constraining a model to make it simpler and reduce the risk of overfitting this can be controlled by a hyperparameter

Hyperparameter

is a parameter of a learning algorithm (not of the model). As such, it is not affected by the learning algorithm itself; it must be set prior to training and remains constant during training.

See all study sets

Machine Learning (CH 1)

Related study sets

Management Final

Practice Exam - Life Insurance MI

Finance test 2

B BUS 320A: Marketing Management

MGMT 309 - Exam 2 Review

Mastering Physics Chapter 4

A&P1 Exam 1 study set

BUS 313 Final Exam Guide

CRIM. Special Topics: Exam 1

TOTAL SET ASTRO MIDTERM --- SRI GUTTIKONDA

Informatics / Health Information

Pharmacology Module 4 - CARDIAC

NSG2317 Final

marketing chapter 12 final

Medication Safety

Chapter 12 Smartbook

330

Inflammatory Bowel Disease- Crohn's Disease, Ulcerative colitis, & Diverticulitis

chapter 20 post test review

Common Knowledge Farm Business Management CDE-2019 test