Fundamentals of Machine Learning
What is a labeled training set?
A labeled training set is a set of data which contains the desired outcome and used to train the model.
Name and define three unsupervised tasks.
Anomaly detection - Finds data points which do not fit the norm and labels them as anomalies. Novelty detection - The goal is to detect new instances that differ from all other instances in training set. Association rule learning - tries to identify relations between attributes.
What is the difference between a model parameter and a learning algorithm's hyperparameter?
A model has one or more model parameters that enable the model to make predictions given new instances. The value for these parameters are determined such that the model generalizes to new instances. A hyperparameter is a parameter of the learning algorithm itself, not the model of the model. Like how much regulation to apply.
Hyperoarameter
A parameter of the learning algorithm.
Algorithm works by comparing new data points to known data points or instead detects patterns in data and builds a predictive model.
Instance-based vs model-based
Regulization
The process of constraining the model to reduce the possibility of overfitting.
Would you frame the problem of spam detection as a supervised learning problem or an unsupervised learning problem?
A supervised learning problem because the dataset contains whether each email is spam or ham.
What is a test set, and why would you want to use it?
A test set is a subset of the total data set and used to estimate the generalization error the model makes on new instances.
What is the purpose of a validation set?
A validation set (Development set or dev set) is used to compare models, and allow us to choose the best model and fine tune our hyperparameters.
What type of algorithm would you use to segment your customers into multiple groups?
A clustering algorithm would be good if you didn't know how to group your customers. The algorithm would group new instances with similar customers. This is unsupervised learning. Classification algorithm would be good if you knew the type of groups you'd like to have, then giving the algorithm many example to classify new instances based to groups based on those examples.
Semi-Supervised Learning
An approach to teaching an algorithm to learn from a data set which contains labelled and unlabelled data. Semi-supervised learning algorithms use a combination of supervised and unsupervised algorithms.
What is an online learning system?
An online learning system is based on feeding data incrementally or in mini-batches, where each step the model learns fast. This makes it possible for the model to adapt quickly to changing data.
Can you name four common unsupervised tasks?
Clustering, Visualization, Dimensionality Reducation, and Association Rule Learning
Name some important unsupervised learning algorithms
Clustering: K-Means DBSCAN Hierarchical Clustering Analysis (HCA) Anamoly Detection and Novelty Detection: One-Class SVM Isolation Forest Visualization and Dimensionality reduction: Principal Component Analysis (PCA) Kernel Local Linear Embedding (LLE) t-Distributed Stochastic Neighbor Embedding (t-SNE) Associated Rule Learning: Apriori Eclat
Can you name four types of problems where Machine Learning shines?
Helps replace human built rule based models. Help solve complex problems where we have no algorithmic solutions. Help humans learn from data. Help build systems that quickly adapt to flucuating environments.
What type of learning algorithm relies on a similarity measure to make predictions?
Instance-based learns the training data by heart and uses a similarity measurement to find the most similar learned instances and uses it to make a prediction for new instances.
Name some important supervised learning algorithms
K-Nearest Neighbors, Linear Regression, Logistic Regression, Support Vector Machines (SVM), Decision Trees and Random Forests, and Neural Nets.
How would you define Machine Learning?
Machine learning is based on building models which learn to get better at a task from the given data.
What do model-based learning algorithms search for? What is the most common strategy they use to succeed? How do they make predictions?
Model-based learning algorithms search for optimal values of the model parameters such that the model generalizes well for new instances. We measure success by training the system to minimize a cost function which measures how bad the model is at making predictions. In order to make predictions, we pass the testing data to our model and make predictions using our optimized model parameters.
Two kinds of learning
Online and batch-learning
What is out-of-core learning?
Out-of-core learning can manage large data sets which do not fit into the computer's main memory by implementing online learning techniques.
What are the two most common supervised tasks?
Regression and classification
What type of Machine Learning algorithm would you use to allow a robot to walk in various unknown terrains?
Reinforcement Learning would be best suited for this type of problem given the example of Alpha Go.
Can you name four of the main challenges in Machine Learning?
Some of the challenges include: The lack of data Uninformative data Nonrepresentative data Poor data quality Models that under/over fit data
What are four Human supervision learning?
Supervised, Unsupervised, Semi-supervised, and Reinforcement Learning
WHat does it mean for your training error to be low but your generalization error is high?
The model is overfitting the data.
If your model performs great on the training data but generalizes poorly to new instances, what is happening? Can you name three possible solutions?
The model may be overfitting the data. A solution to overfitting is to obtain more data, simplfying the model by reducing the number of model parameters or choosing a simpler model, or reducing the noise in the training set.
Idea behind Dimensionality Reducation?
To simplify the data without losing information.
What can go wrong if you tune hyperparameters using the test set?
Tuning hyperparameters using the test set will cause the model to overfit the data and increase the risk of having a generalization error that is not representative of the actual predicition accuracy.
What is Feature Extraction?
When reducing the dimensionality of a data set by merging two features like a car's mileage and age to create a new feature that represents the car's wear and tear.
Unsupervised Learning
When you feed the algorithm a training set that DOES NOT have the desired solutions
Supervised Learning
When you feed the algorithm a training set that includes the desired solution, called labels.