Machine Learning - Midterm Study

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

What are non parametric models

"Nonparametric methods are good when you have a lot of data and no prior knowledge, and when you don't want to worry too much about choosing just the right features."

What seperates Adaline from perceptron?

-Perceptron uses the class labels to learn model coefficients -Adaline uses continuous predicted values (from the net input) to learn the model coefficients, which is more "powerful" since it tells us by "how much" we were right or wrong -Linear activation instead of step-wise activation

How is an Adaptive Linear Neuron (Adaline) like a perceptron

-they are classifiers for binary classification -have a linear decision boundary -can learn iteratively, sample by sample use a threshold function

What are the steps of LDA?

1.Compute the d-dimensional mean vectors. 2.Compute the scatter matrices 3.Compute the eigenvectors and corresponding eigenvalues for the scatter matrices. 4.Sort the eigenvalues and choose those with the largest eigenvalues to form a d×k dimensional matrix 5.Transform the samples onto the new subspace.

How many steps are there in the machine learning flow?

6

Definition of machine learning

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E."

Typically don't keep features where...

Almost all instance have the same value (no information) Almost all instances have unique values (SSN, phone-numbers) The feature is highly correlated with another feature

Give an example of how each of the three could be used other than the ones given in the lecture

An example of the k-Nearest Neighbors could be leveraging it for a genome study and creating classifications based on charactersistics, an example of a decision tree model could be a model built to predict if a person likes video games, and going through different characteristics to make the decision, an example Support Vector Machines would be building a classifier on if an image was Cinderella or Alice in Wonderland and using the closest points as support vectors.

What is logistic regression?

Classification model, not regression (as name implies) Parametric model Widely used Linear model that can be extended to multiclass classification Uses logistic function, (think big s) Overcomes the shortfall of Perceptron & Adaline of non convergence Odds ratio of p/(1-p) where p is the probabiliy of a certain event is used.

Examples of unsupervised learning

Clustering, anomaly detection, Neural networks

Limitations of Parametric models

Constrained - to the specific form is function form is chosen Limited complexity - Suited to simpler problems poor fit - methods are unlikely to match the underlying mapping function

What does the C parameter in Scikit learn do?

Deals with outliers and hard vs soft margin

What is Supervised learning

Input data includes label (desired solution) which is used to "train" the system. •Prediction of future cases: Use the rule to predict the output for future inputs •Knowledge extraction: The rule is easy to understand •Compression: The rule is simpler than the data it explains •Outlier detection: Exceptions that are not covered by the rule, e.g., fraud

What is unsupervised learning

Input data is not labeled. •Learning "what normally happens" •No output •Clustering: Grouping similar instances •Example applications -Customer segmentation in CRM -Image compression: Color quantization -Bioinformatics: Learning motifs

What is machine learning?

It is the idea that a computer learns from experience. With each run of the tasks, the machine can adjust it's parameters to better predict a model for what the next activity will hold.

How is PCA done?

Mean center the data Compute covariance matrix Σ Calculate eigenvalues and eigenvectors of Σ Eigenvector with largest eigenvalue λ1 is 1st principal component (PC) Eigenvector with kth largest eigenvalue λk is kth PC λk / Σi λi = proportion of variance captured by kth PC

Support Vector Machines - General

Objective is to find a hyperplane in an N-dimensional space (N-the number of features) that distinctly classifies the data points Linear and non-linear classification Can handle outlier detection

In K-nearest nieghbor how could one attribute be given more importance than other attributes?

One way this can done is to assign weights to the contributions of the neighbors, so that the nearest neighbors contribute more to the average than the distant ones. Based on what attributes the nearest neighbors have, this will weight the importance of different attributes differently.

DT ▪Which attribute should be tested at the root? ▪Gain(S, Outlook) = 0.246 ▪Gain(S, Humidity) = 0.151 ▪Gain(S, Wind) = 0.084 Gain(S, Temperature) = 0.029

Outlook provides the best prediction for the target

PCA vs LDA

PCA: Component axes that maximize the variance LDA: Maximizing the compenent

Are parametric models supervised or unsupervised learning? defend your answer

Parametric models are a subset of supervised learning. This is because we already know the outcomes, and we are initializing weights based on outcomes and the inputs those outcomes hold.

When to use decision trees?

Problem characteristics: ▪Instances can be described by attribute value pairs ▪Target function is discrete valued ▪Disjunctive hypothesis may be required ▪Possibly noisy training data samples ▪Robust to errors in training data ▪Missing attribute values

Common measures to compare models

RMSE, MAE, R^2

Benefits of dimensionality reduction

Reduces Overfitting:Less redundant data means less opportunity to make decisions based on noise. Improves Accuracy:Less misleading data means modeling accuracy improves Reduces Training Time:Fewer data points reduce algorithm complexity and algorithms train faster.

Options for dimensionality reduction

Regularization, Feature selection, feature extraction

What does R^2 measure? (formula)

Represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. ... It may also be known as the coefficient of determination

SVM Regression

Reverse the objective of classification Instead of trying to fit largest possible street between two classes while limiting margin violations...try to fit as many instances as possible ON the street while limiting margin violations (violations off the street) Width of the street is a hyperparameter (epsilon)

In dimensionality reduction, how can significant improvements first be acheived?

Significant improvements can be achieved by first mapping (projecting) the data into a lower-dimensional space.

Benefits of parametric models

Simple - easier to understand Speed - fast learning from data Less data - Do not require much data, work well if data fit is not perfect

What is semi supervised learning

Some input data is labeled and some is not. •Learning a policy: A sequence of outputs •No supervised output but delayed reward •Credit assignment problem •Game playing •Robot in a maze •Multiple agents, partial observability, ...

Strengths and weaknesses of LDA

Strengths Supervised, which can (but doesn't always) improve the predictive performance of the extracted features.Offers variations (i.e. quadratic LDA) to tackle specific roadblocks. Weaknesses (same as PCA) New features are not easily interpretable Manually set or tune the number of components to keep

What are the categories of machine learning?

Supervised learning, unsupervised learning, Semisupervised learning, Reinforcement learning

What is LDA

Supervised method that can only be used with labeled data Dependent on scale, so normalize data set first Can be used as a classification algorithm itself

What is reinforcement learning?

System interacts and receives rewards or penalties based on decisions to determine a strategy (called a policy)

How does the perceptron work?

Takes a set of inputs, a set of weights, takes a sum, and makes decision based on the activation function

What are the tree parametric models discussed?

The 3 models discussed in the videos are Perceptrons, Adaline, and Logistic Regression.

SVM concepts

Widest possible "street" between classes Known as the largest margin classifier Support vectors (closest points are circled/make the vectors) Off street examples won't change the decision Goal is largest possible margin at whatever angles possible, can have smaller margins if standard up down

ML Checklist: Get Data (Explore) - know general ideas

✔Create a copy and use the copy to explore ✔Create a Jupyter notebook (if using) ✔Study attributes and characteristics •Name, type, % missing, noisiness & type, type of distribution ✔For supervised learning, identify target attributes ✔Visualize the data ✔Study correlations between attributes ✔Consider how to solve manually ✔Identify potential transformations ✔Determine if additional data will be needed (and repeat step if so) ✔Document! Document! Document!

ML Checklist: Get Data (Fine-Tune & Test) - know general ideas

✔Fine tune any hyperparameters ✔Try combining models (ensemble learning) ✔Measure the model on the test set of data

ML Checklist: Get Data (Prepare Data) - know general ideas

✔Fix or remove outliers if desired ✔Fill in missing values ✔Drop any attributes that aren't useful ✔If needed, make continuous features discrete ✔Decompose complex features ✔Add any desired mathematical transformations ✔Aggregate features if possible ✔Standardize or normalize features ✔Document! Document! Document!

ML Checklist: Get Data (GET) - know general ideas

✔List what is needed ✔Find the data ✔Check space limitations ✔Verify legal concerns ✔Get access authorization if necessary ✔Create a workspace ✔Get the data ✔Understand data ✔Convert to make easy to manipulate ✔Ensure security/privacy ✔Verify time/size units ✔Take sample for testing and put aside ✔Document! Document! Document!

ML Checklist: Get Data (Train the system) - know general ideas

✔Train on several algorithms in a prototype approach ✔Measure and compare performance ✔Analyze most significant variables for each algorithm ✔Analyze the types of errors ✔Refine feature selection ✔Repeat above as needed ✔Identify the top 3 to 5 promising models

ML Checklist: Getting Started (general ideas)

✔What is the objective in business terms? ✔Understand how your solution will be used ✔Are there current solutions/workarounds? ✔What categorization? (supervised/unsupervised, etc) ✔How will performance be measured ✔Does the performance measure match the business objective? ✔What's the minimum acceptable performance ✔Any reuse possible? ✔Is human expertise available? ✔What would be the manual solution? ✔Are there any assumptions? (Verify if possible) ✔Document! Document! Document!

What is PCA

•PCA is a linear transformation, so if the data is highly non-linear then the transformed data will be less informative -Non linear dimensionality reduction techniques needed -PCA is good at removing redundant correlated features •Caution: Not a "cure all" •Can lose important info in some cases -How would you know if it is effective? -Just compare accuracies of original vs transformed data set •Unsupervised learning - is there perhaps value in the label?

Methods for feature selection

•Search -Exponential -Backward, Forward, Genetic, others •Variance Threshold •Correlation Threshold •Scikit-Learn includes some tools

What is PCA

•Seek new set of bases that correspond to the highest variance in the data •Transform n-dimensional data to a new n-dimensional basis •The new dimension with the most variance is the first principal component •The next is the second principal component, etc. •Note z1 combines/fuses significant information from both x1 and x2 •Drop dimensions for which there is little variance

When to stop building a decision tree?

•The process is repeated for each successor node until all the examples are classified correctly or there are no attributes left

How to use feature importance

•Use the feature importance property of the model. •Provides a score for each feature of your data, the higher the score more important or relevant is the feature towards your output variable. •Feature importance is an inbuilt class that comes with Tree Based Classifiers.

What examples of non parametric models?

•k-Nearest Neighbors •Decision Trees •Support Vector Machines

DT which attribute is best classifier?

▪A statistical property called information gain, measures how well a given attribute separates the training examples ▪Information gain uses the notion of entropy, commonly used in information theory ▪Information gain = expected reduction of entropy

Information gain uses the notion of...

▪entropy, commonly used in information theory ▪Information gain = expected reduction of entropy

What is information gain in a Decision Tree?

▪measures how well a given attribute separates the training examples

SVM (classify Linear and Kernel as parametric or non parametric)

○Linear is parametric ○Kernel is non-parametric.

Non Parametric (parameter amount, data set dependance)

●Without parameters, number of parameters is dependent on data set

What are the limitations of non parametric models

•More data: Require a lot more training data to estimate the mapping function. •Slower: A lot slower to train as they often have far more parameters to train. •Overfitting: More of a risk to overfit the training data and it is harder to explain why specific predictions are made.

In the end to end process what is a common step to all of our items?

Documentation

Decision tree representation

Each node represents a test on an attribute of the instance to be classified and each outgoing arch a possible outcome, leading to a further test. The leafs correspond to classification actions. A binary classification in this case. The instance <Outlook=Sunny, Temp=Hot, Humidity=High, Wind=Strong> is classified as No. ●Will first look at outlook, weather - which determines our first choice, and then we look at different factors from there. •Decision trees represent a disjunction of conjunctions on constraints on the value of attributes: (Outlook = Sunny ∧ Humidity = Normal) ∨ (Outlook = Overcast) ∨ (Outlook = Rain ∧ Wind = Weak)

Entropy

Entropy measures the amount of information in a random variable Information gain is the expected reduction in entropy caused by partitioning the examples on an attribute. The higher the information gain the more effective the attribute in classifying training data.

What is PCA graphically?

First PC is the projection direction that maximizes the variance of the projected data Second PC is the projection direction that is orthogonal to the first PC and maximizes variance of the projected data

Dimensionality theory vs reality

From a theoretical point of view, increasing the number of features should lead to better performance. In practice, the inclusion of more features leads to worse performance (i.e., curse of dimensionality). The number of training examples required increases exponentially with dimensionality.

Go beyond k = 1 with KNN

Given a query instance xq, •first locate the k nearest training examples •if discrete values target function then take vote among its k nearest neighbors •if real valued target function then take the mean of the f values of the k nearest neighbors

Given a training set of examples what approach is used on DT's?

Greedy top down

Basic algorithm for learning DT's

ID3

What are the advantages of non-parametric models?

The advantages are that they are flexible and capable of fitting a large number of functional forms, powerful in that they make no assumptions about the underlying function, and they can result in higher performance models for prediction.

What are the advantages of parametric models?

The advantages of Parametric models are three-fold; they are easier to understand and interpret results to the business, you can quickly train the models because they will learn from the data, and these models don't require as much data as others may or have to have a perfect fit to work.

What are three disadvantages of parametric models?

There are 3 disadvantages of parametric models. The first is that we are constained in the form we can transform our model if we are using a specific function, while these models may be easier to translate than some machine learning models, they do have limited complexity for the same reason, and these models are unlikely to fit the data as well as others might.

How are artificial intelligence, machine learning and deep learning related and how are they different?

They are each a subset of the idea of computer aided decision making. As we go from artificial intelligence, to machine learning to deep learning however, we allow the computer to make more of the decisions, and run more testing, and the models get much more precise.

Why is data visualization important?

This is a way to show trending to the customers of our analysis it can also be helpful at the beginning to quickly determine outliers of a dataset

One of the "Getting started" steps is define performance measures, how are these used?

To measure the error rate of each model

Cost function / optimize by minimize

Too many features/dimensions can lead to decrease in performance

When a model doesn't perform as well as expected which of the following might be a reasonable solution to consider?

Tune hyperparameters, consider ensemble learning

What is the difference between LDA and PCA

Unlike PCA, LDA doesn't maximize explained variance, instead it maximizes the seperability between classes

PCA

Unsupervised linear transformation Identify correlation between features Assume that the high dimensional data actually resides in a inherent low-dimensional space Additional dimensions are just random noise Goal is to recover these inherent dimensions and discard the noise dimensions

What is a K-nearest neighbor model?

•Multiclass classifier (also called multinomial) •Instance Based Learning •Lazy learning •Need -Training Data -Distance Measure -Value of k •1-Nearest neighbor: -Given a query instance x, •first locate the nearest training example y •then f(x):= f(y)

What does MAE measure? (formula)

a measure of errors between paired observations expressing the same phenomenon. ... The mean absolute error uses the same scale as the data being measured

Parametric (parameter amount, data set dependance)

fixed number of parameters, independent of data set

What does RMSE measure? (formula)

frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed

What are the three non-parametric models discussed?

k-Nearest Neighbors, Decision Trees, Support Vector Machines

Examples of supervised learning

linear regression, logistic regression, and neural networks as well, apart from decision tree, Support Vector Machine (SVM), random forest, naive Bayes, and k-nearest neighbor.

Examples of parametric models

perceptron, adaline, logistic regression, linear regression, Naive bayes, simple neural nets, Linear discriminant analysis

What is a critical concept in SVM?

scaling

Fundamental question for DT's

•"which attribute should be tested next? Which question gives us more information?"

What is the Covariance matrix?

•Considering the sign (rather than exact value) of covariance: •Positive value means that as one feature increases or decreases the other does also (positively correlated) •Negative value means that as one feature increases the other decreases and vice versa (negatively correlated) •A value close to zero means the features are independent •If highly covariant, are both features necessary? •Covariance matrix is an n × n matrix containing the covariance values for all pairs of features in a data set with n features (dimensions) •The diagonal contains the covariance of a feature with itself which is the variance (which is the square of the standard deviation) •The matrix is symmetric

What is regularization?

•Cost function in models penalizes more complex models

What is feature extraction?

•Create new features based on original features

Feature selection searching

•Exhaustive Search - Exhausting •Forward Search - O(n2 · learning/testing time) - Greedy 1.Score each feature by itself and add the best feature to the initially empty set Feature Set (FS) 2.Try each subset consisting of the current FS plus one remaining feature and add the best feature to FS 3.Continue until stop getting significant improvement •Backward Search - O(n2 · learning/testing time) - Greedy 1.Score the initial complete set FS 2.Try each subset consisting of the current FS minus one feature in FS and drop the feature from FS causing least decrease in accuracy 3.Continue until begin to get significant decreases in accuracy •Branch and Bound and other heuristic approaches available

What is feature selection?

•Filtering irrelevant or redundant features from the dataset •Done automatically as part of some algorithms •Use a subset of the original features

Benefits of non parametric models

•Flexibility: Capable of fitting a large number of functional forms. •Power: No assumptions (or weak assumptions) about the underlying function. •Performance: Can result in higher performance models for prediction.

What is dimensionality?

•In machine learning, dimensionality refers to the number of features (i.e. input variables) in your dataset. •When number of features is very large relative to the number of observations in your dataset, certain algorithms struggle to train effective models -Called the "Curse of Dimensionality" -Especially relevant for clustering algorithms that rely on distance calculations.

How to use logistic regression to train

•Output of the cost function is a probability •Use a threshold to convert outcome •Gives us a quasi confidence measure •Use the log likelihood as the cost function


Set pelajaran terkait

CHAPTER 11 The Gallbladder and the Biliary System

View Set

Humanities 6; Chapter 30: China, The world's most populous country

View Set

VL10 Exploratorische Faktorenanalyse

View Set

Chapter 4: Exploratory and Observational Research Designs and Data Collection

View Set

Chapter 2: Nature of Insurance, Risk, Perils and Hazards

View Set

Price Elasticity of Demand and Price Elasticity of Demand

View Set