Machine Learning - Andrew Ng

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What is overfitting?

In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.

Gradient Descent algorithm

- Gradient descent is an optimization algorithm to find a local minimum of a function. - Using gradient descent one takes steps proportional to the negative of the gradient of the function at the current point. - It used with simultaneous updates of the parameters of success - It is susceptible to falling into local optimum depending on initialization.

What is a simple linear regression?

Least squares estimator of a linear regression model with a single explanatory variable. In other words, simple linear regression fits a straight line through the set of n points in such a way that makes the sum of squared residuals of the model (that is, vertical distances between the points of the data set and the fitted line) as small as possible

What does theta typically represent in stat/ML?

Quite often θ stands for the set of parameters of a distribution.

Regression vs Classification?

Regression: Output variable takes continuous values. - Price of house given a size. Classification: Output variable takes class labels, or discrete value output - Breast cancer, malignant or benign

What is the downside of using an alpha (learning rate) that is too big?

Gradient descent can overshoot the minimum and it may fail to converge or even diverge.

Synonym for output variable

Targets

Hypothesis model

h(theta)

Cost function vs Gradient Descent

- A cost function is something you want to minimize. For example, your cost function might be the sum of squared errors over your training set. - Gradient descent is a method for finding the minimum of a function of multiple variables. So you can use gradient descent to minimize your cost function. If your cost is a function of K variables, then the gradient is the length-K vector that defines the direction in which the cost is increasing most rapidly. So in gradient descent, you follow the negative of the gradient to the point where the cost is a minimum.

How to make sure gradient descent is working properly?

- Create an automatic convergence test - declare convergence based on amount of decrease of J(Theta) - Plot on graph, y axis being J and axis being number of iterations.

Standardization vs normalization

- Normalization rescales the values from to a range of [0,1]. This might useful in some cases where all parameters need to have the same positive scale, but outliers from data set are lost. Xchanged = (X - Xmin)/(Xmax-Xmin) - Standardization rescales data to have a mean of 0 and standard deviation of 1 (unit variance). Xchanged = (x-mean)/sd For most applications standardization is recommended. In the business world, "normalization" typically means that the range of values are "normalized to be from 0.0 to 1.0". "Standardization" typically means that the range of values are "standardized" to measure how many standard deviations the value is from its mean. However, not everyone would agree with that. It's best to explain your definitions before you use them. In any case, your transformation needs to provide something useful.

Why do we square instead of using the absolute value when calculating variance and standard deviation?

- Often we want to minimize our error. When the error is a sum of squares, we are minimizing something quadratic. This is easily accomplished by solving linear equations. So Mean squared error is convenience rather than conceptual necessity.

Stochastic

- Randomly determined - Having a random probability distribution or pattern that may be analyzed statistically but may not be predicted precisely.

What is normalization?

- Simple case - Adjusting values measured on different scales to a notionally common scale, often prior to averaging. - Complicated cases - Sophisticated adjustments where the intention is to bring the entire probability distributions of adjusted values into alignment. - In the case of normalization of scores in educational assessment, there may be an intention to align distributions to a normal distribution. - A different approach to normalization of probability distributions is quantile normalization, where the quantiles of the different measures are brought into alignment.

What does the derivative of a function tell us?

- The derivative of a function of a real variable measures the sensitivity to change of a quantity (a function value or dependent variable) which is determined by another quantity (the independent variable). - Derivatives are a fundamental tool of calculus. For example, the derivative of the position of a moving object with respect to time is the object's velocity: this measures how quickly the position of the object changes when time is advanced. - The derivative of a function of a single variable at a chosen input value, when it exists, is the slope of the tangent line to the graph of the function at that point. The tangent line is the best linear approximation of the function near that input value. For this reason, the derivative is often described as the "instantaneous rate of change", the ratio of the instantaneous change in the dependent variable to that of the independent variable.

Classifier

A classifier is a hypothesis or discrete-valued function that is used to assign (categorical) class labels to data points. In the email classification example, this classifier could be a hypothesis for labeling emails as spam or non-spam.

What are contour plots?

A contour plot is a graphical technique for representing a 3D surface by plotting constant z slices, called contours onto a 2Dimensional format. That is, given a value for z, lines are drawn for connecting the x,y, coordinates where that z value occurs. The circles in a contour plot are called level sets - The function J is equal here. The center of the contour plot is the minimum of the cost function typically in ML.

What is "hypothesis" in machine learning?

A hypothesis is a certain function that we believe (or hope) is similar to the true function, the target function that we want to model. In context of email spam classification, it would be the rule we came up with that allows us to separate spam from non-spam emails.

Training sample

A training sample is a data point x in an available training set that we use for tackling a predictive modeling task. For example, if we are interested in classifying emails, one email in our dataset would be one training sample.

What does it mean for an algorithm to converge?

An iterative algorithm is said to converge when as the iterations proceed the output gets closer and closer to a specific value. In some circumstances, an algorithm will diverge; its output will undergo larger and larger oscillations, never approaching a useful result.

Why is it unnecessary to change alpha over time to ensure that the gradient descent converges to a local minimum?

As we approach a local minimum, the gradient descent will take smaller steps because of the change of the derivative of the cost function J.

Why use features that are on a similar scale

Contour plots with differently scaled features will be extremely thin or extremely fat resulting in a very slow gradient descent (convergence is slower) Get them to a -1 <= x <= 1 scale.

What happens to cost function J(theta) when sufficiently small learning rate alpha is used.

Cost function J(Theta) should decrease with EVERY iteration

What is cross-validation?

Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. The goal of cross validation is to define a dataset to "test" the model in the training phase (i.e., the validation dataset), in order to limit problems like overfitting, give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem), etc.

"Batch" Gradient Descent (BGD or GD)

Each step of gradient descent uses all the training examples. batch GD - This is different from (SGD - stochastic gradient descent or MB-GD - mini batch gradient descent) In GD optimization, we compute the cost gradient based on the complete training set; hence, we sometimes also call it batch GD. In case of very large datasets, using GD can be quite costly since we are only taking a single step for one pass over the training set -- thus, the larger the training set, the slower our algorithm updates the weights and the longer it may take until it converges to the global cost minimum.

What is feature scaling?

Feature scaling is a method used to standardize the range of independent variables or features of data. also known as data normalization If the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance. - Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it. - Some methods are rescaling, standardization, scaling to unit length.

Synonym for Input variable

Features

What letter is typically used to depict a cost function?

Function J

Derive SSE

Given a linear regression model, the difference at each predicted point with the correct point is given by diff = y_i - (mx + b)

What is the downside of using an alpha (learning rate) that is too small?

Gradient descent can be way too slow.

K-Fold cross validation

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation.

Model

In machine learning field, the terms hypothesis and model are often used interchangeably. In other sciences, they can have different meanings, i.e., the hypothesis would be the "educated guess" by the scientist, and the model would be the manifestation of this guess that can be used to test the hypothesis.

Target Function

In predictive modeling, we are typically interested in modeling a particular process; we want to learn or approximate a particular function that, for example, let's us distinguish spam from non-spam email. The target function f(x) = y is the true function f that we want to model. The target function is the (unknown) function which the learning problem attempts to approximate.

What does increasing cost function J(Theta) tell you about your gradient descent?

It tells that gradient descent is not working. Use a bigger Learning rate, alpha. On the other end if you use too big learning rate, alpha you'll end up with a bowl shaped curve and you might be moving farther away from convergence.

Clustering

Method of unsupervised learning - Way of discovering unknown relationships in datasets. Cluster analysis is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties.

SSE

Sum of squared error

Supervised learning

Supervised learning is a type of machine learning algorithm that uses a known dataset (called the training dataset) to make predictions. The training dataset includes input data and response values. From it, the supervised learning algorithm seeks to build a model that can make predictions of the response values for a new dataset. A test dataset is often used to validate the model. Using larger training datasets often yield models with higher predictive power that can generalize well for new datasets.

What happens if you initialize a parameter at a local minimum and attempt to use gradient descent on it?

The derivative turns out to be zero because the tangent is a flat line meaning that regardless of alpha it is multiplied by zero, indicating no change.

Learning algorithm

The goal is to find or approximate the target function, and the learning algorithm is a set of instructions that tries to model the target function using our training dataset. A learning algorithm comes with a hypothesis space, the set of possible hypotheses it can come up with in order to model the unknown target function by formulating the final hypothesis

3D Surface Plot - How can it be used to plot the cost function?

Theta 0 and Theta 1 in a univariate linear regression can be plotted on the x and y axes. the Z axis will indicate the actual cost

What does this hypothesis represent? h_theta(x) = theta_0 + theta_1 x

Univariate linear regression model

Unsupervised learning

Unsupervised learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning from supervised learning and reinforcement learning. Unsupervised learning is closely related to the problem of density estimation in statistics. However unsupervised learning also encompasses many other techniques that seek to summarize and explain key features of the data.

What to do when cost function J(Theta) is moving up and down in waves ?

Use a smaller learning rate, alpha

Advantage of k-fold CV over repeated random sub-sampling

all observations are used for both training and validation, and each observation is used for validation exactly once.

How to avoid overfitting?

cross-validation, regularization, early stopping, pruning

Stratified k-fold cross-validation

the folds are selected so that the mean response value is approximately equal in all the folds. In the case of a dichotomous classification, this means that each fold contains roughly the same proportions of the two types of class labels.

Machine Learning - Andrew Ng

Ensembles d'études connexes

Chapter 12 Review

Old test questions for studying

Real Estate Questions Ch. 16

Week 5: The Federal Reserve and Monetary Policy

Chem Lab Final

Multiplication properties

Proofreading for Comma Usage

Julius Caesar Act V

Intellectual Property Midterm

Database Systems, Chapter 14

Statistics Chapter 7

Pulmonary Part 2

Starting Out with Python Chapter 1

Module 5

Ciphers

mgmt 301 chapter 16 quiz

Business Law Unit 3

The Child with an Infectious Disease

13 Practice Exam Stakeholder Mng.

math final