Machine Learning - Andrew Ng

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

x^i_j notation in ML means?

an index into a training set for the ITH training example and JTH feature (input variable)

multivariate linear regression

ask: why is the notation shorter and what does that convenience notation indicate?

SSE formula?

observation - mean for each observation squared.

Cocktail party effect/problem

The cocktail party effect is the phenomenon of being able to focus one's auditory attention on a particular stimulus while filtering out a range of other stimuli, much the same way that a partygoer can focus on a single conversation in a noisy room. Example of source separation.

For a sufficiently small alpha...

J(Theta) should decrease EVERY iteration

Standardization vs normalization

Normalization rescales the values from to a range of [0,1]. This might useful in some cases where all parameters need to have the same positive scale, but outliers from data set are lost. Xchanged = (X - Xmin)/(Xmax-Xmin) Standardization rescales data to have a mean of 0 and standard deviation of 1 (unit variance). Xchanged = (x-mean)/sd For most applications standardization is recommended. In the business world, "normalization" typically means that the range of values are "normalized to be from 0.0 to 1.0". "Standardization" typically means that the range of values are "standardized" to measure how many standard deviations the value is from its mean. However, not everyone would agree with that. It's best to explain your definitions before you use them. In any case, your transformation needs to provide something useful.

Regression vs Classification?

Regression: the output variable takes continuous values. - Price of house given a size. Classification: the output variable takes class labels, or discrete value output - Breast cancer, malignant or benign? Almost like quantitative vs categorical

What does it mean for an algorithm to converge?

An iterative algorithm is said to converge when as the iterations proceed the output gets closer and closer to a specific value. In some circumstances, an algorithm will diverge; its output will undergo larger and larger oscillations, never approaching a useful result. The "converge to a global optimum" phrase in your first sentence is a reference to algorithms which may converge, but not to the "optimal" value (e.g. a hill-climbing algorithm which, depending on initial conditions, may converge to a local maximum, never reaching the global maximum).

Why is it unnecessary to change alpha over time to ensure that the gradient descent converges to a local minimum?

As we approach a local minimum, the gradient descent will take smaller steps because of the change of the derivative or the steepness of the cost function J. Don't need to worry about divergence.

What is cross-validation?

Cross-validation, sometimes called rotation estimation,[1][2][3] is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (testing dataset).[4] The goal of cross validation is to define a dataset to "test" the model in the training phase (i.e., the validation dataset), in order to limit problems like overfitting, give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem), etc.

"Batch" Gradient Descent (BGD or GD)

Each step of gradient descent uses all the training examples. batch GD - This is different from (SGD - stochastic gradient descent or MB-GD - mini batch gradient descent) In GD optimization, we compute the cost gradient based on the complete training set; hence, we sometimes also call it batch GD. In case of very large datasets, using GD can be quite costly since we are only taking a single step for one pass over the training set -- thus, the larger the training set, the slower our algorithm updates the weights and the longer it may take until it converges to the global cost minimum (note that the SSE cost function is convex).

What is feature scaling?

Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. The range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization[citation needed]. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance[citation needed]. If one of the features has a broad range of values, the distance will be governed by this particular feature[citation needed]. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance[citation needed]. Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it[citation needed]. Some methods are rescaling, standardization, scaling to unit length.

Synonym for Input variable?

Features

Learning algorithm?

Learning algorithm: Again, our goal is to find or approximate the target function, and the learning algorithm is a set of instructions that tries to model the target function using our training dataset. A learning algorithm comes with a hypothesis space, the set of possible hypotheses it can come up with in order to model the unknown target function by formulating the final hypothesis

Higher derivatives?

Let f be a differentiable function, and let f ′(x) be its derivative. The derivative of f ′(x) (if it has one) is written f ′′(x) and is called the second derivative of f. Similarly, the derivative of a second derivative, if it exists, is written f ′′′(x) and is called the third derivative of f. Continuing this process, one can define, if it exists, the nth derivative as the derivative of the (n-1)th derivative. These repeated derivatives are called higher-order derivatives. The nth derivative is also called the derivative of order n.

Model definition?

Model: In machine learning field, the terms hypothesis and model are often used interchangeably. In other sciences, they can have different meanings, i.e., the hypothesis would be the "educated guess" by the scientist, and the model would be the manifestation of this guess that can be used to test the hypothesis.

Supervised learning

Supervised learning is a type of machine learning algorithm that uses a known dataset (called the training dataset) to make predictions. The training dataset includes input data and response values. From it, the supervised learning algorithm seeks to build a model that can make predictions of the response values for a new dataset. A test dataset is often used to validate the model. Using larger training datasets often yield models with higher predictive power that can generalize well for new datasets. Called Supervised learning BECAUSE the data is labeled with the "correct" responses.

Target Function definition?

Target function: In predictive modeling, we are typically interested in modeling a particular process; we want to learn or approximate a particular function that, for example, let's us distinguish spam from non-spam email. The target function f(x) = y is the true function f that we want to model. The target function is the (unknown) function which the learning problem attempts to approximate.

Synonym for output variable?

Targets

What happens if you initialize a parameter at a local minimum and attempt to use gradient descent on it?

The derivative turns out to be zero because the tangent is a flat line meaning that regardless of alpha it is multiplied by zero, indicating no change.

Training sample definition?

Training sample: A training sample is a data point x in an available training set that we use for tackling a predictive modeling task. For example, if we are interested in classifying emails, one email in our dataset would be one training sample. Sometimes, people also use the synonymous terms training instance or training example.

What to do when J(Theta) is moving up and down in waves ?

Use a smaller Alpha!!

What does theta typically represent in stat/ML?

quite often θ stands for the set of parameters of a distribution.

Gradient Descent for linear regression?

(Review again)

What are contour plots?

A contour plot is a graphical technique for representing a #D surface by plotting constant z slices, called contours onto a 2Dimensional format. That is, given a value for z, lines are drawn for connecting the x,y, coordinates twhere that z value occurs. The circles in a contour plot are called level sets - the function J is equal here.The center of the contour plot is the minimum of the cost function typically in ML.

Cost function vs Gradient Descent?

A cost function is something you want to minimize. For example, your cost function might be the sum of squared errors over your training set. Gradient descent is a method for finding the minimum of a function of multiple variables. So you can use gradient descent to minimize your cost function. If your cost is a function of K variables, then the gradient is the length-K vector that defines the direction in which the cost is increasing most rapidly. So in gradient descent, you follow the negative of the gradient to the point where the cost is a minimum. If someone is talking about gradient descent in a machine learning context, the cost function is probably implied (it is the function to which you are applying the gradient descent algorithm).

Classifier?

Classifier: A classifier is a special case of a hypothesis (nowadays, often learned by a machine learning algorithm). A classifier is a hypothesis or discrete-valued function that is used to assign (categorical) class labels to particular data points. In the email classification example, this classifier could be a hypothesis for labeling emails as spam or non-spam. However, a hypothesis must not necessarily be synonymous to a classifier. In a different application, our hypothesis could be a function for mapping study time and educational backgrounds of students to their future SAT scores.

Why do we square instead of using the absolute value when calculating variance and standard deviation?

First I'll answer the mathematical question asked in the question details, which I'm going to restate because I think it is stated wrong: The short answer is "Because of Jensen's inequality." See http://en.wikipedia.org/wiki/Jen... and the rest of the article for context. It says in particular that for a concave function What about the more general question, "Why variance?" I don't believe there is any compelling conceptual reason to use variance as a measure of spread. If forced to choose, my guess is that most people would say more robust measures like interquartile range or MAD better capture the concept of "spread" in most cases. But variance (and more generally "sum of squares") has some attractive properties, many of which flow from the Pythagorean theorem one way or another. Here some of them, without much math: We can decompose sums of squares into meaningful components like "between group variance" and "within-group variance." To generalize the above point, when a random variable Y Y is partly explained by another random variable X X there is a useful decomposition of the variance of Y Y into the part explained by X X and the unexplained part. (See http://en.wikipedia.org/wiki/Law...). If we think more broadly about mean squared error, this too can be decomposed into the sum of variance and squared bias. It is easy to interpret this total error as the sum of "systematic error" and "noise." Often we want to minimize our error. When the error is a sum of squares, we are minimizing something quadratic. This is easily accomplished by solving linear equations. So yes, variance and mean squared error are conveniences rather than conceptual necessities. But they are convenient conveniences.

What letter is typically used to depict a cost function?

Function J

Derive SSE

Given a linear regression model, the difference at each predicted point with the correct point is given by diff = y_i - (mx + b)

What is the downside of using an alpha (learning rate) that is too small?

Gradient descent can be way too slow.

What is the downside of using an alpha (learning rate) that is too big?

Gradient descent can overshoot the minimum and it may fail to converge or even diverge.

What is "hypothesis" in machine learning?

Hypothesis: A hypothesis is a certain function that we believe (or hope) is similar to the true function, the target function that we want to model. In context of email spam classification, it would be the rule we came up with that allows us to separate spam from non-spam emails.

Gradient Descent algorithm

Image result for gradient descent Gradient descent is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. used with simultaneous updates of the parameters of success Susceptible to falling into local optimum depending on initialization.

K-Fold?

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used,[7] but in general k remains an unfixed parameter. When k=n (the number of observations), the k-fold cross-validation is exactly the leave-one-out cross-validation. In stratified k-fold cross-validation, the folds are selected so that the mean response value is approximately equal in all the folds. In the case of a dichotomous classification, this means that each fold contains roughly the same proportions of the two types of class labels. Ultimately this helps fix the problem that we want to maximize both the training and test sets in cross-validation.

Convex functions?

In mathematics, a real-valued function defined on an interval is called convex (or convex downward or concave upward) if the line segment between any two points on the graph of the function lies above or on the graph, in a Euclidean space (or more generally a vector space) of at least two dimensions. Equivalently, a function is convex if its epigraph (the set of points on or above the graph of the function) is a convex set. Well-known examples of convex functions include the quadratic function {\displaystyle x^{2}} x^{2} and the exponential function {\displaystyle e^{x}} e^{x} for any real number x. Convex functions play an important role in many areas of mathematics. They are especially important in the study of optimization problems where they are distinguished by a number of convenient properties. For instance, a (strictly) convex function on an open set has no more than one minimum.

What are first order methods in numerical analysis?

In numerical analysis, methods that have at most linear local error are called first order methods. They are frequently based on finite differences, a local linear approximation.

What is normalization?

In statistics and applications of statistics, normalization can have a range of meanings.[1] In the simplest cases, normalization of ratings means adjusting values measured on different scales to a notionally common scale, often prior to averaging. In more complicated cases, normalization may refer to more sophisticated adjustments where the intention is to bring the entire probability distributions of adjusted values into alignment. In the case of normalization of scores in educational assessment, there may be an intention to align distributions to a normal distribution. A different approach to normalization of probability distributions is quantile normalization, where the quantiles of the different measures are brought into alignment.

What is overfitting?

In statistics and machine learning, one of the most common tasks is to fit a "model" to a set of training data, so as to be able to make reliable predictions on general untrained data. In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data. how to avoid overfitting? cross-validation, regularization, early stopping, pruning, Bayesian priors on parameters or model comparison and more!

What is a simple linear regression?

In statistics, simple linear regression is the least squares estimator of a linear regression model with a single explanatory variable. In other words, simple linear regression fits a straight line through the set of n points in such a way that makes the sum of squared residuals of the model (that is, vertical distances between the points of the data set and the fitted line) as small as possible. minimize squared error

Inflection points

Inflection points are where the function changes concavity. Since concave up corresponds to a positive second derivative and concave down corresponds to a negative second derivative, then when the function changes from concave up to concave down (or vise versa) the second derivative must equal zero at that point. So the second derivative must equal zero to be an inflection point. But don't get excited yet. You have to make sure that the concavity actually changes at that point.

What does the derivative of a function tell us?

The derivative of a function of a real variable measures the sensitivity to change of a quantity (a function value or dependent variable) which is determined by another quantity (the independent variable). Derivatives are a fundamental tool of calculus. For example, the derivative of the position of a moving object with respect to time is the object's velocity: this measures how quickly the position of the object changes when time is advanced. The derivative of a function of a single variable at a chosen input value, when it exists, is the slope of the tangent line to the graph of the function at that point. The tangent line is the best linear approximation of the function near that input value. For this reason, the derivative is often described as the "instantaneous rate of change", the ratio of the instantaneous change in the dependent variable to that of the independent variable.

What is SSE?

The sum of squared error

3D Surface Plot - how can it be used to plot the cost function?

Theta 0 and Theta 1 in a univariate linear regression can be plotted on the x and y axes. the Z axis will indicate the actual cost

Unsupervised learning

Unsupervised learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning from supervised learning and reinforcement learning. Unsupervised learning is closely related to the problem of density estimation in statistics.[1] However unsupervised learning also encompasses many other techniques that seek to summarize and explain key features of the data.

Clustering

a method of unsupervised learning - a good way of discovering unknown relationships in datasets. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties.

What does this hypothesis represent? h_theta(x) = theta_0 + theta_1 x

univariate linear regression model

why use features that are on a similar scale?

contour plots with differently scaled features will be extremely thin or extremely fat resulting in a very slow gradient descent (convergence is slower) get them to a -1 <= x <= 1 scale. poorly scaled is too large -100 to 100 or -0.00001 or 0.00001

How to make sure gradient descent is working properly?

create an automatic convergence test - declare convergence based on amount of decrease of J(Theta) plot on graph, y axis being J and axis being number of iterations.

What does J(Theta) increasing tell you about ur gradient descent?

it's not working lol. use a bigger alpha. on the other end if you use too big of an alpha you'll end up with a bowl shaped curve and you might be moving farther away from convergence.

Definition of stochastic?

randomly determined; having a random probability distribution or pattern that may be analyzed statistically but may not be predicted precisely.

hypothesis model

remember that the HYPOTHESIS MODEL or SET is what is depicted with h(theta)


Ensembles d'études connexes

Personal Finance 5-3 Chapter Test

View Set

PrepU Chapter 31 Skin Integrity and Wound Care

View Set

**Final - Chapter 19 Mixed Quizzes

View Set

Concept Physics mid term study test

View Set

Combo with "Completing the Sentence Unit 12 Vocab" and 1 other

View Set