Exam 2: Part 2

Ace your homework & exams now with Quizwiz!

test data vs. training data

In a dataset a training set is implemented to build up a model, while a test (or validation) set is to validate the model built. Data points in the training set are excluded from the test (validation) set

training data

data that is used to train a predictive model and that therefore must have known values for the target variable of the model

Generalization is the property of a model or modeling process, whereby the model applies to

data that were not used to build the model.

Finding chance occurrences in data that look like interesting patterns, but which do not generalize, is called

overfitting the data

A complex model may be necessary if the phenomenon producing the data is itself complex, but complex models run the risk of

overfitting training data i.e., modeling details of the data that are not found in the general population.

A model that has been overfitted has poor predictive performance, as it

overreacts to minor fluctuations in the training data.

learning performance

referring to the observation that learning can take place without actual performance of the learned behavior

The most common method for focusing on clusters themselves is to

represent a cluster-based upon the cluster center.

Issues with Nearest-Neighbor Methods include:

• Intelligibility; • Dimensionality and Domain Knowledge; & • Computational Efficiency

Why is overfitting bad?

• The short answer is that as a model gets more complex it is allowed to pick up harmful spurious correlations. • These correlations are idiosyncrasies of the specific training set used and do not represent characteristics of the population in general. • The harm occurs when these spurious correlations produce incorrect generalizations in the model. • This is what causes performance to decline when overfitting occurs. • What generally happens is that as we progress on the x axis, y will have no predictive power.

Recognizing Overfitting

• Use the fitting graph, which shows the accuracy of a model as a function of complexity. • We need to "hold out" some data. When we do this the data is now considered a TEST SET. - Creating holdout data is like creating a "lab test" of generalization performance. - We will simulate the use scenario on these holdout data: we will hide from the model (and possibly the modelers) the actual values for the target on the holdout data. - The model will predict the values.

Overfitting in tree induction

• We build tree-structured models for classification. • If we continue to split the data, eventually the subsets will be pure—all instances in any chosen subset will have the same value for the target variable. • A procedure that grows trees until the leaves are pure tends to overfit. • At some point (sweet spot) the tree starts to overfit: it acquires details of the training set that are not characteristic of the population in general, as represented by the holdout set.

Cross-validation begins by

• splitting a labeled dataset into k partitions called folds. Typically, k will be five or ten. (Here we show five-fold cross-validation: the original dataset is split randomly into five equal-sized pieces.); • Then, each piece is used in turn as the test set, with the other four used to train a model. • The result is five different accuracy results, which then can be used to compute the average accuracy and its variance.

mathematical functioning of overfitting

As we add more Xi, the function becomes more and more complicated...each Xi has a corresponding Wi.

Hierarchical cluster analysis

Following microarray analysis, groups of genes with similar functions or patterns of regulation can be compared with

How many neighbors and How Much Influence?

If we increase k to the maximum possible (so that k = n) the entire dataset would be used for every prediction. k is the nearest neighbors.

Nearest neighbor classification

The point to be classified, labeled with a question mark, would be classified + because the majority of its nearest (three) neighbors are +.

What is the tradeoff when overfitting is used?

There is a fundamental trade-off between model complexity and the possibility of overfitting.

There is no guarantee that a single run of the k-means algorithm will result in

a good clustering. It (k) is usually run many times.;

Cross-validation makes better use of

a limited dataset.

A learning curve is a

a plot of model learning performance over experience or time.

On a typical fitting graph, each point on a curve represents a(n)

accuracy estimation of a model with a specified complexity (as indicated on the horizontal axis).

We account for overfitting in regression analysis by looking at the

adjusted R square, which accounts for the number of independent variables.

generalization (data)

applies to a wider population

The centroid is the

average of the values for each feature.

The fundamental concept of similarity between data items occurs through

data mining.

holdout data

data not used in building the model, but for which we do know the actual value of the target variable.

Similarity underlies many

data science methods and solutions to business problems.; Sharing characteristics help data mining procedures by enabling them to be "grouped" together based upon the similarity of the data being evaluated.; Similarity can be used for doing classification and regression analysis. It can also be used for grouping similar entities into clusters.;

Test set

data set used to estimate accuracy of final model on unseen data

A very common proxy for the similarity of two entities is the

distance between them in the instance space defined by their feature vector representation.

There is no single choice or technique to

eliminate overfitting.

To have a fair evaluation of the generalization performance of a model, we should

estimate its accuracy on holdout data (data not used in building the model, but for which we do know the actual value of the target variable).

The basic idea of clustering is that we want to

find groups of objects (consumers, businesses, etc.) where the objects within the groups are similar.

When we train a machine learning model, we don't just want it to learn to model the training data, We want it to

generalize to data it hasn't seen before.

Overfitting occurs when a model is excessively complex, such as

having too many parameters relative to the number of observations.

A dendrogram is a type of tree diagram showing

hierarchical clustering — relationships between similar sets of data.

Cross-validation is a more sophisticated procedure for

holdout training and testing data

Generalization (also known as the out-of-sample error) is a measure of

how accurately an algorithm is able to predict outcome values for previously unseen data.

Accuracy estimates on training data and testing data vary differently based on

how complex we allow a model to be.

The accuracy of a model depends upon:

how complex we want it to be

Unlike splitting the data into one training and one holdout set, cross-validation computes

its estimates over all the data by performing multiple splits and systematically swapping out samples for testing.

The resulting figure of a Euclidean distance calculation is

just a number — it has no units, and no meaningful interpretation. It is just useful for comparing the similarity of one pair to another (o nearest neighbors)

The most popular centroid-based clustering algorithm is called

k-means. In k-means, the "means" are the centroids. (a centroid represent the average of the values for each feature.)

What is the best strategy with regard to overfitting?

learn to recognize over-fitting and manage complexity in a principled way. The goal is to reduce, rather than eliminate, overfitting.

Learning curves are a widely used diagnostic tool in

machine learning for algorithms that learn from a training dataset incrementally.

Data mining involves a fundamental trade-off between

model complexity and the possibility of overfitting.

When the model is not allowed to be complex enough, it is ______________. As the model gets too complex, they look very accurate on the training data- but are __________.

not very accurate; overfitting.

Overfitting occurs when

our learner outputs a classification that is 100% accurate on the training data but 50% accurate on test data, when in fact it could have output one that is 75% accurate on both, it has overfit..

All data mining procedures have the tendency to

overfit to some extent. (some methods more so than others)

All model types can be

overfit.

Clustering is another application in the study of

similarity.

Geometric Interpretation

stem from imagination and portray no objects (dots, stripes, plaids, chevrons)

Distortion in clustering represents the

sum of squared distances of points from cluster centers. Decreases with an increasing number of clusters.

The nearest neighbors method can be used to find

the companies most similar to our best corporate customers, or the online consumers most similar to our best retail customers.

Euclidean distance

the straight-line distance, or shortest possible path, between two points. (A is at coordinates (xA, yA) and B is at (xB, yB). •A is at coordinates (xA, yA) and B is at (xB, yB). •We can draw a right triangle between the two objects, as shown, whose base is the difference in the x's: (xA - xB) and whose height is the difference in the y's: (yA - yB).)


Related study sets

Mastering microbe chapter 5 - metabolism

View Set

Ch 25: Assessment of Cardiovascular Function (2)

View Set

NCLEX ch 25 The Postpartum Period and Associated Complications

View Set

Chapter 4 - The Banking Services of Financial Institutions

View Set

General Psychology Chapter 8 Inquizitive

View Set