Exam 1

Ace your homework & exams now with Quizwiz!

If there are 30 variables and 5% of the values are missing, how much of the data must be omitted?

0.95^30= 0.215 1-0.215= 80%

In hierarchical clustering methods you have to pre-specify the number of clusters. (T/F)

False

Statistical models can overfit more easily than machine learning models? (T/F)

False (machine learning overfit constantly)

Euclidean distance is heavily influenced by the scale of the data. (T/F)

True

k-Means clustering is more computationally efficient in large data sets than hierarchical clustering methods. (T/F)

True

Multicollinearity can lead to overfitting? (T/F)

True (use of variables that are multicolinear results in overfitting)

supervised learning differs from unsupervised learning in that supervised learning requires

at least one output variable

What is the most appropriate plot to visualize the frequencies of a single binary variable?

box chart

What is the most appropriate plot to visualize the relationship between a quantitative variable and a categorical variable?

box plot (side by side)

What is the most appropriate plot to get a summary of the likely values of a single quantitative variable?

histogram

supervised learning and unsupervised clustering require at least one...

input/predictor variable

RASE

is baed on the sum of the squared error is a measure of the percent of variation in the output variable explained by he inputs

Clustering

is the method used for data reduction.

The eucledian distance is a measure:

of distance between observations or clusters

(Rule of thumb) 3 standard deviations away from the mean

outlier

What technique should be used for modeling rare events?

oversampling/undersampling

Principal Component Analysis

provides few variables that are weighted linear combinations of the original variables, and that retain the majority of the information of full original set of variables.

dimension reduction

reducing the number of variables. common initial step

The supervised learning technique for numeric (continuous) response:

regression

What is the most appropriate plot to visualize the relationship between two quantitative (continuous) variables?

scatter plot

(normalize) standardizing variable

subtract the mean from each value then divide by standard deviation of the variable.

Determine which is the best approach to for this problem: Determine whether a credit card transaction is valid or fraudulent

supervised learning

Determine which is the best approach to for this problem: Do single men play more golf than married men?

supervised learning

Determine which is the best approach to for this problem: What is the average weekly salary of all female employees under 40?

supervised learning

Dimensionality

the number of independent or input variables used by a model

Data reduction

the process of consolidating large number of records into smaller set.

eigenvectors

the weights that are used to project the original data onto the two new directions

Determine which is the best approach to for this problem: Determine the characteristics of a successful used car salesperson

unsupervised learning

Determine which is the best approach to for this problem: Develop a profile for credit card customers likely to carry an average balance of more than $1000

unsupervised learning

Determine which is the best approach to for this problem: Develop an algorithm to group students according to their interests

unsupervised learning

The most common way to evaluate performance of a predictive model is to use the:

validation ASE (or RASE)


Related study sets

FINANCIAL TERMS, RULES, AGENCIES: Credit

View Set

Chapter 4 Growth, Diversity, and Conflict

View Set

Health Career Exploration Module- Health Care Career Pathways

View Set

chapter 13 learn smarts (with answers)

View Set

Practice Problems for Fractions, Decimals, & Percents

View Set

MAR4802-Lesson 15: Advertising and Public Relations

View Set