Exam 1
If there are 30 variables and 5% of the values are missing, how much of the data must be omitted?
0.95^30= 0.215 1-0.215= 80%
In hierarchical clustering methods you have to pre-specify the number of clusters. (T/F)
False
Statistical models can overfit more easily than machine learning models? (T/F)
False (machine learning overfit constantly)
Euclidean distance is heavily influenced by the scale of the data. (T/F)
True
k-Means clustering is more computationally efficient in large data sets than hierarchical clustering methods. (T/F)
True
Multicollinearity can lead to overfitting? (T/F)
True (use of variables that are multicolinear results in overfitting)
supervised learning differs from unsupervised learning in that supervised learning requires
at least one output variable
What is the most appropriate plot to visualize the frequencies of a single binary variable?
box chart
What is the most appropriate plot to visualize the relationship between a quantitative variable and a categorical variable?
box plot (side by side)
What is the most appropriate plot to get a summary of the likely values of a single quantitative variable?
histogram
supervised learning and unsupervised clustering require at least one...
input/predictor variable
RASE
is baed on the sum of the squared error is a measure of the percent of variation in the output variable explained by he inputs
Clustering
is the method used for data reduction.
The eucledian distance is a measure:
of distance between observations or clusters
(Rule of thumb) 3 standard deviations away from the mean
outlier
What technique should be used for modeling rare events?
oversampling/undersampling
Principal Component Analysis
provides few variables that are weighted linear combinations of the original variables, and that retain the majority of the information of full original set of variables.
dimension reduction
reducing the number of variables. common initial step
The supervised learning technique for numeric (continuous) response:
regression
What is the most appropriate plot to visualize the relationship between two quantitative (continuous) variables?
scatter plot
(normalize) standardizing variable
subtract the mean from each value then divide by standard deviation of the variable.
Determine which is the best approach to for this problem: Determine whether a credit card transaction is valid or fraudulent
supervised learning
Determine which is the best approach to for this problem: Do single men play more golf than married men?
supervised learning
Determine which is the best approach to for this problem: What is the average weekly salary of all female employees under 40?
supervised learning
Dimensionality
the number of independent or input variables used by a model
Data reduction
the process of consolidating large number of records into smaller set.
eigenvectors
the weights that are used to project the original data onto the two new directions
Determine which is the best approach to for this problem: Determine the characteristics of a successful used car salesperson
unsupervised learning
Determine which is the best approach to for this problem: Develop a profile for credit card customers likely to carry an average balance of more than $1000
unsupervised learning
Determine which is the best approach to for this problem: Develop an algorithm to group students according to their interests
unsupervised learning
The most common way to evaluate performance of a predictive model is to use the:
validation ASE (or RASE)