Data Analytics Final
In a box plot, the box include %50 of the data, the horizontal line represents (i)____________, the top and bottom of the box represent (ii)________, respectively.
(i) the median (50th percentile), (ii) 75th and 25th percentiles
How would the correlations change if we normalized the data first?
Correlations will not change, since data are normalized by computing correlations
Which of the following are characteristics of Naive Bayes Classifier? (2 correct answers)
Data Driven Makes no assumptions about the distribution of the data
What are the characteristics of k-NN algorithm? (2 correct answers)
Makes no assumptions about the data Data-driven, not model-driven
Two models are applied to a dataset that has been partitioned. Model A is considerably more accurate than model B on the training data, but slightly less accurate than model B on the validation data. Which model are you more likely to consider for final deployment?
Model B
When a model is fit to training data, zero error with those data is not necessarily good. This special case is called ______.
Overfitting
Which of the following are true about Principal Component Analysis (PCA)? (2 correct answers)
PCA is intended for use with quantitative variables The idea of PCA is to find a linear combination of the two variables that contains most, even if not all, of the information, so that this new variable can replace the two original variables.
Which of the following are the methods that we use for dimension reduction? (4 correct answers)
Removing one of the variables in pairs that have a very strong correlation Logistics Regression Multiple Linear Regression Principal Component Analysis
Which of the following are advantages of Naive Bayes Method? (3 correct answers)
Simple and computationally efficient Handles purely categorical data well Works well with very large data sets
Identify whether the task required is supervised or unsupervised learning: Printing of custom discount coupons at the conclusion of a grocery store checkout based on what you just bought and what others have bought previously.
This is unsupervised learning
True or False: Bar charts are useful for comparing a single statistic (e.g. average, count, percentage) across groups. The height of the bar represents the value of statistic, and different bars correspond to different groups.
True
True or False: Naive Bayes method relies on assumption of independence between predictor variables within each class
True
True or False: Pairs of variables that have a very strong (positive or negative) correlation contain duplicative information. Therefore, we want to omit the variables that are strongly correlated to others to avoid multicolinearity (when fitting models).
True
True or False: The classification matrix, also called confusion matrix, gives estimates of the true classification and misclassification rates.
True
True or False: k-NN is a "lazy learner": the time consuming computation is deferred to the time of prediction. For every record to be predicted, we compute its distances from the entire set of training records only at the time of prediction. This behavior prohibits using this algorithm for real time prediction of a large number of records simultaneously.
True
To obtain an honest estimate of future classification error, we use the classification matrix that is computed from ________.
Validation Data
Scatter plots play important role in prediction. Next step can be developing a model. Scatter plots provide information about relationships (linear or non-linear) between variables. The variables in scatter plot ________.
must be numerical
The density ellipsoid in scatterplot matrix is a good graphical indicator of the correlation between two variables. The ellipsoid collapses diagonally as the correlation between the two variables approaches either 1 or -1. The ellipsoid is more circular if the two variables are more correlated. (TRUE or FALSE?)
False
True or False: Sensitivity and Specificity are plotted on an ROC Curve.
False
True or False: The test data are used to build models, or to further tweak the model or improve its fit.
False
True or False: To implement the k-NN algorithm successfully on JMP PRO one has to normalize the continuous predictors first.
False
True or False: k-NN algorithm can only be used for classification (of categorical outcome)
False
True or Fasle: The Naive Bayes platform fits a model to predict the value of a numerical variable as well as the value of a categorical variable.
False
Which of the following are the most popular visualization tools in JMP_Pro? (3 correct answers)
Fit Y by X Distribution Graph Builder
Identify whether the task required is supervised or unsupervised learning: Deciding whether to issue a loan to an applicant based on demographic and financial data (with reference to a database of similar data on prior customers).
This is supervised learning
In JMP a diamond is displayed in the box, where the center of the diamond is
The mean
