DATS FINAL EXAM QUESTIONS
Briefly explain the difference between a sampling error and a non-sampling error.
Sampling error is the difference between the mean of the sample and the mean of the population. On the other hand, non-sampling error is an issue with how the sample works or is carried out.
What is the formula for Accuracy? a. (TP + TN) / Total b. TP / (TP + FP) c. TP / (TP + FN) d. TN / (TN + FP)
a. (TP + TN) / Total
Peter Parker got stranded and is now living alone on an island without any technology. In order to make his life more interesting, he has been writing daily journals. Figuring out if there is some connection between the locations he travels and the amount of fruits he gathers daily, he conducts a test. The location is recorded as North-East-South-West-NorthEast-NorthWest-SouthEast-SouthWest, and the amount of fruits is recorded as 'color', 'taste', 'texture', 'edible', and 'poisonous'. What test should he run? a. Chi- Square b. ANOVA test c. Pearson test d. Spearman test
a. Chi-square
For regular Linear Regression models, what do we use? Choose the following: a. Coefficients' p-values b. RMSE root-mean-squared-error c. P-statistics for overall model significance d. Coefficient of Determination R2, for percentage explained e. Chi-Squared tests f. all of the above
a. Coefficients' p-values, b. RMSE root-mean-squared-error, d. Coefficient of Determination R2, for percentage explained
Which of the following statements best describes the purpose of the ANOVA test? A) ANOVA is used to compare means of two independent samples. B) ANOVA is used to determine if two independent samples are correlated. C) ANOVA is used to determine if there are significant differences among the means of three or more groups. D) ANOVA is used to compare means of two dependent samples.
C) ANOVA is used to determine if there are significant differences among the means of three or more groups.
Normal distribution is mathematically defined as ______ distribution with exponential tails exp(−x2/2𝜎2) symmetrically on both sides, total area = unity.
Gaussian distribution
When would you use a logistic regression model, and what is its primary purpose a. To predict continuous outcomes, such as house prices or stock prices b. To classify data into two or more discrete categories based on predictor variables. c. To estimate the mean response of a continuous variable at different levels of predictor variables d. To identify clusters within data points based on their similarity.
b. To classify data into two or more discrete categories based on predictor variables.
Which measure of central tendency is most affected by outliers in a data set? A. Mean B. Median C. Mode D. Range
mean
The __________ is the value that occurs most frequently in a data set
mode
If a data set has a bell-shaped distribution, which of the following statements is true regarding the mean and median? A. The mean is greater than the median. B. The mean is less than the median. C. The mean and median are approximately equal. D. There is no relationship between the mean and median.
the mean and median are approximately equal
True or False, the standard error is the same as standard deviation of the sampling distribution:
true
List 2 common metrics used to analyze a confusion matrix
Answer: Any 2 of the following metrics: accuracy, precision, recall (sensitivity or true positive) rate, specificity, F1 score
What does a boxplot show and what library do you have to load to create a boxplot?
The boxplot shows how the data is distributed and it also shows any outliers. You need library(ggplot2) to show a boxplot.
Binomial distributions come from a binary choice (Y/N, Head/Tail, 0/1, T/F).
True
True or False: If the p value is greater than 0.05 you fail to reject the null hypothesis
True
When would you use a Pearson test? A spearman test?
Use Pearson correlation when you have two continuous variables with a linear relationship and normally distributed data. - Use Spearman correlation when you have ordinal data or when the relationship between variables is non-linear or non-normally distributed.