Intro Statistics Midterm
The standard deviation of the distribution of sample averages is called the standard error of the mean. A) True B) False
A) True
When the data include a lot of variation, a scatterplot is better than a line graph. A) True B) False
A) True
In a scatterplot, adding the ______ can provide better visual cue about the relationship between the variables. A) trend line B) error bar C) legend D) probability distribution
A) trend line
Standard deviations is a measure of central tendency. A) True B) False
B) False
A ___ distribution is flatter than ____ distribution. Note I fixed this question, so only one answer is correct :P A) Leptokurtic; platykurtic B) Platykurtic; leptokurtic C) Mesokurtic; platykurtic D) Leptokurtic; mesokurtic
B) Platykurtic; leptokurtic
If the Cramer's V = .5, it means ______ effect size. A) weak B) positive C) strong D) moderate
C) strong
If the median value for a variable is less than its mean, and its mode value is less than its median, then the distribution of values of this variable tends to be... A) Negatively skewed B) None of the above C) Symmetrically distributed D) Positively skewed
D) Positively skewed
Why does traditional null hypothesis significance testing not tell us what we really want to know? A) Because it gives us the probability of the data having arisen given the null hypothesis is true, but we are really interested in the probability of the null hypothesis being true. B) Because there is no way for us to make inferences about the true state of nature. C) Because it gives us the probability of the data having arisen given the null hypothesis is false, but we are really interested in the probability of the data having arisen given the null hypothesis is true. D) Because if gives us the probability of the null hypothesis being true given the data, but we are really interested in the probably of the data given the null hypothesis is true.
A) Because it gives us the probability of the data having arisen given the null hypothesis is true, but we are really interested in the probability of the null hypothesis being true.
What are two benefits of using R? A) It's free and open source B) It's inexpensive and open source C) It's free and includes responsive customer service D) It's open source and a single platform
A) It's free and open source
Which of the following statements about the p-value do you believe to be true? A) The p-value is the probability of obtaining the observed or more extreme results if the null hypothesis is true B) The p-value is the probability of obtaining the observed or more extreme results if the null hypothesis is true C) The p-value is the probability of obtaining the observed or more extreme results if the alternative hypothesis is true D) The p-value is the probability that the alternative hypothesis is true
A) The p-value is the probability of obtaining the observed or more extreme results if the null hypothesis is true
What is the null hypothesis in a chi-square test? A) The rows and columns in the table are not associated B) The rows and columns in the table are associated C) The rows and columns in the table are not the same D) The rows and columns in the table are the same
A) The rows and columns in the table are not associated
A p-value less than .05 is usually considered statistically significant in the biological sciences. A) True B) False
A) True
Assumptions usually refer to the lists of requirements that must be met so that the statistical tests are valid. A) True B) False
A) True
Both standard deviation and interquartile range can measure the spread for central tendency. A) True B) False
A) True
Boxplots can inform whether your data are skewed. A) True B) False
A) True
Even if the values in a sample from a population are not normally distributed, the sample means that may be obtained from the same population are always approximately normally distributed. A) True B) False
A) True
For distributions that are not normally distributed, the median is usually a better measure of central tendency than the mean. A) True B) False
A) True
If your data are skewed, then the mean cannot well represent the central tendency of the data. A) True B) False
A) True
Probability density function corresponds to the area under the curve for a continuous probability distribution. A) True B) False
A) True
The chi square test is to test whether there is a statistical relationship between two categorical variables. A) True B) False
A) True
The probability of each real value of some variable is always positive and no greater than 1. A) True B) False
A) True
You can use the degrees of freedom to find the population standard deviation for a chi square distribution. A) True B) False
A) True
z-Score can indicate whether the observation has a value above or below the mean. A) True B) False
A) True
Simply speaking, variance can be understood as ___. A) an average of the squared differences between observations and the mean B) a sum of the differences between observations and the mean C) an average of the differences between observations and the median D) an average of the squared differences between observations and the mode
A) an average of the squared differences between observations and the mean
Which of the following data types includes the data that cannot be logically included in calculations? A) character B) integer C) factor D) numeric
A) character
To calculate the Cramer's V statistic, you need the ______ statistic. A) chi square B) z-score C) alpha D) t-score
A) chi square
The c() function stands for ______. A) combine B) complete C) contrast D) conclude
A) combine
A teacher monitored the number of people texting during class each day and calculated the corresponding probability distribution. What type of probability distribution did the teacher use? A) discrete B) continuous C) logical D) categorical
A) discrete
Before conducting inferential analyses to understand a population with sample data, you should start with knowing the sample data using descriptive statistics and graphics. This step is often called ______. A) exploratory data analysis B) analysis of first trial C) data exploration using chi square D) primary attempt
A) exploratory data analysis
Using the palmerpenguins data that you are familiar with, which of the ggplot() commands below would produce a graph that is appropriate for the data types chosen and does not generate an error? A) ggplot(data = penguins, aes(x = species, y = bill_length_mm, fill = sex)) + geom_point() B) ggplot(data = penguins, aes(x = 'species', y = 'bill_length_mm')) + geom_bar(aes(fill = sex), stat = "identity") C) ggplot(data = penguins, aes(x = species, y = bill_length_mm)) + geom_bar(aes(fill = 'sex'), stat = "identity") D) ggplot(data = penguins, aes(x = species, y = bill_length_mm)) + geom_bar(color = 'blue', stat = "identity")
A) ggplot(data = penguins, aes(x = species, y = bill_length_mm, fill = sex)) + geom_point() *this was regraded from the midterm because a lot of people missed it. None of these produced a perfect graph but the first one came closest. My mistake for not providing a better code samples with a clear winner.
Which graph shows the overall shape of the data? A) histogram B) boxplot C) pie chart D) bar graph
A) histogram
The chi square statistic ______. A) is positive or zero B) is negative or zero C) can be positive, negative, or zero D) is negative and cannot be zero
A) is positive or zero
When there is no relationship between the two variables, the probability of a chi square value being a given value or higher in the sample is called the ______. A) p-value B) t-statistic C) z-score D) confidence interval
A) p-value
The expected value of a binomial random variable is calculated as ______. A) the sample size * the probability of success B) the probability of success + the probability of failure C) the probably of failure/the sample size D) the sample size/population size
A) the sample size * the probability of success
The probability density function of the chi square distribution shows the probability of a value of chi square occurring when ______. A) there is no relationship between the two variables contributing to the chi square B) the two variables are linearly correlated C) the value is larger than the mean D) there are positive relationship between the two variables
A) there is no relationship between the two variables contributing to the chi square
A ______ confidence interval means that there is 5% of observations in the tails of the distribution. A) 105% B) 95% C) 75% D) 5%
B) 95%
If the P-value of a given test statistic is 0.03 then, A) At the 0.05 significance level, you would reject the null hypothesis. B) All of the above C) Assuming the null hypothesis is true, there is a 3% chance of getting a more extreme test statistic. D) It is unlikely that the extremeness of the test statistic is due to chance.
B) All of the above
A boxplot usually consists of a line representing the mode. A) True B) False
B) False
A variable that is normally distributed can still be right- or left-skewed. A) True B) False
B) False
Assume that the 95% Confidence Interval of the mean of a variable is -0.1 - 1.7. The probability that the true mean is greater than 0 is at least 95%. A) True B) False
B) False
The chi square independence test determines what makes the relationship significant between two categorical variables. A) True B) False
B) False
Compared to alpha = 0.05 when alpha = 0.001 the difference between observed and expected would have to be smaller to reach statistical significance. A) True B) False
B) False *this was regraded from the midterm because a lot of people missed it. Smaller significance levels (alpha) mean lower chance of type I errors (claiming that the null hypothesis is false when in fact it is true). Being more certain about not making this mistake requires larger effect sizes, all else being equal.
Central Limit Theorem indicates that the population standard deviation can be estimated using repeated random sampling. A) True B) False
B) False *this was regraded from the midterm because a lot of people missed it. The central limit theorem is about the population mean. While we can calculate a standard deviation across repeated random samples and their means, this would not be the population standard deviation, it would be conceptually similar to calculating the standard error, which provides an estimate of the confidence in our estimate of the mean.
What does a typical R function act on? A) library B) argument C) package D) help
B) argument
Which of the following lines of code will not remove NA from the column "bill_length_mm" in the palmerpenguins data, assuming the dataframe is called "palmerpenguins"? A) cleaned_data <- na.omit(palmerpenguins$bill_length_mm) B) cleaned_data <- palmerpenguins %>% filter(!is.na(palmerpenguins$bill_length_mm)) C) cleaned_data <- drop_na(palmerpenguins, bill_length_mm) D) cleaned_data <- palmerpenguins %>% filter(!is.na(bill_length_mm))
B) cleaned_data <- palmerpenguins %>% filter(!is.na(palmerpenguins$bill_length_mm)) *this was regraded from the midterm because a lot of people missed it. Notice the "NOT" in the question. B uses dplyr %>% which does not require the use of dataframe name in filter() function
______ computes the probability that an exact number of successes happens for a discrete random variable. A) pbinom() B) dbinom() C) qbinom() D) ebinom()
B) dbinom()
For a categorical variable, a ______ shows the number of observations in each category. A) median B) frequency distribution C) percentile term D) scattering distribution
B) frequency distribution
If you want to set the color of all points in a scatter plot as green, which of the following codes is correct? A) ggplot(mpg, aes(displ, hwy), color = "green") + geom_point() B) ggplot(mpg, aes(displ, hwy)) + geom_point(color = "green") C) ggplot(mpg, aes(displ, hwy)) + geom_point(aes(color = "green")) D) ggplot(mpg, aes(displ, hwy, color = "green")) + geom_point()
B) ggplot(mpg, aes(displ, hwy)) + geom_point(color = "green")
One of the assumptions of conducting a chi square test is that the observations must be ______. A) depending on the sample size B) independent of the others C) depending on the previous value D) a fixed value
B) independent of the others
What do you need to calculate first before calculating kurtosis? A) mean and median B) mean and standard deviation C) mode and mean D) median and mode
B) mean and standard deviation
The degrees of freedom is the ______ of the chi square distribution. A) sample mean B) population mean C) standard errors D) variance
B) population mean *this was regraded from the midterm because a lot of people missed it. The df are used to calculate both standard deviation and variance of the chi-squared distribution but that is not the question.
The shape of the chi square distribution is usually ______. A) bell shape B) right-skewed C) left-skewed D) a straight line
B) right-skewed
A statistic used to make a decision about the hypothesis of interest is called A) critical region B) test statistic C) significance levels D) statement of hypothesis
B) test statistic
Which of the following statement is not true: In repeated samples the sample average will tend to be closer to the true population mean when ___ A) The sample size is big. B) the population variance is large. C) the population mean is large relative to the population standard deviation. D) the standard error of the sample average is small. E) more observations are sampled.
B) the population variance is large.
In a two-way contingency table of count data, differences between observed values and expected values indicate that ______. A) the sample size needs to be larger B) there may be a relationship between the variables C) there is a calculation error D) the expected values are too high
B) there may be a relationship between the variables
A normally distributed variable has approximately 95% of values within ______ standard deviations of the mean. A) one B) two C) three D) four
B) two
The standardized residual is distributed similarly to ______. A) t-score B) z-score C) chi square D) alpha
B) z-score
______ is the sum of the values divided by the number of values. A) Mode B) Standard deviations C) Mean D) Median
C) Mean
The null hypothesis usually ______. A) computes the standard errors B) tests the statistical significance of results C) claims there is no difference or no relationship between variables D) shows the z-score
C) claims there is no difference or no relationship between variables
The shape of chi square distribution is defined by ______. A) sample size B) chi square C) degrees of freedom D) standard deviation
C) degrees of freedom
The Yates' continuity correction can be used when ______. A) the sample size is too small B) the difference between observed and expected is large C) each variable has only two categories D) the z-score is close to the mean
C) each variable has only two categories
In a bar graph, adding ______ can inform the spread of the data. A) median line B) probability function C) error bar D) legend
C) error bar
If one observation is extremely large compared to other observations of the variable, which of the following statistics may be influenced the most? A) median B) mode C) mean D) minimum
C) mean
The ______ is the middle number when the values of the variable are in order from smallest to largest. A) mode B) minimum C) median D) mean
C) median
From the code below, ___ is the data frame, and ____ is a variable in the data. pd_cleaned %>% group_by(Sex) %>% summarize(Frequency = n()) %>% mutate(Percent = 100 * (Frequency / sum(Frequency)), `Valid Percent` = ifelse(!is.na(Sex), 100 * (Frequency / (sum(Frequency[!is.na(Sex)]))), NA)) A) pd_cleaned; Frequency B) pd; Sex C) pd_cleaned; Sex D) Sex; pd_cleaned
C) pd_cleaned; Sex
A large chi square value suggests ______. A) only positive relationship between two variables B) both variables have large degrees of freedom C) relationship between two variables D) no relationship between two variables
C) relationship between two variables
The binomial distribution is defined by ______. A) the number of observations B) the probability of failure C) the number of observations and the probability of success D) the probability of success and the probability of failure
C) the number of observations and the probability of success
For the data frame format, ______. A) it can be only imported from Excel spreadsheet B) no more than one data frame can be saved in the environment at the same time C) the rows are observations and the columns are variables D) it contains only numerical data type
C) the rows are observations and the columns are variables
For a normal distribution, what is the number of standard deviations away from the mean containing 95% of observations? A) 1.85 B) 2.15 C) 3.0 D) 1.96
D) 1.96
You have two categorical variables. The first variable has two categories and the second variable has four categories. The chi square distribution would have ______ degrees of freedom. A) 8 B) 7 C) 4 D) 3
D) 3
The lowest expected value in each contingency table cell in order for a chi-square test to be effective is... A) 4 B) 2 C) 3 D) 5
D) 5
Assume that the null hypothesis is false. Which of the following statements is true? A) A study with a larger sample is MORE likely than a smaller study to get the result that p>0.05 B) A study with a larger sample is EQUALLY likely than a smaller study to get the result that p>0.05 C) None of the above D) A study with a larger sample is LESS likely than a smaller study to get the result that p>0.05
D) A study with a larger sample is LESS likely than a smaller study to get the result that p>0.05
What best describes a bin in the histogram? A) It indicates the average of each group B) None of the above C) It shows the maximum value of each group) D) It contains a certain proportion of the observations E) It includes the extreme values of the distribution
D) It contains a certain proportion of the observations
______ measures how many observations are in the tails of a distribution. A) Standard deviation B) Range C) Skewness D) Kurtosis
D) Kurtosis
Statistical power is ... A) The probability of concluding there is no effect when in fact there is none. B) The probability of concluding there is no effect when in fact there is one. C) The probability of concluding there is an effect when in fact there is none. D) The probability of concluding there is an effect when in fact there is one.
D) The probability of concluding there is an effect when in fact there is one.
If you know the z-score, standard deviation(s), and mean (M), what formula would you use to compute the value of the single observation (raw score, X)? A) z = (M - X) / s B) M = z * s + X C) X = (z + s) / M D) X = z * s + M
D) X = z * s + M
As compared with a 95% confidence interval, a 99% confidence interval would result in ______ in the confidence interval. A) a range that cannot be determined B) a smaller range of values C) the same range of values D) a wider range of values
D) a wider range of values
For density plots, the area under the curve ______. A) means the value of observations B) is separated into bins C) does not count the outliers of the data D) adds up to 100% of the data
D) adds up to 100% of the data
For binomial distribution, a ______ function determines the probability of success over a range of values. A) probability mass B) sample C) probability density D) cumulative distribution
D) cumulative distribution *this was regraded from the midterm because a lot of people missed it. The probability mass function calculates the probability of an exact value of x. The probability density function applies to a continuous outcome distribution like the normal distribution. Only the CDF does what is being asked here.
Probability mass function corresponds to ___ distributions, while probability density function corresponds to ____ distributions. A) continuous; logical B) continuous; discrete C) logical; continuous D) discrete; continuous
D) discrete; continuous
The strength of a relationship between two variables in statistics is referred to as ______. A) significance B) degrees of freedom C) confidence D) effect size
D) effect size
The interquartile range (IQR) is the difference between ___ and ____ quartiles. A) third and fourth B) second and fourth C) first and fourth D) first and third
D) first and third
The vertical y-axis in histogram is ______. A) mean B) probability density C) percentile D) frequency
D) frequency
What do you need to calculate first before calculating skewness? A) mean and median B) median and mode C) mode and mean D) mean and standard deviation
D) mean and standard deviation *this was regraded from the midterm because a lot of people missed it. We did not cover the formula for calculating skew, only the R package to do so. Visual assessment and comparison of mean, median, mode is OK here. Calculation of sample skewness requires mean and standard deviation (see textbook).
The vertical y-axis in density plot is ______. A) probability mass B) frequency C) counts D) probability density
D) probability density
The shape of the normal distribution is determined by ______. A) the probability of success and the variance B) the variance and the mean probability C) the sample size and the probability of success D) the mean and the standard deviation
D) the mean and the standard deviation
In a boxplot, what do the error bars represent? A) the standard deviation around the mean B) the range of values between minimum and maximum C) none of the above D) the range of values within 1.5 times the interquartile range E) the interquartile range (IQR)
D) the range of values within 1.5 times the interquartile range
What do you need to calculate the standard error of the mean? A) the population variance B) the population size and standard deviation of the population C) the population distribution D) the sample size and standard deviation of a sample
D) the sample size and standard deviation of a sample
What is the first step to conducting Null Hypothesis Significance Testing? A) calculate test statistic B) compute z-score C) create a bar graph D) write the null and alternate hypotheses
D) write the null and alternate hypotheses