Stat superset
Central Limit Theorem
"As sample size increases, the distribution of sample mean, x(bar), becomes more and more normally distributed with mean, μ, and standard deviation, σ." For a relatively large sample size, the mean is approximately normally distributed regardless of the distribution of the variable under consideration Approximation becomes better with increasing sample size. = graphs become more and more bell-curved
Ac
("A compliment") The compliment of the event A
A∩B
("A intersection B" or "A and B") The intersection of events A and B
A∪B
("A union B" or "A or B") The union of events A and B.
(N) (n)
("N choose n") the number of combinations where n objects is chosen from the N possible objects
P(A/B)
("Probability of A given B") The probability of A conditioned on event B has occurred
P(A)
("Probability of A") The probability that event A occurs
S
("Script capital S" or "bold capital S") The sample space
η
("eta") The population median
µ
("mu") The population mean
p̂
("p-hat") the sample proportion
σ²
("sigma squared") The population variance
σ
("sigma") The population standard deviation
A,B,C
(Capital letters at beginning of the alphabet) an event in probability theory
Central Tendency
- Mean - Median - Mode
Variability / Dispersion
- Variance - Standard Deviation
If X ~ unif(-2.5, 1.6), what is the mean of X?
-0.45
Obtain the z-score for which the area under the standard normal curve to its left is 0.025.
-1.96
Assumptions for ANOVA (5)
-Data Comes from Independent Samples -k Populations are normal -Population Variances are Equal -Tj unknown. -Sum of Tj=0, eij are Normally distributed with 0 as a mean and Population variance squared.
Space Shuttles. The National Aeronautics and Space Administration (NASA) compiles data on space-shuttle launches and publishes them on its Web site. The table below displays a frequency distribution for the number of crew members on each shuttle mission from April 1981 to July 2000. Let X denote the crew size of a randomly selected shuttle mission between April 1981 and July 2000. Find P(X = 4). Crew size, frequency: 2, 4 3, 1 4, 2 5, 36 6, 18 7, 33 8, 2
.0208
Determine the area under the standard normal curve that lies between -2.0 and -1.5.
.0440
Determine the area under the standard normal curve that lies to the left of -1.56.
.0594
Determine the area under the standard normal curve that lies either to the left of -2.12 or to the right of 1.67.
.0645
Determine the area under the standard normal curve that lies between 1.1 and 4.2.
.1357
Determine the area under the standard normal curve that lies between 0.59 and 1.51.
.2121
Determine the area under the standard normal curve that lies to the right of 0.6.
.2743
Pinworm infestation, commonly found in children, can be treated with the drug pyrantel pamoate. According to the Merck Manual, the treatment is effective is 90% of cases. Suppose that a simple random sample of n = 10 children with pinworm infestation are given pyrantel pamoate. What is the probability that exactly 9 of the children are cured?
.3874
Determine z₀.₃₃
.44
The number of tropical cyclones that make landfall on the U.S. coast in the month of August follows a Poisson distribution with parameter λ = 0.5814. What is the probability that at least one tropical cyclone will strike the coast in a randomly selected August?
.4409
Determine the area under the standard normal curve that lies to the right of 0.
.5
The number of tropical cyclones that make landfall on the U.S. coast in the month of August follows a Poisson distribution with parameter λ = 0.5814. What is the probability that no tropical cyclones will strike the coast in a randomly selected August?
.5591
The number of tropical cyclones that make landfall on the U.S. coast in the month of August follows a Poisson distribution with parameter λ = 0.5814. How many tropical cyclones are expected to make landfall on the U.S. coastline in any randomly selected August?
.5814
Find the z-score that has an area of 0.75 to its left under the standard normal curve
.67
Pinworm infestation, commonly found in children, can be treated with the drug pyrantel pamoate. According to the Merck Manual, the treatment is effective is 90% of cases. Suppose that a simple random sample of n = 10 children with pinworm infestation are given pyrantel pamoate. What is the probability that at least 9 of the children are cured?
.7360
The number of tropical cyclones that make landfall on the U.S. coast in the month of August follows a Poisson distribution with parameter λ = 0.5814. What is the standard deviation of the number of tropical cyclones that make landfall on the U.S. coastline in any randomly selected August?
.7625
Determine the area under the standard normal curve that lies either to the left of 0.63 or to the right of 1.54.
.7975
Determine the area under the standard normal curve that lies to the right of -1.07.
.8577
Determine the area under the standard normal curve that lies between -2.18 and 1.44.
.9105
Pinworm infestation, commonly found in children, can be treated with the drug pyrantel pamoate. According to the Merck Manual, the treatment is effective is 90% of cases. Suppose that a simple random sample of n = 10 children with pinworm infestation are given pyrantel pamoate. What is the standard deviation of the number of children in the sample who will be effectively treated using pyrantel pamoate?
.9487
Determine the area under the standard normal curve that lies to the right of 4.2.
0
In the simple linear regression model, what is the assumed expected value of the error variable?
0
Values of R-Squared close to ____ fit badly
0
An event that cannot occur has probability ___ and is called a _____ event.
0 impossible
Determine the area under the standard normal curve that lies to the left of -4.00.
0.0000
Determine the area under the standard normal curve that lies to the left of 0.
0.5
Determine the area under the standard normal curve that lies to the left of 2.24.
0.9875
If y = -3x + 7, then y decreases by ____ when x increases by 4
1
In simple linear regression analysis, if the residual sum of squares is zero, then the coefficient of determination r^2 be _____
1
Values of R-Squared close to ____ fit well
1
An event that must occur has probability ____ and is called a ____ event
1 certain
Practice
1 SRS of size n is selected. - hope it reps the population Sample -> Calculated Stat -> Decision made on Population
Empirical Rule
1) about 68% of the measurements will fall w/in 1 std of the mean 2) about 95% will fall w/in 2 stds of mean 3) about 99.7% will fall w/in 3 stds of mean
characteristics of a poisson random variable
1) exp. consists in counting the number of times a certain event occurs in a given unit of time in a given area of volume/weight/distance/etc. 2) the probability that an event occurs in a given unit of time/area/volume is the same for all units 3) the number of events that occur in one unit of time/area/volume is independent of the number that occur in other units 4) the mean of events in each unit is denoted by lambda
characteristics of a binomial random variable
1) exp. consists of n identical trials 2) only 2 possible outcomes for each trial, one denoted by S (for success) and one by F (for failure) 3) probability of S remains the same from trial to trial; this probability = p, and the probability of F = (1-p) = q 4) the trials are independent 5) the binomial random variable X is the number of S's in n trials
Chebyshev's Rule
1) it is possible that very few of the measurements will fall within one standard deviation of the mean 2) at least 3/4 of the measurements will fall within two standard deviations of the mean 3) at least 8/9 of the measurements will fall within three standard deviations of the mean 4) generally, for any number k greater than 1, at least (1 - 1/(k^2)) of the measurements will fall within k standard deviations of the mean
requirements for the probability distribution of a discrete random variable
1) p(x) is equal to or greater than 0 for all values of x 2) the sum of all the p(x) for all the possible values of x is equal to 1
properties of the sampling distribution of x-bar
1) the mean of the sampling distribution of x-bar equals the mean of the sampled population 2) the standard deviation of the sampling distribution of x is equal to sigma (subscript x-bar) = sigma/ square root of n --> this is called the standard error of the mean
To make a an inference about a population parameter, we use the sample statistic with a sampling distribution that is
1) unbiased, and 2) has smaller standard deviation (standard error) than any other statistic with a sampling distribution that is unbiased
3 Steps to Check Assumptions for Regression Model:
1. Look at Residual plot (Should be random about 0) -Verifies assp. 1-3 2. Look at Q-Q Plot of Residuals. (Should be closely clustered) -Verifies step 4 3. Check for Outliers (Look at Residuals..then compare to +/-3(Root MSE).
Assumptions of Regression Model:
1. Relationship between Independent and Dependent Variable is linear. 2. Errors are independent. 3. Errors have Constant Variance. 4. Errors are normally distributed.
1. T/F: The mean is always the best measure of central tendency. 2. If we are comparing the spread of two samples, which sample is the more variable of the two?
1. T/F: The mean is always the best measure of central tendency. 2. If we are comparing the spread of two samples, which sample is the more variable of the two?
Match the sampling schemes to their respective experimental design equivalents 1. simple random sample 2. stratified random sample 3. cluster sample a. randomized complete block design b. no experimental design taught in this course c. completely randomized design
1. c 2. a 3. b
If X ~ unif(-2.5, 1.6), what is the standard deviation of X?
1.1836
Space Shuttles. The National Aeronautics and Space Administration (NASA) compiles data on space-shuttle launches and publishes them on its Web site. The table below displays a relative frequency distribution for the number of crew members on each shuttle mission from April 1981 to July 2000. Let X denote the crew size of a randomly selected shuttle mission between April 1981 and July 2000. Find the standard deviation of the number of crew members on a space shutle mission between April 1981 and July 2000. Crew size, relative frequency 2, .0417 3, .0104 4, .0208 5, .3750 6, .1875 7, .3438 8, .0208
1.26
Determine 5!
120
How many samples of size 5 are possible from a population of size 70?
12103014
Determine the value of 15C4
1365
In the Pick 6 lottery, six integers are random selected out of 50. How many different combinations of integers can occur?
15890700
The authors of the article "Adjuvant Radiotherapy and Chemotherapy in Node-Positive Premenopausal Women with Breast Cancer" (New Engl. J. of Med., 1997: 956-962) reported on the results of an experiment designed to compare treating cancer patients with chemotherapy only to treatment with a combination of chemotherapy and radiation. Of the 154 individuals who received the chemotherapy-only treatment, 76 survived at least 15 years, whereas 98 of the 164 patients who received the hybrid treatment survived at least that long. Find a 99% confidence interval for the difference between proportions of those who, when treated with just chemotherapy, survive at least 15 years and the analogous proportion for the hybrid treatment.
2 proportion z conf interval
AML and the Cost of Labor. Active Management of Labor (AML) was introduced in the 1960s to reduce the amount of time a woman spends in labor during the birth process. The following data displays the costs, in dollars, of eight randomly sampled AML deliveries. What is the sample mean cost of these AML deliveries? 3141, 2873, 2116, 1684, 3470, 1799, 2539, 3093
2589.375
AML and the Cost of Labor. Active Management of Labor (AML) was introduced in the 1960s to reduce the amount of time a woman spends in labor during the birth process. The following data displays the costs, in dollars, of eight randomly sampled AML deliveries. What is the sample median cost of these AML deliveries? Costs (in dollars) of Eight AML Deliveries 3141 2873 2116 1684 3470 1799 2539 3093
2706
Consider the normal probability density function given below. For this normal distribution, the mean is ___ and the variance is ___. ƒ(x) = {12π}-1/2exp{-(x-3)2/12}, -∞ < x < ∞.
3 6
Determine the value of 15P4
32760
Space Shuttles. The National Aeronautics and Space Administration (NASA) compiles data on space-shuttle launches and publishes them on its Web site. The table below displays a relative frequency distribution for the number of crew members on each shuttle mission from April 1981 to July 2000. Let X denote the crew size of a randomly selected shuttle mission between April 1981 and July 2000. Find the mean number of crew members on a space shutle mission between April 1981 and July 2000. Crew size, relative frequency 2, .0417 3, .0104 4, .0208 5, .3750 6, .1875 7, .3438 8, .0208
5.77
Pinworm infestation, commonly found in children, can be treated with the drug pyrantel pamoate. According to the Merck Manual, the treatment is effective in 90% of cases. Suppose that a simple random sample of n = 10 children with pinworm infestation are given pyrantel pamoate. How many children of the 10 are expected to be effectively treated by the drug?
9
Suppose for a sample size of 75, we have a mean of 13 and a standard deviation of 2. Then a 96% confidence interval for the mean of the sampled population would have left and right end points 13-(2.06)(2/√75) and 13+(2.06)(2/√75), respectively. if we constructed such a confidence interval for 100 samples, we could expect about __________ of them to contain the population mean.
96 96% confidence interval of 100 samples .96 x 100 = 96
equation for finding the area to the right of a number "a" for an exponential distribution
A = P(x is greater than or equal to a) = e raised to the (-a/theta)
pth percentile
A number such that p% of the measurements fall below that number and (100-p)% fall above it
The probability of an intersection of two events can be calculated using the a multiplicative rule b additive rule c subtractive rule d conditional rule
A. Multiplicative rule
Which of the following is not true of Chebychev's rule? a. it can be used for any number k greater than 0 b. it can be applied to any data set c. it tells us how many measurements fall within k standard deviations of the mean d. all of the above are true of Chebyshev's rule
A. is not true; it can be used for any number k greater than 1
Type II Error
Accepting the Null Hypothesis when it should have be rejected
Using the _____________________ , it is possible to obtain the probability of the union of two events
Additive Rule of Probability
"A union B" is also referred to "A and B" A) True B) False
B) False, "A union B" is referred to as "A or B", and "A intersection B" is also referred to as "A and B"
When rolling two die, what is the probability of having at least one roll a 5? a. 11/36 b. 1/12 c. ¼ d. ¾
B. 1/12 because each die has a 1/6 chance of rolling a 5, so by multiplying each die's chances of rolling a 5 (6 x 6), you will get 36 possible outcomes. However, you must then take into account the possibility of BOTH dice rolling 5s (since the problem states "at least one"), in addition to each die rolling a 5 (2 possible outcomes). Now you will have concluded that there are 3/36 possible outcomes total, which simplifies to 1/12.
Which of the following is NOT a symbol used to represent the "mean"? a) X-bar b) Eta c) Mu d) A&C
B. Eta, which is used to denote the median of a population
Application of Theory
Based on Practice
A SRS of size n=150 was taken from a population with unknown distribution. If the normal quantile plot of the sample has a "S" shape, what can be concluded about the distribution?
Based on the "S" shape of the points on the normal quantile plot, we can conclude that the population is symmetric, but not normal.
A SRS of size n=150 was taken from a population with unknown distribution. If the normal quantile plot of the sample has a concave shape, what can be concluded about the distribution?
Based on the concave shape of the points on the normal quantile plot, we can conclude that the population is extremely right skewed.
A SRS of size n=150 was taken from a population with unknown distribution. If the normal quantile plot of the sample has a convex shape, what can be concluded about the distribution?
Based on the convex shape of the points on the normal quantile plot, we can conclude that the population is moderately left skewed.
Inferential Stats
Based on theory
Repeated trials of an experiment are called ___________ trials if the following three conditions are satisfied: 1. Each trial of the experiment has two possible outcomes, denoted s for _____, and f for ________. 2. The trials are _________. 3. The probability of success, called the __________ and denoted p remains the same from trial to trial.
Bernoulli success failure independent
What name is given to the distribution that provides a formula for finding the probabilities associated with the number of successes in a sequence of n independent Bernoulli trials, each having the same probability of success, p?
Binomial distribution
Descriptive statistics a. Utilizes sample date to make estimates b. Is a statement about the degree of uncertainty associated with statistical inference c. Summarizes and describes important features of the data d. Is a measurement that cannot be measured on a natural numerical scale
C is the correct answer. Descriptive statistics is used to describe/summarize any set of data. Using variability and central tendency, a person can conclude what the results mean. This is an example of descriptive statistics.
What is the probability that you will roll an odd number on a 6 sided die? A) 25% B) 50% C) 0.5 D) 0.25
C) 0.5; 50% would be incorrect because probability is not a percentage
Which of the following is NOT sensitive to extreme values? a. The range b. The standard deviation c. The interquartile range
C. The Interquartile range is not sensitive to extreme values, because it measures the middle 50% of the data. Therefore, a few outliers do not really affect it.
3 Standard Deviations rule
Calculate the std dev of the sample . The value has a 99.74% of being within 3 std dev of the mean.
Mass of Sampling Distribution
Centered around µ
theorem 6.2 (central limit theorem)
Consider a random sample of n observations selected form a population (any population) with meanμand standard deviation σ. Then, when n is sufficiently large, the sampling distribution of ̄X will be approximately normal with meanμand standard deviationσ ̄X=σ/√n. The larger the sample size, the better will be the normal approximation to the (true)sampling distribution of ̄X.
uniform distribution
Continuous random variables that appear to have equally likely outcomes over a range of possible values have a uniform probability distribution.
What are the 3 principles of experimental design?
Control, randomization, replication
Sally plans to visit Europe this summer. She heard from a friend that a shirt she wanted costed 50 euros in Europe. The same shirt from a different store in America costed 55 American dollars. Which of the following would be the most useful for Sally to see whether buying the shirt in America or Europe would be cheaper? A. A box plot B. Stem-and-leaf plot C. Chebychev's Rule D. z-score-answer E. Empirical Rule
D. Z-Score as it would provide an equal unit to measure the two different currency values.
Which set of fences appear on the boxplot? a) inner fences b) outer fences c) both inner and outer fences d) neither inner or outer fences
D. neither inner or outer fences appear on the box plot. They are only used to help construct the plot but are not seen on the box plot.
Increase in Sample Size
Decrease in standard deviation Decrease in spread of distribution of sample
__________ statistics consists of methods for organizing and summarizing information
Descriptive
What are the 2 major branches of statistics?
Descriptive and inferential
Mike was a Primatologist that studied monkeys. He wanted to understand the behavior of monkeys and their interactions among the environment. Specifically he wants to know if Primates have a fixed fear of snakes and whether this behavior may explain Humans fear of snakes. He does this by placing a monkey near a fake snake and examining it's behavior and reaction. What would we call this if we wanted to collect data from it? (a) Published Source (b) Designed Experiment (c ) Observational study (d) All of the above
Designed experiment, since Mike has control of the treatment.
One is always more likely to make a Type 1 error than a Type 2 error using statistical hypothesis testing.
F
The population mean, population median, and population mode are all the same value.
F
The same test statistic is used to test all hypotheses for all population parameters.
F
The three assumptions which are needed in order for the one factor ANOVA to yield exact results (exact p-values) are normality of the populations, homoscedasticity of the population variances, and the independence of the observations.
F
Unfortunately, one cannot test hypotheses concerning a single population proportion.
F
A normal distribution always has an expected value of zero.
F A standard normal distribution always has an expected value of zero.
The null distribution is the distribution of the appropriate test statistic, assuming the alternative hypothesis is true.
F Assuming the null hypothesis is true
The Central Limit Theorem is important because, provided the sample size is sufficiently large, it can be applied for determining the sampling distribution of the sampling median without assuming the distribution of the population of interest is known.
F CLT determines distribution of sample mean, not median
A disadvantage of using a 99% confidence level rather than a 95% confidence level for a parameter estimation is that larger samples must be taken for establishing higher levels of confidence.
F Confidence level is independent of sample size.
The p-value is the probability of getting the resulting value of the test statistic that is from the sample data, or one more rare, if, in fact, the alternative hypothesis is true.
F If, in fact, the null hypothesis is true
Increasing the confidence level decreases the width of the interval estimate for the population mean μ, given that the sample data remains the same.
F Increasing confidence level increases the width
In statistical hypothesis testing, if one fails to reject the null hypothesis, then there is evidence to support the null hypothesis.
F The null hypothesis is assumed to be true
The Central Limit Theorem lays a key role in developing a confidence interval for the population mean μ provided the sample size is not too large.
F The sample size must be sufficiently large to apply the Central Limit Theorem.
A 95% confidence interval has a probability of .95 that the parameter μ will be contained in the confidence interval.
F There is no probability associated with μ.
There is only one symmetric bell shaped density function that has mean 0 and standard deviation 1.
F There is only one standard normal, but there are many with mean 0 and standard deviation 1.
One factor ANOVA is applied when we wish to use a hypothesis test to determine if there are differences in the population means of three or more normally distributed or approximately normal populations with equal or approximately equal variances.
F There must only be 2 sources of variability
Every estimator has a sampling distribution having parameters that are totally unrelated to the population of interest from which the data was sampled.
F They are often related
The alternative hypothesis in statistical hypothesis testing should be specified
F after sample data has been gathered and reviewed
The Central Limit Theorem is not applicable to samples taken from non-normally distributed samples.
F applies to all with sufficiently large sample size
According to the Central Limit Theorem, the standard error of (X bar) decreases as the population variance increases.
F as population variance decreases
An estimator is said to be consistent if the estimator gets closer to the parameter it is estimating as the population variance increases.
F as population variance decreases
The alternative hypothesis should be specified after the sample data has been gathered and reviewed.
F before sample data has been gathered and reviewed
The one sample t-test for a single population mean is always a two tailed test.
F can be one-sided
For the p-value method of hypothesis testing, the decision rule is to compare the p-value to the critical value, which is a particular t-score.
F compare the p-value to alpha (the significance level)
One factor ANOVA is a statistical method that is used to detect differences in population variances.
F differences in population means
According to the Central Limit Theorem, the sampling distribution of the sample mean for sufficiently large sample sizes will be very similar to the probability density function of the population of interest.
F distribution is always approx. normal
Consider a population with mean μ and population standard deviation σ. The Central Limit Theorem states that for a sufficiently large sample size, the sampling distribution of the sample mean (X bar) is approximately a normal distribution with expected value (mean) μ and standard error (standard deviation) σ.
F for (X bar), standard deviation is σ/Sqrt (n)
In one factor ANOVA , the appropriate test statistic is MST/MSE and it has a t-distribution if the null distribution if the null hypothesis is true.
F has an f-distribution
In one factor fixed effects ANOVA, the appropriate test statistic is MST/MSE and it has a t-distribution if the null hypothesis is true.
F has an f-distribution
The is only one F distribution.
F infinitely many F distributions
A point estimate is preferred over an interval estimate from a statistical point of view.
F interval estimate accounts for sampling error
Any function of the sample data is an estimator.
F is a statistic
One factor ANOVA analysis involves no hypothesis test.
F it does
The p-value method of statistical hypothesis testing fails to take sampling error into account.
F it does take sampling error into account
By choosing the significance level, alpha, one is able to control or choose the probability of making a Type 2 error in statistical hypothesis testing.
F making a Type 1 error
The null hypothesis is the hypothesis test in one factor ANOVA is that all of the population means are different.
F means are the same
When a hypothesis test results in rejection of the null hypothesis, the null hypothesis has been proven (with a probability of 1)to be false.
F null hypothesis is never proven to be false, but we have evidence to counter it
The test statistic is MST/MSE in one factor ANOVA is distributed as an F statistic when the alternative hypothesis is true.
F null hypothesis is true
The probabilities of Type 1 and Type 2 errors should be set equal before conducting a statistical hypothesis test.
F one can only control the Type 1 error using alpha
P-values for hypothesis tests can only be found and used when the null distribution of the test statistic is normally distributed.
F one distribution is normal, but there are many
Tukey's HSD multiple comparison method is the only multiple comparison method for one factor fixed effects ANOVA.
F one of many methods
One factor fixed effects ANOVA assumes that exactly three sources of variability effect the dependent variable.
F only 2 sources of variability
In one factor fixed effects ANOVA, the quantities MST and MSE estimate the same quantity, sigma squared.
F only if null hypothesis is true
The population mean is an example of an estimator.
F pop mean is a parameter
The Central Limit Theorem yields the approximate sampling distribution of the population mean provided the sample size is sufficiently large.
F population mean is a parameter, and parameters have no distributions.
The two-sample t-test is used to detect differences in the two sample means.
F population means
According to the Central Limit Theorem, the sampling distribution of the sample mean for sufficiently large sample sizes will be identical to the distribution of the sampled (parent) population.
F regardless of population of interest, distribution of (X bar) will be approximately normal.
All student's t based confidence intervals for a population mean μ will be of equal width provided the level of confidence is the same for each interval.
F sample size, level of confidence, and standard deviation must be the same for each interval.
The sample mean is identical to the population mean.
F sampling error
Every random sample is a representative sample.
F sampling error is likely
Parameters are used to estimate statistics.
F statistics are used to estimate parameters
In one factor fixed effects ANOVA, we assume that there are only two sources of variability in the data: one is error and the other is regression.
F the other is treatment
Confidence intervals give no information regarding the precision of an estimator.
F the smaller the interval, the more precise the estimator
In statistical hypothesis testing, the sum of the probabilities of making Type 1 and Type 2 errors is one.
F they are inversely related
Statistical inference occurs when a population population parameters are used to infer the values of statistics.
F use estimators to infer the values of parameters
For small sample sizes, the two-sample t-test for two populations means assumes that the populations of interest are at least approximately bell shaped and have approximately equal population variances.
F variances don't have to be equal
The null hypothesis will be rejected when the p-value is sufficiently large.
F when the p-value is sufficiently low
The Central Limit Theorem states that for a sufficiently large sample size, the sampling distribution of the sample mean (X bar) is N(μ,σ) where μ is the mean of the population of interest, σ is the standard deviation of the population, and n is the sample size.
F σ/Sqrt(n)
Test for ANOVA
F-test with an F distribution with (k-1, n-k) degrees of freedom)
T/F: If the residuals are all large in magnitude, then much of the variability in observed y values appears to be due to the linear relationship between x and y, whereas many smaller residuals suggest quite a bit of inherent variability in y relative to the amount due to the linear relation
False
T/F: In the simple linear regression model, it must not be assumed that the values of the error term are independent of one another
False
T/F: In the simple linear regression model, it must not be assumed that the variance of e is the same for all values of independent variable x
False
T/F: Let Y^ = B_0^ + B_1^x^0, where x^0 is some fixed value of x, then the mean value of Y^ is E(Y^) = B_0^ _ B_1^x^0
False
T/F: Regression analysis is the part of statistics that deals with investigation of the relationship between two or more variables related in a deterministic relationship
False
T/F: The coefficient of determination, denoted by r^2, is interpreted as the proportion of observed y variation that cannot be explained by the simple linear regression model
False
T/F: The least squares estimates are always the unique solution to the system of normal equations
False
T/F: The least squares estimates are never the unique solution to system of normal equations
False
T/F: The simple linear regression model should not be used for further inferences (estimates of mean value or predictions of future values) unless the model utility test result in acceptance of H_0: B_1 = 0 for a suitably small significance level alpha
False
T/F: The slope B_1 of the population regression line is the true average change in the independent variable x associated with a 1-unit increase in the dependent variable y
False
T/F: The square root of the sample correlation coefficient gives the value of the coefficient of determination that would result from fitting the simple linear regression model
False
T/F: The value of r (the sample correlation coefficient) depends on which of the two variables under study is labeled x and which is labeled y
False
T/F: The variable T = [Y^-(B_0 + B_1x^0)] / S_i has a t distribution with n - 1 degrees of Freedom, where n is the sample size and x^0 is a specified value of the independent variable x
False
T/F: The variance B_1^ of the least squares line equals the variance sigma^2 of the random error e divided by sqrt(S_xx), where S_xx = Sum[(x_i - xbar)^2)
False
TV Viewing Times. The A.C. Nielsen Company collects and publishes information on the television viewing habits of Americans. Data from a sample of Americans yields estimates of average TV viewing time per week. The results are generalized to all Americans. (For more information, see Exercise 1.7 on page 8.) TRUE or FALSE: This is an example of a descriptive study.
False - results are generalized to all americans
T/F: If you conduct a random sample, then the sample is guaranteed to be a representative sample.
False because you could still by chance pick an unrepresentative sample like in the class example of height of basketball players.
T/F In A union B, if sets A and B both have six, you count the six twice in the A union B set.
False, in A union B you would add A + B and subtract A union B from the total. Therefore 6 + 6 - 6 = 6.
T/F - If all data points are whole numbers, the mean must also be a whole number.
False, you can have a fraction or decimal for an average. In the case of averaging an even set of numbers, you would have to add up the middle two numbers and divide by two to get the mean. In some cases that number will come out as a decimal.
T/F: Sample points will always be numbers.
False. Sample points do not have to be numbers, they can represent qualitative data. For example, when rolling a die the sample points are HH, HT, TH, and TT.
(T/F) The symbol "sigma" is the notation for the population variance.
False. Sigma (σ) means standard deviation within a population & sigma squared (σ²) means population variance.
T/F: the 50% percentile is also representative of the mean.
False. The 50th percentile is representative of the median. The 50th percentile would only represent the mean when the median and mean are equal, which occurs when we have a symmetric graph.
True or False: Taking the average of the deviations is a good way to measure the variability of a dataset.
False. You're answer will be 0 which is incorrect.
The central limit theorem states that for sufficiently large sample sizes (30 or more) the sampling distribution of the sample mean, x̄, is N(μ, σ/n) where μ is the mean of the parent population, σ is the standard deviation of the parent population, and n is the sample size. T/F
False; it should be σ/√n
The central limit theorem yields the approximate sampling distribution of the population mean provided the sample size is sufficiently large. true or false
False; this statement would be true if it read, "the sampling distribution of the SAMPLE mean."
What does R-squared measure?
How well the model fits the data
theorem 6.1
If a random sample of n observations is selected from a population with a normal distribution, the sampling distribution of ̄X will be a normal distribution
________ statistics consists of methods for drawing and measuring the reliability of conclusions about a population based on the information obtained from a sample of the population
Inferential
This test is used when: -Testing more than 2 Population Means. -Do NOT have normally distributed populations.
Kruskal-Wallis Test
The _____________ provides a long-run interpretation for the mean of a population. It states, "In a ______ of independent observations of a random variable X, the average value of those observations will approximately equal the mean, u, of X. The _______ the ______ of observations, the ________ the avg tends to be to u.
Law of Large Numbers large number larger number closer
Sampling Distributions: Properties
Mean Standard Deviation Shape
In the set 2, 2, 1, 3, 4, and 15, is the mean an appropriate measure? Why or why not?
No it is not. A better approach to the problem would be to find the median of the set. The mean is affected by outliers, and would give an unreliable average of the data. The Median is not affected by outliers and would not change due to an extremely small or large number.
Null Hypothesis
No relationship between X & Y (what we assume)
Not Normal Population
Not Normal Sample DIstribution
Influential Observations
Observation, that although is not an outlier, does still influence regression line.
probability p(x) for a uniform distribution
P(a equal to or less than X, which is equal to or less than b) = (b-a)/(d-c), when c is equal to or less than a, which is less than b, which is less than or equal to d
Probabilities for a random variable X that have a Poisson distribution are computed using the formula below. In this formula, λ is a positive real number, and e is the natural number. The random variable X is called a _________ random variable and is said to have the ___________ with ________ λ. P(X = x) = e-λλx/x!, x = 0,1,2,... .
Poisson Poisson distribution parameter
What name is given to the distribution used to model the frequency of a specified event occurring at rate λ during a particular period of time?
Poisson distribution
Sample Distribution
Probability distribution of all random variables of a statistic.
Sampling Distribution of the Mean
Probability distribution that models all values of X across all SRS's.
Type I Error
Rejecting the Null Hypothesis when it should be accepted (most common)
Alternative Hypothesis
Relationship between X & Y (what we are trying to prove)
Theory
Sample is Randomly selected Many possible SRS's each w/ different value for the stat which depends on which sample chosen = Random Variable
{}
Set notation. The collection of what is inside the brackets.
A SRS of size n=150 was taken from a population with unknown distribution. If the normal quantile plot of the sample has a roughly linear shape, what can be concluded about the distribution?
Since the points on the normal quantile plot are roughly linear, we can conclude the population distribution is normal, or approximately normal.
A key characteristic of statistical analysis is that it takes sampling error into account.
T
A multiple comparison method in one factor ANOVA preserves the overall level of confidence that all confidence intervals simultaneously hold.
T
A multiple comparisons method for constructing pairwise confidence intervals is not needed if we are estimating the difference in the means of two populations.
T
A statistical test is performed under the assumption that the null hypothesis is true.
T
A test statistic is always performed under the assumption that the null hypothesis is true.
T
An estimator is said to be unbiased if the expected value of the estimator is exactly the parameter which the estimator.
T
Assuming the null hypothesis is true , we know that the statistics MST and MSE both estimate the common population variance (sigma squared).
T
Every estimator is a statistic.
T
Every estimator of a population parameter is a random variable which has its own density function.
T
In one factor fixed effects ANOVA, the factor may have more than two treatments.
T
In statistical hypothesis testing, all of the information in determining whether or not thee is sufficient evidence to support the alternative hypothesis is given in the p-value and the significance level, alpha.
T
In statistical hypothesis testing, the significance level, α, is the definition of a rare event and the probability of making a Type 1 error.
T
Parameters describe some characteristic of the population of interest.
T
Sampling error is a source of error for every estimator.
T
Statistical inference permits us to draw conclusions concerning a population and its population parameters based on sample data.
T
The Central Limit Theorem applies to populations which are modeled as discrete random variables.
T
The Central Limit Theorem can be applied without regard to the shape of the population density function provided the sample size is sufficiently large enough
T
The Central Limit Theorem is important because it can always be applied without assuming the of the population of interest, provided the sample size is sufficiently large and the population variance is finite.
T
The empirical F score for the hypothesis test in a one factor fixed effects ANOVA is inversely related to the p-value.
T
The expected value of the sample mean (X bar) is the same as the population mean of the sampled population.
T
The expected value of the sample mean is identical to the population mean.
T
The null distribution is the distribution of the test statistic, if, in fact, the null hypothesis is true.
T
The null distribution of a test statistic must be known, or at least approximately known, in order for the p-value to be determined.
T
The p-value is the probability of getting the resulting value of the test statistic that is found using the sample data, or one more rare, if, in fact, the null hypothesis is true.
T
The p-value is the probability of getting the resulting value of the test statistic that is found, or one more rare, if, in fact, the null hypothesis is true.
T
The probabilities of Type 1 and Type 2 errors are inversely related if the sample size or sample sizes are fixed.
T
The purpose of a multiple-comparison method in ANOVA is to detect differences among population means and to estimate the differences in population means, provided differences exist.
T
The purpose of multiple-comparison method in ANOVA is to determine which of the population means differ and to estimate the difference in the population means that are, indeed, different.
T
The significance level, alpha, is not only the definition of a rare event, but also the probability of making a Type 1 error.
T
The standard error of the sample mean (X bar) does not depend on the expected value of the population of interest that is being sampled.
T
The terms "random variable" and "population of interest" are often interchanged in statistical analysis.
T
The width of a confidence interval for the population mean μ is dependent on the sample standard deviation.
T
The width of a confidence interval for μ decreases as the sample size increases for a fixed confidence level, provided the sample standard deviation remains constant.
T
The width of a confidence interval for μ depends only on the parameter value it is attempting to estimate.
T
IQR
The interquartile range
QL
The lower quartile. Same as the 25th percentile (aka Q₁)
A student decides to make a campus wide survey in order to determine which style snack was most favored. With a total of 20 thousand students on campus, he decides to take a sample size of 20 from 3 different housing facilities which only contained males. After finishing his survey, he concluded that spicy snacks were the most favored. What is the population of interest and was this method an effective way to accurately represent that population? Explain.
The population of interest would be the 20 males from 3 different housing facilities. This method is not an accurate representation of the population because it is too narrow. The student intended to make a campus wide survey, but only focused on one gender and 3 housing facilities, which is extremely biased. His conclusion will not accurately represent the campus as a whole.
p
The population proportion
N
The population size
probability density function definition
The probability distribution function for a continuous random variable X can be represented by a smooth curve - a function of x, denoted by f(x). The curve is called a probability density function. The probability that X falls between two values, a and b, i.e.,P(a≤X≤b) is the area under the curve between a and b.
What does the probability 0 represent? What does the probability 1 represent?
The probability of 0 represents no chance of the event happening and the probability 1 represents no chance of the event not happening. Probability exists between the ranges 0 and 1
x̄
The sample mean
M
The sample median
n
The sample size
s
The sample standard deviation
s²
The sample variance
The model utility test tests whether _______.
The slope of the regression line is nonzero
Qu
The upper quartile. Same as the 75th percentile (aka Q₃)
True or False: Histograms provide good visual descriptions of data sets and let us identify individual measurements.
This is false, because histograms group several values together, instead of plotting them individually (which is exactly the purpose a stem-and-leaf plot serves). It is much easier to observe data on a stem-and-leaf plot, because you can clearly see how/where each individual measurement is distributed.
T/F: Standard deviation is always positive.
This is true because standard deviation is also the square root of the variance, and when it is squared, it can only be a positive number. `
If we were conducting a study and chose only participants whose last names that started with the letter L, would this be a random & representative sample? Why or why not?
This would be a representative sample because it is a small quantity of people that is meant to accurately reflect the general population. However, it would not be a random sample, since everyone was chosen for the study was specifically selected for having last names starting with the letter L.
Critical Region
To make sure our conclusion isn't due to random chance / variability we establish a critical region, if the outcome falls inside of it, we can reject the Null Hypothesis
Geography Performance Assessment. In an article entitled, "Teaching and Assessing Information Literacy in a Geography Program" (Journal of Geography, Vol. 104, No. 1, pp. 17-23), Dr. Mary Kimsey and S. Lynn Cameron reported results from an on-line assessment instrument given to senior geography students at one institution of higher learning. The results for level of performance of 22 senior geography majors in 2003 and 29 senior geography majors in 2004 were presented. (For more information, see Exercise 1.9 on page 8.) TRUE or FALSE: This is a descriptive study.
True
T/F: A 100(1 - a)% confidence interval for u_y*x^0; the expected value of Y when x = x^0, is given by (B_0^ + B_1^) +- t_a2,n-2 * S_[B_0^ + B_1^] = y^ +- t_a/2,n-2 * S_y^, where n is the sample size
True
T/F: A t variable obtained by standardizing B_0^ + B_1^x^0 leads to a confidence interval and test procedure concerning u_i*x^0 (the expected value of Y when x = x^0)
True
T/F: A value of of the sample correlation coefficient r is = .5 is considered weak, but r^2 = .25 implies that in a regression of y or x, only 25% of observed y variation would be explained by the model
True
T/F: A value of the sample correlation coefficient r near 0 is not evidence of the lack of a strong relationship, but only the absence of a linear relationship, so that such a value of r must be interpreted with caution
True
T/F: B_0^ + B_1^x^0 is an unbiased estimator for B_0 + B_1x^
True
T/F: Before the least squares estimates B_1^ and B_2^ are computed, a scatter plot should be examined to see whether a linear probabilistic model is plausible
True
T/F: For a fixed x value x0, B_0^ + B_1^ * x0 (the height of the estimated regression line above x0) gives either a point estimate of the expected value of Y when x = x0 or a point prediction of Y value that will result from a single new observation made at x = x0
True
T/F: If the coefficient of determination is small, an analyst will usually want to search for an alternative model (either a nonlinear model or a multiple regression model that involves more than a single independent variable)
True
T/F: In some situations, a confidence level for a set of K Bonferroni intervals is guaranteed to be at least 100(1 - K a)%
True
T/F: In the simple linear regression model, it must be assumed that the error term is normally distributed
True
T/F: In the simple linear regression model, it must be assumed that the expected value of e is zero
True
T/F: Inferences about the slope B_1 of the population regression line are based on thinking of the slope B_1^ of the least squares line as a statistic and investigating its sampling distribution
True
T/F: Saying that variables x and y are deterministically related means that once we are told the value of x, the value of y is completely specified
True
T/F: The Equation σ^2_y*xi = V(B_0 + B_1x* + e) states that the amount of variability in the distribution of Y values is the same at each different value of x (homogeneity of x)
True
T/F: The assumptions of the simple linear regression model imply that the standardized variable T = (B_1^ - B_1) / S_B_1^ has a t distribution with n - 2 degrees of freedom
True
T/F: The confidence interval for u_y*x; the expected value of Y when x = x^0 is centered at the point estimate u_x*y^0, and extends out to each side by an amount that depends on the confidence level and on the extent of variability in the estimator on which the point estimate is based
True
T/F: The denominator of the slope B_1^ of the least squares line is S_xx = Sum[x_1 - xbar)^2], which is constant since it depends only on the x_i^1s and not on the Y_1^1s
True
T/F: The distribution of the slop B_1^ of the least squares line is always centered at the value of the slope B_1 of the population regression line
True
T/F: The error sum of squares is the sum of squared deviations about the least squares line y = B_0^ _ B_1^*x
True
T/F: The estimated standard error of B_1^; namely S_B_1^, will tend to be small when there is little variability in the distribution of B_1^ and large otherwise
True
T/F: The estimation B_0^ + B_1^x^0 for u_1*x^0 is more precise when x^0 is near the center of the x_i's then when it is far from the x values at which observations have been made
True
T/F: The height of true regression line y = B_0 + B_1(x) above and particular X is the expected value of Y for that value of X
True
T/F: The higher the value of the coefficient of determination, the more successful is the simple linear regression model in explaining y variation
True
T/F: The joint or simultaneous confidence level for a set of K bonferroni intervals is guaranteed to be at least 100(1 - K a )%
True
T/F: The least squares regression line should not be used to make a prediction for an x value much beyond the range of the data x values
True
T/F: The model utility test is the test of H_0: B_1 = 0 versus H_*: B_1 =-0, in which case the test statistic value is the t ratio t = B_1^ / S_B_1^
True
T/F: The most commonly encountered pair of hypothesis about the slope of B_1 of the population regression line is H_0: B_1 = 0 versus H_1: B_1 +- 0
True
T/F: The null hypothesis H_0: B_1 = 0 can be tested against the alternative hypothesis H_*: B_1 != 0 by constructing an ANOVA table and rejecting H_0 if the test statistic value f >= F_n,1,n-2, when n is the sample size
True
T/F: The objective of regression analysis is the exploit of the relationship between two (or more) variables so that we can gain information about one of them through knowing values of the others
True
T/F: The predicted value y_i^ is the height of the estimated regression line above the value x_i for which the ith observation with made
True
T/F: The predicted value y_i^ is the value of y that we would predict or expect when using the estimated regression line with x = x_i
True
T/F: The proportion of variation in the dependent variable explained by fitting the simple linear regression model does not depend on which variable is treated as the dependent variable
True
T/F: The ratio of the error sum of squares to total sum of squares is the proportion of total variation that cannot be explained by the simple linear regression model
True
T/F: The residual y_i - y_i^ is the difference between the observed y_i and the predicted y_i^
True
T/F: The residuals are the vertical deviations y1 - y1^, y2 - y2^, ... yn - yn^ from the estimated regression line
True
T/F: The sample correlation coefficient r can be used to make various inferences about the population correlation coefficient p (rho)
True
T/F: The set of pairs (x, y) for which y = B_0 + B_1(x) determines a straight line with slope B_1 and y-intercept B_0
True
T/F: The simplest deterministic mathematical relationship between two variables x and y is a linear relationship v = B_0 + B_1(x)
True
T/F: The slop of a line y = B_0 + B_1(x) is the change in y per a 1-unit increase in X
True
T/F: The slope B_1 of the true regression line y = B_0 + B_1(x) is interpreted as the expected change in Y associated with a 1-unit increase in the value of X
True
T/F: The slope B_1^ of the least squares line is a linear function of the "independent" random variables Y1,Y2,...Yn, each of which is normally distributed
True
T/F: The slope B_1^ of the least squares line is an unbiased estimator of the slope coefficient B_1 of the true regression line
True
T/F: The sum of squared deviations about the least squares regression line is always smaller than the sum of squared deviations about any other line
True
T/F: The total sum of squares is the sum of squared deviations about the sample mean of the observed y values.
True
T/F: The true regression line y = B_0 + B_1(x) is the line of mean values
True
T/F: The value of r (the sample correlation coefficient) equals -1 if all (x1, y1) pairs lie on a straight line with negative slope
True
T/F: The value of r (the sample correlation coefficient) equals 1 if all (x1, y1) pairs lie on a straight line with positive slope
True
T/F: The value of r (the sample correlation coefficient) is always between -1 and 1, inclusive
True
T/F: The value of r (the sample correlation coefficient) is independent of the units in which x and y are measured.
True
T/F: The y-intercept of a line y = B_0 + B_1(x) is the height at which the line crosses the vertical axis and is obtained by setting x = 0 in the equation
True
T/F: There is an estimated standard error for the statistic B_0^ from which a confidence interval for the intercept B_0 of the population regression line can be calculated
True
T/F: Values of x_i all close to one another imply a highly variable estimator B_1^ of the slope B_1 of the true regression line
True
T/F: Values of x_i that are quite spread out results in a more precise estimator B_1^ of the slope B_1 of the true regression line
True
T/F: We refer to an interval of plausible values for a future Y as a prediction interval rather than a confidence interval, since a future value of Y is a random variable
True
The Music People Buy. Results of monthly telephone surveys yield the percentage estimates of all music expenditures. TRUE or FALSE: This is an example of an inferential study.
True
The central limit theorem is important because it can always be without assuming the distribution of the population of interest, provided the sample size is sufficiently large and the population variance is finite. true or false
True
The expected value of the sample mean, x̄, is the same as the population mean of the sampled population.
True
Confidence Interval for ANOVA
Tukey's Test
If X ~ unif(a,b), what is the shape of the distribution?
Uniform or rectangular
Given that A={2,4,6,8,10} and B={1,2,3,4,5}, identify the union and the intersection
Union ∪ = 1,2,3,4,5,6,8, 10 Intersection ∩ = 2,4
Lurking Variable
Variable that you have not considered in your analysis, but influences your dependent variable in a regression.
This test is used when: -Testing more than 2 Population Means. -Do NOT have Equal Variances.
Welch ANOVA Test
When is a value "statistically significant"?
When "t" statistic" is greater than the critical value / in critical region
When examining stem-and-leaf plots and histograms, what characteristics should we look for?
When examining stem-and-leaf plots and histograms, look for both central tendency and variability of the data. It is important to notice which numbers the data is centered towards, as well as the range of how the data is spread out.
Circle One: When the data is from a symmetric, mound-shape distribution we use the (Empirical Rule/Chebychave's Rule)
When the data is from a symmetric distribution, either rule can be used. The empirical rule will give a more exact approximation than Chebychave's so it would be a better choice, but both could be used.
This test is used when: -Testing 2 Population Means. -Populations are NOT Normal. -Population Variances are Equal. *A test on the Medians*
Wilcoxon Rank Sum Test
In simple Linear Regression: -Is X independent or dependent? Y? -What is residuals?
X: Independent Y: Dependent Residuals: Difference between observed values and Predicted values
Is IQR a measure of variability?
Yes. IQR is the distance between the lower & upper quartiles.
I have 20 crayons in a box and want to draw a picture using 20 colors. How many different ways can I choose 20 crayons from the box?
You could choose 20 crayons from a box of 20 only one way. Because order doesn't matter, you would use the Combinations rule to do this problem.
Dow Jones Industrial Averages. The table below provides the closing values of the Dow Jones Industrial Averages as of the end of December for the years 2000-2008. The study is a. a descriptive observational study of a population b. an inferential study from a designed experiment applied to a sample c. a descriptive study from a designed experiment applied to a sample d. an inferential study from an observational study applied to a sample
a
Which of the following is NOT a guideline for grouping quantitative data? a. The relative frequencies of each class should be as close as possible. b. The number of classes should be small enough to provide an effective summary, but large enough to display the relevant characteristics of the data. c. Whenever feasible, all classes should have the same width d. Each observation must belong to one, and only one, class
a
Select all of the options that are a property or a correct statement about the probability density function (PDF) and cumulative distribution function (CDF) of a continuous random variable. a. The CDF is a nondecreasing function of x b. The CDF is the 1st derivative of the PDF c. The area under the PDF is one, so the height of the PDF must be less than one. d. The PDF and CDF are both continuous functions of x
a and d
Pareto diagram
a bar graph in which the bars are arranged left to right by descending bar height
variable
a characteristic or property of an individual experimental (or observational) unit in the population
observational study
a data collection method where the experimental units sampled are observed in their natural setting; no attempt is made to control the characteristics of the experimental units sampled (ex: opinion polls and surveys)
designed experiment
a data collection method where the researcher exerts full control over the characteristics of the experimental units sampled; these experiments typically involve a group of experimental units that are assigned to the treatment and an untreated (control) group
parameter
a descriptive measure of the population
statistic
a descriptive summary of a sample
standard normal distribution
a normal distribution with mu = 0 and sigma = 1
sample statistic
a numerical descriptive measure of a sample; calculated from the observations in the sample
discrete random variable
a random variable that can assume a countable (listable) number of values
continuous random variable
a random variable that can assume values corresponding to any of the points contained in an interval
standard normal random variable
a random variable with a standard normal distribution, denoted by Z
a point estimator of a population parameter
a rule or formula that tells us how to use the sample data to calculate a single number that can be used as an estimate of the population parameter
minimum-variance unbiased estimator (MVUE)
a sample statistic with a sampling distribution that is both unbiased and has a smaller standard deviation than any other statistic with an unbiased sampling distribution
normal probability plot
a scatterplot with the ranked data values on one axis and their corresponding expected z-scores from a standard normal distribution on the other axes
population
a set of all units (usually people, things, transactions, or events) that we are interested in studying
simple random sample
a simple random sample of n experimental units is a sample selected from the population in such a way that every different sample of size n has an equal chance of selection
measure of reliability
a statement (usually quantitative) about the degree of uncertainty associated with a statistical inference
sample
a subset of the units of a population
nonresponse bias
a type of selection bias that results when data on all experimental units in a sample are not obtained
random variable
a variable that assumes numerical values associated with the random outcomes of an experiment, where one (and only one) numerical value is assigned to each sample point
Which of the selections are a condition of statistical independence? For all conditions listed, you may assume that P(A) > 0 and P(B) > 0. (Select all that apply.) a. P(A∩B)=P(A)P(B) b. P(A|B)=P(A) c. P(B|A)=P(B) d. none of the above
a, b, and c
what is a continuity correction?
adding or subtracting 0.5 to the X term in the z-score equation depending on the probability being measured
Standard Error of Sample Mean
aka Standard Deviation
Standard Deviation of Random Variable X
aka Standard Error of X σ ÷ √(n) -the standard error of the random variable/population = the sample std dev / square root of siz
to which data sets does Chebyshev's Rule apply?
all data sets, regardless of the shape of the frequency distribution of the data
experimental unit (observational unit)
an object (person, thing, transaction, event, etc.) about which we collect data
Mean of Random Variable X
an unbiased estimator of population mean, µ -the mean of the random variable/population = the sample mean (precise estimator)
Observational studies can reveal only ___________. Designed experiments can help establish ___________.
association causation
where are the inner fences of a box plot?
at distances of 1.5(IQR) from the hinges
where are the outer fences of a box plot?
at distances of 3(IQR) from the hinges
Provided that __________, the least squares estimates are the unique solution to the system of normal equations
at least two of the xi values are different
AML and the Cost of Labor. Active Management of Labor (AML) was introduced in the 1960s to reduce the amount of time a woman spends in labor during the birth process. The following data displays the costs, in dollars, of eight randomly sampled AML deliveries. What is the sample mode cost of these AML deliveries? Costs (in dollars) of Eight AML Deliveries 3141 2873 2116 1684 3470 1799 2539 3093 a. 2589.375 b. the data has no mode c. 3470 d. 2706
b
Which measure of center is appropriate for qualitative data? a. mean b. mode c. median d. all means of center are appropriate for qualitative data
b
Which normal distribution is flatter and more spread out? a. Mean 0, sd 0.5 b. Mean 3, sd 5 c. Mean 10, sd 2 d. Mean 3, sd 2
b
Which of the following is not necessarily a measure of center? a. mean b. parameter c. median d. mode
b
The area to the right of 0 in a standard normal distribution is a. equal to .5 minus the area to the left of 0 b. .5 c. equal to the area to the right of 0 in any normal distribution d. undetermined unless μ and σ are known e. 1
b. .5
What is the difference between a bar graph and a histogram?
bar graph - space between, qualitative histogram - no space between, quantitative
Let X denote the total number of "successes" in n independent Bernoulli trials each having success probability p. Then the probability distribution of the random variable X is given by the formula below. The random variable X is called a __________ random variable and is said to have a ___________ with parameters ___ and ___. P(X = x) = nCx px (1-p)n-x, x = 0, 1, ... n.
binomial binomial distribution n p
We often need to group and analyze data from two variables on a population. Data from two variables of a population are called ______________ , and a frequency distribution for bivariate data is called a ___________ or a ___________.
bivariate data contingency table two-way table
Experimental units that are similar in ways that are expected to affect the response are placed into groups called _______. Then within each of these groups, experimental units are randomly assigned to treatments. This design is called a __________.
blocks randomized complete block design
Drug Use. The U.S. Substance Abuse and Mental Health Services Administration collects and publishes data on nonmedical drug use, by type of drug and age group, in National Household Survey on Drug Abuse. The data are the percentages that are estimates for the entire nation based on information obtained from the sample. a. This is an example of descriptive statistics based on data from a designed experiment. b. This is an example of inferential statistics based on data from a designed experiment. c. This is an example of inferential statistics based on observational data. d. This is an example of descriptive statistics based on observational data.
c
Professional Athlete Salaries. In the Statistical Abstract of the United States, average professional athletes' salaries in baseball, basketball, and football were compiled and compared for the years 1993 and 2003. The study is a. a descriptive study from a designed experiment applied to a population b. an inferential study from a designed experiment applied to a sample c. a descriptive observational study of a population d. an inferential study from an observational study applied to a sample
c
Which of the following is NOT a graph or chart that can be used for organizing and summarizing qualitative data? a. frequency bar chart b. pie chart c. relative frequency histogram d. relative frequency bar chart
c
Suppose for a sample size of 75, we have a mean of 13 and a standard deviation of 2. Then a 96% confidence interval for the mean of the sampled population would have left and right end-points 13-(2.06)(2/Sqrt(75)) and 13+(2.06)(2/Sqrt(75)) respectively. If we constructed such a confidence interval for 100 samples, we could expect about _____ of them to contain the mean population. a. 4 b. 2 c. 96 d. 2.58 e. 75
c. 96 96% = 96/100
The standard normal distribution a. is skewed b. is a discrete probability distribution c. always has a mean equal to 0 and a standard deviation equal to 1 d. has a mean (μ) equal to 1 and a variance (σ2) equal to 0 e. is symmetric about σ2
c. always has a mean equal to 0 and a standard deviation equal to 1
The areas under a probability density function of a continuous random variable correspond to probabilities for the random variable X. This implies that a. X is a discrete random variable b. X is a normal random variable. c. the total area under the probability density function equals 1 d. the probability distribution is always mound shaped e. all of the above
c. the total area under the probability density function equals 1
X
capitalized because it's a random variable that depends on the randomly chosen SRS
Descriptive measures that indicate where the center, or most typical value, of a data set lies are called measures of _____ tendency, or more simply, measures of _____.
central center
The article "Racial Stereotypes in Children's Television Commercials" (J. of Adver. Res., 2008: 80-93) reported the following frequencies with which ethnic characters appeared in recorded commercials that aired on Philadelphia television stations. Ethnicity: African-American Asian Caucasian Hispanic Frequency: 57 11 330 6 The 2000 census proportions for these four ethnic groups are .177, .032, .734, and .057, respectively. Does the data suggest that the distribution of ethnic characters in commercials are different from the census proportions? Use a 1% signficance level.
chi-square goodness of fit test
Each individual in a random sample of high school and college students was cross-classified with respect to both political views and marijuana usage, resulting in the data displayed in the accompanying two-way table ("Attitudes About Marijuana and Political Views," Psychological Reports, 1973: 1051-1054). Does the data support the hypothesis that political views and marijuana usage level are independent within the population? Test the appropriate hypotheses using level of significance .01. Usage Level Never Rarely Frequently Liberal 479 173 119 Political views Conservative 214 47 15 Other 172 45 85
chi-square test for independence
Groups of the population that are selected via a simple random sample are called what?
clusters
A __________ of r objects from a collection of m objects is any unordered arrangements of r of the m objects
combination
The _____ of an event E is every event in the sample space but which is not an event in E.
complement
sample std. dev.
consistent estimator of population mean
what is the best way to determine whether data are from an approximately normal distribution?
construct a normal probability plot for the graph and see whether the data are linear in shape
A __________ random variable is a random variable whose possible values are represented by some type of interval
continuous
A quantitative variable whose possible values form some interval of numbers is classified as _________
continuous
The properties of a PDF are - f(x) is ____ - f(x) ≥ ____ - The total area under f(x) is equal to ____ - It is not necessarily the case that f(x) ≤ ___
continuous 0 1 1
AML and the Cost of Labor. Active Management of Labor (AML) was introduced in the 1960s to reduce the amount of time a woman spends in labor during the birth process. The following data displays the costs, in dollars, of eight randomly sampled AML deliveries. Suppose the sample mean of the data is 2590 (it's not, but suppose it is). Which of the following is correct? 3141, 2873, 2116, 1684, 3470, 1799, 2539, 3093 a. the sum of the eight values divided by eight is 2590. b. the mean is 2590 c. for the eight AML deliveries included, the average estimated cost is $2590 d. for the eight AML deliveries included in this study, the average cost is $2590
d
Consider the data set containing the values 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10. If any one of these values is replaced by 100, which measure of center is affected the most? a. the mode b. all are equally affected c. the median d. the mean
d
Pinworm infestation, commonly found in children, can be treated with the drug pyrantel pamoate. According to the Merck Manual, the treatment is effective in 90% of cases. Suppose that a simple random sample of n = 10 children with pinworm infestation are given pyrantel pamoate. Which choice is the correct description of this experiment? a. A binomial experiment with probability of success p = 0.90. We cannot know the value of n, since we don't know how many children have pinworm infestation. A success is that a child included in the study has pinworm infestation. b. A binomial experiment with probability of success p = 0.90. We cannot know the value of n, since we don't know how many children have pinworm infestation. However, a success is that a child with pinworm infestation is effectively treated using the drug pyrantel pamoate c. A binomial experiment with probability of success p = 0.90. We cannot know the value of n, since we don't know how many children have pinworm infestation. A success is that a child with pinworm infestation is included in the study. d. A binomial experiment with n = 10 Bernoulli trials and probability of success p = 0.90. A success is that a child with pinworm infestation is effectively treated using the drug pyrantel pamoate.
d
Which statement correctly compares a frequency histogram to a relative frequency histogram? a. The look almost identical. The only difference is in the spacing of the bars. In a frequency histogram, there is a small space between the bars. But in a relative frequency histogram, there is no space between the bars. b. There is no difference. The two are basically the same. c. A frequency histogram is used for qualitative data, while a relative frequency histogram can only be used for quantitative data. So the two are very different. d. They look almost identical. The only difference is the scale on the vertical axis. For a frequency histogram, the vertical axis is a count. For a relative frequency histogram, the vertical axis is a proportion.
d
The values of a variable for one or more people or things yield _____. Each individual piece of data is called an __________, and the collection of all observations is called a ____________
data observation data set
quantitative data
data that are measured on a naturally occurring numerical scale
measures of relative standing
descriptive measures of the relationship of a measurement to the rest of the data
In a ____________, researchers impose treatments and controls and then observe characteristics and take measurements
designed experiment
A __________ random variable is a random variable whose possible values can be represented in some type of list.
discrete
A quantitative variable whose possible values can be listed is classified as _______
discrete
Counting the number of siblings in a family is an example of a _______ variable. The weight of a newborn baby is an example of a _________ variable. (both are types of quantitative variables)
discrete continuous
interquartile range (IQR)
distance between upper and lower percentiles (upper limit - lower limit)
A probability __________ is a listing of the possible values and corresponding probabilities of a discrete random variable, or a formula for the probabilities.
distribution
The ________ of a data set is a table, _____, or ________ that provides the values of the observations and how often they occur
distribution graph formula
AML and the Cost of Labor. Active Management of Labor (AML) was introduced in the 1960s to reduce the amount of time a woman spends in labor during the birth process. The following data displays the costs, in dollars, of eight randomly sampled AML deliveries. Suppose the sample median of the data is 2700 (it's not, but suppose it is). Which of the following statement is correct? Costs (in dollars) of Eight AML Deliveries 3141 2873 2116 1684 3470 1799 2539 3093 a. Half of all AMLs have an avg cost of at least n=$2700 b. For the 8 AMLs in the study, the estimated avg cost of half of them is at most $2700 c. Since the sample size is even, the median is the average of the two middle observations, after sorting the observations from smallest to largest. d. The estimated average cost of all AMLs is at least half of $2700.00. e. For the eight AMLs in the study, 50% (4) of them had a cost of at most M = $2,700.00.
e
representative sample
exhibits characteristics typical of those possessed by the target population
When discussing the mean of a random variable, the terms _________ and expectation are commonly used in place of the term mean.
expected value
An ________ is an action whose outcome cannot be predicted with certainty. An _____ is some specified result that may or may not occur when an experiment is performed
experiment event
In a designed experiment, the individuals or items on which the experiment is performed are called ___________. When the (previous blank) are humans, the term _________ is often used instead.
experimental units subjects
probability density function for an exponential distribution
f(x) = (1/theta) (e raised to (-x/theta)), x is greater than 0
uniform distribution probability density function equation
f(x) = 1 / (d-c), (c is less than or equal to x, which is less than or equal to d)
Suppose a random variable has a uniform distribution with a lower bound a = -2.5, and upper bound b = 1.6. The probability density function (PDF) of X is given by the following. f(x)=(_____-_____)⁻¹=____⁻¹=______
f(x)=( 1.6 -(-2.5))⁻¹ = 4.1⁻¹=.2439
(T/F) If a distribution is skewed, it has at least one mode.
false
(T/F) Mutually exclusive events are always independent
false
(T/F) Since most statistical procedures are valid for any type of data, it is not too important that you are able to classify the data
false
(T/F) The bars in bar chart do not touch because the quantitative values represented on the horizontal axis aren't all the same
false
(T/F) The value -0.201 is a valid probability.
false
(T/F) The value 3.5 is a valid probability.
false
All students T based confidence intervals for a population mean μ will be of equal width provided that the level of confidence is the same for each interval. t/f
false
according to the central limit theorem the standard error of x bar decreases as the population variance increases
false
every random sample is a representative sample
false
statistical inference occurs when population parameters are used to infer the values of statistics.
false
the population mean and population median are the same value.
false
there is only one symmetric bell shaped density function that has mean zero and standard deviation 1.
false
the sample mean is another term for the population mean
false sample mean is mu population mean is x bar
confidence intervals give no information regarding the precision of an estimator.
false, the higher the confidence interval (ie 95%) the more precise the estimator will be
the CLT is important because, provided the sample size is sufficiently large, it can be applied for determining the sampling distribution of the sample median without assuming the distribution of the population of interest is known.
false, the sample MEAN would make this statement true
According to the Central Limit Theorem, the sampling distribution of the sample mean for sufficiently large sample sizes will be identical to the distribution of the sampled parent population. T/F
false. The sample mean (M) will never be equal to the distribution of the sampled parent population (X bar). They can be close but never equal.
an estimator is said to be consistent if the estimator gets closer to the parameter it is estimating as the population variance increases
false;
the central limit theorem is not applicable to samples taken from non-normally distributed populations.
false; it IS applicable to non-normally distributed populations
the population mean is an example of an estimator
false; perameter
the population mean and the sample mean are usually the same value
false; they can be close but never the same value
Variance
for each value, find the difference between it and the mean, square those differences, add up squared differences and divide by "n"
a statistic is said to be an unbiased estimator of the parameter if
he sampling distribution of a sample statisticˆθhas a mean equal to the population parameter θ the statistic is intended to estimate
A probability _________ is a graph of the probability distribution that displays the possible values of the discrete random variable on the x axis and the probabilities of those values on the vertical axis.
histogram
Baylor University has decided to give students a z score at the end of each semester, rather than a GPA. The mean and standard deviation of all student's cumulative GPA's, on which the z scores are based, are 3.0 and .6 respectively. What are these student's GPA's? i. Z= 1.5 ii. Z=-.50 iii. Z= -1.3
i: GPA is 3.9 ii: GPA is 2.7 iii: GPA is 2.22 To solve: Z = (X - Average) / (Standard Deviation) Solve for X (GPA)
when is it appropriate to use the normal approximation to the binomial distribution?
if 0 is less than or equal to np - 3(square root of npq) and if np + 3(square root of npq) is less than n
n choose x meaning
if there are n trials, and x of those trials are chosen, how many different combinations of trials could be selected? ex) If set = 1, 2, 3; n = 3, and x = 2: there are 3 possible values for x; 2 of these values can be selected; how many different combinations can you get? --> 1, 2; 1, 3; 2, 3 3 different combinations (order doesn't matter)
measurement error
inaccuracies in the values of the data collected; in surveys, the error may be due to ambiguous or leading questions and the interviewer's effect on the respondent
Two events A and B are said to be statistically ________________ if knowing that one event occurs does not change the probability that the other occurs.
independent
The _______ of two events A and B is every event that occurs in both A and B
intersection (A & B)
alternative hypothesis
is a stat sig change
What name is given to the product of the first k positive integers (counting numbers)? This quantity is denoted by k! = k(k−1)...(2)(1). Additionally 0! = 1.
k factorial
k!
k factorial
Smaller Standard Error
larger sample sizes
A procedure used to estimate the regression parameters B_1 and B_2, and to find the least squares line with provides the best approximation for the relationship between the explanatory variable x and the response variable y is known as the _________.
least squares method
A probability that corresponds to an event represented in the margin of a contingency table is called a __________. A probability that corresponds to an even represented in the intersection of a row and column (a cell) of a contingency table is called a ____________.
marginal probability joint probability
The _____ of a data set is the sum of the observations divided by the number of observations
mean
The _______ of a ______ random variable X is denoted µx, or when no confusion will arise, simply µ. It is defined by µ = ∑x P(X=x)
mean discrete
The ______ and ________ of a Poisson random variable are both equal to value of the rate parameter λ
mean variance
Unbiased Estimator of a parameter
mean of all its possible values = the parameter
Biased Estimator of a parameter
mean of all its possible values does not equal the parameter
census of a population
measurement of a variable for every unit of a population
qualitative data (categorical data)
measurements that cannot be measured on a natural numerical scale; they can only be classified into one of a group of categories
The _______ of a data set is the number that divides the bottom 50% of sorted data from the top 50%
median
Median
middle value in a data set
The ______ of a data set is the value that occurs most frequently
mode
exponential distribution
models the length of time or distance between occurrences of events (when the number of occurrences of those events in that length of time or distance is modeled using a Poisson distribution)
Mode
most frequent value in a data set
population mean
mu
mean for a uniform distribution
mu = (c + d) / 2
mean for a poisson distribution
mu = lambda
mean for an exponential distribution
mu = theta
Two or more events are ______________ if no two of them have outcomes in common
mutually exclusive
n choose x equation
n! / (x! (n-x)! )
"Relatively Large"
n>30
Low-back pain (LBP) is a serious health problem in many industrial settings. The article "Isodynamic Evaluation of Trunk Muscles and Low-Back Pain Among Workers in a Steel Factory" (Ergonomics, 1995: 2107-2117) reported the a summary of data on lateral range of motion (degrees) for a sample of workers without a history of LBP and another sample with a history of this malady. For the No LBP sample, the standard deviation was 2.5, for the LBP sample, teh standard deviation was 7.8. Calculate a 90% confidence interval for the difference between population mean extent of lateral motion for the two conditions.
nonpooled t conf interval
Quantitative noninvasive techniques are needed for routinely assessing symptoms of peripheral neuropathies, such as carpal tunnel syndrome (CTS). The article "A Gap Detection Tactility Test for Sensory Deficits Associated with Carpal Tunnel Syndrome" (Ergonomics, 1995: 2588-2601) reported on a test that involved sensing a tiny gap in an otherwise smooth surface by probing with a finger; this functionally resembles many work-related tactile activities, such as detecting scratches or surface defects. When finger probing was not allowed, the sample average gap detection threshold for normal subjects was 1.71 mm, and the sample standard deviation was .13; for CTS subjects, the sample mean and sample standard deviation were 2.53 and .87, respectively. Does this data suggest that the true average gap detection threshold for CTS subjects exceeds that for normal subjects? Use a significance level of .01.
nonpooled t test
In the simple linear regression model, what is the assumed distribution of the error?
normal
A variable is said to be a ___________ distributed variable, or to have a _______ distribution if the PDF is given by ƒ(x) = {2πσ2}-1/2exp{-(x - μ)2/(2σ2)}, -∞ < x < ∞.
normally normal
In an __________, researchers simply observe characteristics and take measurements.
observational study
highly suspect outliers
observations beyond the outer fences (beyond + or - 3(IQR))
suspect outliers
observations falling between the inner and outer fences (between 1.5(IQR) and 3(IQR))
outliers (in relation to z-scores)
observations with z-scores greater than 3
High concentration of the toxic element arsenic is all too common in groundwater. The article "Evaluation of Treatment Systems for the Removal of Arsenic from Groundwater" (Practice Periodical of Hazardous, Toxic, and Radioactive Waste Mgmt., 2005: 152-157) reported that for a sample of n=5 water specimens selected for treatment by coagulation, the sample mean arsenic concentration was 24.3 mg/L, and the sample standard deviation was 4.1. Calculate and interpret a 95% CI for true average arsenic concentration in all such water specimens.
one mean t conf interval
Minor surgery on horses under field conditions requires a reliable short-term anesthetic producing good muscle relaxation, minimal cardiovascular and respiratory changes, and a quick, smooth recovery with minimal aftereffects so that horses can be left unattended. The article "A Field Trial of Ketamine Anesthesia in the Horse" (Equine Vet. J., 1984: 176-179) reports that for a sample of n=73 horses to which ketamine was administered under certain conditions, the sample average lateral recumbency (lying-down) time was 18.86 min and the standard deviation was 8.6 min. Does this data suggest that true average lateral recumbency time under these conditions is less than 20 min? Test at level of significance .10.
one mean t test
A random sample of 110 lightning flashes in a certain region resulted in a sample average radar echo duration of .81 sec and a sample standard deviation of .34 sec ("Lightning Strikes to an Airplane in a Thunderstorm," J. of Aircraft, 1984: 607-611). Calculate a 99% confidence interval for the true average echo duration. Assume the population standard deviation of radar echo duration is .25.
one mean z conf interval
Water samples are taken from water used for cooling as it is being discharged from a power plant into a river. It has been determined that as long as the mean temperature of the discharged water is at most 150°F, there will be no negative effects on the river's ecosystem. To investigate whether the plant is in compliance with regulations that prohibit a mean discharge water temperature above 150°, 50 water samples will be taken at randomly selected times and the temperature of each sample recorded. At the 5% signficance level, is there enough evidence to suggest that the plant is not in compliance with regulations? Assume the population standard deviation of the discharge water temperature is 10 degrees.
one mean z test
class
one of the categories into which qualitative data can be classified
The technology underlying hip replacements has changed as these operations have become more popular (over 250,000 in the United States in 2008). Starting in 2003, highly durable ceramic hips were marketed. Unfortunately, for too many patients the increased durability has been counterbalanced by an increased incidence of squeaking. The May 11, 2008, issue of the New York Times reported that in one study of 143 individuals who received ceramic hips between 2003 and 2005, 10 of the hips developed squeaking. Calculate a 95% confidence interval for the true proportion of such hips that develop squeaking.
one proportion z conf interval
A random sample of 150 recent donations at a certain blood bank reveals that 82 were type A blood. Does this suggest that the actual percentage of type A donations differs from 40%, the percentage of the population having type A blood? Use a 5% significance level.
one proportion z test
a data set is said to be skewed when
one tail of the distribution has more extreme observations than the other tail
to what distribution(s) does the Empirical Rule apply?
only to normal distributions (mound shaped, symmetrical)
for any set of p measurements (arranged in ascending or descending order), the pth percentile is a number such that
p% of the measurements fall below that number and (100-p)% fall above it
Cushing's disease is characterized by muscular weakness due to adrenal or pituitary dysfunction. To provide effective treatment, it is important to detect childhood Cushing's disease as early as possible. Age at onset of symptoms and age at diagnosis (months) for 15 children suffering from the disease were given in the article "Treatment of Cushing's Disease in Childhood and Adolescence by Transphenoidal Microadenomectomy" (New Engl. J. of Med., 1984: 889). Calculate a 95% confidence interval for the population mean difference between age at onset of symptoms and age at diagnosis.
paired t conf interval
Are textbooks actually cheaper online? Here we compare the price of textbooks at the University of California, Los Angeles (UCLA's) bookstore and prices at Amazon.com. Seventy-three UCLA courses were randomly sampled in Spring 2010, representing less than 10% of all UCLA courses. At the 5% significance level, do the data provide enough evidence to determine that, on average, there is a difference between Amazon's price for a book and the UCLA bookstore's price?
paired t test
A __________ of r objects from a collection of m objects is any ordered arrangement of r of the m objects
permutation
Does treatment using embryonic stem cells (ESCs) help improve heart function following a heart attack? Summary statistics for an experiment to test ESCs in sheep that had a heart attack are calculated. Each of these sheep was randomly assigned to the ESC or control group, and the change in their hearts' pumping capacity was measured in the study. A positive value corresponds to increased pumping capacity, which generally suggests a stronger recovery. Find a 99% confidence interval for the difference in pumping capacity between the ESC group and the control group. The sample standard deviation for the ESC group is 5.17 and the sample standard deviation for the control group is 3.76.
pooled t conf interval
A study examined a random sample of 150 cases of mothers and their newborns in North Carolina over a year. The data consisted of the weight of the newborn and whether the mother smoked during pregnancy. The sample standard deviation of newborn weights is 1.43 for mothers who did not smoke during pregnancy and the sample standard deviation is 1.60 for mothers who did smoke during pregnancy. At the 10% significance level, is there convincing evidence that newborns from mothers who smoke have an average birth weight smaller than newborns from mothers who don't smoke?
pooled t test
The _____ is the collection of all individuals or items under consideration in a statistical study
population
The values of a variable for the entire population is called _______ data, or _______ data.
population census
The distribution of population is called the ___________ distribution or the distribution of the ________
population variable
mu
population mean
Gk letter eta
population median
Parent Population
population where samples are generated Normally distributed w/ μ=0, σ=1
sample mean
precise estimator of population mean
In many cases, knowing that one event has occurred can affect the _________ that another event will occur. The probability that event B occurs if the event A has already occurred is called the ________.
probability conditional probability
The function f(x) is more accurately called a ___________________.
probability density function
A normal _______________ is a plot of the observed values of the variable versus the normal _____, which are the observations expected for a variable having a ________ normal distribution.
probability plot scores standard
A ___________ is a sample selected by a procedure that uses a random device to decide which members of the population will constitute the sample
probability sample
In stratified random sampling, the strata should be sampled proportional to their size. This is called what?
proportional allocation
A __________ variable is a non-numerically valued characteristic that varies from one person or thing to another
qualitative/categorical
A _________ variable is a numerically valued characteristic that varies from one person or thing to another
quantitative
A ________ variable is a quantitative variable whose value depends on chance.
random
bivariate relationship
relationship between 2 variables
A ____________ is a sample that reflects as closely as possible the relevant characteristics of the population
representative sample
The ___________ is the characteristic of the experimental outcome that is to be measured or observed. A ______ is a variable whose effect on the response variable is of interest in the experiment. The possible values of a factor are the _______. The _______ is each experimental condition
response variable factor levels treatment
T/F: The principle of least squares results in values B_0^ and B_1^ that minimizes the sum of squared deviations between the observed values of the _________ and the _________.
response variable y, estimated values y^
Sampling Error
results from using a sample to describe a population characteristic. -why sampling distributions are needed.
To assess a normality of a variable using sample data, construct a normal probability plot. if the plot is _____________, you can assume that the variable is approximately normally distributed.
roughly linear
The _______ is that part of the population from which the information is obtained
sample
The distribution of sample data is called a ________ distribution
sample
The values of a variable for a sample of the population is called ________ data
sample
x-bar
sample mean
M
sample median
The ____________ is the collection of all possible outcomes for an experiment. An ______ is a collection of outcomes for the experiment, that is, any subset of the sample space.
sample space event
When applying counting rules to sampling from a population, the number of possible ______ of size n from a _________ of size N is NCn.
samples population
Descriptive statistics can be applied to _____ and to __________. There are _______ or __________ methods.
samples populations graphical numerical
Inferential statistics can only be performed on _______. Conclusions from the ________ are inferred to the population. Inferential methods are typically _______, and conclusions include at least one __________.
samples sample numerical probability
what is the best way to describe the relationship between two variables?
scatterplot
Probability sampling may still yield a nonrepresentative sample. However, it eliminates unintentional _______. Furthermore, the use of probability sampling guarantees that the techniques of ____________ can be applied.
selection bias statistical interference
selection bias
selection bias occurs when a subset of experimental units in the population has little or no chance of being selected for the sample
If the variable of interest is qualitative, the bars on a bar graph (should/should not) touch.
should not
population standard deviation
sigma
standard deviation for a uniform distribution
sigma = (d-c)/ the square root of 12
standard deviation for an exponential distribution
sigma = theta
In the simple linear regression model, what is the assumed variance of the error variable?
sigma squared
variance for a poisson distribution
sigma squared = lambda
A ________ is a sample chosen in such a way that each possible ________ of a given size is _______ likely to be the one obtained
simple random sample sample equally
Standard Deviation
square root of the variance. (average deviation from mean)
The _______ of a ______ random variable is denoted σ, or, when no confusion will arise, simply σ. It is defined as {∑(x - μ)2P(X = x)}1/2.
standard deviation discrete
standard error of an estimator
standard error of the sampling distribution
If X is a normal random variable with mean µ and standard deviation σ, then the __________ version Z=(X-µ)/σ has a ________ normal distribution. The mean of the distribution of Z is always ___ and the standard deviation of the distribution of Z is always ___
standardized standard 0 1
null hypothesis
status quo
In stratified random sampling, the population is first divided into subpopulations called what?
strata
Mean
sum all values and divide by "n"
This test is used when: -Testing 2 Population Means. -Populations are Normal -Population Variances are NOT Equal
t'-test with Satterwaits degrees of freedom
population variance (sigma ^2)
the average of the squared deviations from the mean (mu) of all the measurements on all units in the population
class relative frequency
the class frequency divided by the total number of observations in the data set; class frequency/n
class percentage
the class relative frequency multiplied by 100; (class relative frequency) x 100
range (of a quantitative data set)
the largest measurement minus the smallest measurement
modal class
the measurement class containing the the largest relative frequency
mode (of a data set)
the measurement that occurs most frequently in the data set; the midpoint of the modal class
median (of a quantitative data set)
the middle number when the measurements are arranged in ascending (or descending) order
class frequency
the number of observations in the data set that fall into a particular class
population standard deviation (sigma)
the positive square root of the population variance
sample standard deviation (s)
the positive square root of the sample variance (s^2)
sampling distribution (of a sample statistic calculated from a sample of n measurements)
the probability distribution of the statistic
variability (of a set of measurements)
the spread of the data
mean (of a set of quantitative data)
the sum of the measurements, divided by the number of measurements contained in the data set
sample variance (for a sample of n measurements) is equal to
the sum of the squared deviations from the mean, divided by (n-1); represented by s^2
central tendency (of a set of measurements)
the tendency of the data to cluster, or center, about certain numerical values
(T/F) For a SRS, the sample distribution approximates the population distribution. The larger the sample size, the better the approximation
true
(T/F) For a continuous random variable, P(X=x) is always equal to 0
true
(T/F) For any discrete random variable X, the sum of the probabilities across all values of X is always equal to 1.
true
(T/F) If P(A) > 0 and P(B) > 0, then it is always true that P(A|B) = P(A) is equivalent to P(A∩B) = P(A)P(B).
true
(T/F) Randomly selecting an individual from a population of interest is like selecting a SRS of size one.
true
(T/F) The mean of a normal distribution has no affect on the distribution's shape.
true
(T/F) The value 0 is a valid probability.
true
(T/F) The value 0.462 is a valid probability.
true
(T/F) Two data sets that have identical frequency distributions have identical relative frequency distributions.
true
Consider two normal distributions, one with mean 4 and standard deviation 3; the other with mean 4 and standard deviation 6. TRUE or FALSE. The two normal distributions are centered at the same place, but have different shapes.
true
T/F: A bell curve is symmetric.
true
an estimator is said to be unbiased if the expected value of the estimator is exactly the parameter which the estimator is attempting to estimate
true
any function of the sample data is a statistic
true
every estimator of a population parameter is a random variable that has its own density function
true
parameters describe characteristics of the population (or a random variable) of interest
true
the central limit theorem applies to populations which are modeled as discrete random variables
true
the standard error of the sample mean X bar does not depend on the expected value mu of the population of interest that is being sampled
true
the width of a confidence interval for the population mean μ is dependent on the sample standard deviation. t/f
true
every estimator is a statistic
true; but not every estimator is a statistic
The article "Aspirin Use and Survival After Diagnosis of Colorectal Cancer" (J. of the Amer. Med. Assoc., 2009: 649-658) reported that of 549 study participants who regularly used aspirin after being diagnosed with colorectal cancer, there were 81 colorectal cancer-specific deaths, whereas among 730 similarly diagnosed individuals who did not subsequently use aspirin, there were 141 colorectal cancer-specific deaths. Does this data suggest that the regular use of aspirin after diagnosis will decrease the incidence rate of colorectal cancer-specific deaths? Use a 1% significance level.
two proportion z test
if a data set is skewed to the left
typically the mean is less than the median
if a data set is skewed to the right
typically the median is less than the mean
A distribution is ________ if it has one peak; ________ if it has two peaks and _________ if it has 3 or more peaks
unimodal bimodal multimodal
The _______ of two events A and B is every event that occurs in A and in B and in both A and B.
union (A or B)
where are the hinges of a box plot?
upper quartile and lower quartile
descriptive statistics
utilizes numerical and graphical methods to look for patterns in a data set, to summarize the information revealed in a data set, and to present that information in a convenient form
inferential statistics
utilizes sample data to make estimates, decisions, predictions, or other generalizations about a larger set of data
What is the term for "a characteristic that varies from one person or thing to another"?
variable
when should you use a continuity correction?
when you're approximating a discrete distribution with a continuous distribution