Statistics Final Exam
ANOVA (Analysis of variance)
analysis of variance test used for designs with *three or more sample means tests - **Are there one or more significant differences anywhere among these samples? Number of observed values divided among three or more groups Simple (One-Way) ANOVA -Used when one variable (such as race) is being explored and this variable has more than two categories (white, black, latino, asian) -there is only one grouping dimension (race) -the variance due to differences in outcome (education) is separated into --variance due to differences between individuals between groups --variance due to difference within groups factorial design -used to explore more than one independent variable (race + gender +income)
expected values from chi-square
expected values: values if there was no relationship between the dependent and independent variables (expected = (R * C) / N R = row total C = column total N = total N)
central limit theorem
for random samples with a sufficiently large sample size, the distribution of many common sample statistics can be approximated with a normal distribution.
conditions for the t-distribution
for small samples, the use of the t-distribution requires that the population distribution be approximately normal. if the sample size is small, we need to check that the data are relatively symmetric and have no huge outliers that might indicate a departure from normality in the population... In practice, we avoid using the t-distribution if the sample is small (say less than 20) and the data contains clear outliers or skewness. (If the sample size is small and the data are heavily skewed or contain extreme outliers, the t-distribution should not be used.)
central limit theorem for sample means
if the sample size n is large, the distribution of sample means from a population with mean μ and standard deviation σ is approximately normally distributed with mean μ and standard deviation of σ / √n
linear regression
method for predicting the value of a dependent variable Y, based on the value of an independent variable X
chi-square test for association
(hmm) to test for an association between two categorical variables, based on a two-way table that has r rows as categories for variable A and c columns as categories for variable B set up hypotheses compute the expected count for each cell in the table using (expected count = (row total * column total)/sample size) compute the value for the chi-square statistic find a p-value using the upper tail of a chi-square distribution with (r-c)(c-1) degrees of freedom The chi-square distribution is appropriate if the expected count is at least five in each cell
Regression Cautions
1 - avoid trying to apply a regression line to predict values far from those that were used to create it 2- plot the data! although the regression line can be calculated for any set of paired quantitative variables, it is only appropriate to use a regression line when there is a linear trend in the data 3 - outliers can have a strong influence on the regression line, just as we saw for correlation. In particular, data points for which the explanatory value is an outlier are often called influential points because they exert an overly strong effect on the regression line
What are the steps in hypothesis testing
1. state the null and alternative hypotheses 2. set the significance level 3. select the appropriate test statistic 4. compute the test statistic value 5. determine the value needed for rejection of the null hypothesis using appropriate table of critical values for the particular statistic 6. compare the obtained value and the critical value 7. state the decision, conclude, and interpret
Assumption about Variances
Equal Variances Assumed: When the two independent samples are assumed to be drawn from populations with identical population variances Equal Variances not Assumed: When the two independent samples are assumed to be drawn from populations with unequal variances (typically pick this one)
t-test for two independent sample proportions
H0 = pi1 = pi2 = p H1 ^ not equal, <, > Standard Error σp = square root of ((pi(1-pi)/n1) + (pi(1-pi)/n2)) pooled estimate p = (n1 * p1 +n2 * p2)/ n1 + n2 test statistic t = ((p1 - p2) - 0)/ σp df = n1 + n2 - 2
Saint Peter's Basilica main altar below the dome, with the Baldachino, interior creator: Moderno, Carlo (italian architect); Buonarroti, Michelangelo (Italian sculptor, painter, and architect); Bernini, Gian Lorenzo (Italian sculptor and draftsman) location: Rome, Roma, Lazio, Italy; Holy See period: Baroque 17th century St. Peter's is famous as a place of pilgrimage and for its liturgical functions. (St. Peter's has many historical associations, with the Early Christian Church, The Papacy, the Protestant Reformation and Catholic counter-reformation and numerous artists, especially Michelangelo.)
Hypothesis tests are used to investigate claims about population parameters. In statistical inference, we use data from a sample to make conclusions about a population. The two main areas of statistical inference are estimation and testing. Hypothesis testing lets us test whether a fact about a population is true (we have hunch that the population mean is equal to some value, we collect a sample to prove the hunch, we use a hypothesis test to see if we were right)
(to help me with writing) characteristics of Baroque art
The Baroque style is characterized by exaggerated motion and clear detail to produce drama, exuberance, and grandeur in sculpture, painting, architecture, literature, dance, and music. Baroque art was meant to evoke emotion and passion instead of the calm rationality that had been prized during the Renaissance.
example problem. tell which test and do it "A sample of 25 adults in Company A found that they spent an average of 35 minutes on Facebook per day, with a standard deviation of 10. A sample of 30 adults at Company B found that they spent an average of 38 minutes per day on Facebook with a standard deviation of 8. Test whether Company A spends less time on Facebook than Company B. Use an alpha of 0.01."
***t-test for two independent sample means null: employees at Companies A and B spend equal amount of time on Facebook. H0 = μA = μB alternative: Employees at Company A spend less time on facebook than employees at Company B H1 = μA < μB standard error SE = square root of s1^2/n1 + s2^2/n2 test statistic t = ((sample mean 2 - sample mean 2) - 0)/SE degrees of freedom df = n1 + n2 - 2 Since our test statistic of -1.21 does not fall in the critical region, we fail to reject the null hypothesis. We do not have enough evidence to say that employees at Company A spends less time on facebook than employees at company B.
example problem. tell which test and do it "A sample of 342 people living in an urban area found that 54% had received a college degree. A sample of 412 people living in rural areas found that 43% had received a college degree. Test for whether a larger proportion of people living in urban areas have a college degree than people living in rural areas. Use an alpha of 0.01.
***t-test for two independent sample proportions Null: an equal proportion of people in urban and rural areas have college degrees H0 = piU = piR Alternative: A larger proportion of people living in urban areas have a college degree than people living in rural areas H1: piU > piR pooled estimate using p = (n1 * p1 +n2 * p2)/ n1 + n2 p = 0.48 standard error using equation = 0.0365 test statistic using equation = 3.014 df = 2.326 so then cv = 2.326 since our test statistic of 3.014 falls in the critical region, we reject the null in favor of the alternative
Randomization distribution
*A randomization distribution simulates a distribution of sample statistics for a population in which the null hypothesis is true simulates sampling *shows the distribution of statistics that would be observed if the null hypothesis is true (therefore, no relationship) centered around the null hypothesis value (*the randomization distribution will be centered at the value indicated by the null hypothesis and shows what values of the sample statistic are likely to occur by random chance, if the null hypothesis is true.) it is a distribution of statistic just based on random chance
Chi-square statistic (and distribution)
*We want to compare our observed values with values when there is no relationship The chi-square statistic is found by comparing the observed counts from a sample with expected counts derived from a null hypothesis. similar to a t-distribution, the chi-square distribution has a degrees of freedom parameter that is determined by the number of categories (cells) in the table In general, for a goodness-of-fit test based on a table with k cells, we use a chi-square distribution with k-1 degrees of freedom. If our chi-square statistic is less than the critical value, p-value will be greater than our alpha and vice versa
When do we use a chi-square test?
- the sampling method is based on random sampling - the variables under study are each categorical - if sample data are displayed in a contingency table or crosstabs, the expected frequency count for each cell of the table is at least 5
Standard Normal Distribution
Because all the normal distributions look the same except for the horizontal scale, another common way to use normal distributions is to convert everything to one specific standard normal scale. *The standard normal distribution has mean zero and standard deviation equal to one To convert from an X value on a N(μ, σ) scale to a Z value on a N(0,1) scale, we standardize values with the z-score: Z = (X-μ)/σ
what is crosstabs and How do we present data in crosstabs or contingency tables?
Another method of analyzing relationships between nominal/ordinal variables We create a table with the dependent variable as the rows and the independent variable as the columns Dependent variable depends on another variable Independent variable influences the dependent variable
simple (one-way) ANOVA
Compare the variability of values within groups with the variability of values between groups F-test μ1 = μ2 = μ3 = μ4 μ1 does not equal ""
how to calculate an ANOVA (also definitions of SST and all terms)
First, calculate the squared difference between each point and the overall mean this is called total sum of squares (SST) (on test will not have to hand calculate this) SST = SSB + SSW Next, calculate the sum of the squared deviations from the group means this is within sum of squares (SSE) Finally, calculate the deviations from the group means to the overall mean this is our between sum of squares (SSG) Then F-statistic F = (SSB/dfB)/(SSW/dfw) dfB = K-1 dfW = N-K K is the number of groups N is the sample size then use table, dfw in rows, dfb in columns to find critical value If our f-statistic is greater than our critical value, reject the null in favor of the alternative if our f-statistic is less then our critical value, fail to reject the null
Formal Statistical decisions
If the p-value is < alpha: Reject Ho This means that the results are significant and we have convincing evidence that Ha is true If the p-value greater than or equals alpha: Do not reject Ho This means the results are not significant and we do not have convincing evidence that Ha is true
paired sample t-test
Sometimes called the dependent sample t-test statistical procedure used to determine whether the mean difference between two sets of observations is zero each subject is measured twice, resulting in pairs of observations common application of paired sample t-test include case-control studies or repeated measures designs The null hypothesis assumes that the true mean difference (μd) is equal to zero H0 = μd = 0 The two-tailed alternative hypothesis assumes that μd is not equal to zero H1 = μd is not equal to 0 (two-tailed)
degrees of freedom
The degrees of freedom in a statistical calculation represent how many values involved in a calculation have the freedom to vary there are more degrees of freedom with a larger sample size
What do degrees of freedom for a chi-square test indicate?
The number of values we need to know in order to fill out the rest of the table. df = (r-1)(c-1) r= number of rows c = number of columns
The Regression Line
The process of fitting a line to a set of data is called linear regression and the line of best fit is called the regression line. The regression line provides a model of a linear association between two variables, and we can use the regression line on a scatterplot to give a predicted value of the response variable, based on a given value of the explanatory variable.
What does Cramer's V measure?
The strength of a statistically significant relationship between categorical variables. Cramer's V ranges from 0 to 1 0=no relationship 1= perfect relationship 0-0.3 weak 0.3-0.7 moderate 0.7-1 strong
Chi-square test
The test is applied when you have two categorical (nominal or ordinal) variables from a single population It is used to determine whether there is a statistically significant association between the two variables null hypothesis is always that there is no relationship between the two variables in the population alternative hypothesis is always that there is a relationship between the two variables in the population (this does not necessarily mean that one variable causes the other)
What is our objective in doing a t-test?
To test whether the sample mean/proportion is statistically different from a known or hypothesized population mean/proportion
ANOVA calculations
Total Variability = SSTotal = deviations of the data from the grand mean SSTotal = SSG + SSE variability between groups = SSG = deviations of the group means from the grand mean variability within groups = SSE = deviations of the data from their group mean when to use an F-distribution (ANOVA to compare means): sample sizes are large (each ni greater than or equal to 30) or data are relatively normally distributed, and variability is similar in all groups
R-Squared
Useful for interpreting the results of certain statistical analyses (e.g., ANOVA and regression) Represents the percentage of variation in the dependent variable that is explained by its relationship with one or more independent variables (*equals the proportion of variation in the dependent variable explained by the independent variable) r^2 = SSB / SST So with an r-squared of 0.098, political party explained 9.8% of the variation in the liberalness score. (in social science, it is rare to get an r-squared over 0.5)
critical region
We indicate a "critical region" in the theoretical distribution where the alternative hypothesis for us will seem more acceptable than the null hypothesis.
The distribution of sample means using the sample standard deviation (t-distribution)
When choosing random samples of size n from a population with mean μ, the distribution of the sample means is centered at the population mean, μ, has standard error estimated by SE = s/ √n where s is the standard deviation of a sample The standardization sample means approximately follow a t-distribution with n - 1 degrees of freedom (df) for small sample sizes (n<30), the t-distribution is only a good approximation if the underlying population has a distribution that is approximately normal.
why use the t-distribution?
You must use the t-distribution table when the population standard deviation (σ) is not known and the sample size is small (n<30) since people often prefer to use the normal, and since the t-distribution becomes equivalent to the normal when the number of cases becomes large (Central Limit Theorem), common practice often is: *if σ known, then use normal *if σ not known: -if n is large, then use normal -if n is small then use t-distribution
when to use t-test for independent samples
used when we wish to compare means between two groups
Null and Alternative Hypothesis (nondirectional and directional) hypotheses two types of statistical hypothesis and meanings
null hypothesis (denoted by H0) - usually the hypothesis that sample observations result purely from chance statement of equality what we hypothesize the population parameter to be equal to H0 : μ = 85 we assume the null hypothesis is true unless we find evidence to contradict it "there is no relationship between __" "tomato plants show no difference in growth rates when planted in compost rather than soil" *Claim that there is no effect or no difference in general, statement of equality the purpose of the null hypothesis: -acts as a starting point because it is the state of affairs that is accepted as true in the absence of any other information -provides as a benchmark against which observed outcomes can be compared to see if these differences are due to some factor alternative hypothesis (denoted by H1 or Ha) - the hypothesis that sample observations are influenced by some non-random cause, such as the treatment in experiments statement of inequality (not equal to, less than, greater than) possible alternative values of the population parameter "__ and __ are positively related" "the rates of growth of tomato plants in compost are higher than those in soil" *Claim for which we seek significant evidence in general, uses notation indicating greater than (>), not equal to, or less than (<) depending on the question of interest. the purpose of the alternative hypothesis: -hypothesis that is directly tested -Results of this test are compared with what you expect by chance alone (null hypothesis) to see which of the two is the more attractive explanation for any differences between groups or variables you might observe
dummy variable
one that takes the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome to create a dummy variable for gender, we choose one category to represent in the regression the other category will be the "reference category" the reference category is coded 0 and new variable coded 1
p-value
relates to sample data *probability of an observed or more extreme result arising by chance (*the p-value is the proportion of samples, when the null hypothesis is true, that would give a statistic as extreme as (or more extreme than) are observed sample) the smaller the p-value, the stronger the statistical evidence is against the null hypothesis and in favor of the alternative what we obtain from various statistical tests (t-test, chi-square, ANOVA)
p-value and alternative hypothesis
right-tailed test: if Ha contains >, the p-value is the proportion of simulated statistics greater than or equal to the observed statistic, in the right tale left-tailed test: If Ha contains <, the p-value is the proportion of simulated statistics less than or equal to the observed statistic, in the left tail two-tailed test: If Ha contains does not equal to, we find the proportion of simulated statistics in the smaller tail at or beyond the observed statistics, and then to find the p-value, we double this proportion to account for the other tail.
significance level (α/Alpha level)
risk associated with not being sure that what you observe in an experiment is due to the treatment or what was being tested the researcher defines a level of risk that he or she is willing to take "this could have occurred by chance alone- something else is going on" (*The significance level for a test of hypotheses is a boundary below which we conclude that a p-value shows statistically significant evidence against the null hypothesis). common significant levels are alpha = 0.05, or 0.10. (if it's not specified, use 0.05) denoted by alpha (α) ***represents the tolerable probability of making a type I error We indicate a "critical region" in the theoretical distribution where the alternative hypothesis for us will seem more acceptable than the null hypothesis.
t-test for two independent sample means
whether two groups differ on our variable independent samples h0 = μ1 = μ1 SE = square root of s1^2/n1 + s2^2/n2 t = ((sample mean 2 - sample mean 2) - 0)/SE df = n1 + n2 - 2
why not use z-test/t-test with more than two samples
the greater the number of separate tests we make, the greater is the likelihood that we'll make a type II error (failure to reject a false null hypothesis)
Type I and Type II Error
there are two fundamental mistakes we can make when we make a decision in hypothesis testing: - type I: rejected a null hypothesis even where there really is no difference between groups - type II: accepted a false null hypothesis
regression line (and residual sum of squares)
this line is the predicted value of the dependent variable for each value of the independent variable line of best fit the line fits these data because it minimizes the distance between each individual point and the regression line regression finds the line that minimizes the residual sum of squares OLS = ordinary least squares y = a + bx a is the intercept and bx is the coefficient the intercept is the value of y when x is 0 the coefficient is the rate of increase in y for each unit increase in x