Stats test 3 material

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

conditions for chi-square test

- the chi square test for two way tables looks for evidence of association between two categorical variables (factors) in the sample data. -must be SRSs -very few (no more than 1 in 5) expected counts can be less than 5 -all expected counts are greater than or equal to 1

Chapter 18 two sample situations

--> need to determine if the 2 samples are independent or not independent samples: the individuals in both samples are chosen separately ("independently") matched pairs samples: the individuals in both samples are related (for ex. the same subjects assessed twice, or siblings/pairs)

conditions for inference

-data must come from a SRS -the observations in the pop. must have a normal dist -the sample mean x bar has a normal distribution with the standard deviation σ/√n -when σ is unknown, we estimate it using s. the estimated standard deviation of x bar, known as the standard error, is then s/√n = SEM

handling multiple comparisons statistically

-the first step in examining multiple populations statistically is to test an overall statistical significance as evidence of any difference among the parameters we want to compare --ANOVA F test -if that overall test showed statistical significance, then a detailed follow-up analysis can examine all pair-wise parameter comparisons to define which parameters differ from which and by how much

Conditions for the chi-square test: THE MULTINOMIAL SETTING

1. There is a fixed number n of observations. 2. The n observations are all independent. That is, knowing the result of one observation does not change the probabilities we assign to other observations. 3. Each observation falls into just one of a finite number k of complementary and mutually exclusive outcomes. 4. The probability of a given outcome is the same for each observation.

CONDITIONS FOR INFERENCE ABOUT A PROPORTION

1. We can regard our data as a simple random sample (SRS) from the population. As usual, this is the most important condition. 2. the population is at least 20 times as large as the sample. this ensures independence of successive trials in the random sampling 3. The sample size n is large enough to ensure that the sampling distribution of ˆp is close to Normal. We will see that different inference procedures require different answers to the question "How large is large enough?"

conditions for conducting inference between two population proportions

1. When the samples are large, the distribution of ˆp1−ˆp2 is approximately Normal. 2. The mean of the sampling distribution of ˆp1−ˆp2 is p1−p2. That is, the difference between sample proportions is an unbiased estimator of the difference between population proportions. 3. The standard deviation of the distribution is

*cocaine example

1. create a 2 way table for ACTUAL OBSERVATIONS with treatment on the rows and relapse (yes or no) on the column 2. create an EXPECTED OBSERVATIONS table based on our formula: expected count = row total × column total / table total ex: for desipramine | no: (25 x 26) / 74 = 8.78 3. create table for observation vs. expected VARIANCE use formula (actual - expected)^2 / expected ex: for desipramine | no: (15 - 8.78)^2 / 8.78 = 4.40 4. chi sqaure statistic = sum of entire chart df = (rows - 1) x (columns -1) 5. pvalue = chisq.dist.rt(chi stat) 6. chi square test/check =chisq.test(select all the data from actual observations [not the totals], select all data from expected observations) same P value 7. Ho: no association btw relapse and treatment Ha: null hypothesis is incorrect, evidence of association between rows and columns 8. what treatments appear to be the most effective in determining statistical significance? highlight the largest observations in the last table. desipramine has the highest value for no relapse. as placebo has the least success

*excel pregnancy example you always use right tail for chi square!

1. make columns for: observed expected difference difference squared (variance) diff squared/expected sum of all these values = chi squared statistic 2. degrees of freedom = k-1 n= number of observations k = number of levels (7 bc 7 days of the week) degrees of freedom: k-1 = 6 3. P value normally is =norm.s.dist(z score, true) chi square has the same format: =chisq.dist(chi square stat, df, true) =1 1-1= 0.00397 or to get the right tail simply do =chisq.dist.rt(chi stat, df) 4. speculate what days are different? weekdays are more, weekends are less.

steps to obtain P value:

1. pooled proportion ^p 2. test statistic z you can use excel norm.s.dist (z stat, true)

A test of hypotheses requires a few steps:

1. stating the null hypothesis Ho 2. deciding on a one sided or two sided alternative Ha 3. choosing a significance level a 4. calculating t and its degrees of freedom 5. finding the area under the curve with Table C or software 6. stating the p value and the conclusion

ANOVA assumptions

1. the k samples must be independent SRSs 2. each pop represented by the k samples must be normally distributed 3. the ANOVA F-test requires that all k populations have the same standard deviation if SD is more than 2 times: conclusion: the data does not conform, you cannot use the ANOVA table F test (he will ask this once)

expected counts

A categorical variable has k possible outcomes, with probabilities p1,p2,p3,...,pk of occurring. That is, pi is the probability of the ith outcome. We have n independent observations from this categorical variable. To test the null hypothesis that the probabilities have specified values H0: p1 = p1o, p2 = p2o, ... , pk = pko first compute an expected count for each of the k outcomes as follows: expected count of outcome i = npio

the one-sample t CI

C is the area between -t* and t* we find t* in the line of Table C the margin of error m is: m = t*s/√n

*frizzled bug example

CHECK =chisq.test(observed row, expected row)

approximate level C confidence interval

CI: p^ +/- m m = z* x S^E = z*√(p(1−p)/n)

HYPOTHESIS TEST FOR COMPARING TWO PROPORTIONS

Draw an SRS of size n1 from a large population having proportion p1 of successes, and draw an independent SRS of size n2 from another large population having proportion p2 of successes. To test the hypothesis H0:p1=p2, obtain the z statistic: (pic) where ˆp is the pooled proportion of successes in both samples combined. In terms of a variable Z having the standard Normal distribution, the P-value for a test of H0 against

LARGE-SAMPLE CONFIDENCE INTERVAL FOR COMPARING TWO PROPORTIONS

Draw an SRS of size n1 from a large population having proportion p1 of successes, and draw an independent SRS of size n2 from another large population having proportion p2 of successes. When n1 and n2 are large, an approximate level C confidence interval for p1−p2 is

PLUS FOUR CONFIDENCE INTERVAL FOR COMPARING TWO PROPORTIONS *dont need to do this calculation

Draw independent SRSs from two populations with population proportions of successes p1 and p2. To get the plus four confidence interval for the difference p1−p2, add four imaginary observations: one success and one failure in each of the two samples. Then use the large-sample confidence interval with the new sample sizes (actual sample sizes + 2) and counts of successes (actual counts + 1). Use this interval when the sample size is at least 5 in each group, with any counts of successes and failures.

* excel example 17.3 : pulse wave velocity is measured in children with progeria. It is thought that PWV is higher in children.

H0: no difference from mean population value 6.6 Mu = 6.6 Ha: population mean is greater than 6.6 mu > 6.6 1. descriptive statistics of data set summary statistics confidence level at 95 it gives you the confidence level for 2 tail 1a. calculate CI: sample mean standard error: std deviation/sqrt of count t crit: t.inv(0.95, n-1) (this is for one tail) lower bound and upper bound: mean +/- (t crit * SE) --> for 2 tail you can just use the value for the confidence level as your m 1b. ask if your number you are wondering about (6.6) falls within the interval --> no reject the null this group is statistically significantly different from the expected pop mean 2. calculation of P value 2a. calculate t stat (mean - 6.6)/SE = 6.82 6.82 SD's away from mean! 2b. P value =t.dist.rt(6.82, df) [if left tail, just use t.dist; if two tail use t.dist.2t] = basically 0 P value is less than 5%; reject the null shortcut: assume everything is 6.6 for each observation mean = 6.6 sd = 0 data analysis t test paired 2 sample for means range 1 = actual observations range 2 = 6.6 column

hypotheses for two way tables

Ho states that there is no association between the row and column variables in the table. Ha: there is a relationship we will give the actual counts from the sample data with expected counts given the null hypothesis of no relationship

tb bionic pancreas example with shortcut matched pairs?

Ho: mu diff = 0 Ha: mu diff < 0 data analysis tool pack t test paired two sample for means range 1 = control range 2 = variable bionic mean - control mean is a negative number, which is less than 0. P value is less than 0.05. we reject the null CI: add a difference column and subtract bionic - control data analysis tool pack descriptive statistics of the difference column lower bound = mean - (tcrit*SE) upper bound = mean + (tcrit*SE) zero does not fall in the CI, so reject

Hypothesis tests help us decide if the effect we see in the samples is really there in the populations. The null hypothesis says that there is no difference between the two population proportions of the outcome of interest (so that p1−p2=0):

Ho: p1 = p2 the alternative hypothesis says what kind of difference we expect

hypothesis testing

Ho: there is no association between the row and column variables Ha: Ho is not true

hypotheses for gecko example seeing whether male toepads are larger than female toepads

Ho: μmale - μfemale = 0 Ha: μmale - μfemale

Expected counts are hypothetical counts and do not have to be round numbers.

However, because the population proportions pi under H0 must sum to 1, the expected counts must sum to the total number of observations n in the random sample. That is, the expected counts are simply a statement of how these n observations would be expected to fall into the k possible outcomes (levels) of the categorical variable, according to H0.

The chi-square test for goodness of fit

Hypothesis tests require a set of null and alternative hypotheses, a test statistic, and a P-value that gives the probability if the null hypothesis was true of obtaining a test statistic at least as extreme as that computed.

P-VALUE OF THE CHI-SQUARE GOODNESS-OF-FIT TEST

In a chi-square goodness-of-fit test involving k outcomes, when the null hypothesis is true, the test statistic X2 has the chi-square distribution with k−1 degrees of freedom. The P-value of the chi-square test is the area to the right of the calculated X2 statistic under this chi-square distribution. software finds P values for us.

P value of the chi-square test

The P-value of the chi-square test for a two-way table of counts with r rows and c columns is the area to the right of the test statistic X2, under the chi-square distribution with (r−1)(c−1) degrees of freedom. Use this P-value to decide whether to reject or fail to reject H0.

A hypothesis test for H0:μ1=μ2 is based on the test statistic:

The area under the t distribution with the computed degrees of freedom corresponding to values at least as extreme (in the direction of the alternative) as the computed test statistic provides a very accurate P-value.

THE CHI-SQUARE DISTRIBUTIONS

The chi-square distributions are a family of distributions that take only positive values and are skewed to the right. A specific chi-square distribution is specified by giving its degrees of freedom. Chi-square distributions take only positive values and are right-skewed.

To compare observed and expected counts in a goodness-of-fit test and ask whether they differ significantly, we need a test statistic. That test statistic is the chi-square statistic, X^2.

The chi-square statistic is a measure of how far the observed counts in a random sample are from the expected counts defined by the null hypothesis. The formula for the statistic is: (pic) where k is the number of different outcomes that the categorical variable can take. Each of the k terms in the sum is called a chi-square component.

Chapter 22: The Chi-Square Test for Two-Way Tables

The chi-square test for a two-way table tests the null hypothesis H0 that there is no relationship between the row variable and the column variable. The alternative hypothesis Ha says that there is some relationship but does not say what kind.

Expected counts and chi-square statistic

The expected count, in n independent trials, for a particular outcome with probability p of occurring is simply np

CI for plus four method

The formula for the confidence interval is exactly as before, with the new sample size and the new number of successes. You do not need software that offers the plus four interval—just enter the new sample size (actual size + 4) and the new number of successes (actual number + 2) into the large-sample procedure. Use this interval when the confidence level is at least 90% and the sample size n is at least 10, with any counts of success and failure.

sample size for desired margin of error -i guess this is like working backwards -he wont ask us to do this

The level C confidence interval for a population proportion p will have a margin of error approximately equal to a specified value m when the sample size is n = (z*/m)^2 x p*(1−p*) where p* is a guessed value for the sample proportion. The margin of error will be less than or equal to m if you take the guess p∗ to be 0.5.

Ch. 19: inference about a population proportion

The sample proportion ˆp We are interested in the unknown proportion p of a population that has some outcome. this outcome = a success ex: the population is women aged 14 to 59, and the parameter p is the proportion who currently test positive for HPV. To estimate p, NHANES used a nationally representative sample of 1921 women. In this sample, 515 women tested positive for HPV. The statistic that estimates the parameter p is the sample proportion:

chi-square test

The test compares the observed counts of observations in the cells of the table with the counts that would be expected if H0 was true. The expected count in any cell is expected count = row total × column total / table total

The chi-square statistic is

X^2 = sum of (observed count - expected count)^2 / expected count

the X^2 statistic is summed over all r x c cells in the table:

X^2 = sum of (observed count - expected count)^2 / expected count df = (rows - 1) x (columns -1)

EXPECTED COUNTS REQUIREMENT

You can safely use the chi-square test for goodness of fit for the distribution of one categorical variable with k possible outcomes (levels) when most of the k expected counts—at least 80%, or 4 out of 5—have a value of 5.0 or greater and all k expected counts are 1.0 or greater.

the goodness of fit test assesses...

a variety of null hypotheses that reflect how we expect one categorical variable to be distributed in the target population. Because the variable is categorical, the parameters in H0 are population proportions for the k possible outcomes (levels) making up the variable. The only constraint on the null hypothesis for a goodness-of-fit test is that these k proportions sum to 1, so that H0 accounts for all possible outcomes

example: Antibiotic resistance. A sample of 1714 cultures from individuals in Florida diagnosed with a strep infection was tested for resistance to the antibiotic penicillin; 973 cultures showed partial or complete resistance to the antibiotic.4 Describe the population and explain in words what the parameter p is. Give the numerical value of the statistic ˆp that estimates p.

a) population was cultures from individuals diagnosed with strep. P = proportion resistant to antibiotics in the population b) p^ = 0.568

chi square distribution

an approximation of the distribution of X^2. You can safely use this approximation when all expected cell counts are at least equal to 1.0 and no more than 20% of them are less than 5.0.

very large X^2 values

are significant!

two sample t confidence interval

because we have 2 independent samples, we use the difference between both sample averagees (xbar1 - xbar 2) to estimate (μ1 - μ2) C is the area between -t* and t* find t* in the line of table C for the computed degrees of freedom margin of error = t* SE

Chapter 21: Chi-square test for goodness of fit --> no confidence intervals for this chapter

chi-square test for goodness of fit. The underlying idea is that the test assesses whether the observed counts "fit" the distribution outlined by H0. alternative hypothesis is that Ho is not true

*excel lab rat problem

data data analysis tool pack anova single factor input range: all the data; highlight everything MSG: SS/df = 63399/2 = 31699 MSE (within groups)= SS/df = 139173.6/47 = 2961 F = first number/ second number = 10.7 (also given) P value is less than 0.05 ; reject null --> one of the means is statistically significantly different than the other calculate SD: take the square root of the variance find largest and smallest, and divide the two --> that number needs to be less than 2 times ratio 1.277; so yes

Black blue dress vs. white and gold dress

data analysis toolpack t test two sample assuming unequal variances input two ranges we get a t stat almost 6 sd's away from the mean -->Confidence interval: center= sample 1 mean - sample 2 mean calculate SE: sd^2 (variance) 1. do variance of BB/ n 2. do variance of WG/n 3. add these together (difference of the mean) 4. SE = √product from 3 5. lower bound = product from 3- tcrit*SE 6. upper bound = product from 3 +tcrit*SE --> zero is outside of CI; reject null

lack of significance

does not imply that H0 is true. a non-significant P-value is not conclusive: Ho could be true or not. it only shows that the data is not consistent with the expected

plus four method

improves the accuracy of the confidence interval add four imaginary observations, two successes and two failures. With the added observations, the plus four estimate of p is ˜p = (number of successes in the sample +2)/ (n+4)

marginal distriutions

in the "margins of the table, summarize each factor separately ex: marginal distribution of parental smoking: P(both parents) = 1780/5375 = 33.1% P(neither parent) = 25.2%

mean of sampling distribution ˆp

is the true value of the population proportion p ˆp is an unbiased estimator of p The standard deviation of ˆp is √(p(1−p)/n)

margin of error

m = z*SE of the difference

chapters 17 and 18:

one sample and two samples

*excel example

proportionality: number of successes/ total observations zstat: numerator = sample proportion (19/20 = 0.95) - 0.5 = 0.45 population proportion denominator= square root of (successes (0.5) x failures (1-0.5) / 20) num/denom = 4.02 one sample with z stat, so use norm.s.dist(z stat, true) p value = 1- 0.999972 = 0.000028

when n is large,

s is a good estimate of s and the t distribution is close to the standard normal distribution

*chapter 19 example

sample size 100 marketable strawberry = 77 we are assuming that this ratio will yield 2/3 we are assessing the population mean (p) not sample (mu) null hypo: Ho: p = 2/3 alt hypo: Ha: p > 2/3 check conditions: SRS, population is 20x larger than sample, sample size is large enough that sample distribution is approximately normal 1. calculate z statistic. this gives us the P value which tells us to reject or not reject the null hypo 2. look at formula for z and fill in the blanks! p is population proportion = 2/3 = 0.66667 p hat is sample proportion = 77/100 = 0.77 ALL USING THE POPULATION our numerator = phat- p = 0.10333 denominator = SE SE = square root [(prob of success)(prob of failure)/sample size] or √(p(1−p)/n = 0.66667 x (1-0.66667)/100 = 0.0022222 square root of answer = 0.047 = SE Z stat= 0.10333/0.04714 = 2.192 = more than 2 sd's away from the center. p will be relatively small 2. p value: norm.s.dist(z stat, true) = 0.985 dont forget to subtract from 1! 1-0.985 = 0.0142 = 1.42% less than 5%, so we reject the null hypothesis in favor of the alt. hypothesis 3. test the confidence interval USE SAMPLE INFO center +/- margin of error center = phat (sample proportion) = 0.77 margin of error: zcrit * SE zcrit = norm.s.inv(0.975) = 1.96 (this value is also from table C from 95% confidence) SE (based on phat) = 0.04208 previous SE was calculated from population prortion. this SE uses the sample proportion so: prob of success = 0.77 prob of failure = 0.23 SE = square root [(prob of success)(prob of failure)/sample size] or √(p^(1−p^)/n = 0.77 x (1-0.77)/100 = 0.00177 square root of answer = 0.042 = SE zcrit: 1.96 SE: 0.042 lower bound: 0.77- (1.96 x 0.042) = 0.68752 upper bound: 0.77 + (1.96 x 0.042) = 0.852 --> does not include 0.667. reject the null hypothesis.

X^2 is a measure of the relative distance of the observed counts from the expected counts.

small values of X^2 do not provide sufficient evidence to reject H0. Large X^2 are evidence against Ho because they indicate that the observed counts are far from what we would expect if Ho was true.

when n is small,

t (subscript)df n-1 distribution has more area in the tails than does the standard normal distribution. shorter and fatter. t value is greater; the further it is from the mean

calculate two-sample t test

t = (xbar1 - xbar 2)/ SE the t stat is also just given in excel

tb example 17.2 testosterone of adolescent obese males one sample t test

t = about 2 sd's away from the mean a t statistic far from the mean gives a low P value.

*excel t test

t.inv(0.95,df) = left tail t.inv.2T(0.95,df) = two tail

the F statistic

the analysis of variance F test compares the variation due to specific sources (level of the factor) with the variation among individuals who should be similar (individuals in the sample) the larger the F, the smaller the p value

conditional distribution

the cells of the two-way table represent the intersection of a given level of one factor. they can be used to compute the conditional distributions. conditional distribution of student smoking for different parental smoking statuses: P(student smokes | both parents) = 400/1780 = 22.5% P(student smokes | one parent) = 416/2239 = 18.6% P(student smokes | neither parent) = 188/1356 = 13.9%

Chapter 20 you compare the populations by doing inference about... [we use ch 20 when comparing categorical data from two distinct populations. The CI you use the sample of one minus the sample of the other. SE of the difference by using the pooled SE in the denominator] *don't need to know formula for this. we wont have to calculate pooled stuff

the difference p1-p2 between the population proportions

matched pair t procedures

the individuals in one sample are related to those in the other sample -pre test and post test studies look at data collected on the same sample elements before and after some experiment is performed -twin studies often try to sort out the influence of genetic factors by comparing a variable between sets of twins -conceptually this is a test for one population mean -hypothesis testing is written different -the variable studied becomes xbar diff, average difference, and Ho: mu diff = 0; Ha: mu diff > 0 (or < 0, or not equal to 0)

when the X^2 test is statistically significant:

the largest components indicate which conditions are most different from Ho. You can also compare the observed and expected counts, or compare the computed proportions in a graph. ex: desipramine has the highest success rate

to obtain a confidence interval, we replace the population proportions p1 and p2 in the standard deviation of the sampling distribution with the two sample proportions.

the result is the standard error of the statistic ^p1 - ^p2:

example: genetics of seed color

the seed color of 429 second-generation (F2) amaranth plants after crossing black-seeded and pale-seeded populations. The null hypothesis, based on a dominant epistatic model of genetic inheritance, assumed that Ho: pblack = 3/4, pbrown = 3/16, and ppale = 1/16 --> The test statistic is X2=0.8026 and the test P-value is 0.669.This P-value is not statistically significant, so we cannot reject the null hypothesis. The P-value 0.669 is the area to the right of X2=0.8026 under the chi- square distribution with k−1=2 degrees of freedom. P = chisq.dist.rt(test stat X^2, df) (check)

robustness

the t procedures are robust to small deviations from normality, but -the sample must be a random sample from the pop -outliers and skewness strongly influence the mean and therefore the t procedures. Their impact diminishes as the sample size gets larger because of the central limit theorem

Because of the central limit theorem...

the two-sample t procedures are approximately correct for non-Normal population distributions when the sample sizes are large enough. Equal sample sizes are recommended for optimal robustness against the assumption of Normality.

*matched pairs excel

treated and untreated students difference H0: mu diff = 0 Ha: mu diff not equal to 0 you hope the difference is less than 0, but we do 2 tail just to see difference overall descriptive statistics on the difference column summary stats confidence level 95 1. CI sample mean standard error t crit =t.inv.2t(.05, df) ^value you get should equal the confidence level value (m)* SE --> not in CI; reject null stat. sig. different. difference is not equal to 0 2. calculate P value t stat: =(mean-0)/SE =t.dist.2T(absolute value of t stat, df) shortcut: data analysis t test Paired two sample for means first range: untreated second range: treated alpha = 0.05

if the chi square test finds a statistically significant relationship between the row and column variables in a two way table, ...

use descriptive data analysis to explain the nature of the relationship. You can do this by comparing well-chosen percents, comparing the observed counts with the expected counts, and looking for the largest components of the chi-square statistic.

how to estimate a population proportion p:

use the sample proportion p^ (large sample method). If the conditions for inference apply, the sampling distribution of ˆp is close to Normal, with mean p and standard deviation √(p(1−p)/n)

t distribution for 2 independent sample

we have two independent SRSs coming from 2 populations with μ1,σ1 unknown and μ2,σ2 unknown. we use xbar1,s1 and xbar2,s2 to estimate μ1,σ1 and μ2,σ2 respectively. -in excel, you have to choose two sample for UNEQUAL variances -we also have to manually calculate the SE for this test

A hypothesis test for H0:p=p0 is based on the value of the z statistic

when H0 is true. The test P-value, computed as an area under the standard Normal distribution, tells us whether to reject or fail to reject H0. Use this test in practice when np0≥10 and n(1−p0)≥10.

Chapter 24: One-way ANOVA: comparing several means

when comparing more than 2 populations, the question is not only whether each population mean is different from the others, but also whether they are significantly different when taken as a group

hypothesis tests for p

when testing: Ho: p = po if Ho is true, the sampling distribution is known. the test statistic is the standardized value of p^

matched pairs

you find x bar diff; sdiff; and SEMdiff = sdiff/ √n

example of hypothesis text

you use the pooled proportion p^ to obtain the z

pooled sample proportion:

ˆp = number of successes in both samples combined / number of individuals in both samples combined

a level C confidence interval is:

ˆp ± z*√(p(1−p)/n) but since we don't know P, we use the p^ standard error of p^ this interval can only be used when the number of successes and failures in the sample are each at least 15


Kaugnay na mga set ng pag-aaral

Section 4: Parties, Property and Financing in Texas Contracts

View Set

The Secret Life of Bees Chapters 1-5 Rev

View Set

Abnormal Psychology | Chapter 10: Disorders featuring Somatic Symptoms

View Set

AP Government Review (Exam) Unit 1

View Set