HS301 Fund. of Biostatistics Exam 2

¡Supera tus tareas y exámenes ahora con Quizwiz!

2 step approach of ANOVA test

. We take a two step approach. First we run an overall test of the group means to see if any differences exist this is called Analysis of Variance. The null hypothesis for ANOVA is that all the group means are equal. We consider variance (hence the "analysis of variance" name) and we calculate an f-test for this which we will talk about in a short while. If we reject our null hypothesis that all the means are equal we then do a Post Hoc test of the pairwise comparisons to find the differences. Post Hoc means "after this" and in this case it means after we look for an overall difference in the means - we then look for individuals mean differences. We adjust for multiple comparisons when we do this.

What is the coefficient of determination for a correlation coefficient of 0.65?

.42 (.65 squared)

An ANOVA uses pooled variance?

True

Degree of Freedom as calculated based on sample size?

True

It is necessary to run descriptive statistics prior to running an ANOVA test?

True

Outliers have a large impact on correlations especially when the sample size is small?

True

1 sample t-test

A. Hypotheses. H0: µ = µ0 vs. Ha: µ ≠ µ0 (two-sided) [ Ha: µ < µ0 (left-sided) or Ha: µ > µ0 (right-sided)] B. Test statistic. x-H0/ s/sq rt n (SE) with df n-1 C. P-value. Convert tstat to P-value [table C or software]. Small P Þ strong evidence against H0 D. Significance level (optional). See Ch 9 for guidelines. This is just like a one sample z-test except we have to calculate df which equal n-1. You find the t-stat and associated p-value and if your p-value is small you reject the null. ***This has one group and no concurrent control group, and comparisons are made to an external population

Conducting an ANOVA on two groups will provide the same results as conducting what kind of t-test?

An independent t-test w/ equal variances

Dummy Variable

not all the independent variables in a multiple regression have to continuous variables - we can fit categorical variables too. However to do this we have to convert them into 0/1 indicator variables. These are referred to as "dummy variables". For binary variables, code the variable 0 for "no" and 1 for "yes". - For categorical variables with k categories, use k-1 dummy variables

summary statistics

notations used in comparing means and standard deviation from independent groups

What ANOVA stands for

one-way analysis of variance

paired samples t-test

Each point in one sample is matched to a unique point in the other sample. Pairs can be achieved via sequential samples within individuals (e.g., pre-test/post-test), cross-over trials, and match procedures Also called "matched-pairs" and "dependent samples". **two samples w/ each data point in one sample uniquely matched to a point in the other; we analyze within-pair differences or delta (again see Lesson 13) a group of individuals are measured at the beginning of a period of time and again later --- comparisons testt = (X1 - X2)/S.E...... S.E. = (Sd/n^.5)Df = n-1 (number of pairs)When? measure clinical outcome for a same group of study subjects who received one treatment vs. another OR measure a clinical outcome for a same group of treatment Advantages = minimize differences so that the differences between therapies can be evaluated - minimize background noise Here's an example of a study that addresses whether oat bran reduce LDL cholesterol with a cross-over design. Subjects "cross-over" from a cornflake diet to an oat bran diet. Half subjects start on CORNFLK, half on OATBRAN. They spend two weeks on diet 1 with a measure of LDL cholesterol then a washout period (where no treatment is given). Then they switch diets and spend two weeks on diet 2, again with an LDL cholesterol measure.

A correlation coefficient of 0.79, 95% Confidence Interval -0.02 to 0.98, is statistically significant?

False

What statistics used to compare 3 or more means

side by side stemplots side by side box plots simple dot pot group summary statistics (mean, standard deviations, and sample size)

Measure vitamin content in loaves of bread and see if the average meets national standards

single sample t test

F-test

stands for R. A. Fisher, the brilliant mathematician and statistician who developed this approach.

The LSD is the more conservative Post Hoc test than the Bonferonni?

False

The tails of the t-distribution when df=5 are skinnier than the tails of a t-distribution when df=15?

False

You should do post hoc analysis following an ANOVA where you accept the null hypothesis?

False

With 5 different groups of patients with BMI values, how many pair-wise comparison would you need to make?

10

What t-test would be used for:students' scores on the first quiz vs. the same students' scores on the second quiz?

A paired-samples t-test

What are Residuals?

- The distance from the data point to the line

Calculate the delta for the following data: Subject 1: Pre-test = 458, Post-test=542; Subject 2: Pre-test 543, Post-test=645; Subject 3: Pre-test=356, Post-test=589.

-84, -102, -233

Exploratory and Descriptive methods

-Compare group shapes, locations and spreads Examples of applicable techniques Side-by-side stemplots (right) Side-by-side boxplots (next slide)

The problem of multiple comparisons

When conducting multiple analyses on the same DV, the chance of committing a Type I error increases

What is Simple Regression?

considers the relation between a single explanatory variable X and response variable Y. Another common terminology is to refer to X as the independent variable and y as the dependent variable

What is Multiple Regression?

considers the relation between multiple explanatory variables (X1, X2, ..., Xk) and response variable Y. The multiple regression model helps "adjust out" the effects of confounding factors (i.e., extraneous lurking variables). Simple linear concepts (lesson 16) must be mastered before tackling multiple regression.

Poc Hoc Comparison

if you run the F test for ANOVA and you decide keep the null hypothesis you end here. If you accept the alternative hypothesis then you need to run _____test to see which of the means differ. So the ANOVA Ha says "at least one population mean differs" but does not delineate which differ. It's important to remember that you only run the _______test if you reject the null.

when to use a paired to test

t is based on groups of individuals who experience both conditions of the variables of interest

Why do You Use Residuals?

- to assess departures from normality and equal variance

multiple regression analysis

A regression model is applied (regress FEV on SMOKE). The least squares regression line is ŷ = 2.566 + 0.711X The intercept (2.566) = the mean FEV of group 0 The slope = the mean difference = 3.277 − 2.566 = 0.711 tstat = 6.464 with 652 df, P ≈ 0.000 (same as equal variance t test) The 95% CI for slope is 0.495 to 0.927 (same as the 95% CI for μ1 − μ2)

You would like to determine if mean systolic blood pressure differs in 4 groups of women, each group (n=100) with a different level of physical activity (none, low, moderate, high). Which statistical test would you use?

ANOVA

A correlation of 0.25 cannot be statistically significant?

False

A slope of 1.45 95% CI= -0.005 to 5.24 is statistically significant?

False

If two factors are highly correlated, it's correct to always imply that one may cause the other (imply causation)?

False

When would you use a 1 sample t test

is designed to test whether the mean of a distribution differs significantly from some preset value.

Independent two sample t test

the t-test there are two separate groups, with no matching or pairing, and we are comparing these separate groups. For example we compare the mean of 1 group with a mean of the second group.

LSD methods

there are many different post hoc tests and they differ in the way they adjust the p-value to account for multiple comparisons. We will briefly discuss two common post hoc tests the Least Squares Difference (or LSD) and the Bonferroni. The LSD is the least conservative test and should only be used after a significant ANOVA test.

use of t-test

to estimate sample size and power. ____ is the probability to reject the null when you should reject the null and the ability to detect the difference

What t-test would be used for:Does a course offered to college seniors result in a GRE score greater or equal to 1200?

A one-sample t-test

pooled variance

A weighted average of the two estimates of variance—one from each sample—that are calculated when conducting an independent-samples t test. Average is computed so larger sample carries more weight in determining final value. *The pooled variance is used to calculate this standard error estimate which is used in the confidence interval and the test statistic. By using the pooled variance estimate you can use a higher df value (38). Therefore, you are more likely to reject the null hypothesis.

A correlation coefficient of 1.0 is stronger than a correlation coefficient of -1.?

False

In comparing two approaches to calculate degrees of freedom (df Welsch method = 35.4 and df conservative method = 19), you have a great chance of rejecting the null with the conservative method?

False

It is easier to reject the null hypothesis when the df for t is small?

False

You would like to compare heights in 1 group of children with type 1 diabetes (n =50) with a known national average. Which t-test would you use?

One sample t-test

You would like to determine if eating 50 grams of walnuts a day for 14 days significantly decreases weight in 120 women. Which of the following would you use?

Paired t-test

Compare vitamin content of bread loaves immediately after baking versus values in same loaves 3 days later

Paired sample t test

confidence interval

can be constructed to calculate how large the mean difference is. It refers to success rate. Generally it is 2 standard deviations or 95% of the sample but there are other levels of confidence. (1−α)100% confidence interval for µ1 - µ2= (x1-x2) +- (tdf, 1-a/2)(SE x1-x2)

variance within groups

quantifies the variability of data points in a group around its mean. formula MSW= ssw (sum of squares within) / depress of freedom within (N- K) ex. (15-1)(standard deviation)2+ (next group(15-1)(standard deviation2+.........

Conditions for inference

t procedures require these conditions or assumptions. First, that the data come from a simple random sample (SRS) - either the individual observations or the DELTAs. Seconds, that the information is valid (no bias or errors). Finally, that the data come from a normal population or large sample (central limit theorem).

All of them involve testing a continuous (or quantitative) outcome variable (like "age" in years or "cholesterol").

t tests

mean difference

use categorical IV to divide the DV into different groups. then compare means of different DV groups. If there are differences, determine if the differences are statistically significant. The paired t-test is similar to one-sample t test. The null hypothesis is that the population mean difference is 0. (μ0 is usually set to 0, representing "no mean difference", i.e., H0: μ = 0). This can be interpreted as the first measures and the second measures are similar hence the mean diff = 0. The test statistic is t stat= (mean diff - 0) / the standard error of the mean difference. df = n-1.

What is Simple Linear Regression?

- It describes the relationship in the data with a line that predicts the average change in Y per unit X

What are the Conditions for Inference?

- Linearity: should be linear not curvalinear line- -Independent Observations: obs for x not related to obs for y (no bias) - Normality at each level of X: normally distributed, no outliers - Equal Variance at each level of X: plot residuals to test our assumptions

Why would a paired-samples t-test be used for students' scores on the first quiz vs. the same students' scores on the second quiz?

A paired-samples t-test would be used for the same students' scores on the first quiz vs. the same students' scores on the second quiz because the same group experiences both levels of the variable.

population parameters and sample statistics

A parameter is a value that describes a characteristic of an entire population, such as the population mean. ... A statistic is a characteristic of a sample. If you collect a sample and calculate the mean and standard deviation, these are sample statistics Parameters (population) Group 1 σ1 Population standard deviation N1 µ1 Group 2 σ2 N2 µ2 Statistics (sample) Group 1 n1 Group 2 n2 Population mean difference m1-m2 is sample mean difference of x1-x2 x1-x2 is the point estimator for m1- m2

You are conducting a correlation analysis of body mass index and exercise time on a treadmill. You find the correlation is -0.57. Which of the following is the correct interpretation of this finding?

As body mass index increases treadmill time decreases.

Relationship between df and t distribution

As the df get bigger the t distribution approaches, or starts to resemble, the z distribution. t distributions are similar to z distributions but with broader tails. As df increases → t tails get skinnier → t become more like z. The df are determined based on the sample size (n). So as n increases the t distribution looks like the z distribution.

r

Correlation coefficient r quantifies linear relationship with a number between −1 and 1. When all points fall on a line with an upward slope, r = 1. When all data points fall on a line with a downward slope, r = −1 When data points trend upward, r is positive; when data points trend downward, r is negative. The closer r is to 1 or −1, the stronger the correlation.

What are the Components of the Regression Equation?

Equation: ŷ = a + bxŷ ≡ predicted value of Ya=is the intercept of the lineb=is the slope of the line

In this example the _____ is 14.08 with 2 df between and 42 df within. is simply the ratio of the MSB to the MSW. It's a signal-to-noise ratio on a family of F distributions. Convert ______ to P-value with a computer program or Table D. The P-value corresponds to the area in the right tail beyond.

F-stat

Before you determine the p-value of an independent t test you should first determine if the variances are equal. You have a greater chance of rejecting the null if the variances are treated as unequal?

False

Plot data and look for trends

Form- straight line= linear; curved; random Direction- upward trend= positive association; downward trend= negative association; flat trend= no association Strength- how close the data points adhere to an imaginary trend line. refers to the degree to which points adhere to a trend line. The eye is not a good judge of strength. The top plot appears to show a weaker correlation than the bottom plot. Look for outliers- Outliers are really important in correlation, especially with small sample size, because a single outlier can effect the strength of the association. As always it's important to determine if the outlier is a valid data point or a data error

variance between groups

It's calculated be looking at the variability between and within groups. The variability between groups of means around the grand mean → provides a "signal" of group difference. Based on a statistic called the Mean Square Between (MSB).

How are Regression Lines Fitted?

Least squares method is a procedure to determine the best fit line to data; the proof uses simple calculus and linear algebra. The basic problem is to find the best fit straight line y = ax + b given that

In an independent t-test, we calculate a mean difference between the groups (group 1 - group 2) and we use this to test against the null. If the mean difference and 95% confidence intervals are 6.1 (95% CI -2.5 to 12.0), would you reject the null hypothesis?

No

Analysis of Variance

One-way ANalysis Of VAriance (ANOVA) has a categorical explanatory variable (groups) and a quantitative response variable (continuous variable). ANOVA test group means for a significant difference. The statistical hypotheses is H0: μ1 = μ2 = ... = μk , Ha: at least one of the μis differ Method: compare variability between groups to variability within groups (F statistic). There are two kinds of df for the F statistics, one for between and one for within group variation.

You would like to compare muscle strength before and after a new treatment in 80 children with type 1 diabetes. Which t-test would you use?

Paired t-test

notations

SSB ≡ sum of squares between dfB ≡ degrees of freedom between k ≡ number of groups x-bar ≡ grand mean x-bari ≡ mean of group i

Levene's Test of Variances

So before you run and ANOVA you have to make sure the variances in the groups that you are comparing are equal. You should do this by graphical exploration, comparing variance spreads visually with side-by-side plots. You should run descriptive statistics and if a group's standard deviation is more than twice that of another, be alerted to possible heteroscedasticity. And you should statistically compare the variances using test for variances. test the hypothesis that the variance of group 1 = variance of group 2 and so forth. The test statistic is a particular type of Fstat. The Fstat is converted to a P-value by the computational program. Interpretation of P is routine Þ small P Þ evidence against H0, suggesting heteroscedasticity. So if you run a test and you do not reject the null - you can go ahead and conduct an ANOVA.

Normality condition for using a t-test

The Normality condition applies to the sampling distribution of the mean, not the population. Therefore, it is OK to use t procedures when: The population is Normal or the population is not Normal but is symmetrical and n is at least 5 to 10. The population is skewed and the n is at least 30 to 100 (depending on the extent of the skew).

Correlation Assumptions

The assumptions are as follows: level of measurement, related pairs, absence of outliers, normality of variables, linearity, and homoscedasticity. Level of measurement refers to each variable. For a Pearson correlation, each variable should be continuous Related pairs refers to the pairs of variables. Each participant or observation should have a pair of values. So if the correlation was between weight and height, then each observation used should have both a weight and a height value. Absence of outliers refers to not having outliers in either variable. Having an outlier can skew the results of the correlation by pulling the line of best fit formed by the correlation too far in one direction or another. Typically, an outlier is defined as a value that is 3.29 standard deviations from the mean, or a standardized value of less than ±3.29. Linearity and homoscedasticity refer to the shape of the values formed by the scatterplot. For linearity, a "straight line" relationship between the variable should be formed. If a line were to be drawn between all the dots going from left to right, the line should be straight and not curved. Homoscedasticity refers to the distance between the points to that straight line. The shape of the scatterplot should be tube-like in shape. If the shape is cone-like, then homoskedasticity would not be met.

Correlation is used

The goal of correlation is to determine if there is a significant association between these two variables Used in the statistical analysis of quantitative continuous data. For correlation and simple regression we have a quantitative response variable Y ("dependent variable") and a quantitative explanatory variable X ("independent variable").

standard error of the mean

The square root law says the SE is inversely proportional to the square root of the sample size. That means as the sample size increases the the SE gets more precise (more narrow). (Standard deviation of the mean) σX= (Standard Error of the mean) SE = σ/sq rt n Putting it all together:The sampling distribution of x-bar is Normal with mean µ and standard deviation (SE) = σ / √n (when population Normal or n is large)These facts make inferences about µ possibleExample: Let X represent Weschler adult intelligence scores: X ~ N(100, 15).Take an SRS of n = 10SE = σ / √n = 15/√10 = 4.7xbar ~ N(100, 4.7)68% of sample mean will be in the range µ ± SE = 100 ± 4.7 = 95.3 to 104.7 These facts allows us to make inference about the population mean. For example, means of WAIS scores based on n = 10 will be Normal with a mean of 100 and SE of 4.7. (Note calculation of the SE above.) Using the 68 part of the 68-95-99.7 rule allows us to predict that 68% of the means under these sampling conditions will fall in the range 95.3 to 104.7.

conditions for inference

These are the conditions (or assumptions) for using t-procedures. "Validity conditions" include good information (no information bias or data errors); a good sample ("no selection bias"); and no confounding. "Sampling conditions" include independence (no biased sample) and normal sampling distribution .

The F test in ANOVA is the mean squares between / mean square within?

True

The Mean Squares Between is the variation of the group means around the grand mean?

True

You would like to compare the body mass index of 2 groups, one containing data from children with a diagnosis of type 1 diabetes, one containing data from children with no diagnosis of type 1 diabetes. All the children are boys (and total n=100). Which t-test would you use?

Two sample (independent) t-test

1 sided vs 2 sided test

Use 1 sided test when you know the direction of the alternative hypothesis. Appropriate if you only want to determine if there is a difference between groups in a specific direction. However, 1 sided testing is always for chi-squared tests and F tests. The use of 2 sided testing is by default, The vast majority of hypothesis tests that analysts perform are two-tailed because they can detect effects in both directions or differences between both groups

use the F test to determine whether you can pool the variance estimates

We should check for homoscedasticity (homogeneity of variances) first by Levine's test/F test.If the variances are homogenous,we should use t test with pooled variances.If there is statistically significant heterogeneity,we should use another version of t- test (which is suitable for cases with unequal variance) with Welch-Satterthwaite degrees of freedom

- outliers - non-linear relationships - confounding variables (lurking variables that can make your data look like there's a correlation) Beware of lurking variables - also known as confounding. Here's a great example, A near perfect negative correlation (r = −.987) was seen between cholera mortality and elevation above sea level during a 19th century epidemic. We now know that cholera is transmitted by water. The observed relationship between cholera and elevation was confounded by the lurking variable of proximity to polluted water.

What can effect your interpretation of a correlation

family-wise error rate

When comparing means of multiple groups, pair by pair using a t-test, this error rate is the probability of falsely rejecting the null hypothesis in either one of those tests. ex. Assume three null hypotheses are true: At α = 0.05, the Pr(retain all three H0s) = (1−0.05)3 = 0.857. Therefore, Pr(reject at least one) = 1−0.847 = 0.143 Þ this is the family-wise error rate. In other words it's not 5% error of false rejection of the Null (type 1 error) it's 14%. The family-wise error rate is much greater than intended. This is "The Problem of Multiple Comparisons".

Why would a one-sample t-test be used to test if a course offered to college seniors results in a GRE score greater or equal to 1200?

because the sample mean is compared to a single fixed value.

r2

coefficient of determination. This statistic quantifies the proportion of the variance in Y [mathematically] "explained" by X. For the illustrative data, r = 0.737 and r2 = 0.54. Therefore, 54% of the variance in Y is explained by X.

1.Compare vitamin content of bread immediately after baking versus loaves that have been on shelf for 3 days

independent t test

Exploratory data analysis

is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

Bonferroni's method

is much more conservative. Ensures that the family wise error rate is less than or equal to alpha after all possible pair-wise comparisons have been made. adjust the p-value more to account for multiple comparison therefore it's more difficult to detect statistical differences among the means compared with the LSD. in slide 36- P-values from Bonferroni are higher and confidence intervals are broader than LSD method, reflecting its conservative approach

Equal Variance Assumption

is really important in ANOVA. Equal variance is called homoscedasticity. (Unequal variance = heteroscedasticity). Homoscedasticity allows us to pool group variances to form the MSW 3 methods of assessing group variance are recommended: Graphical exploration with side by side stem plots or box plots. Widely discrepant hinge spreads (IQRs) warn the user of heteroscedasticity. Summary statistics- standard deviations can be used to assess group variances. One rule of thumb suggests if one groups standard deviation is more than double anothers, heteroscedasticity should be suspected. Hypothesis test of variance. Many such tests exist, but only levene's test is widely applicable in non-normal populations.

power

is the probability of rejecting the null hypothesis when in fact it is false. Power is the probability of making a correct decision (to reject the null hypothesis) when the null hypothesis is false. Power is the probability that a test of significance will pick up on an effect that is present.

"family" of t distribution

t distributions are a family of distributions identified by "Student" (William Sealy Gosset) in 1908. tdistributions are a family of distributions w/ family members sharing common characteristics.

degree of freedom (df)

t family members are identified by their degrees of freedom, df. df= n-1 with independent t-tests there are a few different ways to calculate the degrees of freedom. Two of the most common are The Welsh method (don't worry the computer will calculate this) or a conservative method. We will also learn about equal variance tests that allow you use a pooled variance estimate to increase df. The conservative methods ends up with lower degrees of freedom. Remember the families of the t-distribution? The lower the df the more flattened out the curve - that means it's more difficult to reject the null hypothesis. In general, you want to use a df that are higher. Here, the Welch method has 35.4 df and the Conservative method provides 19df.

Use of a Data Table

•One column for the response variable (chol) •One column for the explanatory variable (group)

The t-distribution starts to look like a z-distribution when n is large?

True

The Mean Squares Within is the average amount of variation within the groups?

True

The degrees of freedom (df) are used to determine which t-distribution to use?

True

Correlation

The strength of the linear relationship between two quantitative variables

standard error of the mean difference

a standard deviation of the distribution of the differences between the means between group 1 and group 2


Conjuntos de estudio relacionados

Biochemistry Learning Curve Chapter 5.1, 5.2

View Set

Personal and Community Health Chapters 3 & 4

View Set

Public Health - Chapter 1 Community and Public Health: Yesterday, Today, and Tomorrow

View Set

Chapter 57: Drugs Affecting Gastrointestinal Secretions

View Set

Lesson 2: Job Opportunities for Entrepreneurship as a Career

View Set