Regressions Exam 1

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

1) There was a significant main effect for sex, F(1, 650) = 27.12, p < .0001, and for smoking, F(1,650)=53.84, p<.0001. There was a marginally significant interaction, F(1, 650) = 3.77, p = .053. 2) The cell sizes, means, and standard deviations for the 2x2 factorial design are presented in Table 1. The main effect of sex was significant (F[1, 650] = 27.12, p < .0001), as was the main effect of smoking status, F(1,650)=53.84, p<.0001. The interaction of sex and smoking, however, was marginally significant, F(1, 650) = 3.77, p = .053. 3) The FEV volume change scores were subjected to a two-way analysis of variance having two levels of sex (female, male) and two levels of smoking status(smoking, non-smoking). All effects were statistically significant at the .05 significance level. The main effect of sex yielded an F ratio of F(1, 650) = 27.12, p < .0001, indicating that the mean change score was significantly greater for for males (M = 2.81, SD = 1.00) than for females (M = 2.45, SD = 0.65). The main effect of smoking status yielded an F(1,650)=53.84, p<.0001, indicating that the mean change score was significantly higher in smoking group (M = 3.28, SD = 0.75) than in the non-smoking group (M = 2.57, SD = 0.85). The interaction effect was non-significant or marginally significant, F(1, 650) = 3.77, p = .053.

Report the following information for a two-way ANOVA in APA style

1) Full model: SBP=105+22.32bpmeds1+1.67Female+0.83age32_70+0.04totchol1+0.06glucose1 2) Male: SBP=105+22.32bpmeds1+0.83age32_70+0.04totchol1+0.06glucose1 3) Female: SBP= (105+1.67)+22.32bpmeds1+0.83age32_70+0.04totchol1+0.06glucose1 4) No BP intervention: SBP=105+1.67Female+0.83age32_70+0.04totchol1+0.06glucose1 5) BP intervention: SBP=(105+22.32)+1.67Female+0.83age32_70+0.04totchol1+0.06glucose1

State the predictive models for the following main effect model.

I'll not simply delete those outliers from the sample. I'll use diagnostic statistics as a tool to identify those extreme cases and further examine their age, waist circumference and total cholesterol.

Summarizing the participants with 5 largest leverage scores, 5 largest Cook's distance values, and 5 largest dfbeta values for bmxwaist. What do you suggest handling those participants?

1) F, 2) T, 3) T, 4) F

True or False: 1.As the number of groups increases, the modified significance level for pairwise tests increase as well. 2.As the total sample size increase, the degrees of freedom for the residuals increase as well 3.The constant variance condition can be somewhat relaxed when the sample sizes are relatively consistent across groups. 4.The independence assumption can be relaxed when the total sample size is large.

Yes, t tests are significant.

Using the quadratic term, are all independent variables statistically significant?

•Residual analysis •Normality •Linearity •Homoskedasticity

Which linear regression assumptions apply to the dependent variable?

•Influence statistics for continuous independent variables •Multicollinearity

Which linear regression assumptions apply to the independent variable?

No The Pearson's correlation is independent of unit or scale change.

After data manipulation (subtracting the mean from the original variable), does the r-value change?

They are all statistically significant because Student's t tests are significant and p<0.001.

Are gender, age, and waist circumference significant? Why?

1) Yes, Dietary study group H0: β(SV)= β(LV)= β(NOR)= β and β(m)≠ β(f) •F2,745=(51806.4/195.9)=132.24, p=0.001. Rejecting H0 . 2) Yes, Sex H0: β(m)= β(f) = β and β(SV) ≠ β(LV) ≠ β(NOR) •F1,745=(13056.3/195.9)=66.65, p=0.001. Rejecting H0 .

Are the F-tests for Dietary and Sex significant?

The white group has the largest variance and the Latinx group has the smallest variance.

Are the variances of the total cholesterol among 4 groups similar?

1) We conclude that there is a very strong positive correlation between FEV and height, as r=0.8681 2) We conclude that r2=0.702 (or about 70%) of the variation in FEV can be explained by the linear relationship between FEV and height in children. This implies that about 30% of the variation in FEV cannot be explained by children's height.

Based on the following pairwise correlation, what can we determine?

1) β1=r SDY/SDX=√(0.572) 0.867/2.954=0.756×0.294=0.222 (unstandardized coefficient - STATA must have ", beta") 2) β0 ̅=2.637 - 0.222*9.931= 0.432 •predictied FEV (y-hat)=0.432+0.222×Age 2) T-test for slope: •H0: β1 (age) = 0. Or there is no association between FEV and age. •t test: t= β1 /SE(β1)=0.222/0.0075= 29.53, df=652, p<0.001. Then reject H0. •"SE(" β1")=" √(2&Res MS)⁄(SDx∗√(2&n-1))=√(2&0.322)⁄(2.954×√(2&654-1))=0.0075 •Note: √(2&F)=√(2&872.18)=29.53=t. Thus, F = t^2 ! The average FEV increases by 0.222 L/s as age increases by one year. 3) 95% CI [β1 +Zα /2*SE(β1)]: (0.222 - 1.96*0.0075, 0.222 + 1.96*0.0075) = (0.21, 0.24), not include zero! We are 95% confident that the true slope (the increase rate in the population) falls between 0.207 and 0.237. 4) Conclusion: As children grow up from 3 to 19 years, their lung function increases.

Calculate B1, B0, and do a t-test for the slope. What is the conclusion?

Yes, the homoskedasticity assumption is met since the residuals do not follow a particular pattern.

Check if the homoskedasticity assumption for bpxsy2 by bmxwaist is met and show the residual's plot by the predictor. What is your conclusion?

The linearity assumption seems to be met since the residuals show a roughly linear trend.

Check if the linearity assumption for the ordinary linear regression is met and show the plot of the residuals by predicted bpxsy2. What is your conclusion?

The normality assumption is met since residuals follow a normal distribution

Check if the normality assumption for the dependent variable is met and show the histogram of the residuals. What is your conclusion?

1) For both, the smaller the value, the better (opposite of R-squared) 2) BIC is better as it takes sample size into account more than AIC, which can make it more useful 3) No statistical test between AIC and BIC differences (smaller=better, but you cannot tell how small is significant)

Compare Akaike's information criterion (AIC) and Schwarz's Byesian informaron criterion(BIC)

The R-squared increases from 5% to 25.6%, suggesting the multiple linear regression model can explain more variance of systolic blood pressure.

Compared to the simple linear regression model (r^2 = 0.0490), how much does R-squared increase?

The reduction in Res SS = 1 - (1341757/1835571)=27%

Compared to the simple linear regression model, what is the proportion of reduction in the residual sum of squares? (original ResSS = 1835571)

There appears an interaction between age and waist circumference. The relationship between total cholesterol and waist circumference is modified by age. Specifically, in the age 18 group, as waist circumference increases, the total cholesterol also increases. In contrast, for people older than 28 years old, as waist circumference increases, the total cholesterol decreases. The older people are, the steeper the downward trend is.

Demonstrate how you interpret that age modifies waist circumference effect on total cholesterol?

There are apparent nonlinear relationships between the total cholesterol and age as well as between the total cholesterol and waist circumference. The two curvilinear relationships are different. For people younger than 48 (=30+18) years old, as age increases, the total cholesterol increases. However, for people older than 48 years old, as age increases, the total cholesterol accelerated decreases. As waist circumference increases, the total cholesterol accelerated increased when waist circumference was smaller than 96.4 (=40+56.4) cm. However, the total cholesterol accelerated declined when waist circumference was greater than 96.4 (=40+56.4) cm.

Demonstrate how you interpret that association of total cholesterol with age and waist circumference.

Although male's SBP is higher than female's in both non-obese and obese groups, the difference between male and female's SBP in the non-obese group is much greater than that in the obese group.

Demonstrate how you interpret that gender and BMI international effect on SBP?

As waist circumference increases, the SBP in both genders increases. However, male's SBP escalates slower than females after adjusting for age. Among people whose waist circumference is smaller than 113.4 cm (=56.4+57), females have lower SBP than males after controlling for age. However, among people whose waist circumference is greater than 113.4 cm, females have higher SBP than males after controlling for age.

Demonstrate how you interpret that gender modifies waist circumference effect on SBP?

The p-value for the Bartlett's test was 0.06. Failed to reject H0 of equal variances.

Does Bartlette's test result support equal variances or unequal variances?

The p-value for the Bartlette's test is less than 0.001, rejecting the null hypothesis of equal variances among groups.

Does Bartlette's test result support equal variances or unequal variances?

Adjusted R-squared indicates that 25.56% of variance in systolic blood pressure can be explained by the model. The R-squared value is similar.

How accurate is the model?

R-squared indicates that 4.9% of variance in systolic blood pressure can be explained by the model that includes waist circumference.

How accurate is the model? Name the statistic.

R2=0.01. The model explains 1% of variance in total cholesterol. The effect size is pretty small.

How accurately does the model explain the variation in total cholesterol?

R2=SSB/SST=18187/11106231=0.002. About 0.2% of the variance in the total cholesterol can be explained by race.

How accurately does the model explain the variation in total cholesterol?

R2=SSB/SST=83028/8396274=0.01. About 1% of the variance in the total cholesterol can be explained by marital status.

How accurately does the model explain the variation in total cholesterol?

It can be used to solve the structural collinearity and make the intercept meaningful

How can the mean centering technique be used in multiple linear regression?

1) A significant interaction means that the height effect on FEV depends on age or the age effect on FEV depends on height. As height increases, the negative age effect decreases, which is difficult to interpret. 2) As age increases, the height effect increases. Since age and height do not equal zero in this sample, we can center those two variables around their means (mean-centering approach).

How can this model be improved? What does it say?

•Distance from center of x distribution •Effect on predicted values •Effect on model coefficients

How do you assess the influence of continuous predictors?

Variance Inflation Factor (VIF)

How do you detect collinearity?

Analysis of variance showed a main effect of lead exposure (lead_grp) on the number of finger-wrist taps (maxfwt) per 10 seconds, a measure of neurological function, F(2, 96) = 5.28, p = .007, R2 = .010. Post-hoc analyses using Tukey's HSD indicated that the average number of max finger-wrist taps scores were significantly lower for the currently exposed children than for non-exposed children (p = .005), but the average number of max finger-wrist taps scores did not differ significantly between currently exposed children and previously exposed children (p = .177) as well as between previously exposed children and non-exposed children (p=0.671). We conclude that current exposure to lead is associated with children's poor development of neurological function. Since the sample size for the previously lead-exposed group is relatively small, more research is needed

How do you report ANOVA results with the following outputs?

Bartlett's test (if data are normal) or Levene's test ( if data are not normal and balanced) Can correct with square root or logarithm transformation

How do you test for the homogeneity of variances in one-way ANOVA?

Histogram by group and Shapiro-Wilk test (can correct with square root or logarithm transformation) - Do not want SWTest to be significant Lack of normality has very little effect on the significant level of the F-test

How do you test if the dependent variable is approximately normally distributed for each group within one-way ANOVA?

1) R2 increases from 77.4% (main effect model) to 78.2%. 2) Age and age^2 are highly correlated. This structural correlation is not a problem for linear regression since two variables are not correlated by chance as a consequence of some unknown confounder. 3) Age has a non-linear relationship with FEV. 4) c.Age declares Age is a continuous variable. (the quadratic function)

How does adding the quadratic regression function affect the data?

1) Slopes are not changed, the relationships between FEV and predictors remain unchanged. 2) Intercept: now a female child whose height is 61.14 inches at 9.9 years has FEV at 2.55 L/s. 3) For every year greater than 9.9 years old, a child's FEV expects to increase by 0.06 L/s more per year after controlling for sex and height. For every year less than 9.9 years old, a child's FEV can expect to had 0.06 L/s less per year after controlling for other variables.

How does the mean-centered multiple linear regression data change?

The Residual Plot - Points should form a cloud rather than a cone or hourglass shape The residuals have a constant standard deviation (or equal variance) across values of predictors.

How is homoscedasticity assessed?

A residual plot •Plot standardized predicted/fitted (y) values on the x-axis and standardized residuals on the y-axis. •Points should not show any sort of trend or pattern. Correction •Squared, square root, or log transformation of the dependent variable

How is linearity assessed?

Histogram of the Residual Plot Residuals or standardized residual points should be evenly distributed above and below 0

How is normality assessed?

The Res SS=1341757

How large are the residual sum of squares?

Res SS=1835570.83

How large is the residual sum of squares?

F=MSB/MSW = (18187/3) / (11088044/6734) = 6062 / 1647 = 3.68

How to calculate the between mean sum of squares and within sum of squares?

F(gender)=MSB/MSW = (49932/1) / (10978499/6730) = 49932 / 1631 = 30.61 F(race)=MSB/MSW = (18715/3) / (10978499/6730) = 6238 / 1631 = 3.82 F(gender by race)=MSB/MSW = (41823/3) / (10978499/6730) = 13941 / 1631 = 8.55

How to calculate the between mean sum of squares, within mean sum of squares, and the F value for gender and race, and the interaction term, respectively?

Adjusted R-squared indicates that 25.56% of variance in systolic blood pressure can be explained by the model.

How to interpret adjusted R-squared?

White females have the highest total cholesterol, while white males have the lowest total cholesterol. Females in other races have the second-highest total cholesterol, while Black males have the second-lowest total cholesterol.

How to interpret margins for two-way ANOVA?

The number of finger-wrist taps/10 seconds in the current exposure group (2) is 10.44 significantly lower than that in the non-exposure group (1), but not significantly different from that in the previous exposure group (3). The number of finger-wrist taps in Group 2 is not significantly lower than that in Group 3.

How would you interpret the following t test in Fisher's Least Significant Difference (LSD) procedure?

Either a linear or a non-lineaer correlation, as you would need to visualize the data to understand if there is a pattern

If an r for two variables is found to be -0.21, which of the following conclusions is most appropriate? 1) A week linear correlation 2) A non-linear correlation 3) Either a linear or a non-linear correlation 4) No correlation

The gender differences in mean total cholesterol level vary depending on race.

If the F test for the interaction term is significant, how do you interpret it?

All pairwise comparison tests except for the married vs. divorced group are significant according to both Bonferroni and Tukey's HSD test. The separated/Divorced/Widowed group was 12.43 mg/dL higher than the never married group. The married group is 9.33 mg/dL higher than the never married group. However, the married group was not significantly different from the divorced group.

If the F test is significant, where are the exact differences using the Bonferroni and Tukey's HSD test, respectively?

All pairwise comparison tests show that Blacks and Latinx have significantly lower total cholesterol level than the others group, according to both Bonferroni and Tukey's HSD test. Other pairwise comparison tests are not significant.

If the F test is significant, where are the exact differences using the Bonferroni and Tukey's HSD test, respectively?

A) F, B) T, C) T, D) F

If the null hypothesis that the means of four groups are all the same is rejected using ANOVA at a 5% significance level, then ... true or false: A.We can then conclude that all the means are different from one another. B.The standardized variability between groups is higher than the standardized variability within groups. C.The pairwise analysis will identify at least one pair of means that are significantly different. D.The appropriate α* to be used in pairwise comparisons is 0.05/4=0.0125 since there are four groups.

The female's predicted systolic pressure is on average 1.43 mmHg lower compared to male's systolic blood pressure when age, waist circumference, and total cholesterol are held constant. The predicted systolic pressure increases by 0.48 mmHg for every additional year in age when sex, wasit circumference, and total cholesterol are held constant. The predicted systolic pressure increases by 0.15 mmHg for one inch increase in waist circumference when sex, age, and total cholesterol are held constant. The predicted systolic pressure increases by 0.04 mmHg for every unit increase in total cholesterol when sex, age, and waist circumference are held constant.

Interpret each significant slope in the multiple linear regression model.

1) For males, the follow-up systolic blood pressure increased by 0.55 mmHg for each additional year in age since age 32 years, after controlling for other covariates. •SBP=112+21.32bpmeds1+0.55age32_70+0.03totchol1+0.06glucose1 2) For females, the follow-up systolic blood pressure increased by 1.08 (=0.55+0.53) mmHg for each additional year in age since age 32 years after adjusting for other covariates. •SBP=(112-7.57)+21.32bpmeds1+(0.55+0.53)age32_70 +0.03totchol1+0.06glucose1 •0.53 means the increment to the slope of age in female.

Interpret the following for Males and Females (Interpretation is focused on the continuous predictor by dummy predictor)

1) Two sex groups have different intercepts and slopes. 2) Females have a lower follow-up systolic blood pressure (SBP) before 46 years old than males and a higher follow-up SBP after 46 years old 3) The beta coefficient of age for male was significantly smaller than that for female.

Interpret the following margins plots.

Yes, it is because the p<0.001.

Is F test significant? Why?

Model 2 provides a better fit than Model 1 because Model 2 is parsimonious with similar adjusted R2 and has lower AIC and BIC.

Is Model 1 (full model) or Model 2 (reduced model) better?

(1) H0 for gender is rejected by the F test since the p-value for the F test (F[1,6730]=30.61) is less than 0.05. (2) H0 for the race is rejected by the F test since the p-value for the F test (F[3,6730]=3.82) is equal to 0.01 and less than 0.05. (3) H0 for the gender by race interaction is rejected by the F test (F[3,6730]=8.55) since the p-value for the F test is less than 0.05.

Is each null hypothesis rejected based on the F test? Why

The sample selection may bias these trends. There are apparent cohort effects. People with high total cholesterol tend to be older and have higher morbidity and mortality of cardiovascular disease and stroke. Those people are not included in the NHANES. Survivors are more likely to have normal cholesterol.

Is it possible that the patterns displayed in the graph are biased?

The F test is significant because p<0.0001

Is the F test significant? Why?

The interaction term is statistically significant because Student's t test is significant and p<0.001.

Is the interaction term between age and waist circumference significant? Why?

No, it is not significant since the p-value for Student's t test is greater than 0.05. As the p-value is close to 0.05 and the model is relatively simple, there may be a significant interaction if other important factors are included in the model.

Is the interaction term between gender and BMI significant? Why?

Yes, it is significant since the Student's t test is significant.

Is the interaction term between gender and waist circumference significant? Why?

H0 is rejected because the F test is significant, as F<0.0001

Is the null hypothesis rejected based on the F test? Why

H0 is rejected because the F test is significant, as F<0.05

Is the null hypothesis rejected based on the F test? Why?

Yes, because p<0.0001

Is the r value significant? Why?

There is no evidence suggesting high correlations among the independent variables since all VIF scores are close to 1.

Is there any evidence of multicollinearity among predictors based on VIF? Summarizing VIF values in a table.

Yes, the average effect of the waist circumference reduced from 0.26 to 0.15 mmHg, almost a half. It means that the waist circumference effect was inflated in the simple model without considering other risk factors.

Is there any evidence suggesting that the waist circumference effect is partially confounded by gender, age, and total cholesterol? Why?

Yes, it is because the Student t test is significant, p<0.001.

Is waist circumference a significant predictor for systolic blood pressure? Why?

The comprehensive model has smaller AIC and BIC than the reduced model. Thus, the comprehensive model is better.

Which model is better?

Analysis of variance showed a significant main effect of gender on the total cholesterol (lbxtc), F(1,6730) = 30.61, p < .0001. The other main effect of race on the total cholesterol was also significant (F[3,6730] = 3.82, p = .01). The interaction term between gender and race was significant (F[3,6730]=8.55), suggesting that the mean differences in total cholesterol level between genders differed across all racial groups. Specifically, white males' average total cholesterol level was the lowest, while white females had the highest total cholesterol. However, the total cholesterol levels in Latinx were about the same between males and females. The average total cholesterol levels were lower in Black and other racial males than their counterparts in females. We conclude whether Black and Latinx have a higher total cholesterol level than Other race groups depends on gender in adults.

Utilizing the following data, report the findings in APA format for the question: how the participant's gender and race affect the total cholesterol level in the survey? 1) (lbxtc), F(1,6730) = 30.61, p < .0001 2) In image

The means of FEV vary by sex and smoking.

Was does this graph indicate?

Measure the influence of a single observation on each regression coefficient

What are DFBETAs?

•Influential points are generally points that fall far outside the range of the x or y values of the other points in the data set •Also do not follow trend of other observations

What are influential points?

Bonferroni, Scheffé, Sidak, Fisher's LSD, and Tukey HSD test Bonferroni and Scheffé test are very conservative and Šídák test is less strict

What are post hoc tests? Which are more conservative?

1) Random Sample 2) Linearity 3) Conditional normal distribution of the dependent variable 4) Homoscedasticity 5) No outliers

What are the assumptions for the simple linear regression?

1) Random sample 2) Linearity (Residuals versus predictor plot) 3) Normal distribution of the dependent variable (Residuals versus predicted values plot & Histogram of residuals) 4) Homoscedasticity (Residuals versus predictor plot) 5) No outliers (Influence statistics to detect through Leverage, Cook's distance, dfbeta) 6) No multicollinearity (Variance inflation factors (VIF)<5 or 10)

What are the assumptions of linear regression?

1) Continuous dependent variable and nominal independent variables (factors) - not testable 2) Independence of observations (there is no relationship between observations in each group or between groups themselves) - not testable 3) No significant outliers in the continuous variable - testable via boxplot 4) The dependent variable should be approximately normally distributed in each group 5) Homogeneity of variances

What are the assumptions of one-way ANOVA?

1) Estimate the relationship between two or more independent variables and one independent variable 2) Control confounding effects 3) Compare effect sizes among multiple independent variables 4) Develops a predictive model 5) Explain the variation of the dependent variable as much as possible

What are the benefits of multiple linear regression?

•Principle of scientific parsimony •Reducing the number of predictors improves the n/k ratio •"unnecessary terms in the model yield less precise inferences" (Ramsey and Schafer (p.325)

What are the benefits of working with a smaller number or predictors?

1) Correlation does not mean that there is a causal relationship between x and y; we might have a "lurking variable" that is not included in our study 2) Data is based on averages: Averages suppress individual variability and heterogeneity within the data; because of that, the value of r can be artificially large if we use averages 3) A large r value can indicate a strong correlation between two variables, but it could be a nonlinear relationship

What are the common errors with using r?

•1 to infinity •1 means no correlation •2-5: moderate correlation, •>5 or 10: problematic •Tolerance =1 / VIF

What are the criteria for assessing collinearity?

1) The SBP readings for SV and LV are lower than that for the normal - diet main effect •SV: Y ̅1.-Y ̅..="106.3-113.5= -7.2" ≠ 0 ? •LV: Y ̅2.-Y ̅..="110.4-113.5= -3.1" ≠ 0 ? •Normal: Y ̅3.-Y ̅..="123.95-113.5=10.4" ≠ 0 ? 2) The male SBP is higher than female SBP - sex main effect •Y ̅.1-Y ̅..=117.9-113.5=4.4 ≠ 0 ? •Y ̅.2-Y ̅..=109.1-113.5=-4.4 ≠ 0 ? 3) The differences in SBP among diet groups are various by sex, possible interaction? •Male: (Y ̅11-Y ̅1.)-((Y.) ̅1-Y ̅..)="(109.9-106.3)-(117.9-113.5)=-"0.8 ≠ 0 ? •Female: (Y ̅32-Y ̅3.)-((Y.) ̅2"-" Y ̅..)="(119.6-123.95)-(109.1-113.5)=0.05" ≠ 0 ?

What are the diet main effect, sex main effect, and possible interaction within this chart?

1) Continuous by continuous predictor 2) Binary by continuous predictor 3) Binary by binary predictor

What are the three types of interaction terms?

1) For children shorter than 55 inches, younger children had better FEV than older children. Older children had a stronger relationship between FEV and height. 2) For children taller than 55 inches, the older children had better FEV than younger children and had a stronger relationship between FEV and height.

What conclusions can be drawn from this margins plot?

1) The FEV was 2.49 (intercept) for girls (sex=0) who were 10 years old and 61 inches tall. 2) A significant interaction means that the age and height effects on FEV depend on each other. 3) Given average height (centerhgt=0), each additional year in age is associated with 0.05 L/s more FEV. As height increases, the age effect increases (0.05+0.01*Height). 4) Given average age (centerage=0), each additional inch in height is associated with 0.12 L/s more FEV. As age increases, the height effect increases (0.12+0.01*Age).

What conclusions can be drawn from this mean-centered model?

1) H0(Sex): males and females have an equal FEV, but not for smoke. •F(1,650)=27.12, p<0.001, rejecting H0(Sex). 2) H0 (Smoke): smokers and non-smokers have an equal FEV, but not for sex. •F(1,650)=53.84, p<0.001, rejecting H0 (Smoke). 3) H0(Sex by smoke): the FEV measure in smokers varies by sex groups. •F(1,650)=3.77, p=0.0527, fail to reject H0(Sex by smoke). •F=MSB/MSW=2.51/0.67=3.77 •If the interaction term were significant, it means that the smoking effect would be different between genders. 4) The model explains 11.72% of the variance in FEV.

What conclusions can be drawn from this partial F test?

1) A 30-year-old who weighs 80 kilograms has a 30% chance of having hypertension, while a 30-year-old who weighs 100 kilograms has a 50-60% chance of having hypertension. 2) Knowing a person's age alone does not give us enough information to predict their probability of having hypertension. We also need to know their weight.

What conclusions can be drawn from this visualizing interaction plot?

•Theoretically and/or practically meaningful •Parsimonious •Significant F and/or t test results for all independent variables •Lowest AIC and BIC •Highest R2 or adjusted R2 •Cross-validated or replicable/generalizable

What criteria should the best-fitting linear regression model meet?

F distribution

What distribution is this?

The distributions of the total cholesterol in the whites, blacks, and Latinx groups are skewed to the right. The distribution of the total cholesterol in the other group is approximately normal.

What do the following distributions indicate?

The distributions of the total cholesterol in the never married and married groups are skewed to the right. The distribution of the total cholesterol in the divorced group is approximately normal.

What do the following histograms indicate in terms of martial status and total cholesterol?

1)Both gender difference in waist circumference in Blacks (reversed) and Other Hispanics (equal) was significantly different from that in Whites, respectively. 2) In males, non-whites had significant smaller waist than Whites. The descending order in waist is White, Other Hispanic, Black, Mexican, and Other race. In females, the order in waist is White, Blacks, Other Hispanic, Mexican, and Other race.

What do the following say about gender, race, and waist circumference?

1) Age effect on systolic blood pressure by sex and anti-hypertension medication. Different intercepts and same slopes.

What do the two margin plots indicate?

1) Main Effects - How different levels of the two independent variables affect the dependent variable? 2) Whether levels of one independent variable affect the dependent variable in the same way across the levels of the second independent variable; The two independent variables affecting each other

What does Two-Way ANOVA determine in regards to main and interaction effects?

The coefficient of determination shows the proportion of the variance explained by one variable or reliability

What does r^2 determine?

1) Cell Means Model: The underlying population means are equal 2) Factor Effects Model: The underlying population means are equal

What does the Cell Means Model in one-way ANOVA do? The Factor Effects Model?

Tests if there is a linear correlation between two independent variables

What does the T-test determine?

1) Children currently exposed (group 2) to lead had a significantly poor neurological function than non-exposed children (group 1). H0 is rejected. 2) There is no significant difference between the mean scores for the currently (2) and previously exposed (3) groups, and between the mean scores for the previously exposed (3) and non-exposed groups (1). H0s are accepted.

What does the Tukey's Honest Significant Difference (HSD) test indicate?

The first category becomes the reference category in which all following categories are compared

What does the factor of variable notation (i.) do for categorical variables?

1) H0: The fit of the intercept only model and that of the age-sex-height model are equal; The H0 is rejected by the F test result (p<0.0001) 2) The current model fits the data well, but it may not be the best model; Adjust R2 = 1-(Res MS)⁄(SST/(N-1))=1-0.17/0.75=0.7736 penalizes the number of independent variables and indicates 77.36% of the variance in FEV can be explained by the model. Critical thinking: 22.64% of variations remain unexplained, suggesting missing more variables. 3) Height has the largest effect on FEV according to standardized estimates

What does the following data suggest about the multiple linear model?

In general, females have higher total cholesterol than males. The difference in total cholesterol between the Black and Others group is constant in both genders. The gender difference in total cholesterol is minimal. In contrast, the gender difference in total cholesterol in whites is the largest.

What does the following graph indicate about mean total cholesterol level by gender and race?

1) H0 (µ1=µ2=µ3=µ ) is rejected, because F < 0.05; therefore, at least one mean among the lead exposure groups is different from others 2) Homogeneity of variance is met, as Bartlett's test is p=0.925 3) 10% of the variance of maxfwt scores are explained by the lead exposure as R^2 (R2=SSB/SST=1600/16154) = 0.099

What does the following one-way ANOVA output means in terms of lead exposure groups and FEV?

The 3 variances are not similar. The divorced group has the largest variance. The homogeneity assumption is not met.

What does the following output indicate about the variances of the total cholesterol among the 3 groups?

1) The FEV is the lowest at 8 years when other variables are held constant. 2) After 8 years old, the FEV accelerated increases when other variables are held constant.

What does the following polynomial linear regression model state in terms of FEV?

1) The number of finger-wrist taps/10 seconds in the current exposure group (2) is 10.44 significantly lower than that in the non-exposure group (1), but not significantly different from that in the previous exposure group (3). The number of finger-wrist taps in Group 2 is not significantly lower than that in Group 3. 2) Conclusion: Children currently exposed to lead had a poor neurological function than non-exposed. However, the neurological function is not quite different between non-exposed children and previously exposed children. More data are needed to compare these two groups in the future.

What does the following post hoc tests mean in regards to maxfwt and exposure groups?

Compared to the White or male group, other Hispanics and Black effects on waist circumference are significantly different by gender.

What does the following say about effects within the model?

The model fit is good. At least one of the effects (study group or sex) is significant. A total of 29.89% of the variations in SBP can be explained by the dietary groups and sex.

What does the following say about model fit?

1) Model fits the data well based on the F test. 2) Approximately 22% of the variance in baseline systolic blood pressure (SBP) can be explained by the model. 3) Total cholesterol and age were positively associated with baseline SBP. 4) Regression coefficient 7.11 mmHg means the increment to the effect of current smoking status in anti-hypertension medication group or increment to the effect of anti-hypertension medication among current smokers.

What does the following say about the interaction between Current Smoking Status and Hypertension intervention Group on Systolic Blood Pressure?

1) The table shows smoker's average SBP in the anti-hypertension medication group was much higher than that in other groups. 2) The graph shows: •Smokers (red) in the non-medication group had a lower baseline SBP than non-smokers (blue). •The difference in the baseline SBP between smokers and nonsmokers was larger in the anti-hypertension medication group than in the non-anti-hypertension medication group.

What does the following say about the interaction between Current Smoking Status and Hypertension intervention Group on Systolic Blood Pressure?

1) Compared to Mexican Americans, people in the other race (β= -10.44) had a significantly smaller waist circumference. 2) The intercept is the Mexican Americans' mean waist circumference (99.249 cm)

What does the following table indicate in terms of using country of birth to predict the adult's waist circumference (cm) in the U.S. general population?

1) Compared to white people, Hispanics (β= -2.272) and other race (β= -10.903) had a significantly smaller waist circumference, respectively. 2) Intercept is the whites' mean waist circumference.

What does the following table indicate in terms of using country of birth to predict the adult's waist circumference (cm) in the U.S. general population?

1) predicted waist =105.371-5.910×(Country of birth) 2) H0 : β1(country of birth)=0. Or there is no difference in the waist circumference between countries of birth. 3) H0 for the slope is rejected by the t test because p<0.001 4) The model explains 2.7% of the variations in waist circumference. 5) The average waist decreased by 5.9 cm for one unit increase in the country of birth (i.e., from 1=being born in the U.S. to 2=others) 6) 95% CI: we are 95% confident that the true slope falls between -6.866 and -4.953. 7) Conclusion: In 2011-2012, the participants who were born outside of the U.S. had a significantly smaller waist circumference than those born in the U.S.

What does the following table indicate in terms of using country of birth to predict the adult's waist circumference (cm) in the U.S. general population?

•The average waist circumference was 99.46 cm for people born in the U.S. (born4=1) and 93.55 cm for those born in foreign countries (born4=2).

What does the following table indicate in terms of using country of birth to predict the adult's waist circumference (cm) in the U.S. general population?

•H0: The fit between the reduced model and the full model is the same. Or the reduced model is true. •If H0 is not rejected by the partial F test result (p>0.05). The reduced model fits the data better. •Repeat the above process until the partial F test become significant. Then stop. It is no feasible to handle a large number of predictors.

What does the partial f test determine? What is the process?

Standardized β1 = r = 0.756=√(0.572) A 1 standard deviation increase in age is predicted to result in a 0.76 standard deviation increase in the FEV.

What does the standardized beta indicate?

1) H0: The fit of the intercept only model and that of the age model are equal; The H0 is rejected by the F test result (p<0.0001) 2) Prediction accuracy by R2 (coefficient of determination) indicates 57% of the variance in FEV can be explained by the age. 3) Adjusted R2= 1-(Res MS)⁄SD2y=1-0.322/0.8672=0.5716

What does this indicate about the model fit?

β3 = increment to the effect of xi1 from one unit change in xi2 OR β3 = increment to the effect of xi2 from one unit change in xi1.

What does β3 mean in this model?

Because the inverse (-1) of a square matrix X exists only if the columns are linearly independent. Since the vector of regression estimates β depends on (X'X)-1, the parameter estimates β0, β1, and so on cannot be uniquely determined if some of the columns of X are linearly dependent! Therefore, you will run into trouble when trying to estimate the regression equation •The coefficient estimates (i.e., magnitudes and signs) can swing wildly based on which other independent variables are in the model. •Reduces the precision of the estimate coefficients (i.e., large SEβ). Thus, the p-values become questionable. •Slightly different models lead to very different conclusions.

What happens if a square matrix C has dependent columns?

1) F test, MS (between, within), p-value, R^2 (coefficient of determination) 2) Tukey's HSD test

What information do you need for the APA citation of one-way ANOVA?

1) The predicted FEV increases by 0.06 as age increases by one year when sex and height are held constant. The age effect decreases from 0.22 to 0.06. 2) The predicted FEV increases by 0.10 as height increases by one inch, after controlling for the age and sex effects. 3) The predicted FEV increases by 0.16 for one unit increase in sex (0=F, 1=M). Alternatively, The predicted male FEV is 0.16 higher than the female FEV after adjusting for the age and height. 4) Intercept: the average FEV is -4.45 for a female (sex=0) whose height is zero (height=0) at zero age (age=0). This interpretation has no practical meaning. Conclusion: We are 95% confident that the true height effect is between 0.10 and 0.11 after controlling for other covariates.

What interpretations can be down from this?

Cook's distance measures the effect of a single observation on the predicted/fitted values of all subjects

What is Cook's Distance?

•Y=a + bX + e •In(Y)= a1 + b1X +e1 •b1 means that Ln(Y) changes b1 amount for one unit change in X. •exp[Ln(Y)] = Y = exp(b1) means that Y changes exp(b1) amount for one unit change in X.

What is Log Transformation?

•Y=a + bX + e •√(Y)=a1+b1X+e1 •b1 means that √(2&Y) changes b1 amount for one unit change in X. •(√(Y))2 = Y = (b1)^2 means that Y changes (b1)^2 amount for one unit change in X.

What is Square Root Transformation?

Investigates the effects of two or more independent variables on one continuous dependent variable (Two-way ANOVA)

What is a Factorial Design?

The difference between the observed and the predicted/fitted y value

What is a residual?

Begins with full model

What is backward stepwise selection?

1) Randomly divide the sample into 75% for model construction and 25% for inference. 2)Perform variable selection to develop a model with the 75%. 3) Fit the above final model to the 25% and proceed with inference using that fit.

What is cross validation?

Begins with empty model

What is forward stepwise selection?

Leverage is a measure of how far the x values of an observation are from all other observations

What is leverage? (hii)

When two or more independent variables are highly correlated, changes one variable are associated with shifts in another variable.

What is multicollinearity?

Analysis of variance showed a main effect of the marital status (marital) on the total cholesterol (lbxtc), F(2, 4929) = 24.61, p = .000, R2 = .01. Post-hoc analyses using Tukey's HSD indicated that the average total cholesterol level was significantly higher for the separated/divorced/widowed group than the other two groups (p = .000). The average total cholesterol level was significantly higher for the married/living together group than for the never married group (p = .000). However, the average cholesterol levels were not different between the married/living together group and the separated/divorced/widowed group (p=0.08). We conclude that marital status is associated with the total cholesterol level in adults. Specifically, people who are ever married or living together have a higher total cholesterol level.

What is the APA Anova Analysis for a model showing: 1) F(2, 4929) = 24.61, p = .000, R2 = .01 (from Analaysis of Variance 2) See image for Tukey's HSD

Least squares procedure minimizes the sum of the squared errors of prediction (the residuals) There are many regression lines that can be drawn. Each regression line is associated the sum of residual squares. Only the best-fit line has the least sum of residual squares.

What is the Least Squares Procedure?

1) An equation includes all predictors. 2) The partial F is calculated for every predictor, treated as though it were the last predictor to enter the equation. 3) The smallest partial F value is compared with a preselected significance (e.g. F value for 0.1 or 0.2). If the partial F is less than the preselected F, remove that predictor and recomputed the equation with the remaining variables.

What is the backward sequential method?

The relationship between the outcome and a predictor can be partially or fully explained by the relationship between the outcome and another predictor.

What is the confounding effect?

It is an iterative procedure. The first predictor to enter the equation is the one with the largest simple correlation with y. Then the predictor with the largest partial correlation with y is considered, etc. The earlier best entry candidates are remained in the equation.

What is the forward sequential method?

The intercept means that the average systolic blood pressure is 127 mmHg for a man who is 50 years old with 100-inch waist circumference and 187 mg/dL totol cholesterol.

What is the meaning of the intercept?

(1) All mean total cholesterol levels between genders are equal after considering the race effect. (2) All mean total cholesterol levels between racial groups are equal after considering the gender effect. (3) The mean differences in total cholesterol levels between genders are equal across all racial groups. OR The mean differences in total cholesterol level among racial groups are similar between genders.

What is the null hypothesis of the F test for gender, race, and the interaction term, respectively?

All mean total cholesterol levels are equal.

What is the null hypothesis of the F test?

All racial groups have an equal mean total cholesterol.

What is the null hypothesis of the F test?

Predicted SBP = 81.24 - 1.43*2+0.48*50+0.15*120 + 0.04*200 = 128 mmHg

What is the predicted systolic blood pressure for a woman who is 50 years old with a 120-inch waist circumference and 200 mg/dL total cholesterol?

predicted FEV =-4.45+0.06(agei) + 0.16(sexi) + 0.10(heighti)

What is the predictive model from the table?

Predicted SBP = 100.2 + 0.26*Waist circumference

What is the predictive model?

1) State the hypotheses 2) Set the criterion for rejecting H0 3) Check assumptions 4) Normality and homogeneity of variance 5) Calculate the F statistic representing a ratio of variances 6) Find p-value from the F test 7) Decide whether to reject or accept H0 8) Estimate effect size 9) Post hoc testing for detecting where the differences are 10) Interpret the results 11) Report the Results

What is the procedure for conducting one-way ANOVA?

1) Calculate sample r 2) Interpret r and r2 3) Conduct a t test for r 4) Interpret r and r2

What is the process for seeing if there is a linear correlation between two independent variables?

1) Use the F test to determine goodness of fit 2) Estimate regression coefficient or the slope based on the least-squares procedure 3) Estimate the intercept 4) t test for the slope

What is the process for simple linear regression?

To compare multiple means simultaneously using ratios of estimates of variance

What is the purpose of the F test?

r = 0.878 There is a strong, positive correlation between bmxwaist and bmxhip

What is the r value? What does it mean?

It is a variation on the forward selection procedure. At each stage, a test is made of the least useful predictor. The importance of each predictor is constantly reassessed (making it the best)

What is the stepwise method? (the best)

Simple Linear Regression

What model would you use to answer the following question: Does the country of birth predict the adult's waist circumference (cm) in the U.S. general population?

•Select one independent variable. •Linearly combine the independent variables, such as adding them together. •Factor analysis or principal components analysis

What solutions are there should collinearity exists?

Multiple Linear Regression

What type of model should we use to answer the following question: Forced expiratory volume (FEV), the volume of air (in liters) exhaled in the first second during forced exhalation after maximal inspiration is a standard measure of pulmonary function. A study collected on FEV and height for 654 boys in the age of 3-19 group residing in Tecumseh, MI. We previously found that age was significantly associated with FEV. However, the model only explained 57% of variations in FEV. Is this association confounded by other variables? We want to strengthen the model by adding sex and height in this model?

gender, age, waist circumference, total cholesterol are significant risk factors.

What variables are significant predictors for systolic blood pressure? Why?

One-way ANOVA

What would you use to determine the answer of this question: whether the participant's marital status is related to the total cholesterol level in the survey

A regression model that includes many insignificant predictors can have a much higher R2 or adjusted R2 than the regression model that is reduced to a handful of significant predictors. Thus, R2 or adjusted R2 can be easily inflated, and the validity is debatable in the model selection.

Why is R^2, or the coefficient of determination, debatable when talking about model fit?

The t test from corr cannot be converted into useful confidence intervals, because once the population correlation is really not zero, the sampling distribution of estimates r is substantially skewed, even for large sample sizes. It can be done through Fisher's transformation and normal distribution

Why is r not useful for 95% CIs?

The single most important tool in selecting a subset of variables for use in a model is the analyst's knowledge of the substantive area under study

Why is substantive knowledge beneficial?

Increasing Type I error (e.g. false positive)! For each t-test there is a chance that we will commit a type I error, which is the probability that we reject the null hypothesis when it is actually true. This probability is typically 5%. (After three t-tests, the chance for Type I error is 14.27%). One-way ANOVA controls this, and keeps it around 5%

Why not use the Student's t test instead of ANOVA?

1) Avoid an increased risk of committing a Type 1 error (False Positive) 2) Interactions

Why use two-way ANOVA over multiple one-way ANOVA when using two or more independent variables on one continuous dependent variable?

Analysis of variance showed a main effect of race on the total cholesterol (lbxtc), F(3, 6,734) = 3.68, p = .01, R2 = .001. Post-hoc analyses using Tukey's HSD indicated that the average total cholesterol level was significantly lower for the Black and Latinx groups than the other race groups (p < .05). However, the average cholesterol levels were not different between the white group and other groups (p > .05). We conclude that race is associated with the total cholesterol level in adults.

With the following data, report the full test results using the APA style: 1) F(3, 6,734) = 3.68, p = .01, R2 = .001 2) Image

Predicted SBP = 108 + 0.49*minage - 7.85*Female + 0.07*minwaist +0.15*riagender-by-minwaist

Write down the predictive model.

Predicted SBP = 109 + 0.52*minage - 2.69*Female + 3.28*BMI +1.73*riagender-by-bmi

Write down the predictive model.

Predicted SBP = 81.24 - 1.43*gender+0.48*age+0.15*waist circumference + 0.04*total cholesterol

Write down the predictive model.

Predicted_total_cholesterol = 149.45 + 6.84*Female + 1.21*age + 0.67*waist - 0.02*age_by_waist

Regressions Exam 1

Ensembles d'études connexes

Chapter 11 HW

Computer Information Technology: Chapter Two Practice Test

MACRO FINAL 3/6

A&P Ch. 10 practice questions

Econ Macro Test 1

BLAW Chapter 22: Product Liability: Warranties and Strict Liability

yea

Give Me Liberty Ch. 7

Retirement and Other Insurance Concepts

LESSON 12

Jewish Wisdom Tradition: Proverbs and Job

The Role of Stress in Illness

Exam 2 - Macro

chapter 7

TestOut Ch 7.1-7.6

Linear Model Topics

Shakespeare's Romeo and Juliet

APUSH Chapter 23 Identification

HW #3: Categorical Graphs and Summaries

CCNA 1: Module 16-17