Design and Analysis of Experiments

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

How much confidence can we have in Lady Bristol's true skills, if Fisher's experiment is conducted with 6 cups?

- 6 cups has 20 different permutations of administering milk for or tea first. - 1 divided by 20 is 5%. - 100%-5% = 95%, meaning we have 95% confidence that she did it by skill.

Understand the standard normal distribution - density function graph on this slide. Within the first two deviations of the z score (-1 and +1), what is the probability percentage that the value lies within this range? Outside of it?

- 68.2% within - 15.9% outside

What is Kurtosis?

- A measure of how steep, or how flat the the distribution is. (The pointiness)

What to consider when doing Shapiro-wilk test on residuals?

- A p-value below 0.1 should be bused with caution!

If we don't know the standard deviation, but have to estimate it, what kind of distribution are we undertaking?

- A t-distribution as opposed to a normal distribution.

What kind of kinetic model is ANOVA written as? (May be good to investigate how to employ with other kinetic models) (Know!)

- ANOVA is written as a linear model

Describe what a Type 1 error is.

- Accepting the H1 hypothesis and disregarding the null hypothesis, when in fact, the null hypothesis is true. - To say there is an effect, when in reality there is no effect.

Describe what a Type II error is.

- Accepting the null hypothesis, when in fact the H1 hypothesis is correct - Accepting the H0 hypothesis when in fact the H0 hypothesis is wrong.

What is the purpose of randomization?

- Aims at controlling the effect of potential unknown confounders; as not every confounders can be known, so the idea is that randomization mitigates this.

Since we cannot take the mean of the deviation, what do we do instead?

- All the deviations are squared - And then are all summed up.

Regarding the alternative = "less", provide the interpretation of this in hypothesis form,

- Alternative = "less" Interpretation: If you have low TST2 you have high BF. - Alternative = "greater" Interpretation: If you have high TST2, you have high BF.

How many subjects per block does the GRBD require?

- As many subjects per block as there are treatments

Within pwr.norm.test, what is "d?", what is the formula?

- Assumed effect size - Effect delta / sigma (sd)

Give an example of a partial correlation.

- BMI (grey), energy intake, and physical activity. Energy intake and physical activity are correlated (i.e. They share some common variance; ie. Physical activity has some influence on energy intake or vice versa). BMI is correlated to physical activity in the same respect. BMI is correlated to energy intake in the same respect. The overlap is identified.

Why is phi(1) or phi(2) not simply 50% + 2.3% or 50% + 15.9%?

- Because 50% + 16% + 2% only equals 68% AUC. Which doesn't make sense. - In order to get phi(1), this is 1-phi(-1), also known as 100% - 16% - In order to get phi(2), this is 1-phi(-2), also known as 100% - 2.3%

Why does T Dist density function rely on the degrees of freedom?

- Because the degrees of freedom are the basis of the comparisons done within the t-test

In the model for the F-emp determination, why is k (number of regression coefficents) -1, and (n-k) used? (Know!)

- Because the model follows an F distribution: - k-1 is the degrees of freedom for the model - n-k is the degree of freedoms for the error.

What is an example of a random variable?

- Body height ***remember how to frequency distribution.

What are the the main Post Hoc tests?

- Bonferonni Correction - Tukey's HSD

Review this picture. Explain what is going on. How do you close the gap?

- Close the gap by increase sample size, as this basically pull both curves by the top, pinching and making more narrow the curve. And then eventually both the blue (reject null) and green (reject H1) areas would meet.

How to determine Tau value?

- Concordant - discordant. If Concordant pairs is greater than discordant, than positive correlation. If D is greater than C, then it is negative correlation.

What does a control group serve as? (as in what is its base functions?)

- Control for potential errors in the course of experiment - A reference for the effet of the true treatment levels. - Ultimately this is possible because the control group represents a known outcome.

In H1.. View the picture, explain why if alpha(r) does not equal alpha(t), why is SSmodel high, and SSerror value low?

- If alpha values are different, this means that the mean of the subgroup is different than the grand mean. - SSmodel will be quite high due to this (therefore creating a sloped line) - SSerror will be low, as the line now reduces the SSerror (which is determined off the regression line)

Explain how the formula Yi = B0 + B1*Xi + error can accomplish the removal of excess variables? Use an example in which Energy Intake is Y. Do this again with Physical activity.

In example: Energy Intake(i) = B0 + B1*BMI(i) + Error(i) - The residuals are by definition uncorrelated with BMI (they represent the part of energy intake that is not effected by BMI) PAi = B0 + B1*BMI(i) + error - the residuals (the error) is by definition uncorrelated with BMI, (as in shows PAi values not effected by BMI)

Interpret the multiple regression analysis on the left corner.

Interpretation: - If we increase PA by 1 unit, we decrease BMI by 19.5 - If we increase EI by 1 unit, we increase BMI by 2.6 -*none of these results are significant.

When may blocking be useful?

Lab: Containers or shipments of substance that may be blocked by age, in relation to the experiment ELISA: blocking for well plates

Lecture 3

Lecture 3

Lecture 4a - Regression Analysis

Lecture 4a - Regression Analysis

Give an example of H1 hypothesis with graph and power.

Let's assume that H1 would be true (with +40µm being the expected difference/mean group effect). - Values close to 40 would be likely. - Values that substantially deviate from 40 (particularly to the left) would be quite unlikely as this would disprove the H1 hypothesis. - So if you produced small variations, that are not consistent with the 40µm expected; likely the H1 is rejected, or you are experiencing a Type II error. - ΔCrit.1= u1 + Zβ x σ /sqrt(n) when this formula is used, it will give the ΔCrit.1 value which is used to determine how far to the left the values can deviate from the original 40µm expected. (or better, how far from zero is the value in which determines how far to the left values can devidate from the original 40µm expected. - Ex: so if you get a ΔCrit.1 value of 21. That means if the mean group effect of your actual experiement deviates negatively by more than 19 from your expected , then it can be deemed a rejection of H1 or can result in a Type II error.

(Know!) How do you mathematically get MSSmodel? MSSerror? What is I? What is n?

MSSmodel = SSmodel/(I-1) MSSerror = SSerror/(n-I) #not 1, it's the letter "I" I = number of classification variables (i.e. Q1, Q2..ect) N= number of observation across all levels of the classification variable)

Define the X bar (mean) and the sd (standard deviation) for the normal distribution.

Mean = 0 Sd = 1

In an F-value distribution, if you have multiple groups curves: what would a low F -value distribution look like vs a high F-value distribution graph?

- Low F-value: Would have all curves moved closely together; as there is large inner-group variation (residuals), and low variation between groups (MSSmodel). - High F-value, would have all curves separated, because their variation between groups is high, and the inner-group variations are low.

What form of SS is required for finding the F value? (Know!)

- MSSmodel - MSSerror Not SSmodel or SSerror

What are the advantages of experimental research (as opposed to observational)?

- Manipulate the environment (almost any variable/target) to any degree - control for confounding variables (typically associated with observational studies as we cannot fully ensure someone doesn't drink vodka once instead of all tequila).

What are the assumptions of Pearson's correlation?

- Measured on continuous variables (interval or ratio scaled) - R can be validly estimated regardless of how the variables are distributed - but the significance tests (i.e. T-tests) require the variable to be normally distributed

What is the F-value? (SSmodel/SSerror)

- Measures the ratio of two variances. - Answer: Measures the ratio between two variances, which are mean sum of squares. "Variation between sample means / variation within samples

What is the blocking principle?

- Measuring potential confounders within an experiment and including them in the statistical model

Analyzing contrast between groups using ANOVA, is nothing but performing ______________ _____________ analysis using particularly defined classification variables

- Multiple Regression Analysis (these are basically the same, except MR does not need categorical variables).

So far, we've only done simple regression (i.e. only considered one outcome variable and one predicting variable). What is it called when you extend your regression model to include two or more predictors?

- Multiple regression analysis

In order for ANOVA to be used, what is a pre-requisite for this to occur? Measure with what method?

- Needs homoscedasticity - Fligner.test()

Can you compare standard deviations across different variables? Why or why not?

- Not really, because the unit measure remains in the standard deviation score. - if you are comparing sd of a variable against variables that have the same measurement units, this is acceptable. Sd in height of 6cm cannot be said to have wider variation that the BMI sd of 5.2 kg/m

What is the structure of an incidence matrix?

- Number of treatments needs to be constant - number of blocks needs to constant - Rk which is the number of replications needs to be equal across all treatments - the number of subjects in each block is allowed to vary

Within pwr.norm.test(), what needs to be considered before running it?

- One of the values within the structure need to be null, because that is the one we are investigating.

Residuals can _____ be elected to be normally distributed in _____________ designs.

- Only - Balanced designs

Then what is the partial correlation in respect to the model given of Energy intake and Physical activity? Explain.

- Only section II represents the partial correlations between Energy intake and physical activity. - "the partial correlation between the two variables X and Y is the correlation between those variables X and Y after elimination of the effect of any further variables (i.e. BMI)

Does the data have to be normally distributed to do a multiple regression analysis? Or do the residuals have to be normally distributed? (Know!!)

- Only thing required to be normally distributed is the residuals!!

How to determine the amount of treatment combinations? Is there any blocking? Is a priori information required?

- P levels of one treatment - q levels of a second treatment - There will be p x q treatment combinations. (If there is a third treatment, it will be p x q x z for example.) - There is no blocking - There is no a priori information required.

Interpret this z.score shown in the picture, measuring probability of a woman being 165cm. Z = (xi[165] - mean[164])/sd[6.79] = 0.147 How do you find the probability of individuals falling under this z score? Then how do you find the probability of individuals within a given range of (165 < height < 170).

- PHI(0.147) = 0.558 or - pnorm(0.147) = 0.558 - Interpretation: 55.8% of females fall below 165cm. - PHI value of 170 - PHI value of 165. or PHI(zscore(170)) - PHI(zscore(165) = 0.881 - 0.558 = 25.3%

How does Kendall's correlation work?

- Pairwise comparison of data - Sort/rank X and Y by the values of variable X (for each observation there is an X and Y score), in which lowest value of X pair is ranked 1. - The first pair, with the lowest X value, is compared to all the other observational pairs. Then the second pair is compared to all the pairs from all the following observations. (ex: 1 compared to all 9 obs. out of 10. #2 is compared to 8. #3 to 7 observations. Ect.)

Rank Pearson, spearman, and Kendall, in terms of highest to lowest correlation score of same data?

- Pearson would have highest (not necessarily the most accurate) - Spearman would have the second highest (maybe a bit more accurate) - Kendall would have the lowest, but perhaps more accurate than either of the first two.)

How to overcome the retention of scale unit that is inherent with measuring by covariance?

- Perform a z-transformation of the variables X and Y and then calculate the covariance of these transformed variables.

(Know!) Now that we know homoscedasticity is good, and that normal distribution is good, how do we do a multiple comparison of groups? (This is in the light that we know AT LEAST two groups out of the 4 are significantly different in terms of body fat.

- Planned contrast - Post Hoc tests

What is the difference between Planned Contrasts and Post Hoc tests?

- Planned contrast: if there are specific hypotheses to be tested - Post Hoc test: if there are no particular hypothesis to be tested.

The treatment effect has to be distinguished from what?

- Random effects - Confounding factors

What is a Latin square called in which the first row and first column are arranged in alphabetical order? What can you do with this?

- Reduced Latin square - It is possible to randomize all rows and columns. *** It is done in order by randomize get rows first then columns

How can the partial correlations be identified, what is the method for eliminating the effect of a third (or any other variable?

- Regression

Regarding the significance of individual regression coefficients, how can this be determined? What does this explain?

- Using a T-test - Assesses whether the individual regression coefficients (b0, b1) significantly contribute to an improvement of the model of interest, as compared to the simplest model.

In Biostatistics, we often want to know how closely measurements are related (i.e. glucose & plasma insulin; energy intake & body fat mass). In order to determine "relation", what is required?

- Variation - Such as a sample of 6 men, who a standardized to body height (say 180cm), but vary with respect to age, hair style, BMI.

(Know!) Within an ANOVA, what are the two 2 explanations that variance is broken down into?

- Variation due to classification variable(s) (ie variation between groups) - Variation due to random effects (i.e. Variation within groups)

How is the significance of a individual regression coefficient such as Anxiety (B1), determined? (Remember that this is through coefficients pathway!)

- Via, b/SEb = t- test; and then t-test-emp difference from simplified model, in which the difference is determined as a significance.

What are the 5 different kind of results from Kendall's tau value?

- concordant: if both value in the pair increase or decrease with the variables in the compared pair. (I.e. If they all go up, or all go down). - discordant: If there is a discrepancy between the two as in Xi decreases compared to Xj, but Yi increases compared to Yj. - Tie in X - Tie in Y - Tie in X and Y

What is another way that you can prove that anxiety has a significant effect on exam scores?

- confint(Exam.lm) - which produces a table shows the values at 2.5% (quantile like) and at 97.5% (quantile like); in which we see that anxiety has negatives in both, meaning it doesn't cross the zero.

What is the R function to find the correlation of two covariables? How do you deal if there are NA's present, as in: (1) if you want the value to yield NA if there is only a single value missing; (2) exclude all data from one observer action if at least one value is missing; (3) pair wise.complete.obs uses all data available (excluding cases pairwise - as in only that pair is is marked NA instead of the entire column).

- cor() - cor(X, method, use= ) Use= "everything" is default, and yields all NA if a single value is missing Use= "complete.obs" excludes all data from the single observer with the NA present. Use="pairwise.complete.obs" to use all data, and just exclude the two covariables in the single instance that it comes up.

How to determine Tau value in R?

- cor(X, Y, method = Kendall) - result gives Tau value.

How can partial correlations differ from normal correlation?

- cor(my.data) shows a BMI~EI correlation of 0.10. - pcor(my.data) shows a BMI~EI of 0.62. - pcor value is quite a bit higher, as we are now removing the effects of physical activity.

What is the cor.test() R details?

- cor.test(X, method, alternative =..)

Using corr.test(), give the suffixes for a variable that was assigned this test. Describe what it means.

- corr.exam$r (correlation coefficeients) - corr.exam$n (number of cases) - corr.exam$t (get t-value) - corr.exam$p (get p-value

How do you compute Pearson correlation coefficient?

- cov(X,Y)/sdX*sdY

If on a test, you want to use randomization, but you need to get the same answer the professor has? Use the CRD

- design.crd(Music, r = Subjects, randomization = TRUE, seed = 123)$book

Where does the z-score obtain its data of quantiles to convert into z-score? (Remember, all this is assuming it is normally distributed).

- distribution cumulative function - remember qnorm(x) gives the area under the curve!!! This is important! Because dnorm(is used for the curve line only).

What is the goal of challenge replication?

- estimate random error - control random effects

What if it is just a single value that you need to find a probability for? (I.e., only 165cm)

- first need to convert heights into z scores - then use the pnorm(x) to find the cumulative AUC distribution. Then find the PHI value you are looking for, then subtract all the rest below.

The lower the alpha value the _______ the power?

- higher

When is "r" used?

- in biostatistics, we usually work with samples rather than whole populations, therefore the true standard deviations (sdX and sdY) cannot be known (though this was evidenced as the equation for Pearson's correlation earlier). - so instead we have to estimate the Pearson correlation coefficient, and this estimate is called the "r" value.

What does B1 actually mean? (Know!)

- indicates the average change of the outcome variable Y, if the predicting variable X changes by one unit. (In other words: If X changes by 1 one unit, and this is done repetitively, the average effect it has on the variable Yi will be determined. !!Which actually means, it is the slope of Y/X (example), but is determined by sum of deviations = 0 and and sum of squared deviations are minimal.

Interpret the results of summary.lm(bodyfat.aov) in the picture

- intercept: overall grand mean body fat is 20.195% body mass. - mean fat mass difference between Q1 and other 3 quartiles is 4 * 2.2583 = 9.03%. Significant - mean fat difference between Q2 and c(Q3,Q4) is 3*1.2967 = 3.89% - mean fat mass difference between Q3 and Q4 is 2*1.0100 = 2%, yet is insignificant.

If you have a graph, in which outcome is on X axis, and residual value is on Y axis, and when outcome value is low then residual value is low, and if outcome value is high then residual value is high. In this case, would it have homoscedasticity? Why or Why not?

- it would not have homoscedasticity because, if you take all the residuals together as a whole, then they vary quite a bit. - But if you take a sample just from the high outcome group, you will likely have homoscedasticity.

Create a completely randomized factorial design using R. 1. Fat content of the diet: 15, 40, 65 2. PUFA content of diet: low, medium, high 3. Fiber content of diet: low, high

- library(agricolae) - Design <- design.ab(trt = c(length(fat), length(PUFA), length(Fiber)), r =1, design = "crd", randomization = FALSE)$book * r = replicates

Use R and agricolae to create a Randomized complete block design, in which just the design like the visualization before is realized. Groups = Music (rock, country, hip-hop, classic) *We assume that music preference is affected by gender. Therefore gender is to be integrated into the experimental design as a block in which b=2

- library(agricolae) - Music <- c("rock", "country", "hip-hop", "classic") #notice that there is no repetition, because we are moving this into the nested blocking of females and males. - Block <- c("males", "females") - Design <- design.rcbd(Music, 2, randomization = TRUE)$sketch Rownames(Design) <- Block **notice how the second variable in design.rcbd isn't "Block", but is the number of units within "Block". Then notice that rownames are applied after using the information from the "Block variable).

What are the disadvantages of experimental research (as opposed to observational)?

- manipulation to a degree beyond real life - a simplification beyond what is acheivable in real life - possibly biased by the scientist - ethical considerations

What a is the mean of normal distribution? What is the sd of normal distribution?

- mean = 0 - sd = 1

The question: How do we estimate the regression line? How is this determined?

- method of least squares - which is determined such that the sum of deviations is zero, and that if we square the deviations, the sum in sminimal.

How do you test for significance in a Kendall correlation?

- n <- length(X) - Var <- 4n + 10 / 9n(n-1) #z-score: Z.tau <- tau/SD #will be the theoretical zero that is employed as the zero in normal.dis Normal.dis <- pnorm(z.tau) # This score will give the percentage the probability that a variable falls into the left of the Z.tau score. Which means the area under the curve to the left of the Z.tau value is the percentages obtained from the pnorm(z.tau) function. #significance Directional t>0: p1 <- 1-normal.dis (which gives probability that H0 is true in directional hypothesis)> Non-directional: P2 <- 2*p1

What is spearman's correlation?

- no assumptions on distribution - transforms data slightly by: (1) ranking lowest value as 1, second lowest value as 2, ect. - This allows the correlation to not be fooled by outliers. - also, the score tends to be a bit lower than the Pearson correlation - but the "r" value is calculated the same as Pearson's, just using the ranked data though.

Are the blocks randomized within the GRBD?

- no, blocks are set in a a priori fashion, prior knowledge is needed to say we want to split by gender, ect.

In a one-way ANOVA, how any independent variables are there?

- only one independent variable.

What constitutes defining the components of the experimental design of observational studies?

- operationalization of the outcome (like assessment of health status) - Measurement technique (LC-MS, ect)

What is the symbol difference between phi and PHI?

- phi = slanted circle with line it - PHI = upright circle with line in it

Run a test using cor.test() and interpret results. What are the $ suffixes? To a variable that has just been assigned a cor.test() value?

- variable$statistic (gives t-value score) - variable$p.value (gives p-value) - variable$estimate (which gives cor).

How can you make a categorical value such as (male, female) listed as qualitative data? Why is it not that you add a predictor for male and for female? (Know!)

- you can just convert categorical data, like male/female to numeric using as.numeric, then can use -1 or whatever to make it easier to understand. - You do not add a predictor for male and then another predictor for female because they are redundant, you cannot have redundant information!

What is a Type II error denoted as?

- β

-------2 way ANOVA

------ 2 way anova

------------ Part 2b of Lecture 1 --------------------------

--------------------- Part 2 bof Lecture 1 ----------------------

-------Partial Correlation ( 4b)---------------

----------Partial Correlations (4b)----------

----5a One way ANOVA--

----5a One Way ANOVA---

----5b One way ANOVA---

----5b one way ANOVA----

---Multiple Regression------

---Multiple Regression------

In the last graph, what is the area under the curve at -2 z-score? -1? 0? 1? (This assumes normal distribution) *know!

-2: represents 2.3% AUC -1: 15.9% 0: 50% +1: 84.1% +2: 97.7%

What is the multiple regression mathematical equation?

-Remember, any B greater than B0 (i.e. B1,B2) ect is used to "weight" or influence the variable value of X1 or X2 respectively (X1 and X2 are unique variables, not overlapping).

......randomized block design.....

......randomized block design.....

Explain the calculation of Power.

1 - (the value of inncorectly accepting the null hypothesis)

What is the calculation for Power?

1 - β

(Know!) What does the Q-Q plot accomplish?

1. Calculates the quantiles of an empirical variable (i.e. the residuals) 2. Calculates the reactive quantiles of a theoretical normal distribution 3. Then plots both quantiles against each other 4. A straight line will be seen, if the empirical variable follows the expected theoretical distribution.

What are the most common measures of effect sizes?

1. Cohen's d (the difference between two means divided by a standard deviation for the data). 2. Pearson's correlation coefficient "r" 3. Odds ratio (usually obtained by means of logistic regression.

(Know!) What are the rules when choosing contrasts?

1. Contrasts must not interfere with other, as in, they must test unique hypotheses (aka be distinct variables). 2. "Only two chunks": Each contrast should only compare two chunks of variation, such as E1,E2,E3 vs. E4, or E1,E2,E3,E4 vs. C1 3. I-1: The number of contrasts required to complete an analysis is one less than the number of groups.

How do you find out if you have a linear, quadratic, or cubic trend in R?

1. Contrasts(bodyfat$TST4) <- contr.poly(4) # 4 for number of trend lines 2. Bodyfat.aov2 <- aov(Fat ~ TST4) 3. Summary.lm(Bodyfat.aov2) *this indicates which line trend is significant.

What are two strategies for separating treatment effects from confounding effects (error control design)

1. Controlled conditions (keeping everything but the treatment constant) 2. Measuring potential confounders and including them in the statistical model (blocking principle)

What are the assumptions from Post Hoc Test?

1. Cope pretty well with som deviation from normal distribution 2. Can't cope with heteroscedasticity particularly in a unbalanced design.

In R, how can I quickly find the probability, or the area under the curve, that is present to the left of the Z.tau score? If the H0 was that tau=0, then how would find the significance value of this result? And, if in the case H1 was t does not = 0, and H0 = 0 (aka non-directional), then how do you find significance? (Know!)

1. Cor.test(X,Y, method= "Kendall", alternative = "g") = tau of 0.182. #or alt = "t" 2. Pnorm(0.182) = PHI of 0.645 (aka, the area under the curve) 3. 1-0.645 = 0.355 #which means p= 0.355.

What are two suitable statistical tools to determine the correlation coefficient of two covariables?

1. Correlation analysis 2. Regression analysis

How do you complete a Pearson correlation in simple terms?

1. Create Z-score for both variables 2. Find covariance (Xi-Xbar)*(Yi-Ybar) 3. Divide by sd(X)*sd(Y)

(Know!) How do we fight to prevent violation of model assumption? What are tests can we use, such as in the case of heteroscedasticity, non-normally distributed?

1. Data transformation (log/ranks) Tests: Heteroscedasticity: Welch's ANOVA which is oneway.test() Non-Normal distribution: Kruskal.test() Bootstrapping

Explain the ANOVA linear model Y(i,j) = mu + alpha(i) + error(i,j) (Know!)

Mu = overall grand mean Alpha(i): in our example, there are 4 groups. Each group (i) will have an alpha level, which means effect. To get this, is difference of group 1 mean from grand mean. Next would be difference of group 2 mean from grand mean. Ect. Error(i,j): error term of the ith level of the classification variable, and the the j-th measurement (j=1,...,n)

How can you create a partial correlation coefficient table in which the upper triangle (right) is p-values, and the lower triangle (left) is r.

My.corr <- pcor(my.data) R <- my.corr$estimate Upper <- upper.tri(r, diag=T)*r) **upper.tri() returns a True or False result. In which we specify diag=T, which then causes the True values to become 1. The lower triangle is false and 0. So then you multiply by r thus filling upper triangle. P <- my.corr$pvalue Lower <- lower.tri(p)*p Mat.r.p <- Upper + Lower

What is the null hypothesis for a correlation investigation in terms of r?

Null hypothesis: r = 0

What is the general regression equation?

Outcome variable Y = f(predicting variable X) + error E. # f may be more than one predicting variable. # so outcome variable is a function of predicting variable and an error term.

Regarding a Fligner test, if there was a p-value of .92, interpret this in terms of probability.

P-value of 0.92 on Fligner test, means that there is a 92% chance of error to assume the H1 (which is that it is heteroscedastic)

What is the p-value range?

P-value range: [0,1]

When is a planned test utilized vs a post-hoc test?

Planned test: specific hypothesis to be tested Post hoc test: No particular hypotheses to be tested

(Know!) How to determine and translate cook's distance?

Plot(cook.distance(bodyfat.aov), ylim= c(0,1) Abline(h=1, lty = 2) - We see that none of the indivuals are providing undue influence, as none of the cook's distances are greater than 1.

Give R code on how to create box plot comparison of the 4 quartiles in relation to body fat.

Quartiles = quantiles(bodyfat$tricep) Bodyfat$TST4 <- Cut(bodyfat$tricep, breaks = Quantiles, include.lowest = true, labels = c(Q1...Q4) #includelowest includes lowest data that is named by quantiles. Boxplot(fat ~ TST4, col = rainbow(4))

Regression analysis

Regression analysis

What is the formula for R^2?

SSmodel / SStotal

Regarding multiple groups sum of squares, what is the general mathematical model for this (to create SStotal, SSmodel, SSerror)? Alternative method to get SS model?

SSmodel = SStotal - SSerror

(Know!) In H0.. View this picture. Explain why alpha 1 = alpha 2 = alpha 3 ect would produce a low SSmodel value and a high SSerror value

SSmodel low: because the mean of each subgroup are more or less the same as between each group. SSerror high: if one is low, that means the other needs to be high. High SSerror value on H0 is likely a product of the regression line being more or less flat, and the individuals' variations (whom comprise the subgroups) will likely vary from this.

What are the similarities and differences between the density functions of normal and T distribution?

Similarities: - bell shaped - symmetric - Maximum at x=0 Dissimilarities: - T Distribution is flatter and wider than normal distribution - T distribution depends on the degrees of freedom

Why do we square root the standard deviation?

Squared values are hard to interpret. - Square rooting it brings it back to the original value range.

Standard Normal Distribution

Standard Normal Distribution

What are the strengths and weaknesses of correlation research?

Strength - Describes strength of a relationship - Quick and easy Weakness - Correlations do not equal causation - Correlations can be misused.

T-Tests

T-Tests

Another version of creating randomized sample group...

Tea.cups <- rep(c("T", "M"), 4) Design <- rbind(1:8, sample(tea.cups))

Regarding SStotal, SSmodel, and SSerror, what are each measuring?

Variation of: - Response variable - Classification variable - Error terms

What is a completely randomized factorial design? When is a completely randomized factorial design appropriate for an experiment?

When there are multiple treatments, and multiple levels within treatments. 1. There are at least two different treatments 2. Each treatment has at least two levels 3. All levels of each treatment are investigated in combination with all levels of every other treatment.

How do you use Wilcoxon test in R?

Wilcox.test(Fat~TST2, alternative = "less") - remember we are directing the hypothesis (of less) at the predictor which is TST2. - The result here, which indicates a high significance and W number, means that "Low TST2 is directly associated with BF".

(KNOW!!!) Describe the picture. What is the Wilcoxon test used for? When is Welch's test used for?

Wilcoxon: when data is not normally distributed Welch's: If there is normal distribution, but not equal variance.

Explain a linear model of a completely randomized design trial. What is the formula, and explain the variables.

Yij = u + alpha(of i) + epsilonij u = grand mean of the outcome variable Alpha (of i) = effect of the i-th level of the treatment variable (i = 1, .... I) Epsilonij = Error terms of the j-th replicate (where as i = 1,....I; j = 1,....M) In explainable terms: Yij = the grand mean of outcome variable (both treatments) + the effect of treatment within the entire single treatment group + the error term (sd) of the specific individual (j1,...M)

(KNOW!!!) What are the assumptions in a regression analysis?

1. Variable Types - Predictors are quantitative (continuous (i.e. EI) or categorical (i.e. Male or female) - Outcome is quantitative and continuous 2. Non-zero variance 3. No-perfect multicollinearity - predictors must not have perfect correlation 4. No omitted variables 5. Homoscedasticity (residuals shout have constant variance)!! 6. Normal distribution (residuals should have to be normally distributed!) 7. No redundancy in predictors!

Notice the structure of the results, how do we make this nicer in which we separate the subjects into individual columns marked by group? (This also works for data results that come in the same format).

1. design <- design.crd(Music, r = Subjects, randomization = TRUE, seed = 123)$book 2. design <- unstack(design, plots ~ music)

What are the advantages / disadvantages of Between-groups design? Give an example.

Advantages: - Simplicity - No order effect (from practicing or fatigue, ect.) - Required for treatments that are irreversible, or so highly invasive (like removing an organ). Disadvantages: - More subjects required - Reduced sensitivity (as group inherently need more subjects, thus more random error noise ect. occurs. Even genetically identical mice are not exactly alike). Example: - Post-test only/ control group (typical)

What are the advantages / disadvantages of Between-groups design? Give an example.

Advantages: - Budget (subject is used at least twice) - Higher sensitivity Disadvantages: - Carry-over effects (order effects= which are induced by increasing experiences, boredom, fatigue, ect.) - The treatments must be reversible (return to baseline - like can't measure speed of sheep that jump of cliff) (requires reversibility) - Non-invasive measurements Examples: - Pretest / post test design - Longitudinal study (assesses changes of outcome over time) - Crossover study

What is the "alternative" in the R cor.test code?

Alternative = "two sided" Alternative = "greater" Alternative = "less" Non-directional Directional Directional

In short, what is the equation for estimating b0?

B0 = ybar - b1*xbar Interpretation: - mean of outcome variable - the slope *the average of X predictor variable.

In R, how do you determine B1? B0? R^2? (Know!) (without using lm)

B1: cov(Exam, Anxiety)/ var(anxiety) B0: mean(Exam) - b1*mean(Anxiety) R^2: n <- length(Exam) e <- Exam.Anx.lm$residuals SS.total: var(Exam)*(n-1) SS.error: var(e)*(n-1) SS.model = SS.total-SS.error R.2 <- SS.model/SS.total

How in R does a multiple regression analysis function look?

BMI.lm <- lm(BMI~ Energy + PA) - Energy + PA are the predicting variables. *remember, use summary(BMI.lm) as this provides more critical information than BMI.lm alone

What is the difference between a "Between-groups Design" and a "Within-subjects Design"?

Between-groups Design: Two or more clearly separated groups of subjects (Treatment 1 vs. Treatment 2 vs. Control) Within-subjects Design: Each individual subject is exposed to all treatment types (subjects are not split up)

(Know!!) Within bodyfat data, TQ1, TQ2, TQ3, TQ4 columns have all been made in order to store binary 1 or 0 corresponding if the TST4 is respective. This is the beginning of creating a method for weighting. If we want to do a comparison of ANOVA vs A multiple regression analysis; create the R formula for the regression formula. Explain why it is this way.

Bodyfat.lm <- lm(Fat ~ Q2 + Q3 + Q4) - The reason Q1 was not included, because there is only 4 groups, and if it is not either Q2, Q3, and Q4, it will be Q1. Therefore we do not need to influence Q1, as it will be represented by the intercept.

Give an example of coding contrast for weights, using the 4 groups of Q1 Q2 Q3 and Q4.

Contrast 1: -3, 1, 1, 1 Contrast 2: 0, -2, 1, 1 Contrast 3: 0, 0, -1, 1

(Know!) In an 'planned contrast' example of 4 groups, such as the quartiles by TST4, explain how many contrasts there are, and what they consist of.

Contrast 1: Determine if control group is different from all experiement real groups. (C1 vs. E1, E2, E3) Contrast 2: The grouping of E1 and E2 vs. E3; which we would determine if E1 and E2 are different from E3. (If there were more than 4 groups, let's say there was 6 Experimental groups - contrast 2 would be E1,E2,E3,E4,E5 vs. E6.) Contrast 3: The remaining groups, which are E1 and E2 are examined. (In a 6 experimental group situation - contrast 3 would be E1,E2,E3,E4 vs. E5. This would continue with contrasts until it is completely broken down).

What is the mathematical formula for covariance? *understand this***

Covariance of (X,Y) = Esum(xi-xbar)(yi-ybar)/n - this is very similar to variance. Look up difference.

What is the denotation of effect size? What is the formula for effect size? What is effect size?

Denotation: - "d" Formula: - E(Δ) / σ (mean effect change divided by standard deviation) Effect size is the idea that there is a mean effect, yet also takes into consideration that standard deviation is also a phenomenon. So effect size basically constitutes the strength and accuracy of the effect observed. As in, if you have a higher standard deviation, the effect size will be lower; which could be imagined because the variability is so high. But if you have small standard deviation yet a high mean effect change, you result with a high effect size value; this could be imagined as you have a high mean effect and a low variability, indicating the 'realness' of the effect.

How can the empirical t-value be interpreted? For directional hypothesis and non-directional hypothesis. (Know!!)

Directional hypothesis (i.e. H1: p>0 or p<0 - Pearson is significantly greater than zero if t-emp is greater than t-(1-alpha;n-2) - Pearson is significantly less than zero if t-emp is < t(alpha;n-2) *the t-(1-alpha;n-2) is just the t-emp formula, but instead of 1-r in the denominator it is alpha used. Non-directional hypotheses: (i.e. H1: p does not equal 0) - p is significantly different from zero, if abs(t-emp) > t-(1-alpha/2; n-2)

In R, how does one create aggression analysis? Regarding the "~", which is the outcome variable and which is the predicting variable?

Exam.anx.lm <- lm(Exam ~ Anxiety) * lm(outcome variable ~ predictor variable) ("Exam is predicted by (~) Anxiety)

------- 2C design of Experiments -------

Examples of Classical Experimental designs; Completely randomized designs

(Know!) Regarding an ANOVA test, how does F value determine if H1 or H0 is supported? How is F value gotten? (Pay attention also to the H0 can be rejected based on the F-distribution, it basically says if F-emp is greater than an F value that does not follow an F dist, then H0 can be rejected. (?) not sure if completely correct)

F = MSSmodel/MSSerror H1 supported if: F > 1 H0 supported if: F < 1

How is the F-test, which determines significance of R^2 (which determines the accuracy of the regression model), completed? (Know!) #write down

F-emp = (SSmodel/(k-1)) / (SSerror / (n-k)) = MSSmodel/MSSerror ***Which means in order to determine F-emp we need to identify the Mean Sum of Squares for the model / Mean Sum of Squares for the error. K = number of regression coefficients that are estimated (here k=2) N = total number of observation

Calculate F (and sig) and T stat using R, without using lm. (Know!)

F-emp: MSS.error <- SS.error/(n-2)) MSS.model <- SS.model/1 F.emp <- MSS.model/MSS.error #sig p.F <- 1-pf(F.emp, 1, n-2) T-value SSx <- var(Anxiety) * (n-1) SE.b1 <- sqrt(MSS.error/SSx T.b1 <- b1/se.b1 P.b1 <- pt(t.b1, n-2) *2

What is the difference between a fixed variable and random variables?

Fixed: - show no fluctation - provide constant single number - ex: mass of planet earth. Random: - show certain fluctuation - provides a range of varying numbers - ex: length of small intestine of mice.

Display the relationship of ~ within the fligner.test() function.

Fligner.test(outcome variable ~classification variable)

Using R, create a GRBD, in which using Music (rock, country, hip hop, classic), blocked with female and male, with 10 subjects per female and male groups.

GRBD - Music <- c("Rock", "Country", Hip-hop", "Classic") - Block <- c("male", "female") - Treatment <- rep(c(Music, each = 10) - Design <- design.rcbd(Treatment, 2, randomization = TRUE, seed =3)$sketch - Row.names(Design) <- Block

What is the definition of generalized randomized block design?

GRBD = b blocks of r x t r = experimental units t = treatments b = blocks - in which there is t treatments, which is subdivided into b blocks, which is further subdivided into the number of experimental units within t(b(r - each t(b( will have the exact amount of r in each.

What is the hypotheses of ANOVA? (H0, H1) (Know!)

H0: alpha 1 = alpha 2 ect. (Remember alpha corresponds with group effect; so the null hypothesis indicates that there is no difference between groups.) H1: alpha(r) does not equal alpha (t) (they are different, simply)

(Know!) Regarding an ANOVA test, when is H1 supported, and when is H0 supported?

H1 supported: MSSmodel > MSSerror H0 supported: MSSmodel < MSSerror

How can you combine T-test with correlation tables or partial correlation tables? What does it mean?

I don't know.

What does dnorm mean?

In a more precise sense, the PDF is used to specify the probability of the random variable falling within a particular range of values, as opposed to taking on any one value. This probability is given by the integral of this variable's PDF over that range—that is, it is given by the area under the density function but above the horizontal axis and between the lowest and greatest values of the range. The probability density function is nonnegative everywhere, and its integral over the entire space is equal to one.

Interpret the R^2, in which it is 0.1865 in regards Exam ~ anxiety.

"Anxiety has an 18% influence on test scores, out of 100% of possible influence). - This indicates that there are other variables that are effecting the exam score.

How to find significance of Kendall's tau using R, with respect to directional vs. non-directional?

#directional: - cor.test(X,Y, method= Kendall, alternative= "g") #non-directional - cor.test(X,Y, method= "Kendall", alternative = "t")

When viewing the data of a regression analysis in R, as in just calling the variable (i.e. Exam.anx.lm), we get data that says: Coefficients: (Intercept) = 106.07 (Anxiety) = -0.6658 Interpret these results. (Know!!)

(Intercept) = 106.07 - "if anxiety is zero, the exam score is 106.07" (Anxiety) = -0.6658 - "if we increase anxiety by 1 unit, exam score will decrease by 0.6657"

What does the dnorm function do?

****Restart on slide 16

Explain b1 using mathematics. This is easy. Also explain B0.

*also known as cov(X,y)/sx^2

See the picture, ensure how to use cut() function!!

*ensure to recapitulate this function

With regression analysis, the regression coefficients (what are those), the R^2 (what is this), are both estimated by what, sample data or population data? What does this imply?

- "B0" and "B1" - "Coefficient of determination" - Implies: true values in the total population are unknown. - Thus we must wonder, are the obtained estimates just random results, or are the obtained estimates significant (as in can be applied to the population)

What are the three different kind of correlation methods?

- "Pearson" - "spearman" - "Kendall"

What is the R^2? (Know!!)

- "The coefficient of determination" - Can be considered as a measure of error reduction, comparing the regression model to the simplest form. - How close the actual data are fitted to the regression line. - It is the percentage of the response variable variation that is explained by a linear model. - R^2 = explained variation (by model) / total variation of actual data (know this last part!)

If we want to see whether to variables are associated, we have to look at whether they _____________.

- "co-vary"

What is the estimate value of the Pearson correlation called?

- "r"

What is the suffix (last little modification) to use when creating a CRD with agricolae?

- $book

If I call the variable using summary(Exam.anx.lm), I get: Residuals: (in quantile form) Coefficients: Estimate, Std. Error, t value, Pr (of both intercept and anxiety) Residual standard error: Multiple R-squared and R-squared adjusted: And F-statistics: What doe these mean?

- *not sure what residuals are yet. - Estimate: original value from lm (see last question and interpretation) - St. Error: This is used in conjunction with the slope (anxiety) to determine the t-value. *(T-emp = b/SEb) (SEb1= sqrt((SSerror/n-k)/SSx) * ad from the t-value, we can determine the significance of the coefficient (p value)! R^2 = The determination of coefficient, as in, is the model a good fit to the actual data. F-stat = (from MSSmodel/MSSerror) has p-value of <0.05. Which indicates the model is a good fit.

What is a correlation coefficient? What do they range from? (Know!!)

- -1 to +1 - Correlation coefficient describes the strength of the correlation *So when we say something is correlated, we mean that (2, or more) covariables have a negative or positive correlation, which results in a correlation coefficient score.

How do you find variance for Kendall's tau, which is used as an assumption for normal distribution?

- 4n+10 / 9n(n-1) (Also mean is assumed to be zero)

For example, how can we use all our data knowledge of womens' heights (assuming normal distribution), to create a probability of a woman's height lying between 165cm and 170cm? (Slide 25 - lecture 3)

- Convert the interval of heights into z scores. - then obtain the requested probabilities from the density function of the standard normal distribution. - this would be pnorm(z-score of interest) - and if the probability needs to be indicative of a range like 165-170cm, (look at the list of z scores in the earlier questions), then you use PHI (pnorm). You look at the list of PHI, find the maximal value of your selected range, and then subtract all PHI values before your minimal value of selected range. This will give you a percentage probability that it lies within a specific range.

Currently, we have doubts about homoscedasticity and normal distribution. What does cook's distance measure?

- Cook's distance measures the effect of deleting a given observation.

What is the difference between a correlation analysis and a regression analysis? (Know!!)

- Correlation: measures the strength of relationship between two variables/ - Regression: 1. analyzes type of relationship between two variables; 2. distinguishes dependent vs. independent (outcome and predictors); 3. investigates the effect of one or more predictors on an outcome variable 4. Predicts the outcome based on data from predictors 5. Add a line into a scatter plot.

What is correlation?

- Correlations simply describe the relationship between two variable in statistical terms, but is not a method itself.

In the picture attached (slide 11 of 58), how is each dot determined within the graph, i.e. how is one dot created from two variables?

- Covariance: (Xi - Xbar)(Yi-Ybar) ***I think... Because obviously it's placed as an X Y intercept dependent on the two variable values, but perhaps this is what gives the individual correlation score?

What constitutes the formulation of a subject matter model?

- Defining the outcome: ie. the growth of a treee - Listing factors that might affect the outcome: ie. Treatment factors (main objective of the experiment); or factors you cannot control (heat, light, gender, ect.)

What is the difference between dependent variables and independent variables?

- Dependent variables: the experimental outcome, the effects that result from the independent variables - Independent variables: treatment variables, factors, predictors.

Being aware that the mean is a simplification of whole data, we would like to asses the accuracy of this model. Thus, for every observation, we calculate the ______________ of the model from the truth

- Deviation

What does Skewness mean?

- Deviation from symmetry "A measure of the probability of distribution of a real-valued random variable about its mean." - as in: one really high value may effect the mean, thus there may be a positive or negative skewness.

What is deviation?

- Deviation is the individuals distance from the X bar mean.

If Pearson correlation is >1, the variables are ___________ proportional. If the Pearson correlation coefficient is <1, the variables are _______________ proportionate.

- Directly proportional - Inversely proportionate

What would a scatter plot look like if cov(X,Y) = 0. What does this mean?

- Dots would be scattered randomly across the whole graph. - When investigate by placing the additional axes of the X mean and Y mean, you will see that the points reside in all 4 quadrants almost equally.

(Know!) What cook's distance "Dr" suggests that the r-th observation has undue influence on the results?

- Dr > 1 - The r-th observation, if it has a cook's distance (D) greater than 1, then it is likely contributing undue influence to the results

Typically, you want your treatment levels to range from null to "levels close to an overdose". How might you choose which dosages are to be given between?

- Equidistant - Equidistant with omissions - Geometric grading (as in exponential)

How do you get Yhat value from lm() analysis? What about residuals?

- ExamAnx.Lm$fitted.values - ExamAnx.LM$residuals

How to test for significance for R^2 value? For B0 and B1?

- F-test - t.test

(Know!) Why are multiple t-tests, such as group 1 vs group 2, g1 vs G3, g1 vs G4, g2 vs G3, g2 vs G4, G3 vs g4; why are these incorrect to use?

- For each T-test, an alpha significance level of .05 is used (falsely rejecting H0 - type I error - is 5%). - Thus the probability of no such type I error is 95% for each test. - If we consider all 6 t-tests together, we will have a p-value probability of .95^6. Which actually equals 0.735. - Conclusion: So the probability of at least one type 1 error is 1-.735 = 26.5%. - So multiple t-tests are inappropriate!

Within the GRBD, if r = 1, what do you expect?

- GRBD with r=1 will be exactly the same as a RCBD.

What is the H1 hypothesis vs. the H0 hypothesis?

- H0 is the null hypothesis - H1 is the actual research hypothesis (ie. that some form of result will occur).

Regarding which statistical method to use for measuring 'if two groups are different', if we use which method when there is homoscedasticity? We use which method if there is not homoscedasticity (heteroscedasticity)?

- Homoscedasticity: then we use T-test - Heteroscedasticity: then we use Welch's test

(Know!) If fligner.test() p-value is less 0.05, what a does this mean? If it is higher than 0.05?

- If <0.05, that means H1 is true, and data displays heteroscedasticity. - If >0.05, that means we reject H1, and H0 is true, meaning we do have homoscedasticity. THEREFORE WE CAN ASSUME THE VARIANCE OF BODY FAT WITHIN BOTH GROUPS OF TST ARE EQUAL.

(Know!!) If we find out that the F value for the ANOVA is significant (p < 0.05), then what can we conclude?

- If F is statistically significant, we can conclude only that AT LEAST 2 groups within the 4 possibilities (Q1, Q2, Q3, Q4) HAVE significantly different body fat %s. - But we do not know which ones!

What are co-variables?

- Unlike experiments which have an independent variable and a dependent variable, correlations are described in terms of covariables. - Co-variables are the (2, or more) variables used to create a correlation score.

Why does the length of the curve narrow when there is a higher sample size.

- If you have a larger sample size, you will end up with a larger value after the sqrt of the sample size. - The larger value of the sq rt. will make the value of σ /sqrt(n) smaller (considering σ remains constant) - With this effect, u0 + Z(1- α) x σ /sqrt(n) will ultimatly result in a smaller value (given every thing else remains constant). - The resulting ΔCrit.0 will be smaller as compared to the larger ΔCrit.0when the sample size is smaller. Thus meaning that the mean group effect does not need to be as far removed from zero to reject the null hypothesis.

How can H0 (as in no improvement) be rejected?

- If: F-emp > F(1-alpha; k-1; n-k)

(Know!) In analyzing contrast via ANOVA, what is the difference between the Xi (i.e. X1i, X2i, ect) in ANOVA vs multiple regression?

- In multiple regression, the X#i represent unique variable values, of which are multiplied by the predicted slope. - in ANOVA, the X#i are defined as dummies, used for coding membership into classification group. (I.e. Instead of multiplying B1 against value, we are using B1 to determine the "weight" or "influence" of the group.

Regarding SStotal, how is this determined for the outcome variable Y? What is SSerror? What is SSmodel?

- In which SStotal, is when all (individual y scores - y bar) are squared, and then summed up. - SSerror is the individual y scores - the y hat score of the individual (as in with error - without error), each value of this squared, and then summed up for all individuals for the y variable. - SSmodel= SStotal - SSerorr

The GRBD requires at least as many subjects per block as there are treatment levels, but in cases where there are not enough subjects available, which experiment design is used?

- Incomplete block design

(Know!!) Interpret the Intercept of coefficients in the picture. Interpret what TST4Q2 (aka Beta 1) means? What about Beta2? Beta 3?

- Intercept: 13.42, is the mean body fat composition of Q1, which is established when all Beta values are 0. *As there are 4 quartiles and only 3 beta values. - TST4Q2, which is 6.44, means that from the intercept of 13.42, moving to Q2 will increase body fat by a mean of 6.44 - TST4Q3, which is 9.32, is interpreted as "moving from Q1 to Q3, the mean body fat gain is 15.76. Resulting in 29.18 body fat mean total." - TST4Q4, which is 11.34, when summed with Q2 and Q3, represents a 27.1 body fat mean gain from Q1.

What questions does the fact that p (also "r") are estimated cause? (Know!)

- Is this only related to this sample? - If you use other samples, would it also be similar? Ultimately: is "r" statistically significantly different from zero?

What is the really good/important thing (besides eliminating measurement units) that z-transforming values before a multiple regression analysis allows?

- It allows for ranking of the predictors, as in which predictors are most influential /primary!

How can we identify the significance of "r"? In that, the correlation is statistically significant.

- It can be checked using a t-test

A Completely Randomized Design (CRD) is easy, but does not consider what? Which may cause what in the long run of the experiment? If confounding factors are properly taken into account, then the e_______ could be reduced which would improve the s__________ of the experiment leading to improvements in things like ______?

- It does not consider confounding factors - It may cause subtle treatment effects to be unnoticed. - If confounding factors are properly taken into account, then the ERROR could be reduced which would improve the SENSITIVITY of the experiment leading to improvements in things like POWER.

So when you say it has a correlation of -1 or +1 ect, what are you really saying?

- It has an r value >1 or <1

What is the basic idea of "planned contrast"?

- It is assumed that the variability explained by the model is due to participants being assigned to different groups (such as TST4). - The variability can be broken down into different ways to test specific hypotheses about which groups might differ. - The variance is broken down according to hypotheses made a priori. "It is assumed variability comes from the assigning of a priori-determined grouping; and the variability can be analyzed in a way that will show which differ."

In short, was a partial correlation?

- It is the correlation between two variables after eliminating common correlations with other unspecified variables.

Why does this graph, using cumulative distribution (pnorm) look the way it does?

- It looks the way it does, open ended, because it is explaining the AUC as if you are reading from left to right within the density function. - phi(1) is 1-phi(-1) which equals 84%. Phi(2) = 97%. So you an see that phi(-2) which equals 2.3% vs phi(2) will create cumulative score, where the right is always higher.

Know! If p.res is 0.950, what does this mean? 0.050?

- It means that 95% of the residuals will fall below 2.94 - It means that 5% of the residuals will fall below -6.06

What does a "negative correlation" mean? (I.e. -1)

- It means that if X goes up then Y goes down. - Or, if X goes down, Y goes up. - Always opposite

What is the disadvantage of covariance?

- It retains the scale of measure (the unite of scale)

What is the R function to calculate power/replicate/sig values needed for an experiment? What is the structure of the function? Which package is this in?

- Library(pwr) "power package" - pwr.norm.test "the function" - pwr.norm.test( d, n, sig level, power, alternative)

What quadrants, after additional axes are created representing X mean and Y mean, will most dots reside in if there is a negative correlation coefficient?

- Likely the majority will reside within the 2nd and 4th quadrant

2 levels of treatment may indicate a _______ relationship. 3 levels of treatment may indicate a _________ relationship.

- Linear - Parabolic

Using the plot just created, interpret it, as if this was the only information you were given and the intercept and revise coefficients.

- Looks to be positive slope, as in with more revision time, is an improvement in exam score. - Per unit increase in revision time, there is only a .56% increase in exam score. - If you multiple Revisetime(20 Hours) * .56%, you see that it takes 20 hours of revision for an increase in 10% of exam score.

Explain how the equation SSmodel / SStotal = R^2

- Remember: it is the explained variation (by the proposed regression model (line)) / the total variation of the actual data. - Therefore SStotal represents the total variation of the actual data, and the SSmodel represents the linear regression predictor model. - So then, we determine if the variation between the actual data and the regression line is large or small. - With small variation of all points along the line, this means that it will be a high R^2 value (as in the predictor regression line worked well to encapture a slope that minimizes variation of i, therefore it is the best predictor of i. - If there is large variation in actual data from regression line, you can image that the R^2 value will be low.

It still holds that SStotal = SSmodel + SSerror. What can SSmodel be further broken down into?

- SSmodel = SSfactor1 + SSfactor2 + SSinteraction

Explain again.

- SStotal: CREATES a regression line using the difference between the observed data and the mean value of Y. - SS error: is determined by the difference between the observed data and the NEW regression line. - SSmodel: is determined by the data point difference in relation to the mean value of Y (for all observations) and the regression line.

Explain the interpretation of a scatter plot on a BIPLOT, which results when create another X/Y axis within the original graph to display mean X value and mean Y value. What does it mean? (Know!!)

- Scatterplot can visibly demonstrate correlation - When you add extra axes which represent the mean X and the mean Y, we see that if a majority fall in the first and third quadrant, then therefore it is a positive correlation, as if the Xi value is above the average, the Yi value is likely above the average. Similarly if the Xi value is below the average, the Yi value is likely below the Y average.

What erros can occur during sampling design?

- Selection/ recruiting of experimenmtal subjects - Between-group designs vs. within subjects (know the difference!) - sample size (and number of replications required)

(Know!) How to test for normal distribution? (Remember this can be used for all data sets, and even residuals for ensuring multiple regression residuals are as such. The p-values resulting are all above 0.05 (aka p >0.05). What does this mean? *The goal is to identify the relationship between Tricep Skinfold Thickness and Body Fat. In which TST is the predictor, and body fat the outcome. Within the data "bodyfat", columns TST and fat will indicate for both.

- Shapiro Wilk Test - As p > 0.05, this means that we can assume body fat does NOT significantly deviate from normal distribution. Confirming the H0 hypothesis. - If p was less than 0.05, this would confirm the H1 hypothesis, that it is not normally distributed.

Given the previous explanation, how is the partial correlation determined?

- Simply through the Pearson correlation of the residual error terms for both EI and PA. (Ie. cor(EIerror, PAerror); in which BMI and those variables were regressed.

How do you find the T-density function? How do you find the T-distribution function? How do you find the T-values (which is basically z-score)?

- T-density: dt(data, Df=5) need to account for degrees of freedom. #like phi - T-dist: pt(data, Df=5) #like PHI - T-score: qt(T-dist, Df=5) #just like qt(PHI, Df=5)

Regarding treatment levels, the more treatment levels the clearer the __________ of relationship between treatment and outcome will be.

- TYPE (TYPE OF RELATIONSHIP)

So if there is a data set that is not normally distributed, what does this mean in regards to the t-test derived statistical significance?

- That you cannot prove if it is significantly or not.

What determines how many contrasts are used?

- The amount of contrasts used follows the rule of #of groups - 1.

Simply, explain what dnorm (aka probability density function) AUC means?

- The area under the curve of the density function can be interpreted as probability

What is important to keep in mind when interpreting regression coefficients? What does this mean for all the interpretation we have done so far? And how I have to manage multiple regressions in the future?

- The coefficients from multiple regression have to be interpreted based on their respective measurement units. - Coefficients of diffevent predictors are hardly comparable, and the outcome cannot be evaluated. - Use z-transformations in the future before using a regression analysis.

(Know!) If you plot a histogram of the data, you will get a non-normally distrusted looking histogram. Yet, if Shapiro test says it is normally distributed, how do you explain the discrepancy?

- The discrepancy is likely caused by number of observations - In which, it is much more difficult to create a p-value <0.05 when there are not many observations. - So the fact that observation number effects p-value significance; and given that p-value < 0.05 indicates non-normal distribution; it is likely that low observations can give rise to a false indication p-value, meaning that the p-value >0.05 indicating normal distribution is false.

What is the difference between the two strategies to separate treatment effects?

- The first keeps everything constant but the treatment - Often it is difficult to keep everything constant, so the second strategy (blocking) includes and accounts for the effect in the model.

Picture a density curve, draw the difference between a T-distribution and Normal distribution with Df=1, also Df=10, also Df=100. What do you notice? (Know!)

- The lower the degree of freedom, the flatter and wider the curve is. - The higher the degree of freedom, the more narrow and steep the curve becomes. - So for example, a t-dist curve with a high number of degrees of freedom will likely resemble a normal distribution.

What can we use to reconstruct a frequency distribution? What are the requirements? *know!

- The mean and the standard deviation (or variance) - The requirements are for the variable to be normally distributed.

What is the second quartile identical with?

- The median

What is "degrees of freedom"?

- The number of independent pieces of information used for estimating a parameter (For example for estimating the mean height of German women; the degree of free will be all subject -1)

What is a way to get past the short comings of RCBD?

- Using a Generalized Random Block Design (GRBD)

What is the shortcoming of RCBD?

- The number of treatments (t) is fixed, therefore the only thing that could affect n (number of subjects) is the number of blocks (b) in which n = t x b. *so the number of subjects utilized within the experiment is determined by the amount of treatment levels, and the length of the blocking factor. *envision the RCBD visualization) - So for example, it's hard to increase the number of subjects if using a blocking variable like gender because there is only two genders. And since t has to be fixed, then the number of subjects is completely dependent on the t x b. **Also - the RCBD does not include an interaction term, such as that the combined effect of treatment and block factor cannot be investigated.

If both Effect Delta, and standard deviation are mis-estimated by a common factor then...

- The optimal number of replicates remains the same - It's actually only d = delta/sd that matters when calculating optimal sample size.

What is the probability of a Type I error is called what? And is denoted as what??

- The probability of a Type I error is called level of significance and denoted as α.

Why does confidence in results increase as more replications are had?

- The probability that result has occured by chance is diminished. (think Tea Lady who had 4 combination options vs 6 combination options. we would be more confident if she accurately guess correctly the 6 combinations).

What is range?

- The range of values from the minimal lowest value to highest value.

What is the meaning of ΔCrit.0?

- The resulting answer you get from the formula is compared to the mean group effect. If the mean group effect is larger than the determined ΔCrit.0, then the null hypothesis may be rejected. - If you have a higher subject number, the length of the curve narrows.

What is the definition of variance?

- The square of each individual distance from the mean, then summed up, and then divided by the the number of subjects.

What is the definition of standard deviation?

- The standard deviation is the sq rt of the the variance.

What does B0 actually mean?

- The theoretically value of the outcome variable Y, if the the predictor X is 0. - as this would leave only the error term and B0 value in the equation Yi = B0 + B1*Xi + error term.

If Effect Delta is underestimated (assumption) then...

- The true density function of H1 (green curve) is actually further to the right - The probability of Type II error decreases - The optimal number of replicates decreases

Describe what a two-sided test is against a one sided? What is a likely consideration when using a two-sided hypothesis?

- The two-sided test covered both ends of a hypothesis. - If the alternative hypothesis was that HFD will increase villi length, we would utilize a one-sided 'greater' test. Use 'lesser' if decreased villi length - If the alternative hypothesis stated that villi length is altered, without provided a direction, we can then use a two sided test. - Increased sample size

In a situation in which it is only B0 and B1 determining the regression line, what is the significance difference between the t-test and the F test?

- There will be no differences, they will be exactly the same. - If there were more regression coefficients, as in B2, B3, ect. then the T-test significance (given for each coefficient) will be different than the F test significance.

What do post hoc tests do?

- They compare each group mean against the means of (all) the other groups. - In general, post hoc tests use a stricter criterion than alpha to accept an effect as significant; this is why they control the familywise error rate.

Why is var.equal = TRUE, used within the t.test function?

- This says that the variance homoscedasticity is TRUE.

What are the properties of the density function of the standard normal distribution? *know!

- Total area under curve = 1 - Variance of distribution = 1 - Function is symmetric around X=0 - Function attains its maximum at X=0 - The function has inflection points at X=1 and X=-1

In regards to planning your statistical model, which main effects should be included ?

- Treatment effects - Error-control effects

If standard error assumption is underestimated then...

- True density function of H0 (blue curve) and H1 (green curve) are flatter - The blue and green areas representing the type I and type II errors are further apart (because of the flattening) - The optimal number of replicates increases

If standard error assumption is overestimated then...

- True density function of H0 (blue curve) and H1 (green curve) are steeper - The blue and green areas representing the type I and type II errors are closer together - The optimal number of replicates decreases

If Effect Delta is overestimated (assumption) then...

- True density function of H1 (green curve) is actually further to the left) - Probability of type II error increases (beta-value) - Optimal number of replicates increases *these are in cases where Beta value is fixed at 20%

Reiterate what Type I and Type II errors are. How do conservative tests try to reduce the probability of a type I error? What does this mean?

- Type I: Rejecting null hypothesis - Type II: Accepting null hypothesis - conservative tests reduce probability of type I error via: lowering the p-value threshold for rejecting a Null hypothesis. (As in, p < 0.01) - Conservative tests are prone to TYPE II ERRORS

Let's say, you choose less that 10 for the sample size in the picture; what type of error would this likely mean? How do we know 10 is the correct amount of replicates (or participants?)

- Type II error, because Beta couldn't reach the value of significance. - Because there wouldn't be a gap between Delta.Crit1 and Delta.Crit0

What constitutes defining the components of the experimental design of treatment designs?

- Types of treatments (treatment factors and confounding factors) - Number and types of treatment levels (eg. low vs. high dosage of treatment factor ) - Combination of various treatments

What is the formula for Tukey HSD In R?

- tukeyHSD(bodyfat.aov2)

After the multiple graphs created by using plot(bodyfat.aov), in which bodyfat.aov <- aov(fat~TST4), we see that there is some further discrepencies in residual homogeneity. On the residuals vs fitted graph, we see a crooked mean of residuals line. Importantly, this function allows us to identify potentially confounding outliers that may effect our data disproportionately. How can we see if these outliers are indeed effecting our distribution to an disproportionate (undue) degree? Also called, trying to identify "influential outliers" that create an "undue" influence on distribution/results.

- We can use cooks distance to identify the outliers.

(Know!!) Now that we know the fat data is normally distributed, what tests are next? How do we do this in R?

- We need to test for HOMOSCEDASTICITY - which gives us the answer if there is homogeneity of variance. In R: fligner.test(Fat ~ TST2)

If, in the very beginning, we found out that our data was not normally distributed via the Shapiro test, what statistical comparison test could we use? (Remember this skips a step and goes directly to comparing the groups).

- We use Wilcoxon test if the Shapiro test proved that normal distribution is not present.

(Know!) If there is heteroscedasticity (no homoscedasticity), what statistical method would you use to identify if the groups are different? How is this done in R?

- Welch's test - t.test(Fat~TST2, var.equal = FALSE, alternative = "less") - The way to do a Welch's test in R is to ensure that var.equal = FALSE, as this is saying the variance is not equal (heteroscedasticity) in the data.

Interpret this graph as well. Be sure to explain what the intercept means in this case.

- With no anxiety, revision, and all females (genderMale=0), we can expect an exam score of 86.9%. (This is not to be taken seriously!) - If there is no anxiety, but revision increases by one unit, we can expect an increase in performance by 0.25% in female students (genderMale = 0). - If revision time is held constant, a unit increase in anxiety would lead to a reduction of .48% on exam scores of female students (genderMale=0) (only significant one) - If revision and anxiety were held constant, male students performed slightly better than female students (1.03%).

In regards to planning your statistical model, do interactions have to be considered? Which ones?

- Yes they should be included in the statistical analysis - Treatment x Treatment - Treatment x Error-Control (maybe it works only in males and not in females) - Error-control x Error-control (interaction between age and gender)

Are Bonferonni correction and Tukey's HSD both conservative? In terms of Power, who has more, Bonferonni or Tukey HSD? (2 answers)

- Yes, Bonferonni and Tukey HSD are both conservative. - Bonferonni has more power IF the number of comparisons is quite limited. - Tukey's HSD has more power than the Bonferonni correction, if the full set of comparisons is performed.

In theory, if you did a correlation graph, which included: Exam scores, Anxiety levels, residuals, and Y hat scores. All in one data frame. Then a correlation (normal Pearson) was run on this. Is it possible to get the R^2 value from this?

- Yes, by squaring the entire correlation matrix, you can get the R^2 value of the model in relation to anxiety score, by look at the squared y-hat value against the outcome variable. *remember, that normal correlation coefficient is r, so squaring it will give R^2

When reviewing this graph, what is Yhati? As compared to Yi?

- Yhati will be the b0 + b1*xi) Meaning that it is b0 + slope*Xpredictor of the individual gives us the actual slope in a regression plot. So, that when referring to y hat, we are referring to the distance from 0 on the y axis to the y coordinate of which a vertical line from the X axis intersects with the slope itself.

Within class, what is the package that was used to easily make randomized designs?

- agricolae - library(agricolae)

What is a Latin square? When is a Latin square possible? How many blocking factors are used?

- an arrangement of b rows and b columns such that every Latin letter occurs once in each row and once in each column (b = Latin letters) - The Latin square is only possible if there are two blocking factors are to be considered, and if both blocks have identical number of levels. - therefore, the rows and columns represent the two blocking factors, whereas the Latin letters represent the treatment levels.

The experimental setup/material is divided into what? What is the defining feature of the RCBD? What do RCBD stand for? Is the design completely randomized? Or no? Why?

- b sets of t experimental units each, where t is the number of treatments. - the experimental units are spread among those b sets such, such that each treatment occurs exactly once in each block. - Defining feature: Each block sees each treatment exactly once. - RCBD: randomized control block design RCBD is not completely randomized, because we assign the experimental units to sets based on a priori information (as compared to RCD)

Why is taking the mean of deviation inappropriate?

- because the mean of the differences will always equal zero. (Because this was the original x bar value, but since we are just looking at deviation, we remove it and replace it with zero more or less.)

Why is the sum of squares hard to interpret?

- because the value of the SS depends on the number of observations (the higher the number of observations, the more likely the SS will be higher.)

What is corr.test()? (Has an extra r)

- combines all the advantages of cor() and cor.test(). - it can work with complete data frames or matrices. - computes variations types of correlation coefficeients (ie. Pearson, Kendall, spearman) - provides t-test information (includes p values of two tailed sig test)

The more treatment levels the more ____________ the experiment; and the ____________ sample size required to maintain the statistical power of the experiment.

- complexity - larger (sample size)

What does cor.test() accomplish? What does it do? What are the attributes?

- computes correlation (just as cor) - provides t-test information (including p-values and confidence intervals) - works only with variables, not with complete data frames.

With two-way ANOVA, how many independent variables are there? What design study type is this?

- two way ANOVA - Factorial design

Explain the connection between phi, PHI, and z. Look at the image.

- phi(x) is the density function, each value corresponding to the sequence represents the probability that this value be selected (look at the density curve, and instead of counting left to right, you are identifying what is the probability percentage that the an unknown individual would get that exact score.) - PHI(x) is the distribution cumulative function. As the sequence continues, the probability is added up. (Notice z-score 2.0, and the probability is 97.7%, which is the same from the questions before). From this, is how a z-score is generated, as it determines quantiles from this function. - z-score: obtained using qnorm(PHI). Qnorm stands for quantile normalization, meaning it's going to utilize quantiles to produce a z score.

How to calculate probability for any z score that is normally distributed?

- pnorm(z.score)

Within a data set of seq(-2, 2, 0.2), how do you find the z score from PHI? **-2 to 2 is separated by .2

- qnorm(PHI)

What is the mathematical formula for a t-test that checks if "r" is statistically significant (T-emp)?

- r* sqrt(n-2/(1-r^2) - the n-2 is actually the degrees of freedom. - It is n-2 since r is based on the estimates of sdX and sdY

What is the pre-requisite for determine the significance of regression coefficients using t-test? Does it follow a specific distribution?

- r^2 has to be significant - it follows a t-distribution with (n-k) degrees of freedom

How can the significance of a regression analysis be determined? As in, how can the significance of the r^2, and regression coefficients be measured? (Know!)

- r^2: F-test - B0 and B1: t-tests

What are B0 and B1 called, besides the intercept and the slope?

- regression coefficients (These regression coefficeients are unknown, but they can be estimated as B0 and B1 using the available sample data on outcome and predictors).

How do we overcome this standard deviation unit discrepancy problem? How is this accomplished?

- relate the standard deviation to the respective mean. - Divide the standard deviation by the respective mean.

What is homoscedasticity?

- residuals have a constant variance

Which parameters does dnorm require in order to be completed?

- sample mean - sample standard deviation

What is the function for randomization of a set; for example provide the function to randomize a sequence 1:10. Now create a table with 2 treatment groups assigned and control assigned for participants 1-15. This is assigning ***"treatment groups to subjects"***

- sample(10) = 3, 5, 2, 7... - subjects <- 1:15 - groups <- rep(c("C", "T1", "T2"), each = 5) - treatment <- sample(subjects) - design <- rbind(treatment, groups) OR (Preferred, because of order) Subjects <- 1:15 Groups <- rep(c("C", "T1", "T2), each = 5) Treatment <- sample (groups) Design <- rbind(subjects, treatment)

What is the primary way to plot two covariables in which you should be able to see if there is a correlation or not? Give R code. Regarding the R code, which is the X axis and which the Y axis when "~" is used?

- scatter plot - plot(Exam ~ Anxiety, data=ExamAnx) - Y axis ~ X axis

Within pwr.norm.test, write the sig.level part of the structure. Do the same with the power part. Interpret the power part if resulting power value is 0.95

- sig.level = 0.05 - power = 1-beta - power is pretty low; as the power level would be 0.05. We want to get it to 0.20, aka an R power value of 0.80.

When is Kendall's correlation (Tau) recommended?

- small data set - If, when ranking, there are a large number of tied ("shared" ranks)

Now, using 20 participants, randomly "assign them to 5 treatment groups".

- subjects <- 1:20 - rand.subj <- sample(subjects) - groups <- rep(c("C", "T1", "T2", "T3", "T4"), each = 4) #creates 5 treatment groups; enough for all subjects - design <- split(random.subj, groups) - (optional) lapply(design, sort) **remember split, it's like R bind. Know "split!"

From this last question, what is the sum of squared deviations called?

- sum of squares

Interpret the returned data from using the t.test() function.

- t-value is -4.3303 - this is a significant different of -4.3303 evidenced by p-value = 0.0002. - Confidence interval confirm significance, as it does not cross zero. - then compares mean values of both groups. We conclude: Low Tricep Skinfold Thickness is significantly associated with lower percentages of body fat. #not sure if can make causal claim. Or how that would even work.

How do you use t-test in R to measure if the TST2 groups are different in terms of representing body fat?

- t.test(fat~ TST2, var.equal = TRUE, alternative = "less")

Is there a specific method that allows for the frequency reconstruction?

- the DENSITY FUNCTION - (requires normal distribution)

What is the subject matter model crucial for?

- the choice of experimental design - the selection of the target population - the evaluation of the complexity and feasibility of the experiment

Using a Q-Q plot, how will we know that the residuals follow a normal distribution?

- the line created will be straight.

If histogram shows not normally distributed, but shapiro and qqnorm graph show it is normally distributed, what is going on?

- there is likely some problematic outliers.

Why would I use a z-transformation on regression coefficients? What is the terminology of the resulting value? What can this be used for later on?

- they have no measurement units - "standardized regression correlation coefficients" (usually denoted as Beta) - can be used for ranking the predictors.

How do you calculate density/frequency reconstruction in R? ****understand how to do this in R****

- this is done through the dnorm function

Look at this graph, interpret what the F value is.

- this model is significantly accurate. - this is based off the R^2 value

What is a general purpose of creating a mean?

- to simplify and summarize data. (Such as all women in Germany are 164cm tall, by mean).

In R, how would you calculate a partial correlation? If we are trying to find the partial correlation between PA, EI, and BMI, in which we want to exclude the variable of PA. (Know!)

1. Data.frame <- data.frame(Energy, BMI, PA) 2. Energy.lm <- lm(Energy~PA) 3. BMI.lm <- lm(BMI~PA) #so we have residual in which EI is not accounted for by PA. (Which is BMI) #then we have effect on BMI that is not accounted for by PA. (which is EI) #this effectively removes the influence of PA 4. Cor(resid(Energy.lm), resid(BMI.lm)) - now this is very similar to to cor(Energy, BMI), yet now the influence of Physical activity is removed.

How can a goodness of fit be evaluated. Give 4.

1. Determining deviations of the predicted values of the outcome - the actual measures from the value. (Predicted - actual) 2. Squaring these deviations: positive and negative integers do not cancel out, and bigger deviations have a relatively stronger focus than small deviations 3. Summing of squared deviations: creating a single measure. 4. Comparing sum of squares of the model against a very basic model.

in R, with "Exam Anxiety.Rdata".

1. ExamAnx <- load("Exam Anxiety.Rdata") 2. ExamAnx$Gender <- (as.numeric(ExamAnx$Gender)-1) #this creates a dummy variables, as in, gives a numeric value as 1 or 0 3. Exam.z <- scale(ExamAnx) 4. Z.lm <- lm(Exam ~ Revise + Anxiety + Gender, data = data.frame(Exam.z)) #included data = as likely ExamAnx would be attached to workspace 5. Summary(z.lm) ***in short, it's just scaling data before running lm function.

What are the prepretory steps to running an experiment?

1. Formulation of questions and hypotheses 2. Formulation of a subject matter model 3. Definition of the components of the experimental design 4. Translation of the experimental design into a statistical model

(Know!) What are the rules for coded planned contrast?

1. Groups coded with positive weights are compared to groups coded with negative weights 2. The sum of weights should be zero per group 3. A weight of zero is assigned if a group is not involved in a particular comparison 4. For a given contrast, the weights assigned to the group(s) in one chunk of variation, should be equal to the number of group the opposite chunk of variation 5. If a group is singled out in a comparison, it s Gould no longer be used in any subsequent contrasts.

(Know!) What are the required assumptions for ANOVA?

1. Homogeneity of Variance (Fligner test already performed before ANOVA) 2. Normal distribution of RESIDUALS 3. Check for influential outliers

What is an easier way to create partial correlations on R? My.data <- cbind(Energy, BMI, PA)

1. Install.packages("ppcor") 2. Pcor(my.data) 3. View returned data, or assign Pcor to a variable, and then call via Variable$estimate 4. View $p.value data as well.

Explain what this means. " I = different treatments And at total of n = I x M subjects. Let those n subjects be partitioned randomly with equal probability into... - Those I groups - of M subjects in each group Let all the I treatments be assigned to all those I groups of the subject, such that the "i-th" treatment is applied to each of the m subjects in group I. "

1. Let's the subjects (N) be partitioned equally into the treatment groups (I). - Example: I = 2; N = 24 2. Every individual subject (M) receives one single treatment (i). - In other words, if you have 1 M x i, this is representative of 1/24 of the entire experimental design.

Now create a Completely Randomized Design (CRD) using agricolae. Groups = Music (rock, country, hip-hop, classic) # of participants in each group = 9

1. Library(agricolae) 2. Music <- rep(c("rock", "country", "hip-hop", "classic"), each = 9) 3. Subjects <- 36 2. Design <- design.crd(Music, Subjects, randomization = TRUE)$book

Using R, design a Latin square. Music (Treatment) = rock, country, hip hop, classical Region (BF-1) = North, South, East, West Age (BF-2) = <18yo, 18-20yo, 21-23yo, 24+ yo

1. Library(agricolae) 2. Treatment <- c("rock", "country", "hip-hop", "classic") 3. Region <- c("North", "South", "East", "West") 4. Age <- c("<18yo", "18-20yo", "21-23yo", "24+yo") 5. Design <- design.lsq(Treatment, randomization = TRUE, seed = 5)$sketch 6. Row.names(Design) <- Age 7. Col.names(Design) <- Region

(Know!) What are the 4 different kind of line trends? Explain what do they look like.

1. Linear trend 2. Quadratic trend 3. Cubic Trend 4. Quartiles Trend

In the example of women's height, give the r code for creating a histogram with a density curve.

1. N <- length(female subjects) #to count how many 2. Mean(female height) 3. Sd(female height) 4. Z.height <- (Female height data - mean) / sd 5. Hist(z.height, ..) + Curve(N *dnorm(X)...) Or 5a. Hist(z.height, prob=T, ...) *i think prob=T creates density curve.

How to make the previous graph look nicer?

1. Names(design) <- c("ID", "Replicate", "Fat", "PUFA", "Fiber") 2. Design$Fat <- factor(Design$Fat, labels = Fat) 3. Design$PUFA <- factor(Design$PUFA, labels = PUFA) 4. Design$Fiber <- factor(Design$Fiber, labels = Fiber)

In R, create a regression analysis plot.

1. Plot(Exam~Revise) 2. ExamRev.lm <- lm(Exam~Revise) 3. Abline(ExamRev.lm)

In R, construct a completely randomized design of assigning 36 participants into 4 music groups (rock, country, Hip-hop, classic).

1. Subjects <- sample(1:36) 2. Music <- rep(c("rock", "country", "hip-hop", "classic"), each = 9) 3. Design <- splitI(Subjects, Music) 4. Design <- lapply(Design, sort)

When evidencing a multiple regression analysis by table, what is important to report? (Know!)

1. The R^2 and F-value and related significance 2. Predictors, their b, SEb, Beta, and p-value

List 5 situations within an experiment which require MORE replicates?

1. The smaller the treatment effect to be detected (effect size) 2. Greater the variability of outcome 3. Lower the desired level of significance ( = α) (eg. .05) 4. The higher the desired power ( 1 - β ) 5. The more factors or factor levels under consideration

How to report correlation coefficients? Give example using cor(my.data) regarding EI and PA. And another example

1. Type of correlation 2. Size of correlation (two decimal places is fine) 3. Significance level (p-value).

Explain the linear model of GRBD.

Yijk = u + Betai + Tk (Betat)ik + epsilonij i = index of block under consideration k = treatment index j = index of experimental units Yijk = outcome of experimental unit j in block i within treatment k. u = grand mean of the outcome variable Betai = effect of i-th treatment variable, which is the deviation of i-th block mean from the overall outcome u. Tk = Effect of the treatment (k) across all blocks; which is the deviation of the mean of the k-th treatment from the u. (Betat)ik = interaction treatment level k and block level i, which is actually the deviation of the treatment (k) effect in block (i) from the overall treatment effect u. **** this is the difference, the interacting factor. Epsilonij = Error term of the experimental unit j in block i, which is actually the deviation of j from the mean outcome of block i. (So this is not compared against u).

Describe the linear model of the Randomized Complete Block Design.

Yik = u + Betai + Tk + Epsilonik i = the block under consideration k = the variable representing the specific treatment within the block Yik = the outcome of experimental unit (subject) within block i under treatment level k. u = grand mean of outcome variable Betai = effect of the i-th block (aka. The deviation of the mean of the i-th block from the overall mean outcome u). Tk = Effect of the k-th level of the treatment variable (aka. The deviation of the mean outcome under treatment k across all blocks, from the overall mean outcome u) Eik = Error term of the experimental unit(subject) in block i under the treatment level k

Any normal distributed variable can be turned into a standard normal distribution. (Note the importance of each portion of that sentence, "normal distributed variable" and "normal distribution", as in these are separate but work together). If this is the case, how do we obtain a z-score?

Z-score: - take the individuals value (1st) and subtract it from the mean of that group (2nd). - then divide that value by the standard deviation. - this creates the z-score (This is the same as the function scale(X, center=true, scale=true)

What is the expression for the arithmetic mean:

big E(xi - X bar)^2 - so: each individual difference from the mean is squared, and then all are summed up. - this is the same thing done in variance, but in variance it is divided by number of observations.

What is the difference between phi and PHI?

phi: dnorm (density function) PHI: pnorm (distribution function)

What is the mathematical formula for empirical correlation coefficient, aka "r"?

r = n * cov(X,Y)/sqrt(ssX*ssY)

What is the difference between random erros and systematic errors?

random errors: - The cause: investigation of samples rather than populations. - Limited reliability of results (be of weird subjects); this is mitigated through confidence intervals. systematic errors: - confounding factors - bias of scientists - measurement errors

What is the interquartile range?

upper quartile - lower quartile 3rd quartile - 1st quartile

If assuming a null hypothesis, what is the formula to determine ΔCrit.0?

ΔCrit.0 = u0 + Z(1- α) x σ /sqrt(n) u0= 0 (as it is the null hypothesis, this is the mean change expected) Z(1- α) = usually 1 - .95. (as we want a 95% significance value= σ = the standard deviation sqrt(n) = the number of participants

If assuming the H1 hypothesis, what is the formula?

ΔCrit.1= u1 + Zβ x σ /sqrt(n) Or ΔCrit.1= u1 - Z(1-β) x σ /sqrt(n)

If the subject number increases, why does this narrow the curve in the H1 formula

ΔCrit.1= u1 + Zβ x σ /sqrt(n) Or ΔCrit.1= u1 - Z(1-β) x σ /sqrt(n) - When you increase the subject number, it increase the Crit 1 value. - First, the Z(β) value will likely be negative. - So if you have a larger number of subjects, you will have a larger square root value of the number of subjects. - When σ /sqrt(n) occurs, you will have a smaller value. - The smaller value will then be multiplied against a negative Z(β) value, thus resulting in a smaller value as compared to a identical formula with less subjects. - The new smaller negative number (as compared), when subtatacted against the U1 value, which would be the expected change, you will only have a small drop from U1, (ie. instead of it dropping to 30, it will only drop to 35 because a less number is being subtracted from U1 as compared to the formula with a few number of subjects. - The resulting higher ΔCrit.1 value (as compared to the formula with few subjects), will then indicate that the observed mean effect doesn't need to be so less than the expected mean effect.

Define: σ^2 s^2

σ^2: square of each individual difference from the mean, and then summed up, and then divided by the number of observations (n). s^2: square of each individual difference from the mean, then summed up, and then divided by the degrees of freedom (n-1)

When is σ^2 used? When is s^2 used?

σ^2: using data with total population S^2: using data of sample population (and do not want to make conclusion about whole population).


Kaugnay na mga set ng pag-aaral

NSG 374 Chapter 18: Anxiety and Panic Disorders

View Set

Chapter 9 - part 2/ Life PPOR Review

View Set

Geography - module 2 Eurasia and Central America

View Set

Les Phrases tiles en Francais (useful sentences in French)

View Set