Statistics and Experimental Design

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

getting the question of effect right

"is there an effect" vs "how large is the effect" -QWERTY ex. fails to address the magnitude of the effect ex. "large effect" used to describe something that's statistically highly significant ex. "significant difference" used to describe an effect that's numerically large without it having been statistically tested... want to know whether the effect is robust - more so than how strong it is

which model of linear regression gives a better fit?

(i.e. the smaller set of residuals)? = the more complex model. (could always posit a coefficient of about zero...) - line of best fit is almost surely not going to be completely flat, given our data... inevitable given a finite set of data, whether or not there's really a relation between the variables - line of best fit might not reflect the underlying reality - risk of over-fitting

wilcoxon signed rank test

*look at the size of the differences, and rank them from smallest to largest, under the null we expect positive and negative differences to be well intermixed used for small sample size -alternative to the paired t-test for small samples of non-normal data, paired data -if there's a preference for sentence A, the differences in favor of A should be clustered at the top of this list, and those in favour of B at the bottom In effect, we'll proceed by awarding points to each sentence as follows: -remove any data showing no difference (adjusting sample size) -assign one point to the data pair with the smallest difference, two points to the next smallest difference, and so on -share the points in case of ties (bottom two tie: each gets 1.5 points, and so on) -credit those points to the sentence that "won" each data pair

model comparison of linear regression

- can use the t-test to perform a significance test on our simple regression model - linear regression is proposing the idea that any measurement of the d.v. is explainable as the sum of three things: ▪ a multiple of the measurement of the i.v. ▪ a constant ▪ an error term That's a hypothesis about how the y-value comes to be what it is H0 proposes that measurement of d.v is explicable by the sum of: ▪ a constant ▪ an error term (no relation of the independent variables on dependent variables)

violating independence in an experiment

- can't have same participant doing the same experiment several times - what one person does multiple times in the same experiment is not independent data

nominal variables with 3+ levels

- can't represent a nominal variable with three levels (e.g. M, F, N; training A/B/C) with a single "dummy variable" if two of the levels given values 0 and 1, we wouldn't know what value to assign to the third level (or any subsequent ones) • In this situation, we need additional binary variables ex. two variables: one interpretable as "is this M?" (1=yes, 0=no) and another as "is this F?" (1=yes, 0=no) M would be represented by (1, 0) F would be represented by (0, 1) N would be represented by (0, 0) coefficient of "is this M?" would correspond to the increase in the d.v. for M compared to N coefficient of "is this F?" would correspond to the increase in the d.v. for F compared to N coefficient of N would be the "baseline" condition

use of multiple regression

- explore multiple quantitative and qualitative factors simultaneously -relevant if we want to explore the effects of several variables of interest in the same experiment - relevant if we want to control for things that we're not directly interested in but we're worried that might be exerting an effect gender would be one plausible ex., or age, but there are many other possibilities - enter that variable as a predictor and see whether or not it turns out to have any predictive utility

regression

-behind a lot of statistical methods -drawing a line of best fit used for question of whether different word frequencies (or participant age or gender) leads to a different outcome use coefficient b: how large it is (a measure of effect size) and whether it's reliably non-zero (basis for a hypothesis test) least-squares regression regression line calculated to minimize the sum of squares of the residuals (the vertical distances between the points and the line) residuals correspond to how (un)successful the equation is at predicting the value of y, given the value of x

recipe for calculating population mean and standard deviation of a population:

-calculate the mean of your n data points -for each data point, calculate the difference between it and the mean, and square that difference -add together the squared differences for all the data points -divide that by n (for a population) or n-1 (for a sample) -that's the variance: take the square root to get to the standard deviation

other uses of chi square test

-the same method to test whether the data fit with other null-hypothesised distributions ex. suppose we were worried that participants were just choosing at random in condition A H0: choice is at random 2x4 table, 4th column predicted, 1x3 = 3 df observed data: A SNP 37 LABOR 31 CON 26 OTHER 16 expected data: A SNP 27.5 LABOR 27.4 CON 27.5 OTHER 27.5 calculation the same as before, expectedA minus observedA, divide by expectedA squared, add expectedB minus observedA /....

why assume normal distribution?

1. The normal distribution has special mathematical properties 2. The normal distribution turns up a lot in real life - a lot of the things we might want to measure are roughly normal not everything is normally distributed... -tend to look at the data to see whether it's compatible with the hypothesis of the population being underlyingly normally distributed: we can significance-test that too -even completely different distributions, like the binomial, rather come to resemble the normal distribution in shape (which is not a coincidence) -can use the normal distribution as an approximation of other distributions

testing difference between two groups

H(0) states that there's no difference between the mean for two groups previously we had pairs of observations, ex. primed and unprimed -under null hypothesis, drawn from same population -we also know something about the difference between two random variables: if X ~ N(μ, σ2) and Y ~ N(μ, σ2), then X-Y ~ N(0, 2σ2) -under the null hypothesis, we expect these differences to be drawn from a normally distributed population with mean zero

example of CLT and

H0: average exam mark is 60 actual outcome: n = 41 mean = 56.15 sample s = 11.27 SEM = 11.27 / sqrt(41) = 1.76 t = 56.15-60 / 1.76 = 2.19 cv of t40 = 2.021, reject null b/c t value exceeded H1 the avg exam mark isn't 60

correlation and outliers

PMCC vulnerable to outliers, r = -0.64 without the highest point, an outlier, makes the line a moderate correlation r = -0.44 with the point weakens the correlation appropriate treatment outliers subjective - physics experiment, might think that outliers were actually indicative of the breakdown of a linear relationship -are our outliers more likely to be "errors" in some way? theres a risk of omitting outliers if they don't fit out theory -must have a strategy for dealing with outliers before running the experiment ex. "delete all data points more than 2.5SD from appropriate mean" -"replace extreme values with mean + 2SD" ... -Or use methods that are robust to outliers ex. non-parametric tests

violation of independence

Panel data: why would it make a difference if the pool of potential panelists is finite? -when drawing from a a finite pool the probability of drawing another man subsequently lowers the pool of available men to be drawn Ebola vaccine trial: under what circumstances might this 16-0 split not be so unlikely? -substrains of ebola exist, therefore you can't assume all cases of ebola are independent from one another -the delayed 16 cases could've been related cases *this violation of independence makes the 5-0 panel stronger evidence against H0, but would make the 16-0 case split weaker evidence...so that can vary

example mann-whitney u test

RTs from eight participants in each of two conditions, trained and untrained ▪ H0: training has no effect on RT ▪ Data: Trained: 432, 379, 400, 321, 476, 356, 388, 446 Untrained: 450, 489, 340, 430, 466, 480, 370, 499 place these in a single ranked list as follows: • 321(T),340(U),356(T),370(U),379(T),388(T),400(T), 430 (U), 432 (T), 446 (T), 450 (U), 466 (U), 476 (T), 480 (U), 489 (U), 499 (U) rank of trained responses: 1, 3, 5, 6, 7, 9, 10, 13. sum = 54 rank of untrained responses: 2, 4, 8, 11, 12, 14, 15, 16. sum = 82 sum = r, placed in formula trained group r= 54 - (8 x 9 / 2) = 18 untrained group, R=82, n=8, U = 46 this equates to the number of values in the other group exceeded by values in the test group these values add up to 64, because each value in the trained group "competes" once with each value in the untrained group 8 x 8 = 64 "competitions", the trained group "wins" 18, take smaller value critical value of u for 8x8 = 13 18 exceeds 13, cannot reject null normal approximation (dubious for small n): Z = 1.42, p = 0.156

z-test for difference

Z = (423.6 - 390.4)/8.94 = 3.71 mean unprimed RT = 423.6ms; mean primed RT = 390.4ms divide by SEM = 8.94 we would also get this by plugging the relevant numbers into our standard formula Z = 𝑥 − μ here 𝑥 = 32.2ms (the observed difference) μ = 0 (the hypothesised difference) n = 10 (the number of pairs) σ = √800 (the hypothesised SD of the differences)

z-score

a measure of how many standard deviations you are away from the norm (average or mean) reject our null hypothesis (at p < 0.05) if the score exceeds 1.96 in magnitude don't usually know what the standard deviation is supposed to be, for instance: we'll usually want to estimate this too

inferential statistics

allow us to draw broader, probabilistic inferences about what's going on in the world, given the data that we observe numerical data that allow one to generalize- to infer from sample data the probability of something being true of a population

model comparison

analysis of the whole model is a form of model comparison - comparing the predictive power of the whole model with that of an "empty" model (y = constant + error) • conceptually similar to the comparison of any two nested models ex. in linear regression using predictor variable to see if there's a sufficiently good improvement to justify having it nested - the predictors in one model are a subset of the predictors in the other model idea: look at how much of the variance is explained by the simple model, -see how much leftover variance is explained by the more complex model - form an opinion about whether the additional complexity is worth including in the model

ANOVA

analysis of variance t-test can be used to compare two group means could also do this using a regression analysis with one dummy-coded variable both the same thing: we're asking whether it's worth including "group" as a factor in the analysis could compare more than two group means by using multiple dummy variables! - same as a multiple regression, compare full model with an empty model, like z-test - can evaluate the usefulness of this model as a whole , compared to not coding group means, using an F-test if CV exceeded for the F-test would indicate that more complex model explaining enough variability to be worthwhile - this is the ANOVA - generalization of t-test for multiple groups - special case of multiple regression, all variables dummy to encode group membership of nominal variables

assumptions for linear regression

assume that there is a linear relationship between the variables need residuals to be normally distributed, with variance in the d.v. the same across values of the i.v. called "homoscedasticity," necessary for the parameter estimates to be reliable (although regression is reasonably robust to violations of this, i.e. "heteroscedasticity") we don't need the independent variable to be normally distributed - it can even be nominal regression with one two-level IV equivalent to a t-test

independence

assumed that the variables are mutually exclusive of one another ex. if P wins Wimbledon, M will be runner-up -these are not independent observations

difference in response time z-test

assuming normality, primed minus unprimed time is -normally distributed -with mean 0 -and variance 2σ^2, where σ^2 is the population variance if population variance known we could do a Z-test ex. σ = 20 -then variance for each observation = 400 -for each difference, variance = 800 -for the sum of the 10 differences, variance = 8000 -for the mean of the 10 differences, variance = 80 (SD ~ 8.94)

ANOVA f-score

between groups mean square divided by within groups mean square ratio describing how much better the model predictors are at explaining the variance than just arbitrary predictors would be how much better model is than chance admits a significance test (result in final column) - f test is the ratio of the explained variance to the unexplained variance - in general, arbitrary predictors would be somewhat effective in accounting for variance - could account for all the data if we expended n-1 degrees of freedom enables us to state the values of n-1 data points, from which the last one would be inferable offers one possible basis for comparison: how much better (if any) is our proposed i.v. in accounting for variance in the d.v. than just stating the value of one data point would be?

ANOVA degrees of freedom

between-group: n-1 within-group: # participants minus 1 and minus between group df total: one less than the number of data points (mean uses 1 df) between groups: number of predictors in model within groups: what's left (total minus the between groups df)

eliminating systematic errors

bias -make sure that participants are given the same instructions in each condition - e.g. whether or not to respond "carefully" or quickly" or order effects

r ANOVA process

calculate between-groups variability BGSS grand mean = 9 (2^2 + 1^2 + 3^2) x 6 = 84 6 = n 3 = conditions take group mean away from the grand mean, square differences, sum multiplied by n = 84 (condition 1 - 9)^2 + (condition 2 - 9)^2 ... x n = 84 calculate within-group variability WGSS (1 + 0 + 4 + 4 + 0 + 1) + (4 + 1 + 1 + 0 + 4 + 4) + (9 + 1 + 9 + 4 + 1 + 0) = (10 + 14 + 24) = 48 want to explain variability within participants - subtract datapoint from condition mean + next datapoint from mean... condition A + B + C = 48 calculate between-subjects variability BSSS (0 + 0 + 0 + 0 + 1 + 1) x 3 = 6 take the participants grand mean, A(A+B+C)/3 - grand mean, square, multiply by # of conditions (9 - 9)^2 + ...... x 3 = 6 error = WGSS - BSSS = 48 - 6 = 42 df for BGSS, one less than conditions, 3 conditions = 2 error term df = (3 - 1) x (6 - 1) = 10 (# of conditions - 1) x (n - 1) = 10 mean sum of squares between groups = 84 / 2 = 42 mean sum of squares for error = 42 / 10 = 4.2 f-value=42/4.2 = 10 •compare this to CV of F(2,10) in usual fashion = significant *significant difference because it's a repeated measure, would find this in paired t-tests effect driven b/c scores in after condition higher

regression coefficients

can use t-test on each coefficient can perform some kind of model comparison more complex model will always predict the data better, but does it do well enough to justify its extra complexity? complex models, sometimes difficult to establish the effect of any one* variable ex. we want to explore the effects of age and years in school - completely correlated to one another

model comparison one way ANOVA

comparing a model which represents group membership with one which does not • use of the F-test -examine average amount of variance explained by each of the (n-1) predictor variables, compared to chance - reject the null hypothesis (that these variables are not explanatory of variance) if they do well enough if significant, conclude that the variables are justified as part of the model, meaning that group membership influences the d.v.

one way ANOVA

comparing means of more than two groups, using the same logic as multiple regression H0: variables are not explanatory of variance/difference - generalizes the idea of t-tests to more than two groups, offering an OMNIBUS analysis more than two groups -balanced in size -one observation per participant ex. each group receives a different training, and the dependent variable is how they do in a test assumptions: normality of residuals homoscedasticity independence of observations ▪ (might be able to transform data to do something about normality, and ANOVA quite robust to violations in any case...but fixing independence violations won't be so easy) n groups, we use n-1 dummy variables, each with two levels, to represent the group membership ex. variable "is this data from group A?" variable "is this data from group B?" variable "is this data from group C?" identify group membership from the variable settings, (A=1,0,0; B=0,1,0; C=0,0,1; D=0,0,0) • "is there a difference in the means?" as "does knowing about group membership help you in predicting the observed values of the dependent variable?"

example one way ANOVA

comparison of 4 different modalities on 4 different groups of subjects calculate mean per group membership take a grand mean BGSS = 6 x (1 + 1 + 4 + 4) = 60 squared differences between each group mean and the grand mean, multiplied by n each time (where n observations in each group) sum of squares of each observation minus its own group mean degrees of freedom between groups, 3 (= 4 - 1); within groups, 20 (= 4 x (6 - 1)) between group mean square = 60/3 = 20 within group mean square = 60/20 = 3 f-score = 20/3 = 6.66... critical value of F(3, 20) = 3.1 (0.05 level), 4.94 (0.01 level) H0 rejected, significant difference between the groups

critical regions

consist of the extreme sample outcomes that are "very unlikely" reject the H0 if the statistic we calculate lies in the critical region

chi square and data

could also use this method to analyze ordinal, interval or ratio data, suitably recoded ex. in the exam marks example, could replace the raw scores with grades, and see whether the distribution of grades had changed *however, a few limitations to be aware of -we'd be losing power in recoding ratio data in that way -chi-square test not reliable if the expected values are too small in too many cells, as that would inflate the test statistic -need a correction in that case, and also for 2x2 tables: Yates's correction is popular (remove 0.5 from all the differences before squaring) this is automatically done

linear regression and non-linear relationships

could transform the data and then fit a straight line ex. x -> log x ("taking logarithms", "log-transformation") - various options for treating percentage data (e.g. arcsine transformation, "log-odds" transformation)

dealing with non-normality

could use non-parametric tests, that make no distributional assumptions ex. sign test; more sophisticated examples to follow -transformation data, in order to bring it into a normal distribution - usually involves taking logarithms but, for a sufficiently large sample, may still be able to use parametric tests -sample mean is, in some relevant sense, "better-behaved" than the individual data points

critical values of t-test

depend on the degrees of freedom, n-1 paired t-test - 1 degree of freedom 3 paired t-test - 2 degrees of freedom 10 pairs t-test - 9 degrees of freedom with large sample size, the critical value becomes smaller, need extreme result to reject null for a large enough sample, critical values almost identical to those for the Z-score essentially because it becomes almost certain that the sample SD is an accurate estimate of the population SD, at that stage

blue monday experiment

determining most depressing day of the year through travel time, delays, time spent on cultural activities, time spent relaxing, time spent sleeping, time spent in a state of stress, time spent packing, time spent in preparation - complex outcomes (such as most of the things we measure) tend to have many causes / contributory factors want to explore many distinct factors simultaneously, discrete, or continuous

binomial distribution

doesn't consider the magnitude of the distribution used for paired data sees which difference is larger, +/- same as sign test

generality of central limit theorem

doesn't just apply to a "fair", symmetrical binomial distribution - could approximate the number of 6s you get when throwing a fair dice n times as sample bigger skewness goes away - could even approximate the number of jackpots you win when buying a lottery ticket for n consecutive draws... - given a large enough sample (which in this case is huge - the distribution's obviously totally asymmetrical for small n) *given a large enough sample, we end up with something like a normal distribution, regardless of what's underlying -will look like a normal distribution with large enough n - useful, because often we really don't know what the underlying distribution looks like: maybe it's just a big mess... - we can use our sample to estimate the variance, and appeal to the CLT to tell us that our sample mean is normally distributed

r ANOVA assumptions

don't assume independence because repeated as normal, approximate normality of residuals, etc. n appropriate "pairing" of observations require sphericity - equal variance of differences between all combinations of related groups SPSS corrects this if failed

dummy variable

don't really measure anything, not really numbers isn't a sense in which "male" or "female" corresponds to one unit on the measure of "gender" or "sex" decision to label these as (say) zero and one is completely arbitrary *only crucial thing is that they have different values - stand-in for qualitative facts that need to represented in this quantitative model variable for which all cases falling into a specific category assume the value of 1, and all cases not falling into that category assume a value of 0.

cause and effect

effects have multiple causes - we don't usually know which is important in any given instance we don't know whether the cause is the thing we were trying to manipulate ex. suppose we find that longer words take longer to read. what's the obvious explanation? what are others?

two way ANOVA process

equal variances between groups, normal distribution of error terms, independence of observations, etc... - we partition up the sum of squares of the residuals, just as before, but now over two variables (and their interaction) df will be calculated in the usual way conduct a set of distinct F-tests, the baseline being the mean square for "error" ratio of mean square for each variable to that mean square is the appropriate F-statistic each time can tell whether each i.v. separately - and the interaction term - are useful predictors given the number of degrees of freedom that they consume

using a linear model

equation of a straight line: y = a + bx b = gradient - number of units that y increases by given that x increases by one unit a = (y-)intercept - that is, the value that y takes when x = 0 H0: gradient = 0; H1: gradient is non-zero - could use a regression analysis as a basis for predicting the value of y given the value of x ex. what is the increased mortality risk associated with smoking some specified number of cigarettes per day? CVD risk y = 605.4 slope = 14.5 y indicates that the risk is already high, not near 0 showing people are already susceptible to CVD slope shows the increase of risk of 14.5 for ever 10 cigarettes a day lung cancer y = 1.6 slope = 11.26 y indicates a low initial risk of lung cancer, meaning that the steep slow shows that cigarettes are directly related in increased risk of lung cancer

standard error of the mean

equivalent to the SD of the sample mean under the null likely to occur

chi squared test for ratio data

ex. could test whether the classes allocated match expectation/target 1. 2.1 2.2 3 30% 40% 15% 15% this throws away a lot of important data! 49, 50, 45.... *order of classes makes no difference 30/30/15/25 and 30/30/25/15 each relevant, add to 100% chi2 doesn't account for the complexity of the data...

mixing qualitative and quantitative linear regression

ex. explore effects of gender and age in a regression model - like trying to fit two parallel regression lines, one to the female data and one to the male data "Is there an effect of age?" = "do the lines have non-zero gradient?" "Is there an effect of gender?" = "Is there a non-zero distance between the lines, in the y-direction?" // ("Or could we get away with just one line?") ("Is there an interaction?" ~ "What if the lines didn't have to be parallel - would that be better?")

repeated measures

experiments in which an individual subject is exposed to more than one level of an experimental treatment collecting multiple data points from the same participant isn't the same as using different participants • V1: I ask 500 people whether that sentence is grammatical • V2: I ask one person 500 times...

t-test

expresses how many units the sample mean is away from the H0 population mean used to estimate the variance if low sample size need large t-score to reject H0 therefore need large sample population b/c we're estimating units here are the estimated SD of the sample mean using the given data, estimate the variance of the data (again, sometimes called SEM)

confounds

factors that undermine the ability to draw causal inferences from an experiment where there's a different explanation for an outcome than the one we're interested in ex. the jungle versus a jungle...

type II error

failure to reject a false null

normal distribution

family of probability distributions with two parameters, mean (μ) and SD(σ) we can use this to assume how likely a datapoint will fall within a certain range ex. there is a 68% chance it will lie within one s.d. of the mean -a 95% chance it will lie within 2 s.d.s of the mean... -a 99.7% chance it will lie within 3 s.d.s of the mean, (all irrespective of what the mean and s.d. are) critical value is 1.96 at p 0.05, 95% level

normal approximation to binomial

for the single trial - two equiprobable outcomes, 1 and 0, p=0.5 - expected value 0.5 - variance 0.25 • distribution of total of n such variables - normally distributed with mean n/2, variance n/4 • mean of n such variables - normally distributed with mean 1⁄2, variance 1/4n •test for a coin's fairness by spinning it 1000 times and counting the number of 'heads' - mean outcome N~(1/2, 1/4000): SD = 0.0158 - reject H0 (coin is fair) if more than 1.96 SDs away from the mean in either direction ex. if mean is outside (0.469, 0.531) same region as binomial critical regions

linear regression coefficients

gradient interpreted as a measure of effect size - basis for a statistical test can calculate a t-statistic from it and run a t-test -H0: no relation (gradient = 0) - i.e. the t-statistic is the coefficient (minus the null hypothesized value of it, usually zero) divided by its standard error - standard error can be expressed in terms of the squares of the residuals

homoscedasticity

homo means "same" and -scedastic means "scattered" therefore homoscedasticity assumption of equal variances the constancy of the variance of a measure over the levels of the factors under study.

z-test

hypothesis-testing procedure in which there is a single sample and the population variance is known critical values at the 5% level are +/- 1.96, and so on for other levels +/- 2.57 (1% level) standard for testing whether a sample comes from a population of known mean/variance

example wilcoxon signed rank test

ignoring the zero, now have n=11 -calculate differences between groups A and B {+1, +1, -1, -1} - 1st, 2nd, 3rd, 4th each get 2.5 points (10 / 4) - take average of ranks, divine by n, = number of points {+2, -2, +2} - 5th, 6th, 7th each get 6 points (18 / 3) {+3} - 8th gets 8 points {+4, +4, +4} - 9th, 10th, 11th each get 10 points (30 / 3) sum of the positive ranks: 2.5+2.5+6+6+8+10+10+10= 55 sum of the negative ranks: 2.5+2.5+6 = 11 total = 66 (correct for n=11) test statistic is the smaller sum, 11 for n=11, two-tailed test at 5% level, critical value = 10 here, do not reject H0 (for two-tailed test at 5% level) NB: Normal approximation gives Z = 1.956 (sample small here)

independence of coefficients multiple regression

in mult. reg. the coefficient estimated for a given variable will depend upon what else is in the model says whether one explanation looks good depends on what other explanations are being considered single variable might look like a significant predictor when it's on its own (no other explanation for the outcome), but turn out non- significant in a more complex model trying to ascribe the variability in the outcome to the various different possible causes

chance

in that sense would correspond to f = 1; critical value will be somewhere above this

OMNIBUS analysis

includes all the data - evaluation of whole model versus a null model significance means it's better to include group membership in the model than to exclude it when just two groups (analogous to the t-test), this amounts to saying that there's a significant difference in the means yet we know that there's a difference somewhere...we just don't know exactly where! follow up with pairwise comparisons using t-test - lessens the risk of type I error - safer than looking through all the possible pairwise combinations, looking for something that might be "significant"

a specific linear model

introduces a specific set of potential contributory explanations for the observed value of the d.v. ex. effects of age, gender and frequency on reading time - means that we will take the variability that is observed in reading time, in our experiment, and sort it into variability due to age. gender, frequency, "error" (not otherwise explained) - if one of these first three categories (age, gender) fairs to explain "enough" of the variability, we'll abolish it and abandon that explanation

multiple regression

like simple linear regression...but a bit more difficult to explain on a graph: trying to minimize residuals - one independent variable, posit that the d.v. is the sum of: ▪ a multiple of the independent variable ▪ a constant ▪ an error term (assumed to be normally distributed, mean 0) - two independent variables, posit that the d.v. is the sum of ▪ a multiple of one independent variable ▪ a multiple of the other independent variable ▪ a constant ▪ an error term (same assumptions) ▪y = b1x1 + b2x2 + b3x3 + ... + constant + error

imprecision of t-test

limitation is that we're estimating the SD SEM estimate could be particularly unreliable if we have a very small sample ex. I sample two exam marks from the class at random. I might get {64, 65}, which would make it look like there's very little variance...or {50, 80}, making it look like there's masses of it... therefore need large sample size it's possible to get relatively extreme t-scores under the null with a small sample, because the estimate of the variance might be much too low need very large t-scores to reject the null when the sample size is low

hypothesis

links two variables

limitations of two way ANOVA

many levels of each i.v. interaction term can end up associated with many degrees of freedom potential of "false negatives" - no evidence of interaction, if the interaction exists only in one place ex. medication that's effective only against one condition doesn't really arise in the 2x2 case

heteroscedasticity

meaning unequal variances

kurtosis

measure of how "heavy-tailed" the distribution is whether it's a normal bell shaped curve, normal distribution normal distribution kurtosis = 3 heavy-tailed distribution has kurtosis > 3 ("leptokurtic") light-tailed distribution has kurtosis < 3 ("platykurtic")

skewness

measure of the (a)symmetry of the distribution zero skewness = symmetrical distribution negative skew: left tail longer/weightier, mass to the right positive skew: right tail longer/weightier, mass to the left

person's correlation

measure of the extent to which two factors vary together, tell how well either factor predicts the other correlation useful in expressing relationship between continuous variables use person's test to measure this -correlated data appear in opposite quadrants positive correlation: when the x-value is above the mean, so is the y-value, and vice versa negative correlation: when the x-value is above the mean, the y- value is below the mean, and vice versa PMCC takes values in range [-1,1] -1 perfect negative correlation, +1 perfect positive correlation 0 no correlation ex. r = -0.775 is a negative correlation, slope goes \ r = 0.554 large correlation usually descriptive, sometimes inferential descriptive: strength/robustness of relationship inferential: whether there is a linear relationship at all r = 0.1 as small effect, r = 0.3 medium, r = 0.5 large r^2 = proportion of variance in d.v. explained by i.v. r^2 = 1 (d.v. value predictable from i.v. value) ex. a 43 year old will be able to hear 140000 hz -based on the linear plot r^2 = 0 (d.v. value unpredictable from i.v. value)

p-value

measure of the strength of the evidence against the null hypothesis probability of getting result as extreme or more than the actual result, under the assumption that the null hypothesis is true not probability that the null hypothesis is true

ratio data

measured along an objective scale twice the number represents twice the quantity ex. weight, no negative value

main effect vs. interraction

must be cautious in affirming the existence of these effects in general based on the overall trend when that appears to be due to interactions only

linear regression and degrees of freedom

n-2 degrees of freedom t n-2 is the distribution of interest, when finding critical values explained degrees of freedom: if you have two values known mean, then when you know one value it determines the other. this system has one degree of freedom in this case is that we "lose" 2 df bc have to calculate 2 coefficients from our data, the gradient and the y-intercept of the line

descriptive statistics for data types

nominal data can still have a modal value, but median and mean don't make sense ordinal data can have a median, but the mean may not make sense in the same terms as the data, ex. education level, lickert scale we can compute the mean of Likert scale data, but it's not always easy to interpret ratio data - mean, median or mode -can also compute a rich set of measures of dispersion

mann-whitney u test

non parametric, used for non-paired data, small sample idea: two samples, no difference between them, rank order, the two samples should be intermixed if there is a difference, we'll probably have the data from one sample clustering at the top and the other clustering at the bottom -can use for independent variables with many levels -use for ordinal data, but it's also suitable for small samples of non-normal ratio data ex. understand attitudes towards pay discrimination based on gender (i.e., your dependent variable would be "attitudes towards pay discrimination" and your independent variable would be "gender", which has two groups: "male" and "female"). ex. if salaries differed based on educational level (i.e., your dependent variable would be "salary" and your independent variable would be "educational level", which has two groups: "high school" and "university").

ordinal data

not quantitative! ordered data, ex. sometimes, rarely, never sometimes not clear whether our data is ordinal or interval, for instance if we have a Likert scale but we label all the points ("completely/mostly/somewhat unacceptable...")

z-tests and linguistics

not usually what we need although there are exceptions, ex. when we're using this as an approximation to something else two points of difference: -we don't usually know what the standard deviation is "supposed to be" for our data -we don't usually know what the mean is "supposed to be" ex. we could run a test to see whether our reaction time data might be drawn from a population of mean 500ms and SD 50ms - but why would we think that it was??

number of levels

number of different values that a variable can take - applies to "categorical" / "discrete" variables, rather than "continuous" ones e.g. two for right/wrong, three for grammatical gender in German condition

ex. ANOVA degrees of freedom

observed data are {3, 4, 5, 6, 7, 7, 10} summary they have a mean of 6 residuals: {-3, -2, -1, 0, 1, 1, 4} by how much they differ sum of squares of residuals = 32 = square residual describing: one datapoint is 3 and the others have a mean of 6.5 - explains only one datapoint bc other datapoint vary residuals: {0, -2.5, -1.5, -0.5, 0.5, 0.5, 3.5} sum of squares of residuals = 21.5 - we've spent a degree of freedom and cut the sum of squares by 10.5 ex. one of them is 6 others have a mean of 6 we'd still have the same residuals we started with, but variance gone doing this at random (picking a datapoint) will sometimes cut the residual sum of squares a lot, sometimes not so much, occasionally not at all... how effective is linear model comparing to arbitrary model? is it just chance? use f-test

interactions

often treated in the same way, as just another variable ex. age and gender start with a model in which A and B are posited as independent predictor variables if both significant, consider adding a term (designated A:B or A*B) which encodes the interaction between A and B way of capturing interaction in model is a "variable" if predictive, if not useful predictive power, discard yet, can have significant interactions in the absence of significant "main effects" - exert some caution in interpreting these results...

one tailed vs two tailed

one tailed is weaker than two-tailed test, for a given size of critical region -i f an outcome is in the critical region for a two-tailed test, it will be for the corresponding one-tailed test (in that direction) - usually regarded as a less stringent criterion 2 tailed like choosing larger p-value as critical - one tailed better for looking at bias in one direction, men vs women can't chose any other critical values

r-ANOVA differences

one way ANOVA generalizes the idea of t-tests to more than two groups, offering an omnibus analysis by "t-tests" there we mean unpaired two-sample t-tests generalizing other kinds of test? ex. we want to evaluate whether some training has immediate effects and/or delayed effects - administer a test prior to training, and a (different) test after before and after comparison straightforward... classic setup for the paired t-test (or nonparametric alternatives) want to consider performance a /week/ later, or a month later, etc.? - run repeated paired t-tests, but again we would like to do an omnibus analysis to avoid inflated type I error

over-fitting

posit an extra term in a misguided attempt to explain away variance that is actually due to some other cause must evaluate whether our "improvement" in fitting is good enough in explaining the variance to justify positing the extra term

collinearity

predictor variables are correlated, can't tell which of them is associated with the effect overall model will still "work", but standard errors of coefficients will be very large - lower t-values, less potential to reject H0 that the coefficient might actually be zero extreme case: one independent variable is perfectly correlated with another - then impossible to tell which factor is relevant or causing the effect problem with the data, rather than with the method ex. want to test whether reading time was predicted by education level and/or vocabulary...who would you need to recruit? - solvable by having appropriate control groups, given our research question...

discrete distribution

probabilities of individual outcomes can be read off the y-axis (bars add up to 1) ex. binomial distribution

continuous distribution

probabilities of outcome ranges correspond to area under the graph (total area = 1) ex. normal distribution

power

probability of not making a type II error if the H0 is false choosing a lower p-value as the threshold for significance lowers the risk of Type I error but raises the risk of Type II error can make tests more powerful by having larger samples

effect size in linear regression

r = 1 or r = -1 means that all the points lie on a straight line of non-zero gradient in principle this could be any non-zero gradient, even a very small one actual effect of the i.v. on the d.v. could, in principle, be small even with a very strong correlation • In any case, inferential statistics would just aim to show that r is reliably non-zero wouldn't say anything about the strength of the correlation (r could be very small)

linear regression example reading time

reading time: mean in condition A = 450ms, mean in condition B = 500ms, overall mean = 475ms • pick a participant at random - what's our best guess as to what reading time they recorded? if condition unknown, guess 475ms if condition known, either 450ms or 500ms latter would be a better guess if new participant added to the study, would our best guess about their reading time depend on which condition they were assigned to? ex. how tall is Jo? vs. what color eyes does Jo have? ▪ info on gender would influence our guess on height, but eye color

type I error

reject a true null hypothesis, if the result ends up falling in the critical region p-value we take as the threshold of statistical significance determines how probable this is to happen, if the null is true ex. require p < 0.05 for significance - then we have a 5% probability of rejecting the null even if it's true

central limit theorem

relevant to why the normal distribution is so prevalent in reality (it is a limit distribution for others) - handy in cases where it is possible in principle, but laborious in practice, to calculate critical values, as in the binomial case • the mean of these variables converges to a normal distribution as n gets bigger - so if we take the mean of a large enough sample, it's approximately normally distributed - can work out the mean and SD of that distribution based on the individual variables specifically, it has mean μ and variance σ^2 /n (just like the case where the population is normally distributed too)

B versus β

report "beta", the corresponding standardized coefficient value that B would take if all the variables had been standardized before being entered into the model standardized here means "multiplied by a constant to make the variances equal to 1" beta therefore predicted change in SDs, in the d.v., per SD-sized change in the i.v. can be used as an indication of the relative size of the effects

interval data

scales we often use for gradient judgements, with a finite number of points ex. Linert scales - also don't yield objective data ex. 1 = fully unacceptable and 7 = fully acceptable: that is not "seven times as acceptable"

gambler's fallacy

should we expect improbable sequences of events to be cancelled out by subsequent events? ex. after a coin comes up 'heads' six times, or ten, will the following 1000 be heads? no- we expect under the null H0 a distribution of 505 for heads, and 500 for tails

between-subjects design

show different experimental conditions to the same people when we can change the level of the i.v. for a single participant, we can potentially do this type of study same participant receives multiple different "treatments", corresponding to different levels of the i.v.

within-subjects design

show only one experimental condition to any given participant *offers greater power more likely to give rise to issues around a lack of independence of observations each participant is assigned to a group; different groups get different treatments (for instance, seeing different linguistic materials)

loss of power

sign test ignores the relative size of the differences under H0 we don't only expect that pluses are as likely as minuses, but that +1 is as likely as -1, +2 as -2, and so on -argument is by symmetry, independent of distribution - however likely we are to get a 1-7 result under the null, we're just as likely to get a 7-1 result • If all the big differences were in the same direction, that would constitute potential evidence against the null - majority of differences being in that direction as well would also constitute supporting evidence

interpretation of r ANOVA

significant effect indicating that there's a difference somewhere between conditions - follow up with pairwise comparisons to find where this difference originates - no guarantee that there will be any significant differences in the t-tests, here or elsewhere

two way ANOVA

similar to one way ANOVA if we have two separate independent variables ex. 2 categorical independent variables, each with at least 2 levels d.v. on a ratio scale (and with convenient distributional properties) observations independent (e.g. one observation per participant) ex. wish to explore which of two kinds of training leads to better performance on a test, and also to consider whether gender has an effect and interacts with this - fully-crossed or factorial design is one in which we test all the possible groups in such a design: in this case, four groups - interested in whether there is a difference between the means

random error

small fluctuations between measurements one source of difference between measurements that has nothing to do with our variable of interest we expect differences of some kind even if no manipulation e.g. how long it takes this person to read this word - and note that we're not measuring it ideally, because of variations in the person's mental state (or environmental factors)

welch's t-test

so far homoscedasticity assumed, assume = variances used to test the hypothesis that two populations have unequal means when the variance is /different/ slightly different way of pooling the variance estimates different number of degrees of freedom (actually non-integer, derived from a formula) if we have a statistically significant difference under welch's test, that can't be because of a difference in the group variances: it can only be because of a difference in the group means more reliable when the two samples have unequal variances and unequal sample sizes.

individual predictors

so far, all treated equally - SPSS effectively enters them all at once into the model possibility: stepwise entry -assuming uncorrelated predictors, SPSS performs repeated linear regressions, eliminating the weakest predictor each time -could also enter variables in an order informed by theoretical considerations ex. start with the variables of theoretical interest, or those considered especially likely to yield an effect - once established effects associated with these, could explore if adding potentially confounding variables improves fit if potential confounds are not relevant, could (perhaps justifiably) remove them, in effect collapsing data along that dimension ` ▪ (that goes for theoretically interesting effects too...)

variable

something that we can measure and that can take different values - can have some number of levels

standard deviation

square root of the variance critical tool for measuring dispersion

variance

standard deviation squared

alternate hypothesis (H1)

statistical hypothesis that offers an alternative to the null hypothesis when the null is rejected. this hypothesis may take on various forms depending upon the nature of the statistical test (t-test, ANOVA, correlation, etc) and the "direction" of the test (one or two tails).

significant

sufficiently strong evidence which allows rejection of the null

ANOVA mean square

sum of squares divided by df the amount of variance that we've accounted for per degree of freedom expended big means square - lots of variance per degree of freedom if model effective if ineffective - small means square meaning little variance accounted for per degree of freedom

descriptive statistics

summaries data efficiently statistical procedures used to describe characteristics and responses of groups of subjects ex. mean, median, standard deviation

dispersion

suppose we have data obtained from two test conditions, A and B if the data within each group is tightly clustered (low), and there's a big gap between the typical values in group A and B, intuitively we might feel that there's a difference between the groups by contrast, if the data within each group is spread out and the groups overlap a lot in value, we might doubt that there's a difference

single-sample t-test

suppose we want to check that some results are compatible with a target in which the average mark is supposed to be 60 (no view on the SD) results: {54, 58, 52, 66, 40, 65, 58, 70, 43, 55} mean = 55.1 H0 mean = 60 sample SD = 10.2 n = 10 t = (55.1 - 60) / (10.2/ sqrt10) = -1.52 critical value of 9df = 2.4 cannot reject H0: we cannot rule out the possibility that the average mark of the population from which the sample comes is 60

ex. pitch perception and age

test effect of age on the highest frequency perceived various ages recruited and their capability for pitch perception measured (various methods are possible) how to analyze? divide into two groups, "younger" and "older"... or many groups... either way, we would be ignoring differences within categories, and thus throwing away relevant information.. would like to take each data point into consideration, including its precise value (both age and frequency) -use scatterplot to represent -expect linear relationship whether its negative or positive H0 scatterplot is random, no gradient

ex. language and voting

test the relationship between linguistic and political identities with a simple experiment participants hear a political speech read in either a Scottish or Southern British English accent, then state their voting intention - code voting intention as SNP/Lab/Con/Other H0: language encountered makes no different to voting intention goodness-of-fit test - test whether our data are a sufficiently good fit with that expectation to allow us to retain the null, or whether they fit it badly and require us to reject the null Total of 110 in condition A, 120 in condition B ▪ Total of 70 for SNP, 59 Lab, 57 Con, 44 Other ▪ Hence e.g. 110/230 = 47.8% of participants were in condition A, 70/230 = 30.4% voted SNP • Null hypothesis says that these are independent proportions ▪ That is, 47.8% of 30.4% of the total should be both condition A and SNP paired t-test - no, data aren't normally distributed, too few data points, and above all, we already know the means... binomial test (like sign test) - no, we don't know in advance how many people will be in each voting group, plus we want to test all the groups at once ("omnibus analysis")

significance level

the alpha level established before the beginning of a study, p < 0.05 probability of rejecting when H0 is true

how large a sample size?

the binomial, often said that np and n(1-p) must both exceed 5 (or 10) - fair coin, p=0.5: need either 10 or 20 trials minimum - fair die, p=1/6: need either 30 or 60 trials minimum • too small a sample, and the approximation may be inaccurate - consequently, the estimates of the critical values might be inaccurate ex. suppose we want to test whether a die is fair, with respect to the number of 6s that get thrown, based on 10 throws total ~ N(10/6, 50/36) - upper critical value well-defined, lower critical value below zero (in this case, that might lead us to the right conclusion...)

nested

the predictors in one model are a subset of the predictors in the other model

use of r ANOVA and questions

the standard way of applying ANOVA to repeated measures designs broader question about the potential for parallelism between how we treat subjects and items, it also depends which participant you are... fact of life! want to do the same kind of generalization over items that we want to do over subjects? - think of the items as being drawn from a population of potential items just as the participants are drawn from a population of potential participants generalize from subjects to the population same goes for the times ex. reading time on words, using just a sample of words generalizing to the population - draw conclusions

paired t-test

the standard way of testing for a difference in means between paired data -assuming normality of distribution (especially when applied to small data sets) means of differences= 33.2ms (as before) sample SD for differences = 28.8ms (can do in Excel...) t = (33.2-0)/(28.8/ 10) = 3.64 critical value for t9 at the 5% level = 2.26 reject null: there is a difference between the means b/c 2 tailed

null hypothesis (H0)

the statistical hypothesis tested by the statistical procedure; usually a hypothesis of no difference or no relationship

ANOVA sum of squares

total: sum of squared differences between each data point and the mean mean (i.e. the thing we calculate en route to variance) residuals are the difference between the mean and datapoint bc there's variability composed of two categories: variance explained by the regression model (between groups) residual variance (within groups) - small in a good model if between groups variance, many differences explained by group membership in model if residual variance high variance explained by other factors

sign test

turn data into set of positive and negative signs. used to test whether or not two groups are equally sized

order effects

type of systematic error occur when the order in which the participants experience conditions in an experiment affects the results of the study -can use counterbalancing, make sure that half the participants see condition A then condition B, while the other half see B then A

testing for normality

uncertain whether or not we actually have a normal distribution, based on a sample first approach: could look at a histogram does the resulting curve look like a normal distribution, ex. in its symmetry, general shape, etc.? could also look at certain measures of the distribution how much of the data falls within one SD of the mean? 2 SDs? we expect the data to lie mostly within 2 SD Q-Q plots useful here- shows a diagonal line, shows normally distributed can apply tests (e.g. Kolmogorov-Smirnov, Shapiro-Wilk)

assumptions about population data

underlying distribution is normal

nominal data

unordered categories ex. favorite sports team: couldn't really place an ordering on

ANOVA (analysis of variance)

used for a broad class of statistical tests examining equality of means appropriately named: variance ~ variability of d.v., analysis ~ breaking up the variance... into sum of squares, df, means squared,

r ANOVA

used for situations with (more than two) repeated measures, e.g. in longitudinal experiment - taking the variance and trying to break it down into separate components, as always - paired t-test vs unpaired would ignore the difference between the numbers and just look at whether or not there's a difference, ignores the variation, just looks at if there's improvement - variability due to condition (time: which instance of the measurement we're looking at) - within-group variability, multiple observations for the conditions variability between the participants - b/c of repeated measures we have a way of identifying consistent differences between the participants variability due to error (inexplicable) if we can explain away the variability through the conditions compared to variability from error- the model is good,* or if variability from the error explains away the variability from the conditions - adding this extra subcategory (per-participant effects) explains away more of the variability, at a cost of some number of degrees of freedom

chi-square test

used to assess the expected compared to the observed data of a small sample -reject if chi square value is very high, exceeds critical value -calculate the observed and expected values for each cell and thus compute this statistic - just testing for a violation of independence somewhere in the table calculation: expected: take total from (condition A SNP / total) x (total condition A / total) x total ex. (70/230) x (110/230) x 230 = expected condition A SNP SNP(observedA-expectedA)^2/(expectedA)+LABOR(observedB-expectedB)^2/expectedB+... result: 0.37 + 0.27 + 0.06 + 1.21 + 0.34 + 0.25 + 0.05 + 1.11 = 3.66 chi ^2 n-1 = 3 df, 7.81, do not reject null,

bonferroni correction

used to avoid type I error, take the critical values p = 0.05 and divide by number of tests keeping the chance of type I error consistently small

f-test

used to evaluate the differences between the group means in ANOVA reject H0 if the f-value exceeds the critical value of the appropriate f-distribution at the desired level - one-tailed test only interested in whether the model is better than chance at predicting the d.v.

two-sample t-test

useful for comparing the means of two groups were manipulating, when we /don't/ know the variance general idea - estimate the variance based on the variance within each sample pooled estimate from the two groups, weighted when they are unequal in size degrees of freedom = total number of observations minus 2, n-2 ex. comparing groups of n =13, n =11, 13+11 =24 - 2= 22df ex. was there a change in exam performance between years? different sample sizes, diff people* 2016 results: {54, 58, 52, 66, 40, 65, 58, 70, 43, 55} 2017 results: {66, 42, 44, 47, 51, 70, 43, 42, 55, 53, 54, 63} 2016 mean = 56.1, 2017 mean = 52.5 SEM estimate = 4.24 (takes the sample sizes into account diff between means = 3.6 t20 = 2.2 and 2.94 t = 3.6/4.24 = 0.87 (not significant) cannot reject H0

dependent variable (d.v.)

variable that corresponds to the output

independent variable (i.v.)

variable that we are manipulating change the level of the variables as part of the experiment, and compare results gender, language background, handedness, etc. arrange the recruitment process so as to obtain pool of participants with the required distribution of values for age, we can do studies with the same participant at different levels - often called "longitudinal" studies

residuals

vertical distances between a point and the least squares regression line.

r ANOVA ex. frequency vs RT

we'd like to know how readers respond to words of different frequencies (in general, readers and word frequency variability) yet this experiment will just use a sample of a few words, in practice - like to know how much of the variability is due to random variation between the words, in their "readability", irrespective of their frequency - see whether a word is read quickly or slowly by readers as a whole if multiple different items represent a given condition, can do analysis both by subjects (F1) and by items (F2) - makes logical sense, but tricky to interpret - F1/F2 analysis (somewhat) superseded by mixed models, which implement the same kind of idea in a more sophisticated way - want to generalize from study to the population

estimating mean and SD of population from sample

what are they used for? question: is it possible that our two samples actually come from the same population? is it at all probable that they come from populations with the same mean?

effect size on r ANOVA

what proportion of the overall sum of squares is between groups measure called eta squared (η2) - analogous to r2 in its meaning + interpretation standard for rANOVA is something slightly different called partial eta squared

nominal independent variables

what's meant "a multiple of" when the variable is nominal? if variable is binary, can effectively consider the levels of the variable to have numerical values, say 0 and 1 ex. in the smoking-mortality regression, could encode gender as 0 = male, 1 = female y = 600 + 15c + 10g c = number of cigarettes g = gender, takes value of dummy variable - coefficient of 10 attached to this variable would signify an increased risk of 10 units for females compared to males - coefficient of -20 would signify a reduced risk of 20 units for females compared to males if coefficient changed the encoding, the coefficient would change, but the ultimate interpretation would remain the same • use a regression analysis with a single binary variable -same as a t-test -(ANOVA is the generalization to more than two groups)

replication

when locating an interaction somewhere in a large design, might want to do a new experiment homing in on it ex. removing the irrelevant/inactive conditions conceptual difference between (partial) replications and post hoc analyses - difference in looking at the same data and focusing on one condition, or planning a new analysis that focuses on that condition in 1st case, deliberately picked out the data that's most interesting, and something interesting was bound to happen somewhere (given a large enough experiment) in 2nd case, no guarantee that anything interesting will happen anywhere: in that sense, it's a fair test want to avoid the danger of building theories that explain only the data we already have, not those that will come next

lickert scale data

which sentence is more natural? A. Mary gave the book written by the acclaimed novelist to Bill. B. Mary gave Bill the book written by the acclaimed novelist. • elicit ratings on a seven-point scale 1 = totally unnatural, 7 = totally natural within-subjects manipulation, with lots of fillers controlling for order effects in some way -suppose further that we only have a small number of participants, say 12... not great want to test for a difference between these two sentences' ratings: how do we proceed? sign test

risks in linear mixed modeling analysis

widely used powerful method, but not particularly well-understood possibility of trying to apply the method when the data don't meet the assumptions can lead to over-casual experimental design, on the expectation that the analysis will fix the problem don't need all of the complexity of multiple reg. simple and transparent statistical methods are often the appropriate option

z-test vs t-test

z-test testing whether a sample could be from a distribution of known mean and variance z-distribution: unit normal distribution t-test: testing whether a sample could be from a distribution of known mean, unknown variance -estimating variance based on the sample -t-distribution: slightly more spread out than Z-distribution -precise version of t-distribution to use depends on sample size

partial eta squared

η^2 partial = BGSS / (BGSS + error) accounts for variance caused by group membership,

standard procedure

▪ We calculate the results that we would expect to observe, under the assumption that the null hypothesis is true ▪ To do this, we need to appeal to probability theory (or, more usually, someone else's calculations...) ▪ We run the experiment and collect the data ▪ If the results are sufficiently "extreme", under the null hypothesis, we reject the null hypothesis and adopt the alternative hypothesis • p-value used as a measure of the strength of the evidence against the null hypothesis ▪ It's the probability of getting a result as extreme or more extreme than the actual result, under the assumption that the null hypothesis is true ▪ Hence, it's not (for instance) the probability that the null hypothesis is true


Kaugnay na mga set ng pag-aaral

Collegiate Personal Finance F151 (Ch 8, 9, 10, 11 & 13)- IUS Spring 2020

View Set

applied mathematics - unit 1: exam

View Set

Skeletal Muscle Contraction (14 Steps)

View Set

Analyzing Transactions: The Accounting Equation

View Set