STATS
Large samples are over
30
Minimizing error
= parameter that has the least error given the data -Also called ordinary least squares (OLS) -Not always accurate, just not as terrible
What are Type I and Type II errors?
A Type I error occurs when we believe that there is a genuine effect in our population, when in fact there isn't. A Type II error occurs when we believe that there is no effect in the population when, in reality, there is.
What is a test statistic and what does it tell us?
A test statistic is a statistic for which we know how frequently different values occur. The observed value of such a statistic is typically used to test hypotheses, or to establish whether a model is a reasonable representation of what's happening in the population.
Mean is center of
CI; S = SD (not including equation bc it takes too long in docs)
Central limit theorem:
Central limit theorem: as samples get large, the sampling distribution has a normal distribution with a mean equal to the population mean and a SD (standard deviation) of: σ-x (-x meaning mean) = S/√N
Standard error of the mean (SE):
Standard error of the mean (SE): standard deviation of sample means
Outcome = (model) + error
The data we observe can be predicted from the model we choose to fit plus some degree of error
What is the mean and how do we tell if it's representative of our data?
The mean is a simple statistical model of the centre of a distribution of scores. A hypothetical estimate of the 'typical' score. We use the variance, or standard deviation, to tell us whether it is representative of our data. The standard deviation is a measure of how much error there is associated with the mean: a small standard deviation indicates that the mean is a good representation of our data.
Bonferroni correction
applied to alpha level to control overall Type I error rate when multiple significance tests are carried out -Criterion of significance of the alpha level / # of tests conducted -Too strict when lots of tests performed
Confidence interval
boundaries in which we believe the population value will fall
Deviance is another word for
error
Linear models
models based on straight lines
Some overlap is
needed (no overlap = means are from different populations)
Null hypothesis means
no relationship between variables -P is long-run probability
Two Tailed:
non-directional -Changing these and their tests are the fact is cheating
Statistical significance
p < 0.05, our model explains a sufficient amount of variation to reflect genuine effect
T- distribution:
probability distr. changes shape when sample size increases
Interval estimate
sample values as midpoint w/ upper and lower limit)
Sampling variation
samples vary bc they contain different members of population
Point estimate
single value from sample
Large samples
small differences can be significant
Sample
small sub-set of the population
Interval is
small, sample mean is close to true mean
Average error is
ss/n (# of variables)
SS can be used to assess
total error
Population
total set of observations that can be made
P value:
usually 0.05 (Fisher never stated this as the magic number though) -5% chance = threshold of confidence
We predict ______ of an _______ ________ based on a model
values of an outcome variable
Degrees of freedom
(df) = number of scores adjusted because we are trying to estimate population (n-1)
Parameters
(usually) constants believed to represent some fundamental truth about the relations between variables in the model
What do the sum of squares, variance and standard deviation represent? How do they differ?
All of these measures tell us something about how well the mean fits the observed sample data. Large values (relative to the scale of measurement) suggest the mean is a poor fit of the observed scores, and small values suggest a good fit. They are also, therefore, measures of dispersion, with large values indicating a spread-out distribution of scores and small values showing a more tightly packed distribution. These measures all represent the same thing, but differ in how they express it. The sum of squared errors is a 'total' and is, therefore, affected by the number of data points. The variance is the 'average' variability but in units squared. The standard deviation is the average variation but converted back to the original units of measurement. As such, the size of the standard deviation can be compared to the mean (because they are in the same units of measurement).
Types of Hypothesis
Alternative: the one with 1; the effect will be present Null: the one with 0; the effect is absent Not true or untrue, but in terms of probability Directional hypothesis: effect will occur and the direction is stated Non-directional hypothesis: effect is absent and direction is not stated
What is statistical power?
Power is the ability of a test to detect an effect of a particular size (a value of 0.8 is a good level to aim for).
Statistical Power
Probability a given test will find an effect assuming one exists -Depends on: ~How big the effect is, how strict we are on significance, sample size ~= 1-p ~Can use to calculate sample size necessary to achieve given level of power
What's the difference between the standard deviation and the standard error?
The standard deviation tells us how much observations in our sample differ from the mean value within our sample. The standard error tells us not about how the sample mean represents the sample itself, but how well the sample mean represents the population mean. The standard error is the standard deviation of the sampling distribution of a statistic. For a given statistic (e.g. the mean) it tells us how much variability there is in this statistic across samples from the same population. Large values, therefore, indicate that a statistic from a given sample may not be an accurate reflection of the population from which the sample came.
Why do we use samples?
We are usually interested in populations, but because we cannot collect data from every human being (or whatever) in the population, we collect data from a small subset of the population (known as a sample) and use these data to infer things about the population as a whole.
Intervals
Typically use 95% or sometimes 99% CI -Probability that in 95% of the sample, the CI contains the population value (95% of the time the sample value will appear in pop value, 5% of the time it won't)
Fit
degree to which a statical model represents the data collected good/moderate/poor
One tailed:
directional (results in opposite direction, you MUST accept null hypothesis) -Changing these and their tests are the fact is cheating
Non-directional hypothesis:
effect is absent and direction is not stated
Directional hypothesis:
effect will occur and the direction is stated
Test stats
effect/over; systematic variance/non-systematic variance
Experimental error rate
error rate across statistical tests conducted on the same data = 1-0.95 ^n
Sampling distribution
frequency distribution of sample means from the same population
Any parameter than can be estimated in a sample has a
hypothetical sampling distribution and standard error
Small samples
large differences can be non-significant
To calculate confidence interval we need to know the
limits w/in which 95% of the sample will fall
null hypothesis
the one with 0; the effect is absent
alternative hypothesis
the one with 1; the effect will be present