Scope & Methods Exam II
With the mean difference and the standard error of the mean difference, we can easily calculate the Z or t statistic
(Ha -Ho) / SE Mean difference
Type 2 error
(false negative) Conclude there is no relationship when there is
Type 1 error
(false positive) When we conclude there is a relationship when there is not
Lambda Calculated using the following:
(prediction error with no knowledge of IV - prediction error with knowledge of IV) / prediction error with no knowledge of the IV
Ideal Sample
A larger sample with lower dispersion leads to less random error and greater confidence in the sample statistics
What is the probability that a randomly chosen mean will be less than -1 standard errors?
13.6%
What is the probability that a randomly chosen mean will be more than +2 standard errors?
2.1%
What is the probability that a randomly chosen mean will NOT be between -1 and 1 standard errors?
34.1%
Lambda
A measure of association that measures the PRE when the researcher uses the value of the IV to predict the value of the DV Used when at least one of the variables is nominal Lambda is an asymmetric measure of association, which means if you flip the IV and DV you will get a different result
The null hypothesis
A null hypothesis assumes that, in the population there is no relationship between the independent and dependent variables Any relationship that we do observe, we assume is part of random sampling error. The null hypothesis always starts with the assumption that error explains the results We are not trying to show that our hypothesis is true, rather we are trying to show that the null hypothesis is implausible Null hypothesis: H0 Alternative hypothesis: Ha
Populations
A property of a population (parameter) will have a central tendency and dispersion For interval variables we are interested in the mean and the standard deviation
Confidence intervals
A range abound the mean estimate where we are confident that X% of all sample estimates will fall For 95% confidence intervals, we are looking at a range of +/- approx 2 standard errors - 1.96 to be exact For 99% confidence intervals, it is 2.576 standard errors To compute the 95% confidence interval range, we create an upper bound and a lower bound by adding and subtracting 1.96 standard errors from the sample mean
Samples
A sample is a number of cases or observations drawn from a population It is easier for us to collect and see the parameters of samples
Sampling frames
A sampling frame is the method for determining your sample Poor sampling frames lead to selection bias Selection bias leads to compositional differences between the sample and the population Estimators with high bias are not going to be valid
Sample size component
As sample size goes up, random sampling error goes down
Variation component
As variation goes up, random sampling error increases
Why is this Important?
Assuming that the distribution is normal allows us to make precise inferences about the percentage of sample means that will fall within x standard errors of the mean
Normal distribution probability table
Assuming the true population mean is 66, what percentage of the time would a sample mean fall 2.8 standard units lower .26 percent of the time - so it seems pretty safe to say that the sample mean is likely close to the population mean than the hypothetical mean
Comparing two sample means: the confidence interval approach
Compare men v. women on opinions of gun control Looking at men, we know that 95% of all samples from the same population will fall within +/- 1.96 standard deviation of the mean - this range is important If we calculate the confidence interval for women, and we find that the two ranges do not overlap, hen it is very unlikely that these differences are due to random error Why? Because the chance that the true value is the same for both is less than 5% (not that it is not impossible)
Comparing two sample means: The P-value approach
Confidence intervals and your eyeballs will not steer you wrong, but the lack precision P value is an expression for the standard error of the difference between the two sample means P values will give you the exact probability of obtaining the observed difference if the null hypothesis is correct When the p values is less than .05 we can reject the null hypothesis
The Chi square test of significance
Definition: determines whether the observed dispersal of cases departs significantly from what we would expect to find if the null hypothesis were correct X^2: pronounced Ki square It works with cross tabular data The null hypothesis for a X2 test is that the distribution of cases down each column should be the same as the total distribution for all the cases in the sample
Inference
Drawing a conclusion from facts that are known or assumed Ex: parachutes
Why not always a census?
Expensive Extensive
Sample Statistic
First of all, what is a statistic? The word statistic refers simply to a measure A sample statistic is simply a measure of some attribute of a sample Example: we have a sample of students, they all have an age. We could compute the sample statistic of mean age When a sample statistic is used to estimate a population parameter, it is called and estimator
Calculating standard deviation
First we calculate each value difference (or deviation) from the mean Deviation = (individual's value - 𝛍 ) Second, since some of the values are negative, we square each value Deviation squared Third, we add all the individual values together This is called the total sum of squares Not important now but will be later for regression and correlation Fourth, we calculate the average of the sum of the squared deviations - this is called the variance Variance = total sum of squares / n Finally, we compute the standard deviation by taking the square root of the variance Standard deviation = √ variance
Assessing likelihood of hypothetical mean
First we find the difference between the sample mean and the hypothetical mean. In this case it is 66-59 = 7 Seconds as we are working in standardized units we need to convert this raw score into a z-score, in other words, we don't care that there is a difference of 7 points, we want to know what the difference i in units of standard error We calculate that by taking the difference and dividing by the standard error: 7 / 2.5 = 2.8 In other words, if the population mean was 66 our sample would be a 2.8 standard errors LOWER
Determining the standard deviation of the sample
For the population we calculate variance as summed squared deviations / N For the sample, we calculate variance a summed squared deviations / (n - 1) So our standard error when using the sample standard deviation will be higher
Interpreting
How often will random processes give us a mean that is 2.7 standard units below the true mean? We can look this up precisely using a normal distribution probability trade, which gives us the proportion of the normal curve above any given value of Z Take a look (table 6-3) - how much of the curve lies above 1.96 Read down of the left, go across right for second decimal The number is the percentage How does this relate to our 95% confidence interval Is hypothetical mean wrong? Is 66 probably right or probably wrong? - wrong How do you know? - Only a .26 chance it is right
Sample size and sampling error
If a sample is not random then the size of the sample has no relationship to how accurate the statistic will be Random samples do not get rid of error, instead they introduce random sampling error - difference between the population and the sample that occur by chance We are okay purposefully introducing this error because we know how to deal with it
Sample size & Sampling error
If a sample is not random, then the size of the sample has no relationship to how accurate the statistic will be Random samples do not get rid of error, instead they introduce random sampling error - difference between the population and the sample that occur by chance We are okay purposefully introducing this error because we know how to deal with it
The normal distribution II
If we draw a sample from the population, we can assume it is part of a normal distribution of samples The standard error of the sample mean expresses how sure we are that the sample mean is the same as the population mean The larger the samples size, the lower the standard error
The normal distribution
If we repeat the dice roll activity 100 times, and then plot the mean of each try, we will get a normal distribution When we take a sample from a population, there is error Some samples will have a lot of error (deviation from the mean), some will only have a little When we gather enough samples, we can see that the vast majority are going to be close to the mean, but there will still be some that fall at the extremes What values did you get for the dice roll mean?
The Central Limit Theorem
If we take an infinite number of samples of size n from a population of N, the distribution of the means of the samples would be a normal distribution The distribution of sample means would have a mean equal to the true population mean The distribution of sample means would have a standard error equal to the population's standard deviation / square root n
Using the t distribution
In a normal distribution, we don't care about sample sample size - but here the sample size impacts the shape of the curve - so it is extremely important
95% confidence intervals
In general we work with 95% confidence intervals, but we can have others Due to the central limit theorem, we know precisely what this interval will look like If we want an area of the curve where 95% of the sample estimates will fall, how many standard deviations are we looking at? For 95% confidence intervals we are looking at a range of +/- approx 2 standard errors - 1.96 to be exact For 99% CI it is 2.576 standard errors
Problems with lambda
Not good when there is low variation Not good then the within category modes are the same as the overall mode (will return a lambda of zero, even if there is a relationship)
Estimating Population Parameters
Population parameter = sample statistic + random sampling error Random sampling error = (variation component) / (sample size component) What does this suggest? As sample size goes up, sampling error goes down
Estimating population parameters
Population parameter = sample statistic + random sampling error Random sampling error = (variation component) / (sample size component) What does this suggest? As sample size goes up, sampling error goes down
Population parameters
Populations have information, characteristics, attached to them (think of these as variables) Example: if my population is all students in this classroom, what are some population parameters? To get the actual population parameter, we would perform a census In a census, every unit of the population is measured and statistical inference is not necessary
Proportional reduction in error (PRE)
Prediction based metric that varies between 0 and 1 The measure tells you precisely how much better you can predict the dependent variable when you know the value of the independent variable If knowledge of the IV doesn't tell us anything, the value will be 0 If knowledge of the IV givers us perfect prediction, the value will be 1
Degrees of freedom
Property of many distributions (including student's t) Equal to the number of observations (sample size) - the number of parameters being estimated The higher the degrees of freedom, the less random sampling error and the more confidence in the sample statistic typically n - 1
The Variation Component: Standard Deviation
Random sampling error = (variation component) / sqrt n Nominal and ordinal variables have less precise measures of variation, but for interval variables we can compute the standard deviation
The variation component: standard deviation
Random sampling error = (variation component) / √ n Nominal and ordinal variables have less precise measures of variation, but for interval variables we can compute the standard deviation
Properties of samples
Samples are drawn from a population - here we are only interested in random samples Samples have size (n) Samples have standard deviation (sigma) that express variation Samples have standard error (sigma / sqr(n)) that measures the accuracy of the sample mean in relation to the population's actual mean Taking multiples samples form the population No matter what the original population looks like, no matter the distribution or variation, the distribution of the sample means will look the same way What will they look like?
Random sampling
Simple Stratified Only one method allows for statistical inference Ensures that each member of the population has an equal chance of being selected into the sample Maximizes the likelihood that measurement error is random and not systematic
Comparing two sample means: the eyeball approach
Simply we can quickly compute a confidence interval around a mean difference by adding/subtracting 2 times the standard error As we are working with standardized scores, we can check to see fi 0 is within that interval. If so, we cannot reject the null We can also simply ask if the mean difference is at least twice the standard error. If no, we cannot reject the null. Obviously not precise, but a useful shortcut
Confidence intervals II
So far we've been working with point-estimates, that is a precise number (e.g. the mean) In inferential statistics we do not get precise number s- but rather we get uncertain predictions We express those predictions with something called a confidence interval A confidence interval is the range in which X% of the possible sample estimates will fall in by chance
standard error of a mean difference
Square each means standard error Add the two together Take the square root
Standard Deviation VS. Standard Error
Standard deviation: referring to the data itself In a normal distribution, 95% of all the data fall within 1.96 standard deviations of the mean Standard error: referring to the estimate of the mean
Statistical inference
Still drawing conclusions from facts known or assumed to be true, however now we have (and include) measures of uncertainty Statistical inference is when we make estimates (inferences) about a population based on information (conclusions and facts) about a sample It is a set of tools to help us assess how much confidence we have that what we observe in a sample reflects what exists in the population
which error do we want to minimize
TYPE 1
Frequency tables
Take sample, write frequencies Histogram
Measures of Association
Test of significance tell us if a relationship is likely due to random error Measures of association add more information: they communicate the strength of the relationship between the independent and dependent variable Statistical significance is often the thing we are most interested in, but a measure of association will allow you to start properly interpreting results
Interpreting X2
The X2 has its own distribution that like the t-distribution, depends on degrees of freedom Degrees of freedom are calculated by the following D.o.f. = (number of rows - 1) (number of columns - 1) General rule: the larger x2, the lower the probability that the results were obtained by chance Precise value: can be obtained by looking at a critical values of x2 table (table 7-8)
The law of large numbers
The average that of the results of running an experiment a large number of times will be close to the expected value For example, we would expect that 50% of the time a coin flip will be heads, but it's still possible to flip five tails in a row However, if we flip long enough, the mean will approach the expected value
Difference between means
The difference between means is easily calculated by subtracting the null hypothesis mean from the alternative hypothesis mean
Sample Sizes
The relationship between sample size and error is non-linear, so our sample size component is going to be a transformation of the sample size (n) Random sampling error = (variation component) / sq root n Larger samples are always better, but as the sample size increases, the additional samples become marginally less important
Sample sizes
The relationship between sample size and error is non-linear, so our sample size component is going to be a transformation of the sample size (n) Random sampling error = (variation component) / √ n Larger samples are always better, but as the sample size increases,the additional samples become marginally less important
Central limit theorem
The sample means will have a normal distribution and standard deviation Must be random sampling The clt also tells us some other things If we were to take an infinite number of samples, the mean of our sample means would converge to the true population parameters Larger sample sizes will be more accurate
Compute 95% CI
To compute the 95% confidence interval range, we create an upper bound and lower bound by adding and subtracting 1.96 standard errors from the sample mean Sample mean = 75 standard error =2.2 Lower confidence boundary = 75 - 1.96(2.2) Upper confidence boundary = 75 + 1.96(2.2) 95% confidence interval = [70.7 - 79.3] What does this mean? LOOK AT EXAMPLE
Z-Scores
Transform the sample mean into a standard score with a mean of 0 To better make sense of what we are looking at, we can standardize the scores into Z-scores Z-scores have a mean of 0 and a standard deviation of 1 We need to convert our scores to z-scores to properly leverage the power of statistical inference
Sample proportions
Used for nominal / ordinal data Same logic applies, but we can calculate variation differently Example: gun control Sample proportion of individuals who support gun control = p = .72 Sample proportion of individuals who do not support gun control = q = .28 Sample size = 100 Standard error of a sample proportion : sqroot of pq / sqroot of n = .045
Student's T
Used when we have a small sample (<30) The larger the sample, the closer the t distribution resembles the z distribution We care here about degrees of freedom (sample size - 1) The higher the degrees of freedom the lower of the random sampling error
What is the standard deviation
Variation is expressed as a function of the deviations from the mean of a distribution
What is the standard deviation?
Variation is expressed as a function of the deviations from the mean of a distribution
Inference from distributions
We know that in an infinite number of samples from a population, the sample statistic will be equal to the population parameter (Central limit theorem) Every time we take a sample, we treat it as if it were a part of larger population of infinite samples We know that if we take all of these samples and plot them, they will have a normal distribution Any single sample, then, is going to fall somewhere in that normal distribution - some much more frequently than others We are trying to determine how likely it is that this particular sample statistic would occur as a result of random sampling error Remember: At any given time, we can calculate the population parameter by adding the random sampling error to the sample statistic Random sampling error is a function of variation and sampling size
Inference using the student's T distribution
We use the student's t distribution when sample sizes are small The normal distribution always has the same shape, but the student's t-distribution changes its shape depending on the size of the sample The smaller the sample, the less confidence it permits us to have As the sample size gets larger, the student's t distribution gets closer and closer to being normal After about 30 degrees of freedom, indistinguishable from a normal curve Instead of using the standard deviation of the population (sigma) we use the standard deviation of the sample (s)
Standard error of a sample mean
We've been discussing the random sampling error as being the following: Random sampling error = (variation component) / (sample size component) This is more commonly known as the standard error of a sample mean Standard error of the sample mean = sigma / sq root n
Lambda values:
Weak: <= 0.1 Moderate: >0.1 but <= 0.2 Moderately strong: > 0.2 but <= 0.3 Strong: > 0.3
Normal estimation
What we did was just called normal estimation Assume that the hypothetical claim is correct If so, how often would the sample mean occur by chance? If it is unlikely to occur by chance, we can infer that the hypothetical claim is incorrect
Cramer's V
When the dependent variable has a low amount of variation, lambda isn't very helpful - we turn to Cramer's V (based on x^2) Cramer's V takes a value of 0 to 1 (no relationship to perfect relationship) This is not a PRE measure, so this does not reflect predictive accuracy (we cannot say that it improves prediction by X%) It does help us see relationships that lambda misses
Inference using the normal distribution
Why do we do this? Why does it matter? We can say things about the population with a certain degree of certainty Standardize and give ability to perform inference
Using the standard error
Within one standard error or negative one standard error What is the probability that a randomly chosen mean will be between -1 and 1 standard errors?
Problems with X2
X2 is very sensitive to sample size Large sample sizes will often produce statistically significant results even if the relationships doesn't look substantively interesting When expected frequencies are very small (less than 5) it does not work well at all Remember: x2 is only telling us the likelihood that a relationship has occurred by chance (same with p-values) - it does NOT help us interpret the data
Calculating Z score
Z = (deviation from the mean) / (standard unit)
Z - scores
Z scores are standardized units - expressed in units of deviation from the mean of the distribution When a z-score is +/- 1.96, then the difference between the population and sample mean would only happen by chance <5% of the time We can look up the proportion of the normal curve about the z score using a z table
Populations
a population is just the entire universe of cases we are interested in
A raw score:
a value expressed in its original form (e.g. years of age, dollars earned, etc.) also called an unstandardized score
Purposive sampling
expert judgment
A standardized score
is a value expressed in units of standard deviation from the mean. Also called a z-score
𝛍
population mean
𝛔
population standard deviation
Snowball sampling
sample grows from sample itself
Multi-stage
sample multiple levels of analysis
The observed frequency
the actual number of cases falling into each cell
The expected frequency
the hypothetical number of cases that should fall into each cell if Ho is correct
Calculating the chi square test of significance
x^2 = sigma (fo-fe)^2 / fe Note: the smaller the difference between the observed and the expected values, the lower the x2