Biostat midterm

Ace your homework & exams now with Quizwiz!

use median when

- An interval/ratio variable has a highly skewed distribution - You want to report the central value, i.e. the exact center of the distribution - The variable is measured at the ordinal level (make sure a median makes sense)

continuous variables

- May take any value within a defined range Age, weight, height, Hgb, WBC, temperature . . . - Values that are measured - Values with fractions/decimals

use the mean when

- The variable is measured at the interval/ratio level and is not highly skewed (i.e., symmetrical distribution) - The variable is measured at the ordinal level and a mean makes sense - You want to report the typical value - You anticipate additional statistical analyses

discrete variables

- Variables that have a limited set of values Biologic sex, political preference, treatment received- Values that are counted Values can only assume whole numbers Number of missing teeth, hospital admissions, children in family

z test basic concept

-Find the z-score for the observed (sample) mean-See where it falls in the sampling distribution-Is it so far from the population mean that we can conclude that it is probably different from our population? OR Is our observed value different enough from expected that we can conclude it occurred by more than just chance alone? Does it fall in the 'rejection' or 'critical' region defined by α?

interpret the confidence interval for: Researchers found that the mean age of a sample of problem gamblers was 36.5 years (95% CI: 33.9, 39.1)

-If we did the study 100 times, in 95 of them the mean would lie between 33.9 and 39.1 years -There is a 95% probability that the CI contains the true population mean -There is a 95% probability that the mean age of problem gamblers in the population is between 33.9 and 39.1 years

Steps to create sampling distribution of a mean

-Sample repeatedly from the population -Calculate the statistic of interest (the mean) -Form a distribution based on the set of means obtained from the samples The set of means obtained will form a new distribution called the sampling distribution

Three factors that influence how representative a sample is of the population:

-Sampling procedure -Sample size -Response

4 types of probability sampling

-Simple random sampling-Systematic sampling-Stratified random sampling-Cluster sampling

Why not 𝑋bar and SD?

-These refer to a group that we are studying, i.e. a sample or study population, rather than the whole population -Using μ and σreminds us that we are assuming that our variable is normally distributed

When might you sample the entire population?

-When your population is very small -When you have extensive resources -When your outcome is rare

power=

1 -β

Steps for Hypotheses Testing ttest

1. Check assumptions 2. Define null (H0) and alternative hypotheses (Ha) 3. State alpha (typically 0.05; decide 1-sided vs. 2-sided) 4. Define degrees of freedom (df) 5. State decision rule/determine critical value 6. Calculate test statistic and compare it to critical value 7. State results/conclusions

Steps for Hypotheses Testing

1. Define null (H0) and alternative hypotheses (Ha)2. State alpha (typically 0.05; decide 1-sided vs. 2-sided)3. State decision rule/determine critical value4. Calculate test statistic and compare it to critical value5. State results/conclusions

Steps for Z-test

1. State null and alternative hypotheses2. Choose α3. State decision rule/ Calculate critical value4. Calculate test-statistic (and compare to critical value)5. State results/conclusions

Given the diet example (slidedeck 4b slide 77) , what is the test statistic? a. 0.06 b. 1.03 c. 2.35

1.03 [Use the formula d/(s/√𝑛)= 7/(13.64/2)]

The Chi-square Distribution

2 test tells us only if the variables are independent or not • It does not tell us the pattern or nature of the relationship • For this, we look at the proportions or the ratio of the proportions • In our example -A test of H0 that the proportion of M&M preference among Ugrads (pi ugrad) is the same as the proportion among Grads (pi grad)

In biomedical research, we almost always use

2-tailed or 2-sided tests why? - Much we don't know - More stringent or "conservative" • Not infrequently, a 1-sided test will achieve significance when a 2-sided test does not

Height among US men over 20 is approximately normally distributed with a mean of 68 inches (5' 8") and a standard deviation of 3 inches • What proportion of US men are: -More than 74 inches (6' 2") tall? -Less than 62 inches (5' 2") tall? -Between 65 and 68 inches (5' 5" and 6' 2") tall? -Between 71 and 74 inches (5' 11" and 6' 2") tall?

74 in: Mean = 68 inches SD = 3 inches area between mean and 2.0=47.72% >74 in: z = 2.0, area between mean and 2.0 = 47.72% <62 in: mean=68 in sd= 3 in z= -2.0 Area below = 50 -47.72 = 2.28%• So, 2.3% below z = -2. proportion between 65" and 68": Mean = 68 inches SD = 3 inches z= -1.0 answer= 34.1% Proportion between 71 and 74": mean=68 in sd= 3 in z= 1.0 13.6%

The standard chi-squared test for a 2 by 2 contingency table is valid only if: a. all the expected frequencies are greater than five b. both variables are continuous c. at least one variable is from a Normal distribution d. all the observed frequencies are greater than five e. the sample is very large

A

What is true about a p-value of 0.03? (choose all that apply) a. It is less than the standard αvalue that is used to test for significance b. It represents a 3% chance that the difference we found would be equal or greater than what we found if the null hypothesis were false c. It represents a 3% chance that the difference we found would be equal or greater than what we found if the null hypothesis were true d. It proves the null is true e. We would fail to reject the null

A and C

We conduct a test to determine if older and younger men differ on the number of hours they spend watching football during the fall/winter season. We find that older men watch 0.2 hours more football than younger men, with a p-value=0.01. What conclusions can we make? (choose all that apply) a. Older men watch significantly more football than younger men b. The amount of TV older men watch is higher than the amount of TV younger men watch c. The amount of football younger men watch is considerably less than the amount of football older people watch d. We may have a Type 1 error e. We have proven that the null hypothesis is true

A and D

Chi-Squared GOF Method, Helpful Hints

A larger df implies a larger # of cells in the two-way table, and will produce a larger critical value for the test• The bigger the (Obs ─ Exp) difference, the more likely the hypothesized pattern is incorrect• Some expected values will be larger than observed, others will be smaller-Some (Obs -Exp) will be positive, some negative-We square them (like we did for SD) to make them positive• Always compare Expected to Observed values to help with interpretation

A probability sampling is

A probability sampling is one in which every unit in the population has a known chance (greater than zero) of being selected in the sample • 4 Types-Simple random sampling-Systematic sampling-Stratified random sampling-Cluster sampling

Example: SEPTA on time record

A public transit auditor is interested in determining the percent of time SEPTA buses are on time (w/in 10 mins), 10-20 minutes late, or >20 minutes off schedule• SEPTA claims that the 90% are on time, 7% are between 10-20 minutes late, and 3% are >20 minutes late• The auditor sets off to ride the bus and after weeks of riding clocked in the following:- 160 bus rides on time- 30 bus rides 10-20 minutes late- 10 bus rides >20 minutes late• Can SEPTA make the claim that 90% of the buses run on time? Hypotheses:- H0: The proportion of bus runs that are on time, 10-20 mins late, and >20 mins late are similar to what SEPTA claims- H1: The proportion of bus runs that are on time, 10-20 mins late, and >20 mins late are not similar to what SEPTA claims

Wilcoxon Rank Sum Test

A researcher wants to know if men and women download a similar quantity of movies onto their computer • She takes a random sample of 8 men and 8 women from a college campus unpaired data analogous to 2 sample t test Null hypothesis, H0?- Median bytes downloaded by men= median for women• Alternative hypothesis, H1?- Median bytes downloaded by men ≠ median for women • Results- Our sum is ≤ 49, so this extreme of a sum is unlikely to have occurred by chance- We reject the null hypothesis- The difference between men and women in gigabytes of movies on their computers was statistically significant, with men having downloaded more Gbytes than women (Md for men = 103, Md for women = 60.5, p<.05)• Conclusion - We found evidence that men download more movies than women see slides 35-40 on nonparametric deck

Variance

A summary measure of the distance of each value from the mean of the distribution• Subtract the mean from each value, square that value, add them up, then divide by N -1 s^2 signifies variance

McNemar's test could be used (check all that apply): a. to compare the numbers of cigarette smokers among cancer cases and age and sex matched healthy controls b. to compare for the presence of respiratory symptoms in a group of asthmatics in winter and summer c. to look at the relationship between cigarette smoking and respiratory symptoms in a group of asthmatics d. to examine the change in peak flow rate, a continuous measure of lung function, in a group of asthmatics from winter to summer e. to compare the number of cigarette smokers among a group of cancer cases and a random sample of the general population

AB

For a Fisher's exact test for a contingency table, which of the following are true? (check all that apply) a. applies to 2 by 2 tables b. variables should have mutually exclusive categories c. is suitable when expected frequencies are small d. is difficult to calculate when the expected frequencies are large

ABCD

What is Meant by "Contingency"?

All studies present as a common Contingency Table• In Epi, usually a 4-fold (4-square) or 2x2 table• Called "2x2" because it has 2 rows and 2 columns, but can have any combo of rows/columns (i.e., 3x2, 4x4, etc.) • Called "contingency" because the values in the cells are contingent (or depend) on the marginal values (those at the edges of the table)• Null and alternative hypotheses depend on the study design-Will discuss later

How Do You Decide What to Collect?

Always collect the most detail as possible (continuous)-Not recommended if strong reason to believe subject will not answer truthfully or can't measure accurately (i.e., actual income vs. range)• Can always reduce the detail in the level of measurement during analysis but can never increase-Blood pressure (mmHg) can be re-categorized to normal, borderline, high-You cannot get an actual number later if you collect a category

levels of measurement

Amount of detail/information contained within the variable •Four levels: nominal, ordinal, interval, ratio

How To Test for Outliers and Adjust Your Whiskers

An outlier is a value that is much smaller or larger than other values in the data, usually defined as a value that is > 1.5*IQR Steps to find outliers: 1. Find IQR = Q3-Q1 2. Multiply IQR by 1.5 3. Add 1.5(IQR) to Q3 4. Subtract 1.5(IQR) from Q1 5. Any values higher than the value in Step 3 or lower than the value in Step 4 are outliers Note: If you have outliers, you would adjust your whiskers inward. In this case, the whiskers would end at the first and last observations in your data that are not considered outliers

frequency polygons

Another way to represent interval/ratio level data • Instead of bars that span the interval, place a dot at the midpoint of the interval and connect the dots with a straight line

Continuing on with the soda example, suppose that the data follow a skewed distribution. What is the probability that the contents of an individual can is less than 12 oz? a. 0.007 b. 0.1587 c. 0.9928

Answer: We don't have the tools to compute it at this point. We cannot use the standard normal distribution to compute probabilities here because the distribution of interest is not normally distributed.

Chi-Square Goodness of Fit Test Steps

As with all tests, we follow the basic five steps: 1. Check that assumptions of the test are met 2. State your hypotheses 3. Compute the test statistic 4. Identify critical value 5. Make a decision and interpret the results

chi sq contingency test steps

As with all tests, we follow the five steps:1. Check that assumptions of the test are met2. State your hypotheses3. Compute the test statistic 4. Identify critical value5. Make a decision and interpret the results

2 sample t test assumptions

Assumes: - Dependent variable is normally distributed - Two unrelated groups (aka independent groups, mutually exclusive) - Data are gathered from simple random samples - Equal variances - Similar sample sizes (>20-30) if last 2 not satisfied, use unequal variances option

Cough first thing in the morning in a group of schoolchildren, as reported by the child and by the child's parents (Bland et al. 1979) Which of the following is true? a. the association between reports by parents and children can be tested by a 2 test b. the difference between symptom prevalence as reported by children and parents can be tested by McNemar's test c. if McNemar's test is significant, the contingency 2 test is not valid d. the contingency 2 test has 2 degrees of freedom

B

In a chi-squared test for a 5 by 3 contingency table: a. variables must be quantitative b. observed frequencies are compared to expected frequencies c. there are 15 degrees of freedom d. all the observed values must be greater than 1

B

The chi-squared GOF fit test is used to: a) Assess the independence of two nominal variables b) Assess if observed frequencies fit a specified distributional pattern c) Assess difference in the means of two groups d) Compare paired nominal variables

B

Why would we choose to use the SD over the variance as a measure of dispersion?

Because the variance is expressed in squared units, making it difficult to interpret. The SD is expressed in the units of the original variable. Also, the SD is a unit of distance that allows us to make comparisons across populations

Best way to ensure a sample will lead to valid and reliable inferences is to

Best way to ensure a sample will lead to valid and reliable inferences is to use probability sampling

2 Contingency Test: When Might We Use It?

Both the independent and dependent variables are nominal, dichotomous, or ordinal, or can be categorized into a nominal/ordinal variable• To compare proportions in 2 or more groups:-Undergraduate students, graduate students-children, young adult, elderly-Color M&Ms: red, yellow, orange, blue, green, brown

A researcher reports that a test is "significant at 5%." This test will also be a. significant at 1%. b. not significant at 1%. c. significant at 10%.

C

To compute the McNemar 2 test value

Calculate expected values-Use the following equation: X^2 = Sum(|Oi-Ei| -0.5)^2 /Ei

Characteristics of the Mode

Can be used with all levels of measurement • Only measure of central tendency that can be used for nominal data • For ordinal/IR data, the limitation is that the most common value may not necessarily be the middle value

Calculating the Probabilities from Contingency Tables

Can generate a statistic to calculate probabilities of departure from the null • Uses the chi-square (2) statistic• If there is no association between the 2 risk factor groups, we would expect exactly the same proportion in each group to have disease• We can calculate what the expected values would be using the marginal values in our contingency table• How is this done?

central tendency provides information on

Central tendency only provides information on where most data lie Does not inform us of how far apart the data are or variability from one subject to the next

how to calculate a CI

Choose percent (X), usually 95%•Look up the Z value for a 2-tailed test for 5%, (100 - X%) -Z=1.96•Compute the SEM = s/ 𝑛•Calculate the CI: Mean ±1.96 (SEM)

To find the critical value:

Compute degrees of freedom as (r -1)(c-1), where r is the # of rows and where c is the number of columns• Set • Look up 2 critical value based on and df using table- If we have a test value > critical value, we have a significant test

Levels of Measurement Determines the Test

Continuous Data (z-test, t-tests, sign test, Wilcoxon signed rank, rank sum) • What about Nominal data? -Dichotomous measurements: Yes, no Alive, dead Young, old -More than two categories: Brown, blue, or green eye color Political party preference

step 4: what is our critical value

Critical Value for χ2 with df=1, α=.05 is 3.84‒ We need a test value >3.84 to be significant at .05 level • Can also use X^2 ("chiinv") function in Excel to get critical value • Our test value is 5.625• 5.625 > 3.84, so we reject the H0 • Exposure to M&M promotional marketing is associated with buying M&Ms

Which of the following are true about the null hypothesis? (choose all that apply) a. It refers to the sample b. It asserts that there is a difference between the treatment groups c. It should include the word significant d. It refers to the population e. It is the opposite of the alternative hypothesis

D and E

McNemar'sTest: Assumptions

Data are from a random sample • One dichotomous variable • One independent variable linking 2 groups -Pre/post-Parent/child • Two groups that are mutually exclusive • b+c > 25

Comparing a Continuous Variable by 2 groups

Data normally distributed - Do we have 2 independent groups? 2-sample t-test - Is the same subject assessed at 2 time points or subjects matched (paired or related) Paired t-test • Data not normally distributed - Do we have 2 independent groups? Wilcoxon rank sum test - Is the same subject assessed at 2 time points or subjects matched (paired or related) Wilcoxon signed rank test

Descriptive Statistics:Measures of Central Tendency

Describes the 'typical' or the middle value of a variable • Provides one number to summarize the data for a variable -Mean (average) -Median (mid point; half above/half below) -Mode (most common value)

histogram construction:

Determine the difference between lowest & highest value• Weight Example:- Lowest value 93: Highest value 258 - Difference between highest and lowest value is 165 • If we have 1 bar per value, more possible bars than data points• Hard to discern a pattern by eyeballing data• X-axis will be cluttered with many values making it difficult to read• Instead create bars that represent equal intervals, with no more than 20 bars• The midpoint of the interval serves as the point on the x-axis

Computing the χ2 Test Statistic

Determine the total number of subjects, n • Total rows and columns separately (margins) • Determine expected values for each cell using marginal values • Determine the actual (observed) numbers in each cell • For each cell, calculate: (Observedi-Expectedi)^2 / Expectedi • Add the (Oi-Ei)^2/ Ei quantities across all cells to get χ2 test value

Computing the χ2 GOF Test Statistic

Determine the total number of subjects, n• Based on the expected population percentages, determine the number of subjects that would be in each category if the expectations were correct ("expected values")• Determine the actual (observed) numbers in each category • For each category, calculate: (Observedi-Expectedi)^2 / Expectedi • Add the (Oi-Ei)^2/ Ei quantities across all categories to get χ2 test value

Limitations of Chi Square Tests

Difficult to interpret when variables have many categories-Best when variables have no more than 3 categories• Should not use 2 test with expected cell values <5 -Use Fisher's Exact Test instead• 2 test is particularly sensitive to sample size-As N increases, obtained 2 value increases-With large samples, trivial relationships may be significant-Remember that statistical significance is not the same as clinical/practical significance• McNemar'sneeds n>25 in b+c

range and limitations of range

Distance between highest and lowest values • Presented as the high and low values of the data (i.e., range: 2, 14 yrs) • Provides lower and upper limits of the data in your sample • Limitations -Uses only the extremes so a single high or low value can drastically alter it Example: Ages of bachelors degree graduates at a university -The more data points, the larger the range since extreme values become more likely

Example 3: What Do We Do with Paired Data?

Does a promotional marketing event on campus lead to increased M&M purchasing among students who were exposed to the promotion?• Before the event, we randomly select and query students on whether they purchased M&Ms in the past week• These students are then exposed to a marketing event on campus• After 1 week, we contact these same students again to see if they have purchased M&Ms in the past week

McNemar'sTest - Simplified

Expected values = (b+c)/2 Because of the particular form of the expected values McNemar 2simplifies to: X^2M = (|b-c|-1)^2 / b+c

z table

For any z-score, we can look up the corresponding area/proportion of the distribution • The total area = 1.0 or 100%• Different formats of z tables -Some tables gives area between the z score and the mean -Others give area above the z-score -Can always calculate what we need from any z table -Tables give areas for positive z-scores Since symmetric, we know the areas for negative z scores

How can you use the SEM?

For data that is approximately normal, -There is ~68% chance that the true population mean is within 1 SEM of the sample mean -There is a 95% chance that the true population mean is within 2 SEMs of the sample mean

McNemar'sTest: When Might We Use It?

For paired dichotomous data-Pre/post-Data matched in some way

For population parameters we use (blank) letters •For sample statistics we use (blank) letters

For population parameters we use Greek letters•For sample statistics we use Roman letters

For sign test, our test statistic must be (blank) than the critical value to reject null

For sign test, our test statistic must be LOWER than the critical value to reject null - Opposite of parametric tests!

For skewed parent distributions, the sampling distribution (blank) as n increases

For skewed parent distributions, the sampling distribution assumes a more normal shape as n increases

The Purpose of Sampling Distributions

Generating a sampling distribution each time we wanted to answer a question would be too time consuming and expensive•Instead statistical theory can be used to determine the sampling distribution•The properties of the sampling distribution are the basis of the Central Limit Theorem

Central Limit Theorem

Given a population and standard deviation σ, the sampling distribution of the mean based on repeated samples of size n will have the following 3 properties: -The mean of the sampling distribution is equal to the population mean based on the individual values -The standard deviation in the sampling distribution of the mean is equal to σ/ 𝑛.This quantity is called the standard error of the mean -Various notations for it: 𝜎𝑥 , SDx , SE, SEM -Regardless of the distribution of the population, the sampling distribution of the mean will be approximately normally distributed

Comparing Two Groups with I/R Data

Histogram Frequency polygon Table Dot plots Box-Whisker plots* Line Graph

Statistics in Decision Making

How many doses of flu vaccine should a practice manager order for next winter? When should they be given?• Does bicycle helmet use reduce the likelihood of head injury in a crash? • Do all women need yearly mammography after age 50?• What is the probability that a certain test result will be correct?

If = sized groups, total df =

If = sized groups, total df =N-2

Converting to the Standard Normal Distribution: Why convert?

If we convert our data on IQ scores, cholesterol levels, etc. to a standard normal distribution, we can easily calculate proportions of subjects with values above, below, and between any given values • Similar in concept to converting from meters to feet, pounds to grams

Simple Random Sampling

In a simple random sample, each element in the population has an equal probability of selection AND each combination of elements has an equal probability of selection• Examples:-Names drawn out of a hat-Random numbers to select elements from an ordered list simple random sample • Starts with an enumerated list- Examples: medical records, CMS list, houses in a development • Each case has equal probability of selection• Can use random number tables or a computer-generated list of random numbers• In Stata, can use the count and sample command• Example: - Chart review of STD cases in the ED for meeting CDC guidelines for PE, diagnosis, and treatment (N=4400)- Want a sample of 200. Use random numbers generator- 22, 46 ,85 ,88, 110, 136, 183, 194, 199, 212, 215 . . . . . .4389

In most cases the population mean can be safely assumed to lie

In most cases the population mean can be safely assumed to lie within 2 SEM's of the sample mean

The 68%, 95% and 99% Rules

In normally distributed data:• 68% of the values (observations) will be within 1 SD of the mean• 95% of the values will be within 2 SDs of the mean• 99% of the values will be within 3 SDs of the mean

statistics

In samples, the observed mean or standard deviation (calculated based on sample data) is an estimate of the population mean or standard deviation • These estimates are called statistics

Goodness of Fit Chi-Square Test: Assumptions

Independent random sample • Variable of interest is categorical (nominal, dichotomous, or ordinal) • Mutually exclusive categories • Expected value in each level of the variable (in each cell) is at least 5

2 sample t test ho and h1

Is the difference between our sample means greater than the differences within our sample means?• Are our 2 groups (drug and placebo) from the same population or are they from two different populations?

It is uncommon for a sample mean from any population to be

It is uncommon for a sample mean from any population to be more than 2 SEMs away from the population mean it estimates

How to use the SEM

Mean = 120, SEM = 5• Based on the sample data-68% chance that the true population mean is mean SEM: 120 5 or between 115 and 125-95% chance that the true population mean is mean 2SEM: 120 10 or between 110 and 130• Dr. Y is disappointed at the wide range• What if Dr. Y sampled 100 patients?-Mean and SD will likely be different, say 118 and 10-SEM = 10 / 100= 1• Would Dr. Y be happier with this result?

parameters

Measurements of central tendency and dispersion (mean and standard deviation) are fixed and invariant characteristics in populations We refer to them as parameters

Measures of central tendency and dispersion become (blank) with extreme values

Measures of central tendency and dispersion become unstable with extreme values

measures of dispersion refer to? used for? no measure of dispersion for (blank) data

Measures of dispersion refer to how closely the data cluster around the measure of central tendency -Used for interval/ratio and sometimes ordinal data -No measure of dispersion for nominal data

Mean

Most commonly used measure of central tendency • In "statisticalese" called the arithmetic mean • Usually what is meant by 'average' • Calculated by adding up the value of all subjects and dividing by total number of subjects

Step 3: How Do You Calculate the Test Value, Ei ?

Only interested in the 2 off-diagonal cells-Where "buy" values don't match• If no association (independence exists) between the promotional event and buying, we would expect just as many people buying before and not after as we see buying after but not before • Expected values are simply the sum of the 2 off-diagonal cells/2-Ei= (b+c)/2 • In our case, Ei= (b+c)/2 = (28+12)/2 = 40/2 = 20

Parametric tests

Parametric tests assume Gaussian, or normal distribution - z-test - t-test

Parametric tests can be (blank) w/ non-normal data

Parametric tests can be unreliable and inaccurate w/ non-normal data - Less susceptible with large samples

Randomized controlled trial (RCT)

Patients are assigned to one of two treatment procedures at random• Proportion of successes in each group are compared-We are interested in making a positive change, hopefully, we don't do harm• Example: -Patients with lung cancer are randomly assigned to 2 cancer Rx regimens, we compare the success of the treatment in both groups at the end of the trial

Type I Errors

Probability of error determined by the level of α we set •α = 0.05, we are accepting a 5% chance of a Type I error We will reject H0 5% of the time when it is true •We will say there is a significant effect 1 in 20 times when, in reality, there is no effect •Also known as a False Positive: we say there is a difference, when there is not

Wilcoxon Signed-Rank: Ranking

Rank the differences by absolute value (ignore the sign); ignore zeros • Correct the rank for ties • Apply sign to rank • Sum ranks in +/- groups; Interested in one with lower frequency (here, it is positive ranks) = 8.5

Example 1: Z-test

Research question: Do hypertensive smokers differ in cholesterol level from the general population?•Mean serum cholesterol level for general population = 211 mg/100 ml, SD = 46 mg/100 ml•Cholesterol level is approximately normally distributed•We randomly sample 25 hypertensive smokers•Mean serum cholesterol in sample = 217 mg/100 ml•We want to know if our sample is similar to the general population or some other population in terms of mean cholesterol level

State Results and Conclusion chi square

Results:• Ugrads were significantly more likely than Grads to prefer plain M&M's (50/80, 63% vs 20/140, 14%), but Ugrads were significantly less likely than Grads to prefer peanut (10/80, 13% vs 60/140, 43%) and crispy (20/80, 25% vs 60/140, 43%), p<0.05.• Conclusion options:• M&M preference differs by student status, with undergraduates preferring plain M&M's and graduate students preferring peanut and crispy M&M's. • Student status and M&M preference are related; undergraduates prefer plain M&M's while graduate students prefer peanut and crispy M&M's.

Sampling distribution of the mean extremely useful because

Sampling distribution of the mean extremely useful because it allows us to make specific observations about the probability that a specific observation will occur •From our example: -If the mean number of months since previous check-up is really 14, how likely is a random sample of n=2 patients in which the mean is 15 or more months? -From the sampling distribution a mean of ≥15 months occurs 6 out of 25 times, or 24% (not that unusual)

Sign test only accounts for (blank)

Sign test only accounts for the change in direction, not the magnitude of change What if we have big differences in magnitude for the decreases, but only small magnitude in differences for the increases? • Sign test might not be sensitive enough

z test signal and noise

Signal (numerator): Difference between our (observed) sample mean and the (expected) population mean Noise (denominator): SEM, Standard error of the mean Large difference relative to SEM: difference is statistically significant• Small difference relative to SEM: difference is not statistically significant

Calculating the Median

Sort the data from lowest to highest value •Identify median "position" by taking (n+1)/2 • If odd number of values, the median is the middle value -If n=9, the median is the 5th value [(9+1)/2] • If even number of values, the median is the average of the two middle values -If n=8, the median is between the 4th and 5th values [(8+1)/2)] - Take the average of the 4th and 5th values [(X4 + X5)/2] to find the median

Paired Sample T-test Steps for Hypothesis testing

Step 1: Check assumptions• Step 2: State the null & alternative hypotheses- H0: μd = 0; H1: μd 0• Step 3: Choose and 1-sided or 2-sided test conditions• Step 4: Define degrees of freedom (n ─ 1), where n is # pairs• Step 5: Determine critical value- t(critical)= ±(depends on df) • Step 6: Compute the test statistic, and compare to critical value • Step 7: State results and conclusions- Reject or fail to reject the null hypothesis- State what the decision means

iqr calculation

Steps same as finding percentiles: 1. Ordered lowest to highest, k = 25%, n=19 2. Multiply percentile by (n+1) to find location, L=(n+1)*k = 20*0.25 = 5 3. L is a whole number, so we find position 5; this is Q1 4. At k=50%, L= (n+1)*k = 20*0.5 = 10; this is Q2 5. At k=75%, L= (n+1)*k = 20*0.75 = 15; this is Q3

Retrospective Sampling

Study groups are identified based on whether they have or don't have the outcome or disease of interest• We look back to determine the proportion of subjects in each Dz group who have some antecedent factor (exposure) of interest• Commonly called the Case-Control study• Example: -Cases with lung cancer are compared to controls (without lung CA) to ascertain the proportion in each group who were smokers in the past

Prospective Sampling

Study groups identified on the basis of the presence or absence of some antecedent factor • We follow over time to determine the proportion in each group that develop a disease or other condition (outcome)• Referred to in epidemiology as a Cohort Study• Example: -We follow smokers and non-smokers to see who develops lung cancer

Associative or Observational Studies

Subjects are sampled and we document the presence or absence of various attributes -characteristics, behaviors, demographics, etc.• Possible associations are made based on comparing proportions for these attributes• Example: -Lead intoxication is associated with hyperactivity by noting presence or absence of lead poisoning in children seen at a pediatric clinic and presence or absence of hyperactivity-CANNOT make causal inferences

Test statistic for sign test, x, is the

Test statistic for sign test, x, is the frequency value for the group with the lowest total number of +/- Count number of increases or decreases in outcome value slide 11 of nonparametric deck

McNemar'sTest: What Is It?

Test used on paired dichotomous data to determine whether the row and column marginal frequencies are equal (whether there is "marginal homogeneity")• The null hypothesis of marginal homogeneity states that the two marginal probabilities for each outcome are the same, i.e. pa + pb = pa + pc and pc + pd = pb + pd

Chi Square (X^2) Contingency Test: What Is It?

The 2 test is a two-sample test• Compares two samples to each other-Analogous to a two-sample t-test-Independent samples t-test compares means of two groups-2 compares proportions between two groups-It is the counterpart to the 2 GOF test for when we have 2 groups, instead of 1• Also called 2 test for independence• It is another non-parametric test; does not require normal distribution

Contingency Table for Matched Design Layout

The 4 cells display the 4 possibilities for the pairs• Only interested in 2 off-diagonal cells, expect equal values trying pre and post if really no difference in promotional marketing (lecture 5b slide 58)

Why choose the standard deviation?

The ideal measure of dispersion: • Uses all the information available • Describes the average or typical deviation of the values from the mean • Increases in value as the distribution of values becomes more diverse • Allows us to compare across populations - Can compare dispersion across different datasets with different sample sizes or across groups • Critical measure for tests of statistical significance • Used for interval/ratio variables

The main purpose of the SEM is to

The main purpose of the SEM is to indicate the accuracy with which a sample mean estimates the population mean

positively skewed distribution

The mean is greater than the median

negatively skewed distribution

The mean is less than the median

The Mean as a Balance Point

The mean is the single point of equilibrium (balance) in a data set •The sum of the deviations from the mean always equals zero The mean is affected by all values in the data set -If you change a single value, the mean changes

The mean of the population and the mean of the sampling distribution of means:

The mean of the population and the mean of the sampling distribution of means have the same value

Wilcoxon Signed-Rank Test: Assumptions

The measures of the dependent variable are matched or paired (correlated or related in some way)• The dependent variable (e.g. pain score) is at least ordinal• There are at least 5 pairs

Characteristics of the Median

The median is robust, or not influenced by extreme values (a few very high or very low values) -Useful measure of central tendency for highly skewed (non-symmetrical) data -Example: NFL player salary, household income • Cannot be used for nominal level data since no order is present • Can be used for ordinal data, but when the range of possible values is small, it may be of limited value

Power of a study:

The probability of correctly rejecting H0 when it is false, i.e. correctly finding a difference or effect when one exists - β= 0.20 usually considered sufficient • Important in planning a study so it is large enough to detect the effect that is the subject of the research if an effect exists - Otherwise waste of time and money

What is a p-value?

The probability of seeing results like ours if, in truth, there is no difference, i.e. H0 is true -The probability of seeing a difference as large or larger than what we saw if, in truth, there is no difference, i.e. H0 is true

The shape of the sampling distribution of means:

The shape of the sampling distribution of means, even for a sample size as small as 2, is beginning to approach the shape of a normal distribution This happens despite the fact that the original shape of the population distribution was rectangular

The smaller the p-value, the lower the chance of

The smaller the p-value, the lower the chance of getting a difference as big as the one observed if the null hypothesis is true

(blank) is the main statistic for sample means

The standard error of the mean (SEM) is the main statistic for sample means

Mode

The value that occurs most frequently • In a frequency distribution, look for the highest frequency • In a graph, look for the peak or highest bar in a histogram • Distributions with two peaks are bimodal (have two modes) -Even if the peaks are not exactly the same height

The SD versus SEM

The value σmeasures the standard deviation in a population based on measurements of individuals•It tells us how much variability that is expected among individuals•The standard error of the mean (SEM), however, is the standard deviation of the means in a sampling distribution•It tells us how much variability can be expected among means in future samples

The variability in the sampling distribution of the means is (blank) than the variability in the original population

The variability in the sampling distribution of the means is less than the variability in the original population

Use the mode when

The variable is measured at the nominal level - You want to report the most common value

Theoretical distributions (almost) never

Theoretical distributions-(almost) never generate sampling distributions by actually taking many samples

Sampling Distribution

There is a sampling distribution of the difference between the means, just as there is for a mean •If we selected multiple samples (i.e. did the study multiple times), the difference between the means would vary •If we plotted the differences from each study (set of 2 samples), they would appear normally distributed •The mean difference would be the "true" difference between the groups (i.e., the one we would expect if we were to measure the population)

Usually with n as small as 30, the sampling distribution is (blank)

Usually with n as small as 30, the sampling distribution is well approximated by the normal

median

Value that is the exact middle of the sample•Point at which half of subjects lie above the value and half below it

z score

We can express any normally distributed quantity in terms of standard deviation units, also called a z-score z=(x-u)/σ x is a specific value to be standardized μ is the mean of the population σ is the standard deviation of the population

We use smaller samples to

We use smaller samples to infer what would happen in the larger population

How To Interpret Contingency Table

When you stay within columns and only use totals, you are referring to frequency or % of that variable, irrespective of the other variable. When you stay within rows and only use row totals, you are referring to the frequency or % of that variable, irrespective of the other variable. When you look within cells, you need to be sure you understand the question you are asking... • If asking, "What proportion of [var1] are [var2]"? • The marginal value for [var1] will be your denominator, and the individual cell count will be your numerator.

2 Contingency Test: What Types of Questions Can We Answer?

Which contraceptive methods are more effective for younger vs. older teens? • Are children less likely to have lifelong peanut allergies with early exposure? • Is violent video game playing associated with aggressive behaviors? • Is probiotic use equally effective at reducing GI symptoms for children and adults?• Does one treatment reduce stroke risk better than another over 1 yr? 5 yrs? • Is there a higher risk for breast cancer if one lives within 5 miles of a fracking site?• Is JUUL use associated with increase risk for lung disease?• What type M&Ms are preferred by undergraduate vs. graduate students?

Fisher's Exact Test

X^2 procedure, is an approximate method• When any expected cell frequency <5, inflates the value of the 2• Use an alternative test, Fisher's exact test• If table larger than 2x2, categories should be combined to eliminate most of the expected values <5 • If sample size large, the 2 test will closely approximate Fisher's Exact Test• Complex formula. Computed by Stata.

Example 2: Two-Sample Comparison

You have been asked to assist M&M/Mars, Inc. with deciding which type(s) of M&Ms to sell on campus. They want to identify if undergrad and graduate college students prefer the same or different types of M&Ms; their overall customer base on campus is 75% undergrads.• They are interested in 3 of their best sellers: plain, peanut and crispy.• Your goal is to capture 200 students.• You and a friend ask passersby on 36th & Locust for their top choice, and you document your findings. research q Can be stated in several ways:-Is there an association (or relationship or dependency) between being an undergrad or graduate student and M&M preference?OR-Is there a difference in the proportion of undergraduate and graduate students who prefer plain, peanut, or crispy M&Ms?• Regardless of how you state the research question, we can use the same statistical test: the chi-square Contingency (2) test

Z-test, State Results/Conclusions

Ztest = 0.65, which is < 1.96 (Zcritical)• We do not reject H0, which suggests that our mean does not differ from the population mean• The cholesterol level of hypertensive smokers does not differ significantly from that of the general population (217 mg/100ml vs. 211 mg/100ml, z=0.65, p>0.05)-The hypertensive smokers come from the same distribution as the general population

What is the correct degrees of freedom in a study comparing cholesterol levels between two groups that receive either placebo or an experimental medication, where n1=10 and n2=12?

[df for independent samples t-test = (10 + 12) -2 = 20]

Sodas in a can are supposed to contain an average of 12 oz. This particular brand of soda has a standard deviation of 0.2 oz, with an average of 12.1 oz. If the can's contents follow a normal distribution, what is the probability that the mean contents of a case (24 cans) are less than 12 oz? a. 0.007 b. 0.1587 c. 0.9928

a

The manager of an automobile dealership is considering a new bonus plan to increase sales. Currently, the mean sales rate per salesperson is 5 automobiles per month. The correct set of hypotheses to test the effect of the bonus plan increasing sales is a. H0: μ= 5, Ha: μ> 5. b. H0: μ> 5, Ha: μ= 5. c. H0: 𝑥= 5, Ha: 𝑥> 5. d. H0: 𝑥=5, Ha: 𝑥= 5.

a

Individual:

a member of the sample

Sample:

a subset of the study population•

Continuing on with the soda example, suppose that the data follow a skewed distribution. What is the probability that the mean contents of a case (24 cans) are less than 12 oz? a. 0.007 b. 0.1587 c. 0.9928

a. 0.007, same as before. This is because the CLT tells us that the distribution of sample means will follow a normal distribution, even with relatively small sample sizes. Therefore, we may use the standard normal distribution to calculate probabilities.

A researcher wants to know if calcium is an effective treatment for lowering blood pressure. He assigns one randomly chosen group of subjects to take calcium supplements; the other group will get placebo. At the end of the treatment period, he measures the difference in blood pressure. The 50 members of the calcium group had blood pressure lowered by an average of 25 points with a sample standard deviation of 10 points. The 50 members of the placebo group had blood pressure lowered by an average of 15 points with a sample standard deviation of 8 points.To analyze this information we will use a. Two-sample t-test b. One-sample z-test c. Paired t-test

a. Two-sample t-test [you have two independent groups and a continuous outcome measure, and no evidence that differences are not normally distributed]

The z-statistic from a sample of n= 19 observations for the two-sided test of H0: μ= 6 and H1: μ≠ 6 has the value z= 1.6. Based on this information: a. We would retain the null hypothesis at = 0.10. b. We would reject the null hypothesis at = 0.10. c. 0.025 < p-value < 0.05. d. Both (b) and (c) are correct.

a. We would retain the null hypothesis at = 0.10. [If you look in the z-table, you'll find that the area under the curve between the mean and z-critical is 44.52%. This means the remaining 5.48% is in the upper tail, leaving 5.48% also in the lower tail. 5.48%*2=10.84% which is > 10%. The p-value is greater than alpha (our cutoff), so we fail to reject the null.]

two variables (bivariate)

assume both variables are nominal/ordinal • Examples-Subjects exposed/not exposed to chemical agent and whether they developed respiratory symptoms-Compare college men to women on whether or not they binge drink -Education level and whether or not subjects have a job• 2 Basic Options:-Table-Bar graphs with percentages as bar heights

A consumer believes the veggie manufacturer is under-filling its frozen peas. The true mean weight stated on the package is 500 g. By weighing a sample of nine packages, he will reject the null hypothesis if the sample mean is less than 496.8 g. Assume the weights of packages follow a Normal distribution with a standard deviation of 6 g. A possible Type I error of this test would be to decide that there is a. insufficient evidence to conclude Veggies-R-Us is under-filling its packages when in fact it is under-filling their packages. b. sufficient evidence to conclude Veggies-R-Us is under-filling its packages when in fact it is not under-filling their packages. c. insufficient evidence to conclude Veggies-R-Us is under-filling its packages. d. sufficient evidence to conclude Veggies-R-Us is under-filling its packages when in fact it is under-filling its packages. e. insufficient evidence to conclude Veggies-R-Us is under-filling its packages when in fact package weights equal indicated amounts.

b

A survey interviews 1000 Americans by telephone and asks: "What do you think is the biggest problem facing education today?" The population of interest for this poll is most likely a. American students b. American adults c. American teachers

b

It has been claimed that women live longer than men; however, men tend to be older than their wives. Ages of 16 husbands and their wives from the US were obtained. These data should be analyzed with a a. two-sample t-test. b. paired sample t-test. c. two-sample z-test.

b

It has been claimed that women live longer than men; however, men tend to be older than their wives. Ages of 16 husbands and wives from the US were obtained. The null hypothesis of equality of mean differences is rejected. What conclusion can be made from this study? Based on this data, we a. Have reason to believe that husbands are older than their wives. b. Have reason to believe that US husbands are older than their wives. c. The sample was too small. No conclusions can be made.

b

Sodas in a can are supposed to contain an average of 12 oz. This particular brand of soda has a standard deviation of 0.1 oz, with an average of 12.1 oz. If the can's contents follow a normal distribution, what is the probability that the contents of an individual can is less than 12 oz? a. 0.007 b. 0.1587 c. 0.9928

b

The administration at a large state university is interested in getting the opinions of students on a proposed instructional fee for use of computer labs on campus. The administration selected a simple random sample of 50 freshman, 50 sophomores, 50 juniors, and 50 seniors. This is an example of a a. Systematic sample b. Stratified random sample c. Simple random sample

b

What is the P-value for a test of the hypotheses H0: μ= 10 against Ha: μ≠ 10 if the calculated test statistic is z= 2.56? a. 0.0052 b. 0.0104 c. 0.9948

b

The water diet requires you to drink 2 cups of water every half hour from when you get up until you go to bed but eat anything you want. Four adult volunteers agreed to test this diet. They are weighed prior to beginning the diet and 6 weeks after. Their weights in pounds are: weight before- mean=173.75 weight after- mean=166.75 What is the mean of the differences? a. 170.25 b. 7 c. 13.64

b. 7 [The difference between the means is mean1-mean2]

What is the rejection region at the 5% level of significance? a. t > 2.353 b. t > 3.182 c. t > 1.943

b. t > 3.182 [Find the df, which is n-1 = 4-1 = 3; use the t-distribution table to look up the critical value for 2-sided alpha=0.05]

case control design

begin with determining which group represented cases (people with disease/outcome) vs controls. then look in the past to classify whether each had the risk factor or not, and compare the presence of risk factors in each group.

A market researcher selects 500 people from each of 10 cities. Identify which type of sampling is used. a. Simple random b. Stratified c. Cluster d. Matched pair

c

when to use dispersion vs central tendency

chart on slide 33 of slidedeck 2b

Results may be statistically significant but

clinically unimportant.

What if we have no idea what the expected percentages should be?What if we want to compare 2 groups?

contingency table chi square test

A pollster uses a computer to generate 500 random numbers and then interviews the voters corresponding to those numbers. Which type of sampling is being used? a. Stratified b. Cluster c. Matched pair d. Simple random

d

data values can be

data values can be discrete or continuous

populations use (blank) type of distribution, and sampling use (blank) type of distribution

empirical, theoretical

f we draw equally sized samples from a non-normal distribution, the distribution of the means of these samples will (blank), as long as the (blank)

f we draw equally sized samples from a non-normal distribution, the distribution of the means of these samples will still be normal, as long as the samples are large enough

type 2 error

fail to reject when there actually is a difference false negative

finding the critical value of z

go to z table, if alpha is 0.05, find the z score with 5% of distribution below it

sd versus sem

good table on slide 31 of slideset 3b

Very small p values do not necessarily indicate

great discoveries.

populations are a set of (blank), and sampling is a set of (blank)

individuals, means (each calculated from a different sample) visualization of this available on slides 12 and 13 of slide deck 4a

level of measurement practice examples on slide 53 of slide deck 1a

on slide 53 of slide deck 1a

68% and 95% rules for confidence intervals

population: ~68% of individual values are within 1 SD of the meanMean ± 1 SD ~95% of individual values are within 2 SDs (1.96 to be exact) of the mean Sampling: ~68% of sample means arewithin 1 SEM of the meanMean ± 1 SEM ~95% of sample means are within 2 SEM (1.96 to be exact) of the mean95% CI = mean ± 1.96 SEM

regardless of the distribution of the population, the sampling distribution of the mean will be

regardless of the distribution of the population, the sampling distribution of the mean will be approximately normally distributed

Basis for making statistical inferences about a population from a sample

sampling distributions

test statistics determine

signal to noise ratio

Box and Whiskers Plot with Outliers

slide 15 of slidedeck 2b • Describes the data distribution • Used for continuous data • Provides easily interpretable visual display, showing: - Median - Q1 - Q3 - Range - IQR - Outliers

Wilcoxon Signed-Rank Calculation

slide 19 of nonparametric slidedeck • Calculate the median and IQR for each group • Compute the differences • compute the absolute value of differences • Order by absolute value, smallest to largest • Rank the differences by absolute value (ignore the sign, ties assigned midpoint) • Add the sign of the difference (+ or -) for the corresponding rank • Sum both the - ranks and + ranks, interested in whichever is smaller • sun of smaller ranks If calculated sum (T-value) ≤ critical value in table, statistically significant (opposite of z- and t-tests)

finding the area above/below a positive z score

slide 27 slidedeck 2c

finding the area between z scores on the same size of the mean

slide 29 slidedeck 2c

Parametric vs Non-Parametric Equivalents

slide 43 of nonparametric deck

Computing Percentile

slide 7 of slidedeck 2b • Arrange scores in order from lowest to highest • Find the percentile location, L = (n+1)*k -Where n=number of observations, k=percentile of interest, L = location • Is L (location) a whole number?

probability sampling methods chart

slidedeck 3a, slide 26

populations use measure of dispersion (blank), and sampling use measure of dispersion (blank)

standard deviation (SD or rho) ; standard error of the mean (SEM)

cohort design

start w a study population free of the outcome, then classify by exposure (has risk factor or doesnt have risk factor), then measure outcomes (disease or no) in each group and compare

associative design

start with study pipulation, measure/classify and compare whether they have the disease/outcome or not, then whether those groups had risk factor or not

independent samples t test formula

t = (xbar1-xbar2) / (square root: (variance1 over n1) + (variance2 over n2)) Calculate the means of each group and subtract to get the difference (numerator) calculate variance of each group s^2 = (sum(Xi-Xbar)^2) / n-1 n=number in each group From variances and sample size, calculate the pooled SE of the difference (not the SE of the means for each group) • t = the difference in means divided by pooled SE • Can be positive or negative

Student's t-test: Basic Concept

t= variance between 2 groups (signal) / variance within 2 groups (noise) • A large t statistic indicates the groups are different • A small t statistic indicates the groups are similar • If the difference between the means is large compared to the SE (variation) of the difference, the data support that there is a difference between the groups

Paired Samples T-test Formula

t=d/(s/square root of n) d = mean difference- s = SD of the differences- n = number of pairs

Failure to reject the null hypothesis does not guarantee

that differences observed are not real.

Target Population:

the entire group of individuals that we wish to apply our conclusions to•

Study Population:

the group of individuals to which we can legitimately apply our conclusions•

The size of the p value does not indicate

the importance of the result.

independent variable

the intervention or exposure, characteristic, or what is being manipulated • x variable • treatment variable• exposure variable • predictor variable• risk factor • experimental variable

dependent variable

the outcome of interest, which should change in response to some intervention• y variable • measured variable• outcome variable • explained variable• response variable • observed variable

the smaller the p-value, the

the smaller the p-value, the stronger the evidence against the null hypothesis And the more comfortable we are with rejecting the H0

z-test used for:

to compare a sample mean to a known population mean-Compare "observed" value to "true" or "expected" value

in epi we use tables to

to determine risk for outcomes based on exposure to assess risk of exposure based on outcomes

false negative

type 2 error fail to reject when there actually is a difference

α = 0.05, we are accepting a 5% chance of

type i error

Example 1: Independent Samples T-Test• A company (Worker Safety, Inc.) has developed a new All In One (AIO) type of protective mask that they would like to test in a working environment• New mask is cheaper to make and the company boasts of its ability to protect both the eyes and respiratory system, and is lighter to wear• Conduct an RCT with 20 randomly selected people, randomly assigned to:-Control group, HEPA mask (n=10) and treatment group, new AIO mask (n=10) • They plan to measure the amount (μg) of particulate matter (PM) collected on the mask after 1 day of work what do we want to know? •What is the independent variable? •Level of measurement? •What is the dependent variable? •Level of measurement? •What measure of central tendency should we use? how do we check if normally distributed research hypothesis null? alternative? df? critical value? t value? interpret conclusion What is the error we could have made? what is this type of error called? what is the probability we made this error?

want to know: •Will the AIO mask be flying off the shelves? •Can we determine how much variation in the particulate matter (PM) arose from differences between masks and how much came from within masks? -With our signal to noise analogy: Is there a large difference in masks between the PM captured compared to the variability in PM captured? IV: mask type LOM: nominal DV: ug particulate matter LOM: i/r MCT: mean normal: look at histograms of each for distribution, compare mean and median Research Hypothesis: The AIO mask captures more particulate matter compared to the Hepa mask Null Hypothesis: The mean PM captured on the AIO mask is the same as on the Hepa mask Alternative Hypothesis: The mean PM captured on the AIO mask is not the same as on the Hepa mask df: 18 critical val: 2.101 t=4.178 reject the null The AIO mask captures significantly more particulate matter than the HEPA mask (mean = 35 vs. 27 μg, p=0.0006) null hypothesis could be true and we rejected type 1 (false positive) prob is 0.0006 or 6 in 10,000

False Positive

we say there is a difference, when there is not type i error

Critical Values for 2-sided z-test, if α=.05

z=-1.96 and z=1.96

α is

α is set by the researcher before the study begins and is the type I error that he/she is willing to live with

chi-square (2) test statistic

χ2 = Σ(Observedi-Expectedi)^2 / Expectedi

χ2 GOF Test Statistic and Critical Value

χ2GOF = Σ(Observedi-Expectedi)^2 / Expectedi This sum is known as the chi-square (x^2) test statisticTo find the critical value:• Compute degrees of freedom as (k -1), where k is the number of categories • Set • Look up 2 critical value based on alpha and df using table- If we have a test value > critical value, we have a significant test

Degrees of Freedom (df) - huh?

• # df has to do with how many numbers in a data set are free to vary; helps estimate variability • Example: 7 hats, wear different one each day- 1st six days can vary, last one is given • Similarly, if we know a mean and sample size, all except one observation in sample can vary- mean*N = sum of all observations • Ex: if mean=3.5 and n=10 ppl; sum=35 - 1st nine values can change (free to vary) - 10th is fixed because all numbers must sum to 35 • Can also think of df as N minus the number of "relations among observations" in the sample, or the # of things you are measuring - In t-test, there is one parameter measured in each sample... the mean - So, df (in each sample) is n-1 - BUT... with Student's t-test, we compare 2 samples • Total df = (n1−1) + (n2−1) = n1+ n2 -2 • If = sized groups, total df = N-2

What is Inference?

• A generalization made about a large group or population from the study of a sample of that population• Major part of research is to infer, or generalize from a sample to a larger population• The process of inference is accomplished by using statistical methods based on probability

Percentiles

• A number where a certain percentage of values fall below it -Median is the 50th percentile -50% of cases below -If your height is at the 85th percentile, 85% of people are shorter than you -If you score in the 90th percentile on the GRE, you scored higher than 90% of other test takers • Tertiles, quartiles, quintiles, deciles

Important Properties of the Normal Curve

• All normal curves can be described by the mean and SD-Can think of them as a family of curves• Distances along X-axis, if measured in SD, always encompass same exact proportion of total AUC (i.e., same proportion of population) • With normally distributed variable, if we know mean and SD, we can calculate % of subjects between any 2 values or above/below any value • Many statistical tests based on the normal distribution

Student's t-test

• Also called the 2-sample or independent sample t-test - Independent variable nominal dichotomous • Simplest method to compare means of 2 unrelated groups when outcome (DV) is continuous (sometimes ordinal if enough groups) - Whenever the mean is appropriate descriptive measure

Nonparametric Tests

• Also known as distribution-free tests- No assumptions made about distribution• Assume that data for all groups must have similar spread (dispersion)- Just like we did with "equal" variances for t-test• Nonparametric equivalents exist for each test we have seen so far (and others we will learn)• Generally they have less power than parametric tests

Wilcoxon Signed-rank Test

• Analogous to the paired t-test, but used in situations when parametric assumptions are not met • Usually more powerful than Sign test- Except possibly when heavy tailed* • Uses the size as well as the direction of the difference • Tests whether the medians are the same in the 2 different situations null: - The median pain score before minus the median pain score after acupuncture equals 0- The median difference in pain score = 0 • Alternative hypothesis H1: - The median difference in pain score ≠ 0

1-Way Scatter (Dot) Plots

• Another method to display an interval/ratio level variable• Advantage over histograms and frequency polygons is each observation represented individually (no information is lost)• Disadvantage is graph may be difficult to read if many data points lie close together or have the same value

Comparing 2 Nominal/Ordinal Variables

• Are the observations independent and the sample size relatively large? -Contingency table 2 test • Are the observations independent, but some expected cell frequencies are <5? -Fisher's exact test • Are the data matched or paired? -McNemar's2 test • Do we wish to compare an observed pattern of percentages to a theoretical or expected pattern (one sample test)? -Goodness of fit 2 test

Which is Which? Independent vs. Dependent

• Aspirin compared to placebo to see if aspirin leads to a reduction in heart attacks • The relationship between obesity and the development of diabetes • Asbestos exposure and risk of mesothelioma • Life expectancy and cigarette smoking • The effect of a vaccine program on infant mortality rates in 50 emerging nations • Risk of obesity among diabetics

Interquartile Range (IQR)

• Avoids problems w/ range by focusing only on middle 50% of values • IQR not influenced by extreme values • IQR often reported with the median for skewed data • Limitation: Because IQR is only based on two values, it doesn't give any information about all other values (in Q1 and Q4)

More about Wilcoxon Signed-Rank Test

• Based on ranks, not values • Tests whether the medians are the same or not • A few rules - What if the difference = 0? Subject is not included in analysis - What if >1 subject has the same difference, i.e. a tie? Average the ranks If 2 subjects are tied for 4th place, they would have had ranks 4 and 5 if not tied Both are given the rank of 4.5 If 3 subjects are tied for 4th place, they would have had ranks 4, 5, and 6, if not tied All 3 are given rank of 5

advantages of stratified random sampling

• Can provide greater precision than simple random sample of same size • Ensures sample will be representative on selected trait • Requires smaller sample, costs less • Enhances potential for subgroup analysis

what is both IV and DV are nominal?: Nominal Test Options

• Chi-square Goodness of Fit test• Chi-Square Contingency test• Fisher's Exact test• McNemar'stest

general graphing rules

• Clearly label with a title and axis labels• Show units where appropriate• In a publication, graphs should stand alone with minimal referral to text • Keep gridlines to a minimum• Categories ordered by size (for nominal data)• No three-dimensional graphs• If comparing 2 populations of different sample size, use relative frequency (%) instead of absolute frequency (n)• Pay attention to scales being used

Wilcoxon Signed-Rank Test: Computation

• Compute the differences for each pair• Discard all changes that are 0 - It's as if those pairs were not even in the study• Arrange non-zero values in order of increasing absolute values (that is, ignore the sign)• Give each reordered value a rank• Calculate the ranks for any ties• Put signs on the ranks, corresponding to the sign of the difference for each pair• Add all the positive (or negative) ranks, whichever is smaller• Look up critical value for T in table or use Stata• If calculated T is the same or less than critical T from table, test is statistically significant- Remember... this is opposite from parametric tests where the test statistic must be larger than the critical value to be significant (reject the null) • State the results and conclusion

Tips for Presenting Data with Two Groups

• Consistent use of x and y-axes across multiple panels• Carefully consider including "0" in your axis-Sometimes, it is essential to include 0-Other times, inclusion of 0 is not necessary• Consistent use of y-scale across multiple panels• Consistent use of colors for different categories• Consistent use of fonts, line widths, box sizes, etc., to avoid distortion• With few categories, a single figure may facilitate comparisons; with many categories, consider multiple panels

Scatter Plots: Comparing 2 Interval Ratio Variables

• Depicts the relationship between 2 different interval/ratio level variables• Each point of the graph represents a pair of values

bar charts

• Display frequency distribution of nominal or ordinal data • Categories are along the horizontal axis • Vertical bars are drawn so heights represent # or % of observations in each category • Bars should be equal in width and separated by a space so as not to imply continuity

interval and ratio level data

• Distances between values can be determined and are meaningful• No true zero - arbitrary- Some interval level variables have a zero point but does not represent the absolute lowest value• Can calculate meaningful differences or intervals (addition/subtraction), but not ratios (multiplication/division)• Example: IQ- Difference between IQ of 70 and 80 is the same as difference between IQ of 120 and 130- However, someone with IQ of 100 is not twice as smart as someone with IQ of 50 ratio: • Has equal intervals between values and has characteristic of true zero• Highest level of measurement and contains greatest amount of information• Can calculate meaningful differences or intervals (addition/subtraction) and ratios (multiplication/division)• From statistical viewpoint interval and ratio variables treated and analyzed the same way -We won't distinguish in the course and will refer to I/R data as continuous

Systematic Random Sampling

• Every kth item is selected • k is determined by dividing the number of items by the desired sample size• Randomly select the first item• Example: - Chart review of STD cases in the ED for meeting CDC guidelines for PE, diagnosis, and treatment (N=4400)- Want to sample 200 cases, so every 22nd STD chart- Select a number randomly between 1 and 22- Then select every 22nd STD chart- If randomly selected 14, then sample would consist of chart numbers, 14, 36, 58, 80, etc.

disadvantages of stratified random sampling

• Every member of a population being studied must be classified into one, and only one, subgroup or stratum • More administrative effort • May need more complex analysis

Descriptive Statistics

• For presenting, organizing and summarizing data• Think basic counts, percentages, mean, median, mode• Can be used to describe relationships between two or more variables• Examples:-N (%) of students attending class on a given day-Average number of students in MPH program over last 10 yrs-Proportion of people receiving As, Bs, Cs, and Fs in this class-Average blood pressure for different age groups-Table 1

line graphs

• Good for when independent variable is interval/ratio• Not good for nominal/ordinal data with unrelated categories, e.g. race-No need to connect the categories• Good for showing changes over time -stock market-Birth rates-COVID-19 cases over time• Can add multiple lines to compare group trends

advantages of simple random sampling

• Good when population or individual sample sizes are relatively small • Requires little knowledge of population in advance • Works well with homogenous populations

pie chart

• Hard to tell how much larger falls are to animal bites • Could add percentages to each slice but has no natural order • If comparing 2 groups would need 2 pie charts • Other methods better suited to display data

disadvantages of simple random sampling

• If the sampling frame is large, the method may be impracticable • May result in a sample with very few subjects representing minority subgroups of interest (due to overall small percentage in the population), and thus statistical analyses on these groups may be limited

Quick Tips to Identify Type and Level

• If you have distinct categories/counts, your data are discrete• If you have measures, decimals or fractions, your data are continuous• If you have categories, your data are:-Nominal if they have no order-Ordinal if they have order• If you do not have categories, but numbers that are inherently associated with the values (i.e., time, temperature, height), your data are continuous or interval/ratio

Use z-scores to find N at or between certain values

• If you know the N of the population, you can multiply N by the area under the curve to find how many people are in a given range

CIs using t-distribution

• In situations when we would use a t-test, we calculate CIs with t, rather than z distribution • Choose percent (X), usually 95% • Compute degrees of freedom: n-1 • Look up the t-value for a 2-tailed test for 5% (100 - X%) and df • Compute the SEM: s/ 𝑛 • Compute CI: Mean ±t(SEM)

Variations on the Two Sample t-test

• In the t-test we just did, the sample sizes of the two groups were equal but they do not have to be• We could have these other scenarios, which complicate the SE formula:1. Sample sizes are unequal2. Variances are unequal-Stata makes us choose whether to do the t-test with equal or unequal variances

X^2 Contingency Test: Assumptions

• Independent random sample• Variables of interest are both categorical (nominal, dichotomous, or ordinal)• Mutually exclusive categories• Expected value in each level of the variables (in each cell) is at least 5

When to Use Nonparametric Tests

• Interval/ratio, ordinal or ranked data• Small sample (n<30) • Your data is more suited to median: think outliers - Example 1 Choosing a major1: in mid-1980's at UNC, average starting salary of geography students was well over $100,000 - Would you have considered a career in geography? - Recall, basketball great Michael Jordan—formerly the world's highest paid athlete—graduated from UNC with a degree in geography - Would you still change your major?

Paired t-test

• Measures differences between two conditions within pairs of observations - Each subject has two data points for the same variable (time 1 vs. time 2) - Each subject is matched to a like subject with the same measured variable (Subj1 vs. Subj2) • Compares continuous outcomes • Assesses the differences within individuals, or paired observations, not between the overall mean values of the two groups- i.e., time 1 -time 2; Subject 1 weight -Subject 2 weight, • Tests whether the mean of the differences = 0 - H0: μd=0; H1: μd≠0 • Has differences between the "matched" observations that are assumed to be normally distributed Individuals can be matched or paired with similar individuals-i.e., RCT matching on age grouping, sex, past Sx hx• Matching reduces biological variability between the two groups• Example: Researchers want to study how much very short children grow when treated with growth hormone (GH)-Age of the child affects the amount of growth -For each child given GH, a similar child matched on sex, age and height receives placebo• Use paired t-test to compare mean differences in growth (cm) between each matched pair (Subj 1 -Subj 2)

quota sampling

• Non-probabilistic stratified sampling • Subjects are non-randomly selected according to some fixed quota • Example: A population has 40% women and 60% men, and a total sample size of 100 is needed-Sampling stops when those percentages are reached • In non-proportional quota sampling, a minimum number of sampled units per category are specified • The idea is not to meet certain proportions, but to include sufficient numbers to enable inference (even small groups in the population)

Wilcoxon Signed-Rank Test: H0 and H1

• Null hypothesis, H0: - The median difference between patients' stress level with NPR and jazz = 0 • Alternative hypothesis, H1: - The median difference between patients' stress level with NPR and jazz ≠ 0

judgmental/purposive sampling (state disadvantages too)

• One or more specific predefined groups are targeted\ • Example: -People in the mall with clipboards stopping people and asking if they could interview them -Likely collecting a purposive sample for market research -May be seeking white females 30-40 years old, and they size up those passing by and approach likely candidates • Research believes some subjects more fit for the research compared to other individuals (i.e. purposively chosen) • Disadvantages: -High level of subjectivity by the researcher -Limited representation of wider population

Ordinal Level Data (Ordered Categorical)

• One step up from nominal data• As name implies, has inherent order• Data values such as never, sometimes, often, and always have order• Difference in magnitude is not known stage of cancer, social class, educational degree, satisfaction, likert scales, olympic medals

Tests when IV Nominal/2 groups and DV I/R or Ordinal

• Parametric tests include:- z test- 2-sample t-test- Paired t-test• Non-parametric tests include:- Wilcoxon signed rank- Wilcoxon rank sum

How to Create a Box and Whiskers Plot

• Requires 5 key numbers: - Minimum, Q1, median, Q3, Maximum • Order data from low to high • Identify lowest, highest, and median values • Identify Q1 and Q3 and compute IQR = Q3-Q1 • Prepare a number line going slightly beyond lowest and highest data points • Mark the median, Q1 and Q3 on number line; these are box boundaries• Draw the box • Identify outliers • Draw whiskers to lowest and highest non-outliers

Results and Conclusions for Wilcoxon Signed-Rank

• Results- Computed statistic of 1 is less than our critical value of 3 (so p<0.05)- Reject the null hypothesis- The median pain score after acupuncture is less than before treatment and is statistically significant (11.5 before vs 5.5 after, median difference= -5.5) • Conclusion - We found evidence that acupuncture reduces pain scores in patients with back pain

properties of frequency polygons

• Same two axes as a histogram• Used with interval/ratio data • Usually begin and end with the line touching the x axis • Histograms and frequency polygons convey the same information so personal preference as to what to use • FPs may have better aesthetic value than histogram when >2 groups

Reasons for Sampling

• Samples can be studied more quickly than populations• Study of a sample is less expensive than studying the entire population• Study of an entire population is not possible in most situations• Sample results are often more accurate than results based on a population• Samples can be selected to reduce heterogeneity

histogram

• Similar to a bar chart except it depicts frequency distribution for interval/ratio level data • Horizontal axis displays true limits of each interval -Intervals are of equal size • Vertical axis depicts either the frequency (#) or the relative frequency (%) of observations in each interval

Sign Test

• Similar to paired t-test, but differences of the means do not need to be normally distributed - Pairs are independent; one pair doesn't influence the other - More powerful than t-test with non-normal data • Dependent variable is interval/ratio - Could also be ordinal • Interested in whether there were increases or decreases only - Does not consider the magnitude of changes Conclude: Acupuncture (or variable whatever) did not significantly change back pain (other variable)

What is a t-distribution?

• Similar to the standard normal curve - Mean = 0 - Symmetric - Total area = 1 • Different from standard normal curve - More conservative test, developed especially for smaller samples - Extreme values are more common - Different curve for each possible value of 'degrees of freedom' (abbreviated df) - With smaller sample sizes (and smaller df), t distribution has wider tails - With larger samples and as df -> ∞, normal and t distributions are the same

tables

• Simplest means of summarizing a set of observations • Can be used for all types of data • Nominal & ordinal data -Consists of classes/categories with the corresponding numerical count -Relative frequency or proportion/percentage can be displayed in table • Interval/ratio level data -Must create categories, or a series of non-overlapping intervals -If too many intervals, no advantage over raw data -If too few, great deal of information lost

Statistics allows us to describe

• Statistics allows us to describe the "average person" and see if that description fits or doesn't fit other people

What is Statistics?

• Statistics explores the collection, organization, analysis and interpretation of numerical data • Its concepts can be applied to a number of fields including-Business-Psychology-Agriculture • When the focus is on biological and health sciences, we use the term Biostatistics

Presentation of Data (when to use which figure)

• Tables -all types of data • Bar Charts -nominal and ordinal variables • Histograms -interval/ratio variables • Frequency polygon -interval/ratio variables • Dot plots -interval/ratio variables • Scatter plots -interval/ratio variables • Line Charts -often used for trends over time Note: when presenting data for a single variable, a univariate table or plot is used; when comparing data for two variables, a bivariate table/plot is used

standard deviation

• Technically, the units for the variance of the cheese steak data are (cheese steaks)2 • That is hard to interpret • We often use the Standard Deviation instead, which is the square root of the variance • Taking the square root of the variance expresses the average deviation in the original units • s = 28.6 cheese steaks/family • Notation: "s" refers to standard deviation (√s2 = s), also reported as SD calculate by taking square root of the variance

Goodness of Fit: What Is It?

• The GOF is a one sample test• Compares sample data to a known distribution-Analogous to a one sample z-test -Z-test compares to a known mean-Goodness of fit 2 compares to a known pattern of percentages• It is different from two-sample tests that compare 2 or more groups

confidence intervals

• The population has parameters such as means-Usually, we don't and can't know the population mean (or other parameter)• From our sample or study population, we want to estimate the mean (or other population parameter)-Point estimate A single number, in this case the sample mean• We also want to know how good or how precise our estimate is-Confidence interval Range of values that includes our population value, with a level of certainty

The quantity z represents

• The quantity z represents the distance between the specific score and the population mean in units of the standard deviation • z is negative when the specific score is below the mean, positive when above • From the z-score, we can calculate the area under the curve or the proportion of values between any two values

advantages of systematic random sampling

• The sample is easy to select• Adds a level of structure to random sampling• Assures population will be evenly sampled (avoids possible clusters w/random sampling)• Works well with homogenous populations

disadvantages of systematic random sampling

• The sample may be biased if there is a hidden pattern in the population that coincides with that of the selection - Every 5th widget produced is broken because of machine failure - Clinic with provider seeing more acute patients on given day

Which proportion am I looking for?

• Think of the study design & the question being asked - What are we grouping on? That is the proportion of interest • Cohort Design - Group of alcoholics & non-alcoholics followed for 10 years for development of liver cancer - Grouping on alcohol status - The proportion of interest is liver cancer in alcoholics and non-alcoholic drinkers(denominators are alcohol and non-alcohol groups) • Case Control Design - Group of patients with pancreatic cancer and those without cancer are assembled and asked by questionnaire about their coffee drinking - Grouping on disease status - The proportion of interest is whether participants drink coffee or not in each Cancer group(denominators are Cancer+ vs Cancer-) • Associative Design -Proportion of interest can be either variable -Examine what is classified first and measured second -Excessive alcohol use associated with low GPA (alcohol is grouping) -Low GPA associated with excessive alcohol use (low GPA is grouping) • Clinical Trial -Patients with breast cancer randomized to treatment A or B -Grouping on TX/exposure -The proportion of interest is how many survived in each treatment group at the end of 10 years

Converting to the Standard Normal Distribution: how to convert?

• To convert our IQ data so the mean = 0, we shift the whole curve to the left by 100 units (the mean), so it is now centered at 0 - The equivalent of subtracting the mean IQ (100) from each value in the distribution also see slides 17-20 on slidedeck 2c

Finding the Value for a Specific Percentile

• To find the Score above the XX% tile... • Identify the area under the curve (i.e, the proportion of the population) that falls above the percentile • Use this to compute the area between the mean and the percentile of interest • Look in the body of z-table to find the z-score associated with the area between the mean and percentile of interest • Use the z-score value to solve for X in the z-formula

cluster sampling

• Two stage process in which population divided into clusters -Stage 1: Subset of clusters randomly selected -Stage 2: Within the selected clusters, include all study units or randomly select from within the cluster • Example: Household survey taken in a city -City blocks are the clusters -Randomly sample the city blocks -Randomly sample of households within the selected city blocks and administer surveys • Somewhat less efficient than other sampling methods as requires a larger sample size

Characteristics of the Mean

• Typically the optimum choice for numeric, symmetrically distributed data - Uses every individual value • Can use it to get a total value - 100 people have a mean of 3.2 health care visits for back pain per year. We can multiply to get the total number of visits, i.e., 320 • Extreme values have large effect- E.g., one billionaire can have large effect on mean income • Generally limited to interval/ratio level data - Sometimes useful for ordinal level data

Goodness of Fit Chi-Square Test: When Might We Use It?

• Used when interested in comparing percentages in each category to an expected pattern of percentages• To see if categories are equally common - Subjects asked to choose their favorite fruit of apple, orange, peach, and pear - Are results consistent with 25% for each fruit or is there and imbalance? • Commonly used in genetic experiments - Does percentage of red, pink, and white flowers match autosomal recessive pattern of inheritance (25%, 50%, 25%)? • When researchers want to fail to reject the H0 (not find a significant difference) - Merck wants to show that real-world results of a new medicine are what they claimed they would be

sampling- what we can determine and basis of z test

• We can determine where a mean from a sample falls in a standard normal sampling distribution• Basis of Z-test: How far is the sample mean from the true population mean (in SEM units)? Z = 𝑥bar-μ / (σ / square root of𝑁)

population- what we can determine and basis of z score

• We can determine where an individual value falls in a standard normal distribution• Basis of Z-score: How far is the individual from the population mean (in SD units)? Z = x -μ / σ

nferential Statistics

• We use statistical tests to compare groups• When we use smaller groups (samples) to make inferences about larger groups (population) • Involves generalizing; sampling frame is important• Examples: -RCT of 100 people, tests which intervention works better-Bringing medicines to market-Comparing concussion outcomes for two types of lacrosse helmets-Measuring contraceptive adherence among adolescent girls

histogram characteristics

• Wider the interval, more information lost• Too small an interval, run the risk of cluttered axis• Natural ordering is important with I/R and ordinal data, e.g. cannot reorder categories on x-axis• No spacing between adjacent bars to convey data are continuous

Standard Normal Distribution

•All use the same scale •Set the mean to 0 •Standard deviation = 1 (the variance = s2=(1)2 = 1) •This is also called the Z-distribution

The Normal Distribution

•Also called a Gaussian or bell curve •Single peak •Symmetrical (not skewed) •Mean=median=mode •Tails extend infinitely •Theoretical model, but some empirical distributions close enough to assume normality

Probability

•Basis of statistical inference •Provides precise measurement of the likelihood that an event will occur •Always between 0 and 1 (or 0 and 100%) •Can be based on what we know-1/2 or 0.5 or 50% probability that a fair coin will land on heads if tossed-1/52 probability that the 2 of spades will be picked from a deck of cards (0.02 or 2% probability) •In research, often estimated from data

Z-test summary

•Compares a sample mean to a known population mean•Continuous dependent variable•Requires population distribution to be normal or sample size is large so Central Limit Theorem applies•Requires random sample

2 sample independent ttest

•Compares the means of two independent groups•Continuous variable (DV) must be normally distributed•Random samples

How large is large? It depends.

•If the population is close to normal, then large can be as small as 2•If markedly different than normal, then 10 to 20 may be large enough•To keep it safe, sample size of 30 usually large enough

When do we use the Z-test?

•Level of measurement is interval-ratio (comparing means)•Comparing one sample to a population with a known μ and -Z-test is a 'one sample test'•The sample is random•Sampling distribution is normal -Because the variable is normally distributed OR-Because the sample size is sufficiently large, so the Central Limit Theorem applies

paired ttest

•Measures on same subjects at 2 different times or conditions, or uses matched subjects-NOT independent•Continuous variable •Differences must be normally distributed•There should not be any outliers in the differences•Random sample

convenience sampling

•Members of the sample are selected based on their relative ease of access •Convenience samples are biased because researchers may unconsciously approach some types of respondents and avoid others•Respondents who volunteer for a study may differ in unknown but important ways from those who do not volunteer•Example: sample only students in this class, people living on Broad Street, friends, co-workers, or shoppers at the KOP mall

A Few Easy Rules for level of measurement

•Nominal and ordinal data are categorical and always discrete•Ordinal data has more than 2 categories-If low and high are the only options, the data are nominal-If categories are low, medium, high, the data can be considered ordinal•Continuous data are always interval/ratio•Interval/ratio data can be either continuous or discrete-We will not distinguish I/R data after today!

nonprobability sampling

•Not based on probability •Probability that subject is selected is unknown and may reflect selection bias of the investigator •Does not fulfill the requirements of randomness needed to estimate sampling errors •Types -Convenience sampling -Purposive or judgmental sampling -Quota sampling

Random Assignment

•Not the same as random sampling -Random sampling is used when a sample of subjects is selected from a population of possible subjects -This technique is used in observational studies •Experimental studies (e.g. RCT) use random assignment -Subjects first selected for inclusion if they meet study criteria -If they meet study inclusion criteria, they are assigned to different treatment modalities -If done by random methods, referred to as random assignment

stratified random sampling

•Population divided into relevant strata or subgroups •Randomly select from each stratum • Number selected from each strata is dependent on proportion of strata size to population -Example, if men make up 40% of population, 40% of selected subjects should be from male stratum •Guarantees the sample will be representative on the selected trait

Measures of Dispersion examples

•Range •Percentiles, e.g. tertiles, quartiles, quintiles •Interquartile range •Variance •Standard deviation

Nominal Level Data (Categorical)

•Refers to data that can only be put into groups•Values are categories•No category is better than another and differences between categories cannot be determined•Subset of nominal data is dichotomous data; two-levels ex: male/female, eye color, political party, nationality, region, health insurance, race

Describe the importance of clinical and statistical significance

•Statistical significance is when the p-value resulting from a statistical test is less than the pre-defined alpha value (e.g., p<0.05) •Clinical significance is when group differences are clinically meaningful (i.e., large enough to be clinically important) •Something can be statistically significant and not clinically significant and vice-versa

Hypothesis Testing in General

•Used to decide if data support a real difference/effect or not•Can think of hypothesis testing as comparing signal to noise ratio-Difference or the effect = signal-Dispersion or the variability = noise•If signal is large compared to noise, we conclude there is an effect•If signal is small compared to noise, we conclude there is no effect

How do you calculate a Population Parameter?

•You don't, unless you have access to the entire population (e.g. total number of heart transplants in 1967) •For the most part never really have access to entire population •So, we calculate a sample statistic (e.g. mean or standard deviation calculated from the sample) to estimate the population parameter


Related study sets

Ch 9-10, 12 Check Your Basic Knowledge

View Set

Clinical Nutrition: Nutritionism

View Set

Lesson 4: Your business Snapshot

View Set

Intro to health science and medicine test 1

View Set