Stats
• A group of 100 New Yorkers with lung cancer were identified based on a screening questionnaire at a local hospital. These patients were compared to another group that reported no lung cancer. Both groups were questioned about smoking within the past 10 years. The prevalence of smoking was 25% among lung cancer patients and 5% among non-lung cancer patients. Likely questions: • Type of study? • What can be determined? (odds ratio)
(case-control, and the way you recognize this is look at how they found the pt which were found by the presence of absence of lung cancer so lung cancer are cases and absence is controls) (odds ratio because thats what you get from case control studies.)
• A group of 100 New Yorkers who smoke were identified based on a screening questionnaire at a local hospital. These patients were compared to another group that reported no smoking. Both groups received follow-up surveys asking about development of lung cancer annually for the next 3 years. The prevalence of lung cancer was 25% among smokers and 5% among non-smokers. • Likely questions: • Type of study? • What can be determined?
(prospective cohort, so you know this by how they identified the patients, which were patients that smoke and didnt smoke but you see new yorkers here and that may confuse you, and you might think that maybe this is a cross sectional study of new yorkers but read more carefully they were looking for smokers and non smokers and they were following them over time for 3 years soo it cant be a cross sectional study and because they identified the patients by risk factors then you know that this is a cohort study and they followed them going forward in time over 3 years so you know its a prospective cohort study) (relative risk ratio)
• A group of 100 New Yorkers who smoke were identified based on a screening questionnaire at a local hospital. These patients were compared to another group that reported no smoking. Hospital records were analyzed going back 5 years for all patients. The prevalence of lung cancer was 25% among smokers and 5% among non-smokers. • Type of study? • What can be determined?
(retrospective cohort, they identified pt by smoking status by exposure or risk factor but its retrospective because they went back in time) (relative risk or risk ratio for this type of study.)
• A researcher studies plasma levels of sodium in patients with SIADH and normal patients. The mean value in SIADH patients is 128mg/dl with a standard deviation of 2. The mean value in normal patients is 136mg/dl with a standard deviation of 3. Is this difference significant? • Common questions: • Which test to compare the means? • What p-value indicates significance?
(t-test)-because concentration of Na is a quantitative variable and its number that can go up and dwn in the two groups <0.05), and remember that if the p value is less than 0.05 this means that there is less a 5% chance that these two differences in means were due to chance alone and there arenet differnces in true mean value between the 2 groups
What can you calculate from the 2x2 table?
Can calculate many things: • Risk of disease • Risk ratio • Odds ratio • Attributable risk • Number needed to harm THE ABILITY TO CALCULATE ALL OF THESE THINGS IS VERY HIGH YIELD FOR BOARDS.
• Using a national US database, rates of lung cancer were determined among New Yorkers, Texans, and Californians. Lung cancer prevalence was 25% in New York, 30% in Texas, and 20% in California. The researchers concluded that living in Texas is associated with higher rates of lung cancer.
Cross-sectional study • Presence of different groups could make you think of other (dont get confused) study types • However, note lack of time frame • Study is just a fancy description of disease prevalence in a snap shot of time in 3 different states and because the rates are highest in texas, so you would say that texas is associated with higher rates of lung cacner but they cant say anything about the risk of living in texas or the odds ratio of living in texas for lung cancer...that would require studying pt over time as we will see in other types of epidemiology.
How do determine skew?
Data sets arent always evenly distibuted and sometimes they are skewed and they can have either a postive or negative skew. you can determine the skew by looking at the tail, so if its pointing towards the Y axis and the negative numbers then thats negative...or the opposite one is positive. the way that you figure out the mean, median, and mode in a negatively skewed plot is... Remember that the mode is the highest point and its always the easiest thing to find and the mean is the furthest thing away and is towards the tail and the median is in the middle....and vice versa for a positively skewed plot.
What cant you confuse confidence interval with?
Don't confuse with standard deviation • Mean+/-2SD • 95% of samples fall in this range and this is very differnce from the... • Mean +/- CI • 95% chance that repeated measurement of mean in this range • If you see 95% in a question stem • Read carefully: What are they asking for? • Range of 95% of samples? • 95% confidence interval of mean? (SHY!)
What is the CI in group comparisons?
Group Comparisons • Many studies report differences between groups • Can average differences and calculate CIs • If includes zero,no statistically significant difference • Example: • Mean difference between two groups is 1.0 +/- 3.0 • Includes zero • No significant difference between groups • Similar to p>0.05 • (Formal statement: Null hypothesis not rejected) so anytime you see CI around a mean look to whther or not it includes 0.
What happens if you put a statistical distubition on a x and y axis?
If we plotted the blood glucose level on the x axis and the number of subjects that had these glucose levels on the y axis...we would get a gaussian curve...which is the middle number which is very common for most people and we call this a normal or gaussian distribution...and a lotta things distribute like this..height weight how long you in the hostpital
• A researcher studies plasma levels of sodium in patients with SIADH and normal patients. The mean value in SIADH patients is 128mg/dl with a standard deviation of 2. The mean value in normal patients is 136mg/dl with a standard deviation of 3. Is this difference significant? • If the p value is high (non-significant) why might that be the case? •
Need more patients ( the more pt you have then the more confident you can be that the difference between the two means is real and not due to chance) • Increase sample size then increase power to detect differences and the bottom line is when you have more sample you can be more confident that the difference in means that you are seeing is real...so anytime you get a nonsignificant p value...you often need to go back and collect more data to determine whether or not you actually have a true differrnce
What are the types of data?
Quantitative variables: • 1,2,3,4 (sometimes they can be negative and they go in order ) • Categorical variables: • High, medium, low • Positive,negative • Yes, No • Quantitative variables often reported as number • Mean age was 62 years old • Categorical variables often reported as percentages • 40% of patients take drug A (so you can either be taking drug A or not taking drug A, this is not a quantitative variable but they are gicving you a number to help you understand the percentage) • 20% of patients are heavy exercisers (you are either a heavy excisier or you not...remember that it is not a quantitative variable and this is a point of CONFUSION for students because they will see percentages and think its a number that goes from 0 to 100 but its not a quantitiative variable...you gotta pause and think about it....so taking drug A is a yes or no action and being a heavy exciser is a yes or no function. I point all of htis out because we use different statistical tests for quantitiative versus categorical variables
What happens if the SD is +1 and -1?
So one of the things you def need to know for step 1 is how much of the data you can capture if you go 1 or more standard deviations form the mean... so look at this data set and this has a mean labeled... if we go +1 deviation and -1 deviation... you will capture 68% of the data points within that range.
What happens if the SD is -2 and +2?
Then we will capture 95% of the points?
How does different data points effect the -/+1,2,3 SD?
This curves have the same median, mean and mode, but the one on the right are much more dispersed than the one on the left, but the rules still hold that in the amount of data points that are captured.
What should you think when you see the number 95% in a question stem?
dont get confused! • This value often confusing • Read carefully:What are they asking for? • Range in which 95% of measurements in a data set fall • Mean +/- 2 SD • but if they say 95% confidence interval of the mean • Mean +/- 1.96*SEM MAKE SURE YOU HAVE THIS STRAIGHT IN YOUR MIND
What are the four possible outcomes of our study?
generally written in our table... reality is put at the top and this is the truth and this if there is a real differnce between the means and H1 or the alternative hypothesis (which means that in reality the two means are different) and H0 (which means in relaity there is no difference between the means. Experimental findings are written on the side, which mean H1 finds a difference betwen the means and H0 which means we do not find a difference btwen the means. so if there is a differnce between the means in reality and our experiement finds it then that is called the power (easier to think of power as a number, which means you have 80% ability to detect a difference that truly exist). So staticians like to think in the sense of the null hypothesis which means that there is no difference, so if you want to think of it that way then the power is the likelihood that you will reject the null hypothesis appropiately because its wrong in reality. in the top right we have ∂ which mean we detect a differnce but its not real and not there... and in the bottom left we have a ß error, which is the change of missing difference that is relaly there. 1-power=ß (probability of a type II error)
What are case control studies vs cohort?
in a case control remember we identified by disease( the disease pt are the cases and the non-disase are the controls) A cohort is when you identify a pt by a risk factor like smoking Case control give results in a odd ratio and cohort use a relative risk. and if you have trouble remembering that then look towards the second last letter of both types of studdy.....to decide whether odds ratio or relative risk.8
How doe sthe number of data points influece the likelihood?
increases the likelihood aas the number of data points range in a certain area.
What is scatter hypothesis testing?
lets imagine we plotted in Mizyme level on the x axis and on the y axis we put the number of samples that were estimated at a particular level. so all of the normal patients are represented by black dots and most of the normal pts have means down by 1, but all of the MI patients have levels that are much higher and represented by the red dots....now its very likely that these two differences are real because the scatter data is very far apart from another and its pretty unlikely that the MI patients have the same Mizyme level as the normal pts as oppose to just getting lucky and pulling all of these samples out form the MI patients becasue how far apart these two groups are and how tightly the data points are collected amount the means
What is the correlation coefficient of a graph that has pack years smoking (x) and years lived (y)?
lets imagine you have a data set with a number of patients and you knew the number of pack years (number of days x the number of years a person has smoked) a patient has smoke... and lets suppose you plotted pack years on the x axis and lifespan on the y axis... we can imagine the more years you smoke, then the less years that you live...and it would have a downward trajectory... so people who smoke a very long time will live less as oppose to those who have smoked very little. now if the only thing that determined lifespan was the number of years that you spent smoking..then nothing else mattered and just smoking determined the lifespan then it would a be a perfectly straight line.. now in reality we understand that many things determine lifespan, so the real data would have a lot of scatter, but there would be a generally downward trend...so the longer you smoke then the less time you live. so mathematician create a best fit line through scattered data like this and htis line minimizes the difference between each of the points and if we move it up or down we would have more distance between the points, but the mathematcian calculate the best fit line through the scattered data with the less deviations from the line through each data point. we can characterize the best fit line with a correlation coefficient or also called the pearson coefficient which tell us how much smoking impacts lifespan
What happens when the scatter points are far apart?
lets suppose we have group 1 down here and we have group 2.. all the pt in group 1 have results scattered by the black dots and group 2 red.... you can see from this image that mean between these two groups is pretty likely and the scatter of the data around group 1 is very tight and very tight around group 2 as well and the means are very far apart and its very unlikly that these two means are the same and we just happen to pick out a bynch of sample for group 2 that are way differnet. so when there is very little scatter of data in groups and groups are far apart its more likely that the data is real and not due to chance
What happens when the scatter data points are far apart?
likelihood that this two diffence is due to chance , is very high due to all of the scattering of data
What is the variance?
most scienfific literature report SD but once in awhile you will se a study that reports variance...so remember that the SD sigma is equal to square root to all that stuff on the inside, but variance is sigma squared......so you take all of the stuff on the inside and remove the square root sign, but know that rules for SD the 68 and 95 percent dont apply.
Expalin if the coeffcient is -0.5, +0.5 and 0
now here are some graphs showing the direction of the relationship and this relates to whether the correlation coefficient is positive or negative. Heres a chart on the left showing a r value=-0.5, so its negative and an inverse relationship so the more the x variable then the less the y axis variable and i since its 0.5 its a relatively weak relationship. theres a lot of scatter in the data. in the middle we have a postive 0.5 so the data are similarly scattered in a postive direction and over on the right we have a r=0 which is no relationship and you can see that these data dont go up or down but are very even because there is no correlation between the x axis and the y axxis.s
What happens if scattered data is really scattered?
now if the data was more scattered and the data overlaps a lot and it makes you wonder whether with this much scatter and overlap that maybe the differences between the two groups were due to chance and if we collected more samples then we would find that the means really arent different.... so the key point that im trying to show you here si that the scattering of datapoints influences the likelihood that theres a true differnce between means
What are the 3 tests of significance?
• Three key tests • t-test • ANOVA • Chi-square • Determine likelihood difference between meansi s due to chance • Likelihood of difference due to chance based on • Scatter of data points • How far apart the means are from each other • Number of data points
What another factor that influences scattered data?
number of samples, so you can see an overlap of data in this but you have a lot of samples and you can see the density of normal pt and high density of MI patients..so these two means are more likely to be truly different because we have a lot of clustered samples that are far apart away from one another... so the key point is that the number of data points also influences the liklihood that there is a true diffference between means and we can be more confident when we use a lot of samples.
What is standard deviation?
represented by the sigma and you are very unlikely to be asked to calculate standard deviation on boards but you need to know how to interprett SD X-X (with a line above it means mean lol and is the difference point between each data point and the mean) next is the sum of the differences and you are adding up how far each data point if from mean then you can square these differences and divide by the samples minus 1 and then you take the square root of that entire thing to get SD
What is the cohort study?
so if you wanted to know whether smoking led to lung cancer, the obvious way would to find a group of smokers and follow them over time and see who has more cancer and this a obvious and very powerful way to determine who has more disease. • Compares group with exposure to group without and its very important that you remember this is how patients are identified....they are identified by exposure, which is very important distinction between cohort and a case control study where we identify pt by disease. So once we find our group of pt that are smokers and nonsmokers that is our cohort and we look... • Did exposure change likelihood of disease? • Prospective (which means they identify pt with and without exposure and monitor them going forward in time) • Monitor groups over time • Retrospective • Look back in time at groups (to see whether they had the disease and if you identify the smoker then you look back through their charts to see if they had pneumonia in the last 5 years.) Prospective study are alwyas more powerful than a retrospective study because when you do a retrospective study you are relying on the clinical evidence that was written down...so for example if you were trying to determine if smokers get more pneumonia than nonsmokers retrospectively then its possible that the smoker had pneumonia and no one ever wrote it down in the charts and it wasnt recorded, so your retrospective study will miss that, but if you do it prospectively...you can carefully watch them over time to see whether they get pneumonia and its more likely that you wont miss things.
What is the 2 X2 table?
so in order to udnerstand risk qauntification we have to understand the 2 x 2 table....and we get the results from case control or cohort study they typically put in a table like this on the top we have disease and we put pt that have the disease on the left (postive) and pt without the disease on the right (-) and then on the side of it we write exposure (to smoking for example) which is (+) in the top row and negative in the bottom row. so lets suppose we studied a 100 pts and 25 of them were exposed and had the disease and say we had another 30 pts that did not have the disase then we would write that on the right and this is how you make a 2 X2 table usually from a cohort or a case control study
What is a cross-sectional study?
so its a cross section of a population • Patients studied based on being part of a group • New Yorkers • Women • Tall people • Frequency of disease and risk factors identified • How many have lung cancer? • How many smoke? • Snapshot intime • Patients not followed for months/years (not for 10 years) and we are just looking at them right now and seeing how many of them smoke or have cancer for example.
• A group of 1000 college students is evaluated over ten years. Two hundred are smokers and 800 are non- smokers. Over the 10 year study period, 50 smokers get lung cancer compared with 10 non-smokers.
so let start by focusing on the number 200, so you can write that number to the side and they also tell you 800 pts are non smokers which is the sum of the bottom quadrant. 50 smokers get lung cancer so thats top left. and the 10 nonsmokers get the disease so thats bottom left.
How can you detemine the mean median and mode of a gaussian curve?
so looking at this curve we are going to figure out the mean median and mode... you can still figue it out without any numbers just based on the shape. THe mode is simple because its always the highest point because its the number that occurs most commonly. if the data is evenly distributed (by seeing an even number of measurements on the left and right side of the mode) then the mean and median are equal to the mode
What is the schematic of cohort study?
so we find out patients and we find the cohort of people that are a mixture of smokers and nonsmokers and we follow them over time...and we see how many smokers get disease and how many nonexpose smokers get the disease or dont
Sample question, • A test is administered to 200 medical students. The mean score is 80 with a standard deviation of 5. The test scores are normally distributed. How many students scored >90 on the test? • 90 is two standard deviations away from mean • 2.5% of students score in this range (1/2 of 5%) • 2.5% of 200 = 5 students
so we know that +/-2 SD is 95% of the data, but you need to realize that at least 5% of this data is ourside of the range and that 5% is split up so the 2.5% is above and below the range...so this means that 2.5 % of students score above 90. THis is the type of question that often comes up on stats exams and step and they will give you a data set and ask you to identify. How many patients are in some range based on how many SD they are form the mean.
How do you calculate CI95?
so what we are saying is that we believe based on this data set that 95% of repeated means would fall between 8 and 12. and sometimes its stated as the upper confidence limit being 12 and the lowe confidence being 8.
What is the statistical Distribution?
suppose we randomly select subjects some healthy subjects and measure there glucose levels... we would get a collection of levels like on the screen and it would be a few people with low numbers like 85 and a couple people with higher numbers like a 112 and then there would be a lot of people in the middle
What are the odds and risk ratios?
these are some specific ways in how we use CI and we use them with odds and risk ratios...and this explain the increase risk of disease based on exposure. • Some studies report odds or risk ratios with CIs • If range includes 1.0 then exposure/risk factor does not significantly impact disease/outcome (so some studies will report an odds and risk ratio with a CI but the key thing to remember that if the CI includes 1.0 then the exposure or risk factors do not significantly impact the disease and therir outcome • Example: • Risk of lung cancer among chemical workers studied • Risk ratio = 1.4 +/- 0.5 (well that range of +/-o.5 includes the number 1.0 and t his means that chemical work is not significantly associated with lung cacner) • Confidence interval includes 1.0 • Chemical work not significantly associated with lung cancer • (Formal statement: Null hypothesis not rejected) the bottom line ist hat if you ever shown a risk ratio with Ci check to see if whether or not the Ci include the number 1, and if they do then this means that the CI is the exposure risk factori s not sign associated with a change in disease.
WHat are the 3 basic types of studies?
these are used to determine the association of exposure/risk with disease • Cross-sectional study • Case-control study • Cohort study (prospective/retrospective)
What is dispersion?
they both have a mode of 10 , but the measurement of dispersion is the difference is what characterizes them.
What happens if we go-3 and +3 SD?
we will capture 99.7$ data points in that range, virtually all of them
How do you explain the results we got from mizyme normal=1 and MI=10?
when we testing the mizyme levels on normal pt we found a mean level of 1 and when we tested on MI pt we foudn a mean value of 10 and there are 2 possible explanations for our results of our experiment • Two possibilities o four test of MIzyme • #1:MIzyme does NOT distinguish between normal/MI • Difference in means was by chance; true means are the same and we just got a fluke measurement. • #2: MIzyme DOES distinguish between normal/MI • Difference in means is real • Null hypothesis (H0)=#1 • Alternative hypothesis (H1)=#2
What is the Z score?
which represents how many standard deviations you are ways from the mean, • Z score of 0 is the mean • Z score of +1 is 1 SD above mean • Z score of -1 is 1SD below mean so all of these dotted lines on the chart can represent SD from the mean can be changed z scores by -/+ 1
What is the set up of a case control study?
you would find cases and then they would have to have both exposed and unexposed (like people with lung cancer that are smokers and nonsmokers). and then you have to find a control group which also has to contain smokers and nonsmokers (exposed and unexposed pts) and then you can compare the rates of exposure and what you are looking to see is that there are higher rates of cancer amoung the smokers than the controls
What is the first step to identifying study types?
• #1:How were patients identified? • Cross-sectional: By location/group (i.e.New Yorkers) • Cohort: By exposure/risk factors (i.e. Smokers) • Case-control: By disease (i.e. Lung cancer)
What is the 2nd step to identifying study types?
• #2: Time period of the study • Cross-sectional: No time period (i.e. snapshot) • Retrospective: Look backward for disease/exposure • Prospective: Follow forward in time by following pt to look for disease/exposure
What is the 3rd step for identifing study types?
• #3:What numbers are determined from study? • Cross-sectional: Prevalence of disease (possibly by group) • Cohort: Relative risk (RR) • Case-control: Odds ratio (OR)
What is Anova?
• Analysis of variance when is a T-test when there are more than 2 groups. • Used to compare more than two quantitative means • Consider: • Plasma level of creatinine determined in non-pregnant, pregnant, and post-partum women • Three means determined • Cannot use t-test (two means only) • Use ANOVA • Yields a p-value liket-tests and you can interpret it in a simlar manner
What is the central tendency?
• Center of normal distribution (the middle of the gaussian curve) • Three ways to characterize: • Mean: Average of all numbers • Median: Middle number of data set when all lined up in order • Mode: Most commonly found number
What is power?
• Chance of finding a difference when one exists (we like to do experiments that have high powers which mean that they are able to detect real differences.) • Or chance of rejecting no difference (becausethere really is one) • Also called rejecting the null hypothesis (H0) • Power is increased when: • Increased sample size • Large difference of means • Less scatter of data (more precise measurements)
What is a case-control study?
• Compares group with disease to group without • Looks for exposure or risk factors amoung the two groups (its like the opposite of a cohort study...we are looking for disease first and the exposure).. • Opposite of cohort study • Better for rare diseases
What is the T-test?
• Compares two MEAN quantitative values • Yields a p-value (p stands for probability) • p value is chance that the null hypothesis is correct • No difference between means and any difference that we are seeing is due to chance. • If p<0.05 we usually reject the null hypothesis and state that the difference in means is "statistically significant" now keep in mind that a p value less that 0.05 doesnt means that there is 0% change that the means could be the same and tht differences we are seeing are due to random chance, but it just means that its low enough that we are willing to accept that risk.
What is the Chi-square?
• Compares two or more categorical variables • Must use this test if results are not hard numbers • When asked to choose statistical test for a dataset always ask yourself whether data is quantitative or categorical • Beware of percentages-often categorical data (so if they say 20% in group 1 are heavy excisier and 40% in gourp 2 are heavy exercisers..they way you test for that significantly statistical differnce is by using the chi square because heavy exercise is a yes or no categorical variable.
What dont you wnat to confuse confidence intervals with?
• Don't confuse SD with confidence intervals (remember that +/-2 SD include 95% of the data set, and this is completely different from CI) • Standard deviation is for a given data set • Suppose we have ten samples • These samples have a mean and standard deviation • 95% of these samples fall between +/- 2SD • This is descriptive characteristic of the sample and data set. • Confidence intervals • This does not describe the sample • An inferred value of where the true mean lies for population (how close our data set represents the population of our experiment)
What is the randomized trials?
• Don't confuse with case-control • you may idenfity patients by disease...Patients identified by disease like case-control (for ex a randomized trial for a drug for heart attack patients is identified by the presence of MI, but importantly the exposure is going to be determined randomly...so the MI pt are going to be randomly picked to receive the drug or not which is a randomized trial and its very different from a case control study.) • Exposure determined randomly
What is the four outcomes probablity of being correct based on?
• Each of the four outcomes has a probability of being correct based on: • Difference between means normal/MI • Scatter of data • Number of subjects tests you dont need to know the math but you do need to know these three things here
What are the 4 possible outcomes of our experiement?
• Four possible outcomes of our experiment: • #1: There is a difference in reality and our experiment detects it. This means the alternative hypothesis (H1) is found true by our study. • #2:There is no difference in reality and our experiment also finds no difference. This means the null hypothesis (H0) is found true by our study. • #3: There is no difference in reality but our study finds a difference. This is an error! Type 1 (α) error. (which means you believe that theres a differnce but there isnt one there) • #4: There is a difference in reality but our study misses it. This is also an error! Type 2 (β) error. (means we think the two means are equal but they are different in reality.)
WHat is the goal of a epidemiology study?
• Goal:Determine if exposure/risk or presence of a risk factor associated with disease • Many real world examples • Hypertension then stroke • Smoking then lung cancer • Exercise then fewer heart attacks • Toxic waste then leukemia and by learning of all of these associations we are then able to focus on exposure to try to decrease the rates of disease.
What is the standard error of the mean?
• How precisely you know the true population mean (in statistics we often take a small sample to use its characteristics to estimate the characteristics of a larger population...for example we may measure 10 systolic blood pressures and take the mean value and hope that represents the mean value of a larger population of 10K people and the stadard error of the mean represents how close we are to the true population mean). • SD divided by square root of n (number of samples) • More samples then less SEM (closer to true mean like taking 10 bp reading for 10K people then upgrading and taking 1K for 10K) • Big σ means big SEM • Need lots of samples (n) for small SEM • Small σ means small SEM • Need fewer samples (n) for small SEM
What is hypothesis testing?
• Hypothesis testing mathematically calculates probabilities (i.e. 5% chance, 50% chance) that the two means are truly different and not just different by chance in our experiment • Math is complex (don't need to know thats why we have staticians BUT YOU DO NEED TO UNDERSTAND THE NEXT 3) • Probabilities by hypothesis testing depends on: • Difference between means normal/MI • Scatter of data • Number of subjects tested
What are the key points for central tendency?
• If distribution is equal,mean=mode=median • Mode is always at peak • In skewed data: • Mean is always furthest away from mode toward tail • Median is between Mean/Mode • Mode is least likely to be affected by outliers (which is one measurement that is very far away from others.) • Adding one outlier changes mean, median • Only affects mode if it changes most common number • Outliers are unlikely to change most common number
So out of null hypothesis and alternvative hypothesis what does hypothesis testing represent?
• In reality, either H0 or H1 is correct • In our experiment, either H0 or H1 will be deemed correct • Hypothesis testing determines likelihood our experiment matches with reality
• Sixteen normal subject shave their blood glucose level sampled. The mean blood glucose level is 90mg/dl with a standard deviation of 4md/dl. What is the likelihood that the mean glucose level of another ten subjects would also be 90mg/dl? • How confident are we in the number 90mg/dl?
• In scientific literature,means are reported with a confidence interval • Study subjects: Mean glucose was 90 +/- 4 • Authors believe that if the study subjects were re- sampled, the mean result would fall between 86 and 94 for 95% of re-samples • For 5% of re-samples, the result would fall outside of 86 to 94
• New Yorkers were surveyed to determine whether they smoke and whether they have morning cough. The study found a smoking prevalence of 50%. Among responders, 25% reported morning cough. • Likely questions: • Type of study? • What can be determined?
• KEY THING!!! is to Note the absence of a time period •they never said that they follow Patients not followed for 1-year,etc (so you should recognize that this is just a snap shot in time) (cross-sectional) (prevalence of disease only and not risk ratio or odds ratios) Incidence = (new individuals contracting the disease)/(individuals who could potentially contract gonorrhea) Therefore, the incidence of gonorrhea will decrease because the numerator (new individuals contracting the disease) will decrease, but the denominator (individuals who could potentially contract gonorrhea) will remain constant. Recall that incidence and prevalence are connected by the formula prevalence = incidence × duration
HOw do you calculate a confidence interval?
• To calculate a confidence interval you need 2 things • Standard deviation(σ) • Number of subjects tested to find mean value (n) most people report the 95% CI and the 99 is rarely done.
• Researchers discover a gene that they believe leadsto development of diabetes. A sample of 1000 patients is randomly selected. All patients are screened for the gene. Presence or absence of diabetes is determined from a patient questionnaire. It is determined that the gene is strongly associated with diabetes.
• Keypoints: • Note lack of time frame (they are not following pt for years) • Patients not selected by disease or exposure (random, which is different than how they are selected by cohort and case-control studies) • Just a snapshot in time and looking at what they look like now in time
What is the main outcome of a case control study?
• Main outcome measure is odds ratio (you cant calculate a risk ratio for a case control study like you can do for cohort studies because the results are not reliable) • Odds of disease exposed/odds of disease unexposed •a key point to remember is that case control study Patients identified by disease or no disease and this makes them very different from cohort studies and this is how you are going to recognize descriptio of case control studies in a question stem.
WHat is the main outcome measure of a cohort study?
• Main outcome measure is relative risk (RR) • How much does exposure increase risk of disease • Patients identified by risk factor(i.e.smoking or non) • Different from case-control (by disease presence/absence) • Example results • 50% smokers get lung cancer within 5 years • 10% non-smokers get lung cancer within 5 years • RR = 50/10 = 5 • Smokers 5 times more likely to get lung cancer
What is the main outcome of the study?
• Main outcome of this study is prevalence (prevalence of disease, prevalence of risk factors...we are just trying to determine how much of it is out there right now) • 50% of New Yorkers smoke • 25% of New Yorkers have lung cancer • May have more than one group (so dont get confused if there is more than one group and think its a different type of study) • 50% men have lung cancer, 25% of women have lung cancer (this is still just a snapshot of time of men and women and the groups arent followed over time like in a cohort or case control study but its just a sub group analysis in a cross sectional study looking at men compared to women.) • But groups not followed over time (i.e. years) •there are many important things that we Can't determine from a corss sectional study that we can determine from other types: • How much smoking increases risk of lung cancer (RR, so we would identify smokers and nonsmokers and follow them over time) • Odds of getting lung cancer in smokers vs. non-smokers (OR and we would also have to follow these groups over time) you are often asked on the board to identify cross sectional studies based on a description of what the researches did in the study!
What are the comparing groups?
• Many clinical studies compare group means • Often find differences between groups • Different mean ages • Different mean blood levels, etc. • Need to compare differences to determine the likelihood that they are real and not due to chance • Are the differences "statistically significant?"
What is comparing groups?
• Many clinical studies compare group means • Often find differences between groups • Different mean ages • Different mean blood levels, etc. • Need to compare differences to determine the likelihood that they are real differences and not due to chance • Are the differences "statistically significant?" this is how people usually say it
How do you maximize the power?
• Maximize power to detect a true difference (you dont want to spend a lot of time collecting data and find no difference when actually there was one there we just missed it) • In study design, you have little/no control over: • Scatter of data • Difference between means • You DO have control over • Number of subjects • Number of subjects chosen to give a high power • This is called a power calculation (usually when we see a clinical trial on a drug they make an estimate on the scatter and the difference between the means would be and they calculate how many pt they would have to recruit in order to have a high power to detect the differnce....nobody wants to conduct a large clinical with 1000s of pt over many years and then find out they didnt recruit enough pts and didnt have enough power to see the differencne that was actually present.
What are confidence intervals?
• Mean values often reported with 95% CIs • for example a study of diabetic pts may say that the Mean glucose levels and 120mg/dl +/- 5mg/dl (and the -/+5 is the confidence interval) • this is saying that the Range in which 95% of repeated measurements would be expected to fall and its also saying that if you took our study and repeated it that you would get a mean value of glucose that is similar to ours within a -/+ 5 interval • Confidence intervals are for estimating population mean from a sample data set (we often report the findings from say a 10 or 100 diabetics but they are trying to give you an estimate of what the mean would be in a entire populartion of similar diabetics.) • Suppose we take 10 samples of a population of 1Million people • Mean of 10 samples is X • How sure are we the mean of 1Million people is also X? • Confidence intervals answer this question
What is the correlation coefficient?
• Measure of linear correlation between two variables • Represents strength of association of two variables • Number from -1 to+1 • Closer to 1, stronger the relationship • (-) number means inverse relationship • More smoking, less life span • (+) number means positive relationship • More smoking, more lifespan • 0 means no relationship and the closer the number is to 0 then the weaker the relationship between the variables.
What is the data for risk estimation?
• Obtained by studying: • Presence/absence of risk factor/exposure • In people with and without disease and primarily this data comes from ... • Cohort study • Case-control study
What is the median?
• Odd number of data elements in set • 80-90-110 • Middle number is median = 90 • Even number of data elements • 80-90-110-120 • Halfway between middle pair is median = 100 • Note: Must put data set in order to find median (from low to high number fool)
• A cardiologist discovers a protein level that may be elevated in myocardial infarction called MIzyme. He wishes to use this to detect heart attacks in the ER. He samples levels of MIzyme among 100 normal subjects and 100 subjects with a myocardial infarction. The mean level in normal subjects is 1mg/dl. The mean level in myocardial infarction patients is 10mg/dl. • Can this test be used to detect myocardial infarction in the general population?
• Other way to think about it: Does the mean value of MIzyme in normal subjects truly differ from the mean in myocardial infarction patients? • Or was the difference in our experiment simply due to chance? (but if we repeat it we would not get that difference in mean) • but whether this is real or due to change Depends on several factors: • Difference between means normal/MI • Scatter of data • Number of subjects tested
What is a cohort study?
• Problem:Does not work with rare diseases • Imagine: • 100 smokers, 100 non-smokers • Followed over 1 year • Zero cases of lung cancer both groups (lung cancer is very rare even in people that smoke, so we might suspect that smoking leads to lung cacner but a cohort study may give us results that we cannot rely on) • In rare diseases need LOTS of patients for LONG time and this is very time consuming and expensive. • Easier to find cases of lung cancer first then compare to cases without lung cancer
What is a case series?
• Purely descriptive study (similar to cross-sectional) • Often used in new diseases with unclear cause (so if bizzare cases of type of febrile illness happening through chicago..they may collect a bunch or the cases together and measure the demographics and symptoms to look for clues of what could be causing the disease or how they can prevent it) • Multiple cases of a condition combined/analyzed • Patient demographics (age, gender) • Symptoms • Done to look for clues about etiology/course • No control group (just a collection of cases and a description of what cases look like)
• Example#3: • 10% smokers get lung cancer • 50% nonsmokers get lung cancer
• RR=0.2 (this is less than 1 and this implies that... • Smoking protective! (obviosly this is not true in reality and this is only an example.)
• Example#1: • 10% smokers get lung cancer • 10% nonsmokers get lung cancer
• RR=1
• Example#2: • 50% smokers get lung cancer • 10% nonsmokers get lung cancer
• RR=5
How do you calculate the risk of disase?
• Risk in exposed group=A/(A+B) • Risk in unexposed group=C/(C+D) we have the disease up here and lets say that its lung cancer and say the exposure is smoking and amoung the pt that smoke say 25 of them had lung cacner and 75 didnot have lung cancer..this would mean amount the smokers which is every ody in the top row 25/100 pts had cancer
What is the risk ratio?
• Risk of disease with exposure vs non-exposure • RR=5 (for smokers developing lung cancer) • Smokers 5x more likely to get lung cancer than nonsmokers according to our study • Usually from cohort study and they are unreliable when you derive them from a case control study except in certain special cases • Ranges from zero to infinity • RR = 1 then No increased risk from exposure • RR > 1 then Exposure increases risk • RR < 1 then Exposure decreases risk (this one is sometimes called a protective exposure) so you take the ratio of the top row and divide it by the bottom ratio.
What is the key point of scattered data points?
• Scatter of data points relative to difference in means influences likelihood that difference between means is due to chance • This is how differences between means are tested to determine likelihood that they are different due to chance • Don't need to know the math • Just understand principle
What is Matching?
• Selection of control group (matching) key to getting good study results • Want patients as close to disease patients as possible (except for presence or absense of disease) • Matching reduces confounding ( which are variables like age or race or morbid conditions like diabetes that could effect the results) • Want all potential confounders balanced between cases and controls and to do that then you need to do good matching.
What is the mean and mode?
• Six blood pressure readings: • 90, 80, 80, 100, 110, 120 • Mean=(90+80+80+100+110+120)/6=96.7 • Mode is most frequent number = 80
whats another way CI are important for group comparissons>
• Some studies report group means with CIs • If ranges overlap,no statistically significant difference • Group 1 mean: 10 +/- 5; (and this include the number 8 in the second group) Group 2 mean: 8 +/-4 • Confidence intervals overlap • No significant difference between means • Similar top>0.05 for comparison of means • Group 1 mean: 10 +/- 5; Group 2 mean: 30 +/-4 • Confidence intervals do not overlap • Significant difference between means • Similar to p<0.05 for comparison of means
What is the coefficient of Detemination?
• Sometimes r2 reported instead of r • Always positive so you loose that information whether the relationship was positive or negative. but we know that smoking negatively impacts lifespan. • Indicates % of variation in y explained by x (so here we have some data with a line through it and the r2 value is 0.6 so this means that 60% of the variation in the y axis is explained by x), so if this were smoking down on the x axis and y is lifespan (so it means that 60% of the variation of y is explain by x) then the r value said that 60% of your lifespan is determined by how much you smoke and people like to report numbers like this because its easy to understand. so if we had a r2 of 1 and the dat perfectly fall on this straight line and this would mean...if you ever seen this on a clinical study that a 100% of the variation of y is explained by x. so lets go back to our example of pneumonia study so if this were the white blood count on x axis and y was the length of stay....then it would say that the increase in white blood count is determined by your length of stay (and thats not going to happen in reality, but just trying to get the point across).
What can dispersion measure?
• Standard deviation(SD) • Variance • Standard error of the mean (SEM) • Z-score • Confidence interval
What does p value indicate?
• Studies will report relationships with CC • Example: • Study of pneumonia patients • and the looked at the WBC on admission evaluated for relationship LOS (length of stay) • r=+0.5 • Higher WBC (x axis..then independent variable) then Higher LOS • Sometimes ap value is also reported • P<0.05 indicates significant correlation • p>0.05 indicates no significant correlation (so this a statistcal test to see if the r value that is found is statiscally significant)
What is type 1 statistical errors?
• Type 1 (α) error • False positive • Finding a difference/effect when there is none in reality • Rejecting null hypothesis (H0) when you should not have • Example: Researchers conclude a drug benefits patients but it dose not help at all. • Null hypothesis generally not rejected unless α <0.05 because we want the ∂ error to be very low before we reject that null hypothesis. • Similar (but different)from p value • p value calculated by comparison of the means • α set by study design but they both operate under similar realms...in other words we would like to see the alpha very low before we are going to accept a study result as true and in the same way we would like to see a p value low before we accept that the difference is real.
What is type 2 statistical errors?
• Type 2 (β) error • False negative • Finding no difference/effect when there is one in reality there was one • Accepting null hypothesis (H0) when you should not have • Example: Researchers conclude a drug does not benefit patients but a later study finds that it does • Can get type 2 error if too few patients
Why risk is important?
• Understanding of disease causes comes from estimating risk • Smoking increases risk of lung cancer • Exercise decreases risk of heart attacks and by knowing these things we can better advise pt how to avoid disease. • We know these things from having techniques, and epidemiology to quantifying the risk • Smoking increases risk of lung cancer X percent • Exercise decreases risk of heart attacks Y percent
what is the odds ratio?
• Usually from case control study • Odds of exposure-disease/odds exposure-no-disease • Ranges from zero to infinity just like the risk ratio. • OR = 1 then Exposure equal among disease/no-disease • OR > 1 then Exposure increased among disease/no-disease • OR < 1 then Exposure decreased among disease/no-disease
• Suppose test grade average (mean)=79 • Standard deviation=5 • Your grade=89 calculate the z score
• Your Z score = (89-79)/5 = +2 so you are 2 SD from the mean