Intermediate Epidemiology Final 2019
kappa (misclassification slides)
--fraction of the observed agreement not due to chance in relation to the maximum non-chance agreement --K = (observed agreement - expected agreement) / (1.0 - expected agreement = (P0 - Pe) / (1.0 - Pe)
comparing survival curves (survival analysis slides)
--generalized Wilcoxon Breslow Gehan more sensitive -slide shows a graph of this --logrank more sensitive -slide shows a graph of this *--note= if you see curves cross Cox proportional hazard assumptions are likely to be violated*
methods to control confounding (confounding and mediation slides)
--prevention strategies= attempt to control confounding through the study design itself -randomization -restriction -matching (often used in case-control studies) --analysis strategies -stratification -multivariable techniques (adjustment)
assessing interaction, case control study: observed vs expected approach (interaction and effect modification slides)
*--reference category:* exposure to Z= no -> exposure to X= no -> cases -> controls -> OR= 1.0 *--independent effect of X:* exposure to Z= no -> exposure to X= yes -> cases -> controls -> ORx1z0 *--independent effect of Z:* exposure to Z= yes -> exposure to X= no -> cases -> controls -> ORx0z1 *--observed joint effect:* exposure to Z= yes -> exposure to X= yes -> cases -> controls -> ORx1z1 --note= technically can do additive and multiplicable for this but in general JUST DO multiplicable for case control studies
risk (cumulative incidence) (causation slides)
--probability of developing a given disease *--risk = number of new cases of disease over a time period / number of persons followed at risk over a time period*
how much evidence do we need to act (causation slides)
*--"all scientific work is incomplete-* whether it be observational or experimental... this does not confer upon as a freedom to ignore the knowledge that we already have or to postpone the action that it appears to demand at a given time" *(Hill 1965)* *--note= a lot of times what the intervention is effects this -low risk of an intervention but little evidence= likely to be used (ex= swapping fries for broccoli in school lunches)*
each of the preceding definitions leads to alternative (yet equivalent) strategies to assess the presence or absence of interaction (interaction and effect modification slides)
*--"interaction"* definition= relates to examining whether the observed joint effect of the two risk factors in question is the same or different than the expected from their independent effects -note= interaction is a statistical term -note= when we think about a linear model (y= β0 + β1(X1) + β2(X2) + e [diabetes yes/ no= β1(X1) and hypertension yes/ no = β2(X2)] -note= y= β0 + β1(X1) + β2(X2) + β3(X1*X2) + e *--"effect modification"* definition= relates to examining whether the effect of the risk factor is homogeneous or heterogeneous when stratified according to suspected effect modifier -note= rare instance is if 1 strata = the crude and the other strata doesn't but even in this case it's still an effect modifier -note= as long as strata are different it's an effect modifier; ex= 1 strata could be 1 and the other could not (when measuring RR 1 means no association) -note= hypertension yes= y= β0 + β1(X1) + e hypertension no= y= β0 + β1(X1) + e -note= confounder: CRR ≠ a1RR = a2RR; effect modifier: CRR ≠ a1RR ≠a2RR -note= for effect modifier and confounder you have to use a 2x2 table for the crude, a 2x2 table for a1RR, and a 2x2 table for a2RR
interaction (interaction and effect modification slides)
*--"interaction"* refers to a situation whereby the effects of the risk factor (on the disease outcome) and that of a third factor strengthen (synergism) or weaken each other (antagonism) *-note= not always an effect measure modifier* *-note= interaction is a statistical term* *--"effect modification"* refers to the situation where the effect of the risk factor differs depending on the presence or absence of a third factor (effect modifier) the third variable modifies the effect of the risk factor *-note= not always an interaction* -note= starts are different and crude is different
potential sources of information bias (misclassification slides)
*--respondent:* inability to understand, recall, articulate; unwillingness to disclose or social desirability *--data collector:* unclear or ambiguous questions, lack of a neutral demeanor, insufficiently conscientious, inaccurate transcription, fraud *--data managers:* inaccurate transcription, misreading, miscoding, programming errors *--data analysts:* variable coding and programming errors
evaluating epidemologic associations (causation slides)
*--1. could the association have been observed by chance* -determined through use of statistical tests; evidence to reject the null hypothesis *--2. could the association be due to bias* -bias refers to systematic errors, i.e., how samples were selected or how data was analyzed *--3. could other confounding variables have accounted for the observed relationship* *--note= questions 1 to 3 are about internal validity* --note= causation step 1- is there an association (is it not 1), step 2- is the association statistically significant (if not repeat study with a larger sample size (sample size numbers can change association numbers)), and step 3- think about confounding (adjust associations if needed) -association also means effect here in the note above *--4. to whom does this association apply* -representativeness of sample -participation rates -generalizability *-note= question 4 is about external validity* *--5. does the association represent a cause-and-effect relationship* -considers criteria of causality *-note= we can still look at causality if external validity isn't good but external validity makes studies reproducible* --note= other concepts we have to look at for causation; establish measure of association; establish internal validity; establish external validity
assessing confounding (misclassification slides)
*--1. is the confounding variable related to both the exposure and the outcome in the study* *--2. does the exposure-outcome association seen in the crude analysis have the same direction and similar magnitude as the associations seen within strata of the confounding variable* *--3. does the exposure-outcome association seen in the crude analysis have the same direction and similar magnitude as that seen after controlling (adjusting) for the confounding variable*
classic Bradford & Hill criteria (causation slides)
*--1. strength (effect size)=* a small association does not mean that there is not a causal effect though the larger the association the more likely that it is causal *--2. consistency (reproducibility)=* constant findings observed by different persons in different places with different samples strengthens the likelihood of an effect *--3. specificity (DON'T NEED THIS ANY MORE)=* causation is likely if a very specific population at a specific site and disease with no other likely explanation; the more specific an association between a factor and an effect is the bigger the probability of a causal relationship *--4. temporality=* the effect has to occur after the cause (And if there is an expected delay between the cause and expected effect then the effect must occur after that delay) *--5. biological gradient (dose-response)=* greater exposure should generally lead to greater incidence of the effect however in some cases the mere presence of the factor can trigger the effect in other cases an inverse proportion is observed- greater exposure leads to lower incidence *--6. plausibility=* a plausible mechanism between cause and effect is helpful (but Hill noted that knowledge of the mechanism is limited by current knowledge) *--7. coherence (DON'T NEED ANYMORE)=* coherence between epidemiological and laboratory findings increases the likelihood of an effect however Hill noted that "...lack of such [laboratory] evidence cannot nullify the epidemiological effect on associations" -note= this has been debunked- there's a lot of lab things that can't be done to find out a relationship *--8. experiment=* "occasionally it is possible to appeal to experimental evidence" *--9. analogy (DON'T NEED THIS ANYMORE)=* the effect of similar factors may be considered
criteria for causal inference (Bradford Hill 1965) (causation slides)
*--1. temporality* (exposure came before disease) *--2. strength of association* (the higher the association the more likely for it to be causal) *--3. biological plausibility* *--4. dose-response* (threshold effects can effect this) *--5. replication of findings* (sometimes only causal for a certain population) *--6. consideration of alternate explanations* *--7. cessation of exposure* (sometimes the damage's already done) *--8. coherence with established facts* *--9. specificity of association* --note= there's a lot of debate on this --note= Sir Bradford Hill created this when epidemiology mainly studied infectious diseases
causal pie: sufficient cause (causation slides)
*--a complete causal mechanism --there may be more than one for each disease or condition* --a minimum set of conditions that are sufficient for an outcome to occur --shows a picture of a causal pie on the slide -A, B, C and D on the pie are named known component causes -U on the pie is all unknown causes -all component causes in a sufficient cause are required
causal pie: necessary cause (causation slides)
*--a necessary cause is one that must be present in order to trigger the onset of disease* -often a necessary cause appears as a component cause is all sufficient cause constellations for an outcome --this slide shows pictures of causal pies *--note= whole pie- sufficient cause (complete mechanism) --note= necessary cause- needed in every pie (needed to get disease)*
*****general rule for confounding (confounding and mediation slides) EXTREMELY IMPORTANT TO KNOW AND REMEMBER*****
*--a variable can be a confounder if all the following conditions are met: -the confounding variable is causally associated with the outcome AND -non-causally or causally associated with exposure BUT -is not an intermediate variable in the causal pathway between exposure and outcome* *--NOTE= THIS IS ON THE FINAL!!!!!!!!*
association ≠ causation (causation slides)
*--association ≠ causation* *--causal relationships are directed --associations are not directed* *--note= aka correlation ≠ causation* --note= ex- sepsis does not cause stroke -survival bias (sepsis makes over 50% with it die) -stroke can cause sepsis though (need to find out what comes first (temporality))
regression cautions (multivariate analysis I: continuous outcomes slides)
*--be careful about throwing out those outliers* -if these observations are not the result of gross errors the data without the outliers is not representative of your population --these are not the only options for analyzing skewed continuous variables -but that is another class -note= non-parametric tests, non-parametric regressions, etc)
interaction and confounding (interaction and effect modification slides)
*--biases and confounding effects distort true causal associations* -strategies= avoid, eliminate, reduce, and control -note= should explain *--effect modification is informative* -provides insight into the nature of the relationship between exposure and outcome -may be the most important result of a study -it should be reported and understood *-note= should report*
mediation (confounding and mediation slides)
*--cannot distinguish from confounder in terms of data analysis - either by* *-tabular* *-regression analyses* *--can have partial or full mediation* *--note= can't be 2x2 tables either* --ultimately we need to know going in through our theories and hypotheses and what biologically makes sense to know if something is a confounder or mediator - why --well when we go to do data analysis we can't really distinguish the two --but you can see mediation has the same result as confounding... so we can't distinguish in our modeling - only in our theoretical framework... only distinguishing feature is if there is partial --aka intermediate variables --a variable that represents a step in the causal chain --causes variation in the outcome --caused to vary by the original causal variable --note= can effect your outcome; can have full or partial mediation; can do it with a measure of association and can account for it in different ways --note= DAG air pollution (exposure) -> influenza (mediator (can take out influenza by vaccines)) -> stroke (outcome) air pollution (exposure) -> stroke (outcome) = *the direct effect* air pollution (exposure) -> influenza (mediator) -> stroke (outcome) = *indirect effect* whole DAG= *total effect* examples of DAGs: --directed acyclic graphs (DAGs) illustrating alternative hypothesis on the relation between periodontal disease and risk of coronary heart disease (CHD) in panel A baseline periodontal infection and tooth loss (both observed variables in a study) are assumed to be caused by past periodontal infection through its relation to diet tooth loss is hypothesized to be an intermediary variable in the association between periodontal infection and risk of CHD; in panel B socioeconomic status (SES) is hypothesized to be a confounder of the association between tooth loss and CHD thus making the latter a collider unmeasured variables are represented with faded font and in a dotted box in the diagram -A. periodontal disease (exposure) -> tooth loss (intermediate variable) --(diet (intermediate variable in this association))-> CHD (outcome) periodontal disease (exposure) --(?)-> CHD (outcome) -B. periodontal disease (exposure) -> tooth loss (intermediate variable) --(diet (intermediate variable in this association))-> CHD (outcome) periodontal disease (exposure) --(?)-> CHD (outcome) tooth loss (intermediate variable) <- SES (collider/ confounder) -> CHD (outcome) --note= may be looking for additional ways to intervene or additional risk factors when looking at mediation --full mediation: E --(M)-> D --partial mediation: E --(M)-> D E ---> D
accepting the validity of published findings (causation slides)
*--conditional on 2 important assumptions* *-1. that each published study used unbiased methods and -2. that published studies constitute an unbiased sample of a theoretical population of unbiased studies* --note= this is a very strong assumption due to all the different biases that could happen
multinomial (aka polytomous) logistic regression models assumptions (multivariate analysis II: categorical outcomes slides)
*--does not assume* -normality -linearity -homoscadesticity *--does assume* *-independence* among the dependent variable choices membership in one category is not related to membership of another category *-non-perfect separation* if the groups of the outcome variable are perfectly separated by the predictor(s) then unrealistic coefficients will be estimated and *effect sizes will be greatly exaggerated*
causal pie (causation slides)
*--each complete pie is a causal mechanism* -each pie is also a *sufficient cause* a minimal set of factors that produce disease joint action of many component factors/ causes must be present for disease to develop --within the pie a *component* cause is one event or condition that plays a necessary role in the occurrence of some cases of disease --a *necessary* cause is a factor that MUST be present in order for disease to develop *--blocking one component cause within a sufficient cause model prevents disease from occurring by that mechanism/ pathway* --because of this disease can be prevented by blocking one component --this slide shows an image of the causal pie -blocking T in this causal pie prevents disease from occurring *--note= don't have to block every risk factor (just blocking one will decrease disease likelihood)*
absolute vs relative scales (causation slides)
*--effect measures can be on an absolute scale (subtraction) and on a relative scale (division)* --in epidemiology we may be interested in both *--absolute differences* tell us the increase (or decrease) in effect *--relative differences* tell us the relative increase or decrease in effect comparing quantities associated with exposed and unexposed --note= may not have a big impact on out absolute risk
effect modification overview (interaction and effect modification slides)
*--effect modification - NOT a statistical bias* -the strength of the association depends on the presence of one or more covariates *-NOT a bias* *-different from confounding* confounding the 3rd variable distorts the true association between X and Y *-different effects in different groups - due to the disease mechanism not the study*
some notes about terminology... (causation slides)
*--for OR, RR, and IRR if value is >1 typically we say that there is a "positive association," 1 is no association, and <1 is a "negative association"* --of course interpretation fully depends on what is "exposed" and what is "unexposed" -note= ex- if we classify eating vegetables as exposed we may use something different *--remember... the "null" is 1 for relative measures of association and 0 for absolute measures; hence "away from" or "towards" the null*
prevention strategies (causation slides)
*--high risk* *-primary focus on proximate component causes particularly those related to biologic markers of risk* -note= ex- most pregnant women don't have strokes so only focus on pregnant women at risk to intervene on --the "high risk" approach focuses on individuals with severe hypertension although the relative risk for stroke is high in individuals with severe hypertension (compared to those with normal blood pressure) the prevalence of severe hypertension in the total population is low and thus the associated population attributable risk is low most cases of stoke originate among those with moderate hypertension the prevalence of which is high --"high risk" approach- identify and treat individuals with severe hypertension --this slide shows a graph *--population wide* *-based on distal or intermediate causes - for example those related to social determinants of disease* -note= ex- American Heart Association's simple 7 tells the entire population to do things to decrease risks of disease --distribution of systolic blood pressure before and after the application of a population wide approach the prevalence of both moderate and severe hypertension decreases resulting in a decrease of the population attributable risk --this slide shows graphs -distribution of SBP after the application of a population-wide approach -distribution of SBP before the application of a population-wide approach -population density with "moderate + severe" hypertension before and after the application of a population-wide approach --note= want to look at both relative risk and absolute risk due to this
observed vs expected approach: what kinds of interaction in this case control study (interaction and effect modification slides)
*--reference category:* smoker= no -> oral contraceptives= no -> OR for breast cancer= 1.0 *--independent effect of X:* smoker= no -> oral contraceptives= yes -> OR for breast cancer= 4.5 *--independent effect of Z:* smoker= yes -> oral contraceptives= no -> OR for breast cancer= 7.0 *--observed joint effect:* smoker= yes -> oral contraceptives= yes -> OR for breast cancer= 35.0 *--expected OR[additive]= 4.5 + 7.0 - 1 = 10.5* *--expected OR[multiplicative]= 4.5 x 7.0 = 31.5* --additive= yes --multiplicative= yes --positive= yes additive and multiplicative --negative= no --quantitative= yes additive and multiplicative --qualitative= no
assumption 4: equal residual variances (multivariate analysis I: continuous outcomes slides)
*--homoscedasticity=* constant variance of errors at each value of the predictor --otherwise the SEs of regression coefficients are biased --therefore the significance tests are not accurate *--example=* family income and family expenditures --on the slide theres a graph showing *heteroskedasticity- error gets larger as X increases* --note= residual variance *--note= if you're not meeting your assumptions you're impacting your SE (this effects statistical significance)* --using bivariate regression we use family income to predict luxury spending (as expected there is a strong positive association between income and spending) --upon examining the residuals we detect a problem- *the residuals are very small for low values of family income (families with low incomes don't spend much on luxury items) while there is great variation in the size of the residuals for wealthier families (some families spend a great deal on luxury items while come are more moderate in their luxury spending)* --this situation *represents heteroscedasticity because the size of the error varies across values if the independent variable* examining the scatterplot of residuals against the predicted values if the dependent variable would show *the classic cone-shaped pattern of heterocedasticity* [note= this is not good it'll probably widen SE and cause us to say no association] --this is not good because out goal is to have the smallest standard error
hypotheses and assumptions (multivariate analysis II: categorical outcomes slides)
*--hypotheses* -H0: all betas's are equal to 0 -Ha: at least one of the beta's is equal to 0 *--assumptions* -L= the logic of true conditional probabilities is a linear function of independent variables -I= the observations are independent *-N= not applicable* *-E= note applicable binomial distribution of errors* -O= no influential outliers -M= the independent variables are not a linear combinations of one another *-rare event assumption* --note= y= beta0 + beta1X1 + ei [no important variables are excluded] *--multicollinearity=* moderare multicollinearity is fairly common since any correlation among the indecent variables is an indication of collinearity; when severe multicollinearity occurs the standard errors for the coefficients tend to be very large (inflated) and *sometimes the estimated logistic regression coefficients can be highly unreliable* --LINE-MO -L= slightly changed -I= yes N= no longer exist -E= binomial distribution of the errors errors= Y - "Y hat" -M= yes O= yes *--other things: -no important variables are omitted -no extraneous variables are included -the independent variables are measured without error* --note= epidemiology mostly uses categorical outcomes not continuous ones -on the rare occasion of a continuous outcome it usually has to do with time -this is why we rarely use linear regression
when is a variable a confounder (misclassification slides)
*--it is a known risk factor for the outcome* *--it is associated with the exposure* -more or less common in the exposed group than in the comparison group *--but not a result of the exposure* -cannot be an intermediate step in the pathway between exposure and outcome (mediator) confounding - diagram --exposure of interest <- potential confounder -> disease/ outcome exposure of interest --(?)-> outcome
examining the association between X and Y (multivariate analysis I: continuous outcomes slides)
*--least squares regression line: "y hat"i = b0 + b1xi* -b0 is the intercept -xi is the observed predictor for a given individual i -yi is the observed outcome for a given individual i *-"y hat"i is the predicted outcome for a given individual i for a given value of x*
leverage vs residual squared plot (multivariate analysis I: continuous outcomes slides)
*--leverage is a measure of how far away the independent variable values of an observation are from those of the other observations* *--graph of leverage against the normalized residuals squared -points above the horizontal lune have higher than average leverage -points to the right of the vertical line have larger than average residuals* --a bit of limited utility when you have a categorical predictor --there is a graph on this slide --lvr2plot, mlabel(z1a_a) -mlabel labels the observations with a particular variable value -I chose age of onset but we could have used the ID --what does this tell us -the leverage is quite small note (it says this on the slide) the y-axis goes from 0.00115 to 0.0014 -the residuals are quite small note (it says this on the slide) the x-axis scale goes from 0 to 0.02 --there is a graph on this slide with notes I wrote about where the outliers are
consistency (causation slides)
*--observation of consistent albeit weak associations provides the main rationale for the use of meta-analytic techniques for policy decision making* *--lack of consistency does not necessarily constitute evidence against a causal association* -note= ex- due to biased measure of association or difference in effectiveness of intervention etc *(lack of) consistency* --the reasons why causal associations may not appear consistent -differences in the specific circumstances of exposure -differences in design and analytic strategies -differences in the effectiveness of interventions --differences in the timing of the study with regard to the exposure's latency (incubation) period -note= timing of study can effect it; may happen at a time you can't measure exposure association with disease well --differences in the distribution of a component causes *--differences in the stage of the natural history of the underlying process* -smoking seems to be particularly important in later stages of the natural history of atherosclerosis in the former Yugoslavia earlier atherosclerotic lesions seemed to predominate and thus smoking appeared to be a less important risk factor than in Northern Europe where more advanced lesions likely predominate [note= where we measure smoking and disease can influence measure of association] *--differences in the variability of the risk factor* -intracountry variability in salt intake is not sufficient to allow observation of an association between salt intake and systolic blood pressure however given inter country variability there is a strong correlation between average salt intake and systolic blood pressure (SBP) when "country" is used as the analytic unit; in the image on this slide small circles denote individuals within each country and large circles denote countries
types of interaction (interaction and effect modification slides)
*--positive* (synergistic)= effect modifier *strengthens* the effect of the exposure *--negative* (antagonistic)= effect modifier *diminishes or eliminates* the effect of the exposure *--quantitative*= interaction exists and the association between X and Y is in the *same direction* in strata of Z *--qualitative*= interaction exists and the association between X and Y are in *different directions* in strata of Z -this is relatively rare -note= 1 stratum protective factor and the other stratum risk factor
three effect measures (causation slides)
*--risk and rate differences - absolute scale* -measure amount a factor adds to risk or rate of a disease *--risk and rate ratio (and prevalence or odds ratio) - relative scale* -measures amount by which a factor multiplies the risk or rate (or prevalence or odds) of disease *--attributable fractions - not covered in this class* -measure fraction of cases due to a factor
statistical interaction -definition (interaction and effect modification slides)
*--statistical interaction* (association measure modification) refers to nonuniformly/ heterogeneity of a (true) measure of association between disease and one exposure over levels of another exposure in the *source population* --thus assuming no confounding at each level of the potential modifier we can use our assessment of statistical interaction to make inferences about effect modification --for example we might compare the *risk ratio (RR) [note= RR= multiplicative interaction]* or the *risk difference (RD) [note= RD= additive difference]* for the exposure-disease association across different categories of the potential modifier --note (the slide says this it is not one of my notes): the terms effect measure modification, statistical interaction, interaction, and interaction effect are often used synonymously; e.g., Rothman & Greenland (pp. 329-32, ME 2) --note= if EMM is continuous interaction= one of only ways to test it
*****when is a variable a confounder (confounding and mediation slides) VERY IMPORTANT TO KNOW THIS*****
*--the 3 confounding criteria* --1. it is a known risk factor for the outcome --2. it is associated with the exposure -more or less common in the exposed group than in the comparison group --3. but not a result of the exposure -cannot be an intermediate step in the pathway between exposure and outcome (mediator)
survival analysis... (survival analysis slides)
*--there are 3 possible "outcomes" in survival analysis* -the event of interest occurs -the participant is lost to follow up (no additional data or death if death not outcome) -the study ends and participant does not experience the event of interest prior to study ending *--those lost to follow up or who remained in study until the end are censored*
as a reminder (confounding and mediation slides)
*--to make causal inferences ideally we compare the risk of the outcome in the exposed (actual) with the risk of the outcome in the same people had they not been exposed (counterfactual)* --but since we can't do this in a real world study we must select different sets of individuals for exposed and a comparison group (unexposed) that are similar as possible (note= there are other factors that may contribute to the answer (the outcome) due to this) --note= confounding has 3 rules; mediation: in the causal pathway (caused by exposure and contributes to the outcome unlike confounders)
Mantel-Haenszel adjusted OR (misclassification slides)
*--used to create an overall adjusted OR -OR(MH) = Σi[(ai x di) / Ni] / Σi[(bi x ci) / Ni] --is equivalent to the weighted average of the stratum specific ORs* example: *--so the adjusted OR for OC and MI is 3.97 -this is almost twice as large as the crude OR (1.7) -this assumes that the observed between stratum variation is random likely a reasonable assumption in this case because of the small number of cases -but considering how large some of the stratified ORs are a single adjusted OR might not make sense* -[(4x224 / 292) + (9x390 / 444) + (4x330 / 393) + (6x362 / 442) + (6x301 / 405)] / [(2x62 / 292) + (12x33 / 444) + (33x26 / 393) + (65x9 / 442) + (93x5 / 405)] = 3.97 --note= for each stratum -for malaria would have (mostly outdoors + mostly indoors) / (mostly outdoors + mostly indoors) due to it only having 2 starts *--note= only adjust for one variable and it can't adjust for continuous variables*
interaction in case control studies (interaction and effect modification slides)
*--we cannot look at absolute differences in case control studies* *-why= because we're already choosing people based on the outcome [this was my note]* *-note= can only look at multiplicative interaction*
2 key points in survival analysis (survival analysis slides)
*--what is the end point* -death -disease *-note= strictly define outcome* *--when does the clock start* -at birth -at diagnosis -after treatment -note= when does the clock start is it a certain date (ex= January 1st 1981) or birth or something else
examples of DAGs (confounding and mediation slides)
*diagram example of a DAG is in the slides (on page 13 of my notes):* --note= backdoor path opens if you control for platelet aggregation --note= ex of a mediator not a confounder in diagram A
kappa: expected (chance) agreement (misclassification slides)
-- [(a + c) x (a + b) / (a + b + c + d)] + [(c + d) x (b + d) / (a + b + c + d)] = [((a + c) x (a + b)) + ((c + d) x (b + d))] / (a + b + c + d)^2 --note= this starts weighting how much is in true positive (A) I think she said this
what is a cause (causation slides)
--"an event, condition, or characteristic that *preceded the disease event* and without which the disease event either *would not have occurred* at all or would not have occurred until some later time" (Rothmans and Greenland 1998) --note= without exposure the disease wouldn't happen; if truly causal the person wouldn't be diseased if didn't have exposure or would've gotten the disease later *--1. theory of causation* -exposure to certain risk factors causes disease -temporality and biological plausibility *--2. formulate a testable hypothesis* -exposure to variable X will be related to disease Y *--3. design and conduct a study* -randomized trials, cohort studies, case-control studies, ecological studies -goal of the study is to minimize bias, minimize random error, collect data on confounding variables *--4. analyze the data --5. interpret the results* *--2, 4 and 5 look at= strength of association, dose response, consideration of alternate explanations, cessation of exposure, coherence with established facts, specificity of association, and replication of findings* -note= in strength of association measures of association are thought about --note= the epidemiology triangle doesn't work as well for diseases like diabetes and hypertension because there's no longer an agent --note= temporality- the exposure came first and know time point of exposure --note= biological plausibility- biologically possible --note= think about alternative explanations too *--note= explain why or why you're not matching other study's results*
exceptions to the general rule (confounding and mediation slides)
--"confounding" due to random associations --the "confounder" does not cause the outcome but it is a masker of another unmeasured causal factor (note= ex: diet and SES (food desserts may effect it)[be careful with proxies and how you define the proxies due to this]) --the "confounder" as an intermediate variable in the causal pathway of the relationship between exposure and outcome
Kaplan-Meier example 1: time-to-conception for sub-fertile women (survival analysis slides)
--"failure" in this study is conception --38 women were treated for infertility in 1982 --all women were followed for up to 2 years --raw data: time (months) to conception or censoring in 38 sub-fertile women after laparoscopy and hydrotubation [this is a table] -conceived (event) on one column of the table -did not conceive (censored) was the other column -the numbers in the columns represent the time the event happened or censoring happened [ex= time point 1 6 women conceived so 38 - 6= 32 / 38= 84.2% of women in the study make it past 1st tine point without conceiving] *-censoring at t=2 indicates survival PAST the 2nd cycle (i.e., we know the women "survived" her 2nd cycle pregnancy-free) this for calculating KM estimator at 2 months this person should still be included in the risk set think of it as 2+ months e.g., 2.1 months* --at time 3 the rose set at 3 months includes 26 women [took out the 2.1 censored person] --risk set at 4 months includes 22 women (1 woman got censored) --risk set at 6 months includes 18 women (1 woman got censored) --the example skipped all the way down to the last follow up in the study (the end of the study) and there were 2 remaining at 16 months (9th event time) [1 woman had conceived at month 16] --Kaplan-Meier survival curve -the slide shows a graph -S(t) is estimated at 9 event times --the example has a Kaplan-Meier survival curve graph example -6 women conceived in 1st month (1st menstrual cycle) therefore 32/38 "survived" pregnancy-free past 1 month -S(t=1) = 32/38= 84.2% -S(t=2) = (time 1)(time 2)= (32/38)(27/32)= (84.2%)(84.4%)= 71.1% 5 women conceive in 2nd month the risk set at event time 2 included 32 women therefore 27/32= 84.4% "survived" event time 2 pregnancy-free -S(t=3) = (time 1)(time 2)(time 3)= (32/38)(27/32)(23/26)= (84.2%)(84.4%)(88.5%)= 62.8% 3 women conceive in the 3rd month the risk set at event time 3 includes 26 women 23/26= 88.5% "survived" event time 3 pregnancy-free -S(4)= (time 1)(time 2)(time 3)(time 4)= (32/38)(27/32)(23/26)(19/22)= (84.2%)(84.4%)(88.5%)(86.4%)= 54.2% 3 women conceive in the 4th month and 1 was censored between months 3 and 4 the risk set at event time 4 included 22 women 19/22= 86.4% "survived" event time 4 pregnancy-free -S(t=6)= (all times until now)(time 6)= (54.2%)(88.8%)= 42.9% 2 women conceive in the 6th month of the study and one was censored between months 4 and 6 the risk set at event time 5 included 18 women 16/18= 88.8% "survived" event time 5 pregnancy-free -S(t=16) = (22%)(2/3)= 15% *tail depicts that 2 women did not conceive (cannot take many inferences from end of K-M curve)* --in the slides there is another example of K-M -the table shows day(t), deaths/ censored, at risk, Pt(d), Pt(survived), S(t) -there is also a graph --in the slide there's a 3rd K-M example -the table shows time(tj), risk set, # events, # censored, and S(tj)
overall caveats to "criteria" (causation slides)
--"none of my... [criteria] can bring indisputable evidence for or against the cause-and-effect hypothesis and none can be required as a *sine qua non"* *-sine qua non= something that is absolutely needed* *--note= this criteria was developed in 1963; logistic regression didn't become popular until after 1975; they calculated things by hand back then; now this criteria's not always needed*
kappa: observed agreement (misclassification slides)
--(a + d) / (a + b + c + d)
percent positive agreement (misclassification slides)
--(a) / [(a + c) + (a + b)] / 2 x 100 = 2a / [(a + c) + (a + b)] x 100 = [2a / 2a + b + c] x 100 --number of occurrences for which both observers report a positive result out of the average number of positives by either observer --note= don't need true negative here
1. counterargument: effect measure modification does NOT supersede control of confounding (interaction and effect modification slides)
--*precision and power for studying interaction is often poor* - much lower than the precision and power for single (main) effects of one of the exposures considered alone (in the same study) --even with no bias, observed interactions may poorly reflect the true degree of effect measure modification in the source population --confidence intervals for the degree of modification are valuable for making the imprecision clear -note= interpreting it --these are most easily computed from confidence intervals for product terms (like C x E, E s F, etc often called "interaction terms") in regression models; they often reveal that little can be said and done about confounding
ordinary least squared regression assumptions (LINE) (multivariate analysis I: continuous outcomes slides)
--1. *Linearity* --2. *Independence* [note= 1 measure per person; multiple measures of the same person AREN'T independent] --3. *Normality* --4. *Equal Residual Variances (homoscedasticity)* --5. *No Outliers* *--if you do not assess the assumptions...* -if any of these assumptions are violated then all of the estimates, standard errors of estimates, hypothesis testing, and model interpretation may be biased or misleading -this is why I have emphasized testing for normality in the invariable analysis so much -although to be clear linear regression is quite robust can use even if outcome is not exactly normal --note= if these assumptions aren't met there'll be major weird things seen in data that are not correct
*****is confounding present (confounding and mediation slides) IMPORTANT CONCEPT*****
--1. evaluate association between confounder and disease (overall and separately among exposed and unexposed) --2. if association is present confounding is possible OR if association is absent no confounding --3. (only look at this if association is present) evaluate association between confounder and exposure --4. if association is present confounding is present OR if association is absent no confounding --note= people get confused by this so be careful
assessing confounding (confounding and mediation slides)
--1. is the confounding variance related to both the exposure and the outcome in the study --2. does the exposure-outcome association seen in the crude analysis have the same direction and similar magnitude as the associations seen within strata of the confounding variable --3. does the exposure-outcome association seen in the crude analysis have the same direction and similar magnitude as that seen after controlling (adjusting) for the confounding variable
4 steps for testing for mediation (confounding and mediation slides)
--1. show that the causal variable is correlated with the outcome --2. show that the causal variable is correlated with the mediator --3. show that the mediator affects the outcome variable --4. to establish that M completely mediates the X-Y relationship the effect of X on Y controlling for M should be 0 -the effects in both steps 3 and 4 are estimated in the same equation --note= correlated: aka associated --note= essentially you'll have 3 equations [1. flu -> stroke, 2. air pollution -> stroke, and 2. air pollution -> flu -> stroke air pollution -> stroke
Rothmans causality model - example (causation slides)
--1. sufficient cause 1 formed by the component causes H. pylori (a necessary component cause), diet rich in nitrates, high salt intake, and Xz (a) --2. sufficient cause 2 formed by H. pylori, smoking, high salt intake, and Xz (b) --3. sufficient cause 3 formed by H. pylori, vitamin C efficiency, smoking, diet rich in nitrates, and Xz (c) --this slide shows the figures for each sufficient cause --note= Barry Marshall- H. pylori scientist that drank H. pylori to test if it caused ulcers
misclassification in cohort study (misclassification slides)
--2x2 table complete accurate data (OR= 3.3) - truth [cases and non-cases on the top and exposed and unexposed on the side] -A= 100 (cases and exposed), B= 200 (non-cases and exposed), C= 100 (cases and unexposed), and D= 900 (non-cases and unexposed) -note= we know the total number of people for exposed and unexposed for sure (300 exposed, 1,000 unexposed, and 1,300 total) --2x2 table differential misclassification of outcome: exposed disease info not accurate (OR= 5) -A= 150 (cases and exposed), B= 150 (non-cases and exposed), C= 100 (cases and unexposed), and D= 900 (non-cases and unexposed) [A and B are incorrect here so the total cases and non-cases numbers are incorrect too (250 cases and 1,050 non-cases)] -total number of exposed and unexposed is still correct -50% of 200 non-cases (25%) among exposed misclassified as cases 0% among unexposed -note= saying there's a stronger association than there really is a weaker association --2x2 table non-differential misclassification of outcome: both exposed and unexposed disease info not accurate (OR= 1.5) -A= 150 (cases and exposed), B= 150 (non-cases and exposed), C= 325 (cases and unexposed), and D= 675 (non-cases and unexposed) [all of the numbers are incorrect for A, B, C, and D and the total disease numbers are incorrect (cases= 475 and non-cases= 825)] -total number of exposed and unexposed is still correct -25% of non-cases among exposed AND unexposed misclassified as cases -note= saying there's a weaker association than in reality
multiplicative interaction, cohort study: observed vs expected approach (interaction and effect modification slides)
--2x2 table for incidence rates/ 1,000; exposure no and yes on the top and factor Z no and yes on the side -A= 10.0 (no E and no Z), B= 30.0 (yes E and no Z), C= 20.0 (no E and yes Z), and D= 60.0 (yes E and yes Z) *-the joint relative risk (6.0) is exactly what is expected by multiplying the individual relative risks [3.0 x 2.0 = 6.0]* --2x2 table for relative risks; exposure no and yes on the top and factor Z no and yes on the side -A= 1.0 (no E and no Z), B= 3.0 (yes E and no Z), C= 2.0 (no E and yes Z), and D= 6.0 (yes E and yes Z) *-no multiplicative interaction present* --2x2 table for incidence rates/ 1,000; exposure no and yes on the top and factor Z no and yes on the side -A= 10.0 (no E and no Z), B= 30.0 (yes E and no Z), C= 20.0 (no E and yes Z), and D= 90.0 (yes E and yes Z) *-the joint relative risk (9.0) is larger than what is expected by multiplying the individual relative risks [3.0 x 2.0 = 6.0]* --2x2 table for relative risks; exposure no and yes on the top and factor Z no and yes on the side -A= 1.0 (no E and no Z), B= 3.0 (yes E and no Z), C= 2.0 (no E and yes Z), and D= 9.0 (yes E and yes Z) *-multiplicative interaction present*
central problem (survival analysis slides)
--estimation of the survival curve --there's an image on the slide --approaches *-S(t) = probability(survive past 1)* *-lifetable method* grouped in intervals note= used when you don't know when event occurred (ex= when dementia started); used for things that take awhile *-Kaplan-Meier* ungrouped data small samples *--note= used in times when you're talking to patients, reporters, etc*
additive interaction, cohort study: observed vs expected approach (interaction and effect modification slides)
--2x2 table for incidence rates/ 1,000; top is exposure no and yes and side is factor Z no and yes -A= 10.0 (no E no Z), B= 30.0 (yes E no Z), C= 20.0 (no E yes Z), and D= 40.0 (yes E yes Z) -the *joint attributable risk* (30) is exactly what is expected by *summing the individual attributable risks* [(30 - 10) + (20 - 10)] --2x2 table for attributable risks; exposure no and yes on the top and factor Z no and yes on the side -A= 0.0 (no E no Z), B= 20.0 (yes E no Z), C= 10.0 (no E yes Z), and D= 30.0 (yes E and yes Z) -no additive interaction present --note= similar to slide additive interaction, cohort study: homogeneity of effects approach with the 2 tables in it (this is just put on a 2x2 table) --2x2 table for incidence rates/ 1,000; top is exposure no and yes and side is factor Z no and yes -A= 10.0 (no E no Z), B= 30.0 (yes E no Z), C= 20.0 (no E yes Z), and D= 60.0 (yes E yes Z) don't need to take D into account because this is what we are trying to estimate -the *joint attributable risk* (50) is larger than what is expected by *summing the individual attributable risks* [(30 - 10) + (20 - 10) = 30] --2x2 table for attributable risks; exposure no and yes on the top and factor Z no and yes on the side -A= 0.0 (no E no Z), B= 20.0 (yes E no Z), C= 10.0 (no E yes Z), and D= 50.0 (yes E and yes Z) -additive interaction present
case control study of DDT exposure and breast cancer (interaction and effect modification slides)
--2x2 table for the crude OR with breast cancer cases and controls on top and DDT high and low on the side -A= 500 (high cases), B= 600 (high controls), C= 1,500 (low cases), and D= 4,000 (low controls) -crude OR= 1.8 --age stratified data 2x2 tables -age < 50 years table [note= cases and controls on top and DDT high and low on the side] A= 50 (high cases), B= 300 (high controls), C= 450 (low cases), and D= 2,700 (low controls) stratum specific OR= 1.0 -age 50 years and older table [note= cases and controls on top and DDT high and low on the side] A= 450 (high cases), B= 300 (high controls), C= 1,050 (low cases), and D= 700 (low controls) stratum specific OR= 1.0 -note= age is a confounder due to both stratum specific ORs being the same as each other but different from the crude --age summary -there is no association between DDT levels and BC among women who are younger than 50 years or among those who are 50 years and older -stratum specific odds ratios are different than crude OR indicating confounding by age is present -lack of difference between the two stratum specific odds ratios indicated that effect modification is absent --stratified by lactation history -2x2 table for never breastfed with cases and controls on top and DDT high and low on the side A= 140 (high cases), B= 300 (high controls), C= 550 (low cases), and D= 2,600 (low controls) stratum specific OR= 2.2 -2x2 table for breastfed with cases and controls on top and DDT high and low on the side A= 360 (high cases), B= 300 (high controls), C= 950 (low cases), and D= 800 (low controls stratum specific OR= 1.0 -note= EMM= report stratum specific measurements --lactation summary -women who have high DDT levels and never breastfed have 2.2 fold increased odds of breast cancer -women who have high DDT levels but did breastfeed have no elevated risk *-heterogeneity of stratum specific odds ratios indicates* that lactation is an effect modifier of the relationship between DDT and exposure level and breast cancer
examples: effect measure modification (interaction and effect modification slides)
--2x2 tables 1. exposed (E) and unexposed (U) on top and male (M) and female (F) on the side -risk if all: A= 0.4 (E M), B= 0.2 (U M), C= 0.3 (E F), and D= 0.1 (U F) -effect measures RD= 0.2 for E and U M and 0.2 for E and U F [note= RD are the same] RR= 2.0 for E and U M and 3.0 E and U F [note= RR are different] *description= uniformity of RD only* 2. exposed (E) and unexposed (U) on top and male (M) and female (F) on the side -risk if all: A= 0.4 (E M), B= 0.2 (U M), C= 0.2 (E F), and D= 0.1 (U F) -effect measures RD= 0.2 for E and U M and 0.1 for E and U F [note= RD are different] RR= 2.0 for E and U M and 2.0 E and U F [note= RR are different] *description= uniformity of RR only* 3. exposed (E) and unexposed (U) on top and male (M) and female (F) on the side -risk if all: A= 0.4 (E M), B= 0.2 (U M), C= 0.2 (E F), and D= 0.2 (U F) -effect measures RD= 0.2 for E and U M and 0.0 for E and U F [note= RD are different] RR= 2.0 for E and U M and 1.0 E and U F [note= RR are different] *description= uniformity of neither RD nor RR* 4. exposed (E) and unexposed (U) on top and male (M) and female (F) on the side -risk if all: A= 0.4 (E M), B= 0.2 (U M), C= 0.4 (E F), and D= 0.2 (U F) -effect measures RD= 0.2 for E and U M and 0.2 for E and U F [note= RD are the same] RR= 2.0 for E and U M and 2.0 E and U F [note= RR are the same] *description= uniformity of both RD and RR* *--in populations 1 and 2 uniformity of one effect measure between genders implies that the other measure is nonuniform* --gender -modifies the RR (but not the RD) in population 1 -modifies the RD (but not the RR) in population 2 -modifies both effect measures in population 3 -is not a risk factor in population 4 (under either exposure condition) thus it does not modify RR or RD in that population --thus we see that the assessment of effect measure modification is measure or model dependent; and that for a covariate to be a modifier it must have an effect in at least one exposure level *--note (on the side not my own note): we cannot represent effect measure modification in a DAG since these are scale free while effect measure modification scale is dependent*
statistical significance in assessing confounding (confounding and mediation slides)
--BAD IDEA -> a confounder might be ruled out in a cohort study sole because there is no statistically significant difference in the levels of the confounder comparing exposed and unexposed --if a confounder is strongly associated with the outcome even a small difference (not statistically significant because of limited sample size) according to exposure may still induce confounding... and vice versa
kappa interpretation (misclassification slides)
--Landis & Koch (1977)= poor: -0.1 to 0, slight: 0 to 0.2, fair: 0.2 to 0.4, moderate: 0.4 to 0.6, substantial: 0.6 to 0.8, and almost perfect 0.8 to 1 --Altman (1991)= poor: -0.1 to 0.2, fair: 0.2 to 0.4, moderate: 0.4 to 0.6, good: 0.6 to 0.8, and very good: 0.8 to 1 --Fleiss (1981)= poor: -0.1 to 0.4, fair to good: 0.4 to 0.75, and excellent: 0.75 to 1 --Byrt (1996)= no agreement: -0.1 to 0, poor: 0 to 0.2, slight: 0.2 to 0.4, fair: 0.4 to 0.6, good: 0.6 to 0.8, very good: 0.8 to 0.9, and excellent: 0.9 to 1 *--note= depending on the field kappa is chosen* *-note= for public health 0.71 to 1 is good (she personally does 0.8 to 1 as good for public health human research)*
clinical liftable (survival analysis slides)
--Ni*= effective #@risk = Ni - (Ct / 2) --qi = P(die in ith interval/ alive @start of interval) --qi= di / Ni* --pi= 1 - qi -- = P(survive in the with interval/ alive at start of its interval) --S(0)= 1.0 S(1)= 0.89 S(2)= 0.89(0.77)= 0.68 S(3)= 0.68(0.55)= 0.38 S(4)= 0.38(0.55)= 0.21 S(5)= 0.21(0.71)= 0.15 S(6)= 0.15(0.78)= 0.12 --this slide has a table --note= because you estimate an interval instead of using 8 people you use half of that 4 people due to belief you censored at midway point --note= n of i on the table= effective # at risk
assumption 1: linearity and additivity (multivariate analysis I: continuous outcomes slides)
--OLS regression assumes that the mean value if the outcome is a linear function the predictor --the predictor variable enter the estimation equation in a linear fashion -nonlinear model: Y= alpha + b1*ln[b2X1] + e -linear model: Y= alpha + b1X1 + e --otherwise the regression coefficients and their SEs are biased *--assumptions 1, 3, and 4 can be checked by examining the model residual errors - regression diagnostic plots* *-(i) linearity and additivity* of the relationship between dependent and independent variables: (a) the expected value of dependent variable is a straight-line function of each independent variable holding the others fixed (b) the slope of that line does not dependent on the values of the other variables (c) the effects of different independent variables on the expected value of the dependent variable are additive
relative risk, i.e., risk ratio (causation slides)
--Rexposed = a / (a+b) --Runexposed= c / (c+d) --RR= (a / a+b) / (c / c+d)
variance of the survival function Greenwood, 1926 (survival analysis slides)
--Var[S(t)] = [S(t)]^2 x Σ(on the right side i on top and j on bottom)=1 dj / (N*j(N*j - dj)) --for S(4) = 0.37(0.55) = 0.21 --SE= (0.21)^2 x [(22 / 196(196-22)) + (38 / 164(164 - 38)) + (48 / 107.5(107.5 - 48)) + (19 / (42(42 - 19))] = 0.0013 -then SE of S(4) = sort(0.0013)= 0.072 -95% CI for S(4) 0.21 + or - 2SE 0.21 + or - 2(0.072) 0.066, 0.354)
homogeneity method: what kinds of interaction in this cohort study (interaction and effect modification slides)
--family history= no -> smoking= no -> heart disease incidence/100= 5.0 -> attributable risk= 0 -> relative risk= 1.0 family history= no -> smoking= yes -> heart disease incidence/100= 10.0 -> attributable risk= 5.0 -> relative risk= 2.0 --family history= yes -> smoking= no -> heart disease incidence/100= 20.0 -> attributable risk= 0 -> relative risk= 1.0 family history= yes -> smoking= yes -> heart disease incidence/100= 40.0 -> relative risk= 2.0 --additive= yes [note= attributable risk looks at additive] --multiplicative= no [note= relative risk looks at multiplicative] --positive= yes additive --negative= no additive --quantitative= yes --qualitative= no --typical malarone use= no -> bed net use= no -> malaria incidence/1,000= 50.0 -> attributable risk= 0 -> relative risk= 1.0 typical malarone use= no -> bed net use= yes -> malaria incidence/1,000= 30.0 -> attributable risk= -20.0 -> relative risk= 0.6 --typical malarone use= yes -> bed net use= no -> malaria incidence/1,000= 10.0 -> attributable risk= 0 -> relative risk= 1.0 typical malarone use= yes -> bed net use= yes -> malaria incidence/1,000= 1.0 -> attributable risk= -9.0 -> relative risk= 0.1 --additive= yes [note= attributable risk looks at additive] --multiplicative= yes [note= relative risk looks at multiplicative] --positive= yes multiplicative --negative= yes additive --quantitative= yes additive and multiplicative --qualitative= no --family history= no -> smoking= no -> disease incidence/100= 10.0 -> attributable risk= 0 -> relative risk= 1.0 family history= no -> smoking= yes -> disease incidence/100= 40.0 -> attributable risk= 30.0 -> relative risk= 4.0 --family history= yes -> smoking= no -> disease incidence/100= 40.0 -> attributable risk= 0 -> relative risk= 1.0 family history= yes -> smoking= yes -> disease incidence/100= 100.0 -> attributable risk= 60.0 -> relative risk= 2.5 --additive= yes [note= attributable risk looks at additive] --multiplicative= yes [note= relative risk looks at multiplicative] --positive= yes additive only --negative= yes multiplicative only --quantitative= yes additive and multiplicative --qualitative= no
non-differential misclassification: continuous exposure (measurement error model) (misclassification slides)
--X = T + e --usual assumptions -e is independent of T (constant error across range of X) disease (non-differential) --e and X are normally distributed --result: -if RR(T) = exp(T) then RR(X) = exp(*T) and * = R* where R = (2 high and T low) / (2 high and X low) = ( [2 high and X low] - [2 high and e low]) / [2 high and X low] -R is the reliability index of T as a measure of T is the variance [note= ^2 is the variance] --naïve significance tests are valid and most powerful --same results hold for linear regression and or logistic regression of a rare disease --note= assumptions are error is independent of the true value and disease is non-differential
percent agreement (misclassification slides)
--[(a + d) / (a + b + c + d)] x 100 --note= total percent agreement *--the more people in A and D the higher the percent agreement*
Chamberlain's PPA (misclassification slides)
--[a / (a + b + c)] x 100 --Chamberlain's PPA = PPA / (2 - PPA) --number of occurrences for which both observers report a positive result out of the total number of observations for which at least one observer does so
causal pie: multiple causes (causation slides)
--a *component* cause is any one of a set of conditions which are necessary for the completion of a sufficient cause --a *necessary cause* is a component cause that is a *member of every sufficient cause* *--a sufficient cause is a set of minimum conditions* or events that inevitably produce disease *(the entire causal pie)* --note= sufficient cause- minimum set of conditions in a pie --this slide shows images of causal pies and labels the component causes, the necessary cause and the sufficient cause (the whole pies)
confounding is not an "all or none" phenomenon (confounding and mediation slides)
--a confounding variable may explain whole or just part of the observed association between a given exposure and a given outcome --note= can explain whole or part of the relationship
overall model fit (multivariate analysis II: categorical outcomes slides)
--a logistic function is estimated by *maximum likelihood* (ML) techniques (instead of OLS) which finds values of parameters that maximize the likelihood function given the sample data *--in MLR we used R-squared to understand model fit however in logistic we use other parameters* -the *log likelihood chi-square* is an omnibus test to see if the model as a whole is statistically significant *--option we will return to when evaluating model fit:* *-Hosmer and Lemeshow' goodness-of-fit test=* predicted and observed frequencies closely match indicating a better model fit -smaller p-valyes indicate poorer fit --note= pseudo R-squared; R-squared will never be 1; different pseudo R-squares will give very different model fits *--AIC=* Akaike's Information Criterion -the smaller the AIC the better (only in comparison to other models) *--BIC=* Bayesian Information Criterion -the smaller the BIC the better (only in comparison to other models) *--pseduo- or McFadden (pseudo) R-squared* adjusted (not recommended) --note= she usually uses Hosemer or AIC
simple linear regression (multivariate analysis I: continuous outcomes slides)
--a statistical method --examine the associations between two quantitative variables *(Y must be continuous)* *-X denotes the predictor, explanatory, or independent variable* *-Y denotes the outcome, response, or dependent variable* --note= she uses exposure and outcome for x and y names (x= exposure and y= outcome)
component and joint effects (interaction and effect modification slides)
--a way to view the same phenomena septettes the combined effects of two exposures into *3 parts* -the *two component effects* of each exposure - i.e., its effect in the absence of (in reference category of) the other exposure -the *joint effect* of the two exposures --suppose we have two dichotomous exposures E and S and suppose we know what the average risk (Rij) in the target population would be under all four combinations of these two exposures - i.e., the i-th category of E (i= 0, 1) and the j-th category of D (j= 0, 1) -2x2 table with D (j= 1) and No D (j= 0) on top and E (i= 1) and No E (i= 0) on the side A= R11 (E D [i= 1 j= 1]), B= R10 (E No D [i= 1 j= 0]), C= R01 (No E D [i= 0 j= 1]), and D= R00 (No E No D [i= 0 j= 0]) --the *component effect of E* is the comparison of R10 with R00 - i.e., RR10 = R10 / R00 or RD10 = R10 - R00 --the *component effect of D* is the comparison of R01 with R00 - i.e., RR01 = R01 / R00 or RD01 = R01 - R00 --the *joint effect of E and D* is the comparison of R11 with R00 - i.e., RR11 = R11 / R00 or RD11 = R11 - RR00 -note= component effect yes/ no and no/ yes -note= joint effect yes/ yes and no/ no --all three effects involve the same population with persons unexposed to both E and D (no E and no D) as the joint reference group --to describe the amount of interaction it is common to compare the joint effect with our expectation of the joint effect under *two null conditions:* *-additivity* of effects and *multiplicativity* of effects *--the expected value of the joint effect under each null condition is derived from the component effects that is we express additivity and mulitiplicativity in terms of component and joint effects* *--comment:* it is common to see analyses that examine only multiplicativity nonetheless additivity turns out have a closer connection to basic causal models of interaction --note= additive= closer to true causal
risk difference (causation slides)
--additional risk (R) among those exposed when compared to those unexposed *--RD= Rexposed - Runexposed* --ranges from *-1 to +1 has no units* *--null hypothesis: RD= 0* --same formula for *incidence rate difference= rate among exposed - rate among unexposed*
measuring interaction (interaction and effect modification slides)
--additive interaction -attributable risk model AR for those exposed to X varies as a function of Z -"public health interaction" --multiplicative interaction -relative risk model RR for association between X and Y varies as a function of Z
misclassification in confounders (misclassification slides)
--adjustment for a misclassified confounder will only remove part of the confounding effect --when the main exposure is subject to misclassification but the confounder is not then adjustment may over adjust for the confounding effect --note= can still have residual confounding -note= your adjustment may over adjust if things are misclassified
randomization (RCTs) (confounding and mediation slides)
--advantages -permits straightforward data analysis --disadvantages -need control over the exposure and the ability to assign subjects to study groups -need large sample sizes -not always ethical
epidemiologic triad/ triangle (causation slides)
--agent (cause of disease) at the bottom left --host (organism that harbors the disease) at the top --environment (allows for disease transmission) at the bottom right --vector (carries and transmits pathogen) in the middle of triangle *--agent - host - environment* *-vector in the middle* --note= this slide was an infectious disease example --note= fundamental for disease causality --note= diseases are a result of an interaction; very rarely are genetics purely the cause of a disease --note= epidemiology started with infectious diseases --note= vector examples- mosquitos, ticks, birds, pigs, etc
multiple causality (causation slides)
--also referred to as *multifactorial etiology* --"...requirement that more than one factor be present for disease to develop..." --models -epidemiologic triangle -web of causation -pie model
implications of approach selection for health impact (causation slides)
--although the relative risk of stroke associated with stage 2 hypertension is very high (4.0) its attributable risk is less than that associated with prehypertension notwithstanding the latter's much lower relative risk (1.5) -this is because the prevalence of prehypertension (50%) is much higher than that of stage 2 hypertension (approximately 5%) --note= depends on funding, resources, severity of risk, etc --estimates suggest that a 33% decrease in average salt intake in the population at large would result in a 22% reduction in the incidence of stroke -in comparison even if all hypertensive patients were identified and successfully treated this would reduce stroke incidence by only 15% --similar decreases in all other modifiable hypertension component causes would obviously be expected to have even a greater impact on stroke (CHD) incidence than salt reduction alone *-thus when distal or even intermediate component causes are known primary prevention based on these causes is generally more effective than intervention on proximal causes* --note= proximal causes intervention- intervene on someone that already has the thing (I forget whether she said exposure, outcome, or both)
collider bias (confounding and mediation slides)
--association between X and Y are marginally dependent (not necessarily associated) --Z is completely determined by X and Y --X and Y exhibit perfect dependence given Z --DAG X -> Z <- Y X -> Y --note= Y causes Z and X causes Z so we don't want to adjust for a collider because it'll bias outcomes --in this DAG Z is a collider (it is causally influenced by 2 variables) --colliders block the association between the variables that influence it -DOES NOT CREATE AN UNCONDITIONAL ASSOCIATION BETWEEN X AND Y --conditioning on a collider in regression opens the path between X and Y and introduces bias (can introduce associations when there are no associations) --note= conditioning: aka adjust --collider bias: BMI -height is associated with weight although not perfectly weight (exposure) -> height (collider) <- BMI (outcome) weight (exposure) -> BMI (outcome)
association and cause (causation slides)
--association= X - Y [no arrow just a line to link them] -ex= yellow fingers - lung cancer --possible causal structure= X -> Y [cause] or X <- Z -> Y [confounder] -ex= yellow fingers -> lung cancer [cause] or yellow fingers <- smoke -> lung cancer [confounder]
standardized residuals in STATA (multivariate analysis I: continuous outcomes slides)
--attempts to adjust residuals for their standard errors -the theoretical residuals (Ei (greek E)) are assumed to be homoskedastic (i.e., they all have the same variance) this is not actually true for the calculated residuals (ei): Var(ei) = δ^2(1 - hi) where hi are leverage measures *--STATA calculates what Chatterjee & Hadi call internally studentized residuals for the standardized residuals -internally studentized residuals are defined for each observation i= 1...n (note= i= 1 to n) as an ordinary residual divided by an estimate of its standard deviation* -uses the root mean squared error of the regression for δ *--values of 3 or greater (or -3 or less) may be problematic and are considered outliers* --predict stdresid if e(sample), rstandard
biases in epidemiologic studies (misclassification slides)
--bias occurs with estimated association deviates from the true measure of association --types -information bias -selection bias -(confounding) --how do these biases impact out study findings (in other words why do we care about them) -they are *threats to internal validity* --note= misclassification's a form of information bias
survival analysis approach (survival analysis slides)
--bivariable -Kaplan-Meier with log-rank test --multivariable -Cox proportional hazards regression
log transformation (multivariate analysis I: continuous outcomes slides)
--can do it if normality's violated
non-differential misclassification: categorical exposure (misclassification slides)
--can lead to -bias away from the null -bias towards the null -distortion of the dose response relationship --these effects depend on -pattern of misclassification -distribution of exposure among controls -shape of the dose response relationship --true exposure & threshold dose response -graphic visualization on the slides with exposure as the x-axis, percentage of cases and non-cases on the y-axis, and controls and cases as different color bars for each category (this is a bar graph) -category 1= OR: 1.0 [33% controls and 14% cases], category 2= OR: 1.0 [33% controls and 14% cases], category 3= OR: 5.0 [33% controls and 71% cases] --observed association (dose response) -graphic visualization on the slides with exposure as the x-axis, percentage of cases and non-cases on the y-axis, and controls, observed cases, and cases as different color bars for each category (this is a bar graph) -category 1= OR: 1.0 [33% controls, 14% cases, and 14% observed cases], category 2= OR: 1.8 [33% controls, 14% cases, and 26% observed cases], category 3= OR: 4.2 [33% controls, 71% cases, and 60% observed cases] -categories 2 and 3 are misclassified -note= there is no misclassification in category 1 in this example --true exposure & U-shaped dose response -graphic visualization on the slides with exposure as the x-axis, percentage of cases and non-cases on the y-axis, and controls and cases as different color bars for each category (this is a bar graph) -category 1= OR: 1.0 [33% controls and 44% cases], category 2= OR: 0.25 [33% controls and 11% cases], and category 3= OR: 1.0 [33% controls and 44% cases] --observed association (dose response) -graphic visualization on the slides with exposure as the x-axis, percentage of cases and non-cases on the y-axis, and controls, cases, and observed cases as different color bars for each category (this is a bar graph) -category 1= OR: 1.0 [33% controls, 44% cases, and 44% observed cases], category 2= OR: 0.55 [33% controls, 11% cases, and 24% observed cases], and category 3= OR: 0.70 [33% controls, 44% cases, and 31% observed cases] -categories 2 and 3 are misclassified -category 1 is not misclassified
therefore causal inference... (causation slides)
--causal inference is not a simple (or quick) process --no single study is sufficient in establishing causal inference --requires critical judgement and interpretation --can one "prove" causal associations
censoring (survival analysis slides)
--censoring occurs when complete information is not available about the survival time of some participant -a type of missing data approach to survival analysis -censoring occurs when a participant is lost to follow up or the study ends participant is censored at time when loss to follow up -participants with no event at the end of the study are censored at the study end point --note= type of missing data approach --note= accounts for a person that doesn't complete the study (don't totally get rid of them)
causal pie: component cause (causation slides)
--component causes are a constellation of risk factors that act jointly to form a *sufficient cause* -multiple factors that act jointly to cause a given effect *-no component cause is sufficient to produce the effect* *--component causes are necessary in its causal pie but may or may not be necessary in every causal pie* --note= no component cause is sufficient alone to cause the disease -part of the pie -don't need it in every pie
confounders and colliders (confounding and mediation slides)
--confounder -should be controlled for when estimating causal associations X <-> Z -> Y X ---> Y --collider -should NOT be controlled for when estimating causal associations
confounding overview (interaction and effect modification slides)
--confounding - statistical bias" -non-causal association between an exposure and an outcome is due to the influence of a 3rd variable -we want to account for or control for the confounding variable *the confounding variable must* 1. be associated with exposure 2. be associated with disease in unexposed 3. not in the causal pathway
example: effect modifier or confounder or both (interaction and effect modification slides)
--consider the effect of smoking on cervical cancer by race (a risk factor for the disease) --the table below shows the number (N, x1000) of women at risk and the average risk (cases per 100,00 ascertained over a one year period) off cervical cancer by smoking status (the exposure) and race (the covariate) in 4 different source cohorts --each cohort is followed for one year and contains 1 million people 30% of whom are smokers and 20% of whom are black --assume that the R values in this table are true risks (not estimates) there are no other confounders in each source population and that the smokers form the target --note= moderator= same as EMM (has different names for the same thing) --table 1. smoker= yes -> black= 60 (N (x1,000) and 40 (R (/10^5/ yr)) -> white= 240 (N (x1,000)) and 10 (R (/10^5/ yr)) -> total= 300 (N (x1,000)) and 16 (R (/10^5/ yr)) smoker= no -> black= 140 (N (x1,000) and 40 (R (/10^5/ yr)) -> white= 560 (N (x1,000)) and 10 (R (/10^5/ yr)) -> total= 700 (N (x1,000)) and 16 (R (/10^5/ yr)) 2. smoker= yes -> black= 120 (N (x1,000) and 40 (R (/10^5/ yr)) -> white= 180 (N (x1,000)) and 10 (R (/10^5/ yr)) -> total= 300 (N (x1,000)) and 22 (R (/10^5/ yr)) smoker= no -> black= 80 (N (x1,000) and 40 (R (/10^5/ yr)) -> white= 620 (N (x1,000)) and 10 (R (/10^5/ yr)) -> total= 700 (N (x1,000)) and 13.4 (R (/10^5/ yr)) 3. smoker= yes -> black= 60 (N (x1,000) and 60 (R (/10^5/ yr)) -> white= 240 (N (x1,000)) and 10 (R (/10^5/ yr)) -> total= 300 (N (x1,000)) and 20 (R (/10^5/ yr)) smoker= no -> black= 140 (N (x1,000) and 30 (R (/10^5/ yr)) -> white= 560 (N (x1,000)) and 10 (R (/10^5/ yr)) -> total= 700 (N (x1,000)) and 14 (R (/10^5/ yr)) 4. smoker= yes -> black= 120 (N (x1,000) and 60 (R (/10^5/ yr)) -> white= 180 (N (x1,000)) and 10 (R (/10^5/ yr)) -> total= 300 (N (x1,000)) and 30 (R (/10^5/ yr)) smoker= no -> black= 80 (N (x1,000) and 30 (R (/10^5/ yr)) -> white= 620 (N (x1,000)) and 10 (R (/10^5/ yr)) -> total= 700 (N (x1,000)) and 12.3 (R (/10^5/ yr)) --another table building off of the last one 1. RR (smoking) black= 1.00 and white= 1.00 -> cRR= 1.00 and sRR= 1.00 -> race is= not a confounder and not a modifier 2. RR (smoking) black= 1.00 and white= 1.00 -> cRR= 1.64 and sRR= 1.00 -> race is= a confounder and not a modifier 3. RR (smoking) black= 2.00 and white= 1.00 -> cRR= 1.43 and sRR= 1.43 -> race is= not a confounder and is a modifier 4. RR (smoking) black= 2.00 and white= 1.00 -> cRR= 2.44 and sRR= 1.67 -> race is= a confounder and is a modifier --the table on summarizes the race-specific and "overall" effects of smoking in each population and indicates whether race is a confounder or modifier of the smoking effect in each cohort --note that (the slides say this not me) the overall or summary effects are expressed in two ways: -as crude risk ratios (cRR) ignoring race and -as risk ratios internally standardized for race (sRR) *--internal standardization* creates a group of nonsmokers with the same race distribution in smokers - i.e., *the unexposed group is "made comparable"* to the exposed with respect to race (the potential confounder) --note= cRR= crude risk ratio --note= sRR= stratum specific risk ratio --to determine *whether race is a modifier of the risk ratio* for smoking effect in each population -we compare the risk ratio for blacks with the risk ratio for whites if these 2 risk ratios are equal race does not modify the risk ratio in that population if these 2 risk ratios are different race does modify the risk ratio --unlike confounding effect measure modification by race *does not* require an association between race and smoking in the source population *--conclusion:* depending on the pattern of effects and associations in the source population a cause of disease may be either a confounder for the exposure effect, a modifier of the exposure effect, or neither, or both --there is an argument that if there appears to be appreciable effect measure modification between 2 exposures then any confounding of either effect by the other exposure is irrelevant (note= use to be this way but not any more) --for example if race appears to modify the exposure effect we should report race-specific not summary (adjusted) estimates of effect --the primary reason for this view is that any summary estimate of effect is misleading when there is substantial heterogeneity
residual confounding (confounding and mediation slides)
--controlling for one of several confounding variables does not guarantee that confounding is completely removed --residual confounding may be present because the variable that is controlled for is an imperfect surrogate of the true confounder or because other confounding effects are ignored --note= ex: categorizing continuous variables
oral contraceptives, MI, and age example (misclassification slides)
--crude 2x2 table with cases and controls on top and yes and no to oral contraceptives on the side -A= 29 (cases yes), B= 135 (controls yes), C= 205 (cases no), and D= 1,607 (controls no) OR= 1.7 --ages 25-29 2x2 table with cases and controls on top and yes and no to oral contraceptives on the side -A= 4 (cases yes), B= 62 (controls yes), C= 2 (cases no), and D= 224 (controls no) OR= 7.2 --ages 30-34 2x2 table with cases and controls on top and yes and no to oral contraceptives on the side -A= 9 (cases yes), B= 33 (controls yes), C= 12 (cases no), and D= 390 (controls no) OR= 8.9 --ages 35-39 2x2 table with cases and controls on top and yes and no to oral contraceptives on the side -A= 4 (cases yes), B= 26 (controls yes), C= 33 (cases no), and D= 330 (controls no) OR= 1.5 --ages 40-44 2x2 table with cases and controls on top and yes and no to oral contraceptives on the side -A= 6 (cases yes), B= 9 (controls yes), C= 65 (cases no), and D= 362 (controls no) OR= 3.7 --ages 45-49 2x2 table with cases and controls on top and yes and no to oral contraceptives on the side -A= 6 (cases yes), B= 5 (controls yes), C= 93 (cases no), and D= 301 (controls no) OR= 3.9 --note= OR of people with different ages --note= all but one OR different positive with crude *--all but one of the stratified ORs are father away from the null as compared to the crude OR (35-39 year olds are the exception)* *--this suggests negative confounding (underestimation of the true strength of the association)* *--this makes sense because MI risk diseases with age and OC use decreases with age*
malaria, gender, and occupation example (misclassification slides)
--crude analysis -2x2 table with cases and controls on top and males and females on the side A= 88 (case male), B= 68 (control male), C= 62 (case female), and D= 82 (control female) OR= 1.71 --stratified analysis by occupation (outdoor vs indoor employment) -2x2 table for each strata (one for outdoor and one for indoor) with cases and controls on top and males and females on the side mostly outdoor= A- 53 (male case), B- 15 (male control), C- 10 (female case), and D- 3 (female control) OR= 1.06 mostly indoor= A- 35 (male case), B- 53 (male control), C- 52 (female case), and D- 79 (female control) OR= 1.00 --because the *stratum specific ORs* are similar to each other but notably different from the crude OR occupation is a confounder --and because the ORs are 1.0 (or close to it) we can conclude that there is not association between gender and malaria *-what might explain this* note= males may work more outside than females --note= occupation's a confounder due to similar OR for the strata but different from the crude
direction of confounding (positive association) (confounding and mediation slides)
--crude association between malaria and gender (i.e., a from table 1 on slide 8): + --association between working outdoors and gender (i.e., b from table 1): + --association between working outdoors and malaria (i.e., c from table 1): + --direction of confounding bias: b*c= +*+ = + --triple product: a*b*c= +*+*+ = + -so crude will be greater than adjusted -i.e., further away from null value of 1.00 --note= positive vs negative --example (positive association) -it appears that outside job is strongly associated with both the exposure and outcome *-since both associations are positive (OR > 1.0) we expect positive confounding* *as a result the crude > adjusted and the adjusted OR should be less than the crude OR of 1.71*
validity in epidemiologic studies (confounding and mediation slides)
--degree to which inferences draw from a study are warranted --two components *-internal validity=* ability to draw sound conclusions about relationship between exposure(s) + outcome observed in a study; is there confounding or other (selection or information) bias (note= are or measures of association valid) *-external validity=* extent to which study results can be generalized to populations beyond the study subjects --based on -unbiased study methods -study sample being representative of target population
validity in epidemiologic studies (misclassification slides)
--degree to which inferences drawn from a study are warranted --two components *-internal validity= ability to draw sound conclusions about relationship between exposure(s) + outcome observed in a study is there confounding other (selection, information) bias* *-external validity= extent to which study results can be generalized to populations beyond the study subjects* --based on -unbiased study methods -study sample being representative of target population
causal diagrams (causation slides)
--depict our assumptions about causal relations among the exposure, disease/ outcome, and covariates -X is the exposure -Y is the disease or outcome --directed acyclic graphs (DAGs) are used in epidemiology to describe causal relations -acyclic means there are no feedback loops --note= simplest association is E -> D or X -> Y --X= exposure, and Y= disease or outcome --X directly affects or causes Y -X -> Y *--note= arrow means cause (X precedes Y)*
systematic error: information bias (misclassification slides)
--differential misclassification -misclassification of X differs with respect to categories of Y -may bias effect estimate in any direction --non-differential misclassification -misclassification of X does not differ with respect to categories of Y -usually biases effect estimate toward null expect under special circumstances --assessing misclassification -sensitivity= probability that someone who is truly exposed will be classified as exposed -specificity= probability that someone who is not truly exposed will be classified as not exposed -false positive= probability that someone who is truly not exposed will be classified as exposed = 1 - specificity -false negative= probability that someone who is truly exposed will be classified as not exposed = 1 - sensitivity note= may see this in disease with no treatment -note= your study question may make you more concerned with 1 thing over another (false positive vs false negative) --2x2 table with true + and true - on top and obs + and obs - on the side -A= true positive (true + and obs +) -B= false positive (true - and obs +) -C= false negative (true + and obs -) -D= true negative (true - and obs -) -up sensitivity = down false negative -up specificity = down false positive
recall: the "2x2" table (causation slides)
--disease status on the top (disease and no disease) --exposure status on the side (exposed and not exposed) --total diseased= a+c --total not diseased= b+d --total exposed= a+b --total not exposed= b+d --total= a+b+c+d
reminder of why randomization is gold standard (confounding and mediation slides)
--does coffee drinking cause lung cancer --drink coffee --(?)-> lung cancer *--attempts to ensure equal distributions of the confounding variable in each exposure category* *--if randomization is successful no founding at baseline*
example of confounding (confounding and mediation slides)
--does diabetes cause dementia --cohort study on risk of dementia among adults with and without diabetes --diabetes (E) <- older age (C) -> more dementia (D) --diabetes (E) --(?)-> more dementia (D) --RR= 3.5 ... is this association real -note= found this out before controlling for confounders confounding by age: --subjects with diabetes were on average older than subjects without diabetes; age is risk factor for diabetes --increased incidence of dementia among those with diabetes was in part due to their older age --when confounding by age was controlled subjects with diabetes were only twice as likely to develop dementia as those without diabetes -initial results were *exaggerated* buy confounding by age --confounding occurred because age was associated with both diabetes and dementia --note= *full confounder:* confounder explains entire relationship (ex= smoking and yellow fingers example) --note= *partial confounder:* confounder explains only some of the relationship (ex= age and dementia example) *--look at slides for the tables for this example!!!!!* 1. calculated risk ration for dementia and diabetes (RR= 3.5) [2x2 table= dementia top part of table and diabetes side part of table] 2. calculated risk ratio for dementia and age (RR= 8.8) [2x2 table= dementia top part of table and age side part of table] 3. calculated risk ratio for diabetes and age (RR= 2.3) [2x2 table= diabetes top part of table and age side part of table] *association between diabetes (E) and dementia (D) stratified by age (C) on page 17 of the notes* --risk ratio for both age adjusted starts (80-99 and 45-79) when doing the risk ratios for dementia and diabetes; both strata turned out to have a risk ratio of 2.0 *--note= this is a partial confounder (age doesn't describe entire relationship just some of it)*
effect measure modification (interaction and effect modification slides)
--effect measure modification is often a very important concept for several reasons -it is often a key concern when analyzing data and when interpretation statistical results - e.g., to determine whether we should be estimating a common effect measure or hot to select predictors for a statistical model -it is sometimes used to operationalize [put into operation or to use] specific hypotheses regarding the effects of two exposures - e.g., we might hypothesize that the effect of one exposure *should* vary in a certain way over categories of another exposure -it is relevant for generalizing results across studies and populations --the assessment of effect modification (and statistical interaction) depends on the specific measure used to quantify the effect --that is the assessment of effect measure modification is *measure dependent* or *model dependent* (thus the term "effect-*measure* modification" which is often simply termed "effect modification") --for example if the *risk ratio* for the E effect is non-null and homogeneous (uniform) across categories (strata) of C the *risk difference* for the E effect will be heterogeneous across C; and if the risk difference is non-null and uniform the risk ratio will be heterogeneous furthermore the risk difference and the risk ratio could exhibit opposite (or the same) direction of change across C --note= you're not measuring the effect you're measuring the modifier
decision analysis (causation slides)
--effectiveness of intervention A compared with intervention B; intervention B is more efficacious (i.e., those who tolerate the drug have lower mortality than under intervention A) but because tolerance to intervention A is higher its overall effectiveness is higher; effectiveness of A (compared with B)= [(37.85% - 28.30%) + 37.85%] x 100 = 25.2% --look at the image on the slide
sufficient causes and prevention (causation slides)
--elimination of even a single component cause in a given sufficient cause constellation is useful for preventive purposes as it will by definition remove the "set of minimal events and conditions" which form that sufficient cause --to effect preventive measures it is not necessary to understand causal mechanisms in their entirety --prevention or cessation of a single risk factor results in a disease risk decrease as it eliminates all sufficient causes of which the risk factor is a component cause --common exposures are often related to rare outcomes in the same populations (e.g., Helicobacter pylori and gastric cancer)
measurement of effect or association (causation slides)
--epidemiologic studies strive to determine difference in measures of disease occurrence between populations --populations typically considered as "exposed" vs "unexposed" --measures of association/ effect measures -determine association between "exposure" and disease "outcome" -quantity that measures the extent to which a factor *affects* frequency/ risk of health outcome --note= are we measuring an effect or an association; differences in populations depend on exposure; calculate risk ratios, odds ratios, etc; end goal= to decrease the outcome through an intervention whether it's screening or prevention
2. counterargument: effect measure modification does NOT supersede control of confounding (interaction and effect modification slides)
--even when covariate is a strong modifier we still can estimate the impact of the exposure on total disease occurrence (e.g., attributable or preventable fraction) by standardizing for the covariate as a confounder - i.e., the presence of effect measure modification does not change the interpretation of the attributable/ preventable fraction --generally we are always averaging effects for unknown or unmeasured effect measure modifiers --regardless ignoring interactions is particularly problematic in certain situations such as when the exposure has a positive effect in one group and an inverse effect in another group, i.e., when there is a *crossover effect* --note= positive effect and inverse effect
decision tree (causation slides)
--example of decision tree with two chance nodes; proportions and probabilities shown in parentheses SC social class --slide shows an images of this -starts with a decision node then branches out to intervention A and intervention B then within the interventions there are more branches until the end where it shows the final branches' mortality rates --note= how an intervention can influence mortality
additive interaction case control study: observed vs expected approach (interaction and effect modification slides)
--expected ORx1z1 = ORx1z0 + ORx0z1 - 1.0 --expected ORx1z1 = observed ORx1z1 means no additive interaction --observed ORx1z1 > expected ORx1z1 means positive interaction (synergism) --observed ORx1z1 < expected ORx1z1 means negative interaction (antagonism)
multiplicative interaction, case control study: observed vs expected approach (interaction and effect modification slides)
--expected ORx1z1 = ORx1z0 x ORx0z1 --expected ORx1z1 = observed ORx1z1 means no multiplicable interaction --observed ORx1z1 > expected ORx1z1 means positive interaction (synergism) --observed ORx1z1 < expected ORx1z1 means negative interaction (antagonism)
indirect adjustment (misclassification slides)
--expected number of events is calculated by applying reference ("standard") rates to the number of individuals in each stratum --in each study group the ratio of observed events to expected events provides an estimate of the confounder-adjusted relative risk or rate ratio comparing the study group with the population that served as the source of the reference rates --when used for *mortality*- this is the *standardized mortality ratio (SMR)* --when used for *morbidity* *-standardized incidence ratio (SIR)* *-standardized prevalence ratio (SPR)*
*****temporality (causation slides)*****
--exposure *must* precede disease --in disease with long latency periods exposures must precede latency period --in chronic diseases often need long term exposure for disease induction --note= need this --yes= possible cause -> possible effect --no= possible effect -> possible cause --no= possible effect and possible cause happen together -- ---time--->
theory of causation (causation slides)
--exposure to certain risk factors causes disease --temporality (the exposure came first and know the time point of exposure) and biological plausibility (is it biologically possible) --very rarely is one factor a *sufficient cause* of disease -genetic diseases --exposure to risk factor -> disease --defect in BOTH hex-A genes -> tay-sachs --most often diseases are the result of a complex system of many interconnected factors (multiple causes) --this slide shows the causal pie
consideration of alternate hypothesis (causation slides)
--extent to which investigator has ruled out other possible explanations --methodologically sound studies with no potential residual confounding *--caveat=* alternate explanations limited by understanding of biology and sophistication of analysis
effect modifiers or moderators (DAG) (interaction and effect modification slides)
--factor X is a known risk factor for disease Y --factor EM is a direct effect modifier of the causal effect of X on Y --X -> Y and EM -> Y --note= she was taught to do X-> M with an arrow pointing to that arrow saying EM -note= because X doesn't have to be associated with X or Y unlike confounders -note= EMM yes x -> y or EMM no x -> y -note= confounder= x <-> c -> y x -> y --factor drug X treat hypertension (y) --genotype EM modifies the effect of X on Y --note= genetic and pharmacological epidemiology have a lot of EMM --drug X -> hypertension and genotype EM -> hypertension --note= how measure of association between exposure and outcome different in the categories explored
mediator or intermediate variables (confounding and mediation slides)
--factor X is a known risk factor for disease Y --factor M is affected by X --factor M affects disease Y X -> M -> Y --smoking is a known risk factor for heart disease --hypertension is affected by smoking --hypertension affects heart disease --smoking also directly affects heart disease --smoking -> hypertension -> heart disease smoking ---> heart disease
confounders (causation slides)
--factor Z is a known risk factor for disease Y --factor Z is associated with factor X but is not a result of factor X --X <- Z -> Y --a common cause -yellow fingers <-(+)-- smoking --(+)-> lung cancer yellow fingers <------(+)------> lung cancer --adjust for smoking -yellow fingers <-(+)-- smoking --(+)-> lung cancer *--a confounder induces an association between its effects --conditioning on a confounder removes the association --condition = (restrict, stratify, adjust)*
confounders (confounding and mediation slides)
--factor Z is a known risk factor for disease Y --factor Z is associated with factor X but is not a result of factor X --in other words... a noncausal association between an exposure and observed outcome is detected as a result of a third variable or group of variables --diagram= X (exposure) <-> Z (confounder) -> Y (outcome); X ---> Y -note: this is supposed to be in a triangle with Z at the top, X on the right, and Y on the left *confounding in different types of studies:* --study design= experimental (randomized control trial); approach= random allocation (A and B group); example= A= vaccine and B= placebo); source of confounding (difference(s) between groups)= random difference(s) --study design= observational (prospective study); approach= nonrandom allocation (A and B group); example= A= smokers and B= nonsmokers; source of confounding (difference(s) between groups)= random difference(s) and factor associated with the exposure of interest
biological plausibility (causation slides)
--for an association to be causal it has to be plausible (i.e., consistent with the laws of biology) --biologic plausibility *may well be one of the most problematic guidelines supporting causality -note= we don't know somethings* --the proposed mechanism should be biologically (etiologically) plausible --reference to a "coherent" body of knowledge *--caveat=* we don't know everything yet
experimental evidence (causation slides)
--grade and level of evidence chart --grade A -level: 1a= *systematic review* (with homogeneity) of randomized clinical trials 1b= *individual RCT* (with narrow confidence interval) 1c= *"natural experiments"* i.e., interventions with dramatic effects (e.g., for streptomycin for tuberculosis meningitis; insulin for diabetes) 2a= *systematic review* (with homogeneity) of cohort studies 2b= *individual cohort study or randomized clinical trial of lesser quality* (e.g., with < 50% follow up) [1 cohort study or 1 not so good RCT] --grade B -level: 2c= *outcomes research* (based on existing records) 3a= *systematic review* (with homogeneity) of case-control studies 3b= *individual case-control study* [1 case-control study] --grade C [basically just a case series (cohort and case control can fall into here too)] -level: 4a= temporal (before-after) *series with* controls and cohort and case-control studies of lesser quality 4b= temporal (before-after) *series* without controls --grade D -level: 5= *expert opinion* without explicit critical appraisal or not based on logical deduction [don't base things off of just 1 opinion] --note= don't say something's causal based on 1 study alone *--note= anything from 3b down= no change* *--note= on average takes 15 years for medicine and public health to change*
dose-response (causation slides)
--graded pattern (graphs are on this slide) -exposure level= linear; monotonic throughout range of exposure levels (e.g., smoking and lung cancer) --exposure level= J shaped [curve] (e.g., alcohol and cardiovascular disease) --exposure level= backwards L; threshold pattern (e.g., weight and coronary disease sudden death) --note= may have a J-shaped curve due to confounding by indication (something else may be causing the curve) --dose-response relationship -changes in the exposure are related to trend in risk of disease -strong evidence for causal relation suggesting biologic relation *-caveat=* thresholds i.e., no disease level of exposure [note= not everything monotonic]
indirect adjustment method (misclassification slides)
--helpful when... -stratum specific rates or risks are missing in one of the comparison groups -study groups are too small and stratum specific rates or risks are unstable *--the source population for the standardized rates is not the "standard population"*
interaction - defined (interaction and effect modification slides)
--homogeneity/ heterogeneity definition -the effect of a risk factor X on the risk of an outcome Y is not homogeneous in strata of a third variable Z (i.e., the effect modifier) --observed vs expected joint effects definition -the obverted joint effect of X and Z differs from that elected based on their independent effects -note= interaction definition (2 variables together and how beta's looking (β3(X1*X2) part of the equation))
assessing confounding strategy #1: does the variable meet the criteria to be a confounder (confounding and mediation slides)
--hypothetical case-control study of risk factors for malaria --example of confounding hypothetical study of male gender as a risk factor for malaria infection (2x2 table with cases and controls on top and males and females on the side A= 88 (male cases), B= 68 (male controls), C= 62 (female cases), and D= 82 (female controls) OR= 1.71) --is male gender causally related to the risk of malaria *--possible confounder for a male gender-malaria association* -work environment as a possible confounder of the association between male gender and malaria risk (shows a visual of a DAG= male gender <-> outdoor occupation -> malaria; ,ale gender --(?)-> malaria) *--first criterion is the confounding variable causally associated with the outcome* -work environment as a possible confounder of the association between male gender and malaria risk (shows a visual of a DAG= male gender <-> outdoor occupation -> malaria; ,ale gender --(?)-> malaria) *-is outdoor occupation (or something for which this variable is a marker of --e.g., exposure to mosquitoes) causally related to malaria? YES* visual= 2x2 table with malaria cases and controls on top and outdoor and indoor on the side A= 63 (42.0% (outdoor malaria cases)), B= 18 (12.0% (outdoor controls)), C= 87 (58.0% (indoor malaria cases)), and D= 132 (88.0% (indoor controls)) OR= 5.3 *--second criterion is the confounding variable not causally or causally associated with the exposure* -work environment as a possible confounder of the association between male gender and malaria risk (shows a visual of a DAG= male gender <-> outdoor occupation -> malaria; ,ale gender --(?)-> malaria) *-is outdoor occupation associated with male gender? YES* visual= 2x2 table with males and females on top and outdoor and indoor on the side A= 68 (43.5% (outdoor males)), B= 13 (9.0% (outdoor females)), C= 88 (56.5% (indoor males)), and D= 131 (91.0% (indoor females)) OR= 7.8 *--third criterion is the confounder an intermediate variable in the causal pathway between exposure and outcome* -work environment as a possible confounder of the association between male gender and malaria risk (shows a visual of a DAG= male gender <-> outdoor occupation -> malaria; ,ale gender --(?)-> malaria) -note (from the lecture side not from me): *judgement* and knowledge about the path-physiology of the disease process are critical to answer this question -outdoor occupation is *probably not* a mediator of the relationship between male gender and malaria --provided that -crude association between male gender and malaria OR= 1.71 -outdoor occupation is more frequent among males and -outdoor occupation is associated with greater risk of malaria... -what would be the expected magnitude of the association between male gender and malaria after controlling for occupation -the (adjusted) association estimate will be *smaller than 1.71* (other possibilities are equal to 1.71 and greater than 1.71)
3. counterargument: effect measure modification does NOT supersede control of confounding (interaction and effect modification slides)
--if a covariate is a continuous or ordinal variable we might want to treat it both as a confounder and as a modifier, simultaneously --that is first we adjust for the covariate as a confounder within larger strata of the same covariate --then we compare the adjusted estimates among the larger strata to determine whether the covariate is a modifier --note= type II error can happen --note= we still argue if we should control for confounding when something's thought to be both
coherence with established "facts" (causation slides)
--if a relation is causal would expect observed findings to be consistent with other data --hypothesized causal relations need to be consistent with epidemiologic and biologic knowledge *--caveat=* data may not be available yet to directly support proposed mechanism AND science must be prepared to reinterpret existing understanding of disease process in the face of new evidence
how are we measuring "effect" (interaction and effect modification slides)
--if effect is measured on an *additive scale* (e.g., measuring attributable risks) we are assessing the presence or absence of *additive interaction* (attributable risk model) --if effect is measured on an *relative scale* (e.g., measuring relative risks, odds ration [note= and rate ratio but check if this is correct]) we are assessing the presence or absence of *multiplicative interaction* (relative risk model) --you can have additive interaction [note= this is the + symbol here] without multiplicative interaction [note= this is the x symbol here], you can have x without +, you can have both, and you can have neither
confounding and exchangeability (misclassification slides)
--if we have exchangeability the rate of disease in exposed and comparison groups will be identical if the exposure *truly has no effect* on disease occurrence --if not exposed and comparison groups can have different disease rates because they may differ on other risk factors for disease -> confounding --note= can calculate our adjusted values --confounding: a failure of the comparison group to reflect the counterfactual experience of the exposed group -risk factors are distributed differently between the exposed and unexposed -put another way there are common causes between the exposure and the outcome --not the fault of the investigator as in the case of systematic error (e.g., selection and information bias) -note= systematic error is the fault of the investigator
leverage (multivariate analysis I: continuous outcomes slides)
--measures how much an observation influences regression coefficients *-high influence if leverage h > 2*k/N* where k is the number of parameters (including the intercept) N is the sample size *-a rule-of-thumb= leverage goes from 0 to 1 a value closer to 0 or over 0.5 may indicate problems* --predict lev, leverage
confounding and exchangeability (confounding and mediation slides)
--if we have exchangeability the rate of disease in the exposed and comparison groups will be identical if the exposure *truly has no effect* on disease occurrence --if not exposed and comparison groups can have different disease rates because they may differ on other risk factors for disease -> confounding *--confounding= a failure of the comparison group to reflect the counterfactual experience of the exposed group* -risk factors are distributed differently between exposed and unexposed -put another way there are common causes between the exposure and outcome -note= people exposed could have an experience with other factors that are unexposed (can be factors at many levels *(ex= individual, community, or national levels)*); *not the fault of the investigator if they try to account for them (not a bias)* *--not the fault of the investigator as in the case of systematic error (e.g., selection and information bias)* --note= ex: study with CVD swimmers vs non-swimmers so confounders: exercise amount, diet, pool access, etc
effect of sensitivity and specificity of disease diagnosis (misclassification slides)
--in a 2x2 table *non-differential misclassification results in bias towards the null* --for an *uncommon event imperfect specificity induces a larger bias than imperfect sensitivity* --for a *common event imperfect sensitivity induces a larger bias than imperfect specificity*
example: information bias (misclassification slides)
--in a cohort study of environmental toxins and CKD the investigators used a fairly broad definition of exposure -definition identified 99% of exposure among those with and without CKD -definition included some people who were not actually exposed but were classified as exposed 5% of those with CKD 15% of those without CKD --note= this is differential misclassification (can go towards or away from the null)
why ordinary least squared regression (multivariate analysis I: continuous outcomes slides)
--in a linear regression model in which the errors have expectation 0 and are uncorrelated and have equal variances OLS is the *Best Linear Unbiased Estimator (BLUE)* -Best -Linear= the association between X and Y are a linear function -Unbiased= the estimator is not biased toward the null or alternative hypothesis ex= type I or type II error -estimator= the regression coefficient
relative risk/ risk ratio and risk difference example (causation slides)
--in a particular study of 100 exposed people 20 develop disease and out of 200 unexposed 25 develop disease --2x2 table (outcome on top and exposure on side): -A= 20 (diseased and exposed) -B= 80 (no disease and exposed) -C= 25 (disease and not exposed) -d= 175 (NO DISEASE AND NOT EXPOSED) --RR= (20 / 100) / (25 / 200) = 1.60 --RD= (20 / 100) - (25 / 200) = 0.2 - 0.125 = 0.075
residual stats in STATA with predict (multivariate analysis I: continuous outcomes slides)
--in general predict calculates the requested statistic for all observations possible whether they were used in fitting the model or not --if your regression was not run on all the cases (e.g., you restricted to a subsample) you might want to modify the predict command with "if e(sample)" -this will limit the computations to the cases used by the previous regression
why does causality matter in public health (causation slides)
--inferring whether an association is causal is key to the use of epidemiologic findings in primary prevention and other interventions that aim at modifying the probability of the outcomes of interest --note= public health goal- want people to live a long and happy life *--confounding and bias lead to misleading conclusions from epidemiological studies* --and there are lots of examples --note= people in cohort studies, volunteers for RCTs, and survey takers are usually different from the general population --on the slides the example they give of this is relative risk of coronary heart disease according to the duration of current supplement use among men (vitamin E and CVD) -the slides look at 3 studies that look at if vitamin E goes up in diets then risk CVD goes down -a number of others did it in cohort studies -when someone did RCT it found no association (most settled that both RCT and cohorts were biased) -RCT can have selection bias
quantile regression (multivariate analysis I: continuous outcomes slides)
--instead of log transformation you can use this -when not normally distributed -a lot easier to interpret results here than with log
is misclassification differential or non-differential
--is it misclassification of the exposure or the outcome [note= this is question 1] --if exposure [note= this is question 2] -happened to the same extent among those with and without the outcome -> *non-differential* -happened to a greater extent among those with the outcome compared to those without or vice versa -> *differential* --if outcome [note= this is question 2] -happened to the same extent among the exposed and non-exposed -> *non-differential* -happened to a greater extent among the exposed compared to the unexposed or vice versa -> *differential* *--based on the LIKELIHOOD (percent) of misclassification NOT the number misclassified*
assessment of homogeneity (interaction and effect modification slides)
--is there an association between risk factor X and disease Y --(yes)-> is it due to random variability, confounding, or bias [if yes -> random, confounding, or spurious association] --(no)-> is the association of similar magnitude in subgroups of the population [if no= interaction or effect modification and if yes= no interaction] --note= is there an association between risk factor X and disease Y; if this is no stop you're done (unless you suspect negative confounding)= no means null association here --note= is it due to random variability, confounding, or bias; maybe due to small sample size --note= is the association of similar magnitude in subgroups of the population [if no= interaction or effect modification and if yes= no interaction]; no= EMM= stratum specific estimates differences; yes= confounding (not sure if this is correct= stratum specific estimates similar)
additive interaction, graphically (interaction and effect modification slides)
--look at slides for the graphs --no additive interaction= parallel lines that won't cross [one line being HBV - and the other being HBV +] --additive interaction= these lines will cross [one line being HBV - and the other being HBV +]
measures of central tendency and dispersion (multivariate analysis I: continuous outcomes slides)
--mean and median -when they are pretty close outliers don't mean much but there may still be outliers
Cook's distance (multivariate analysis I: continuous outcomes slides)
--measures how much an observation influences the overall model or predicted values -it is a summary measure of leverage and high residuals *-high influence if D > 4/N* where N is the sample size *-a D>1 indicates big outlier problem* --predict cook, cooked
DFBeta (multivariate analysis I: continuous outcomes slides)
--measures the influence of each observation on the coefficient of a particular independent variable (for example, x1) -this is in standard error terms *--an observation is influential if it has a significant effect on the coefficient -a case is an influential outlier if |DfBeta|>2/SQRT(N)* where N is the sample size -note (it says this in the slide): stata estimates standardized DfBetas --dfbeta x1 (dfbeta for one predictor) --dfbeta (dfbeta for all predictors) --to identify those who exceed the cutoff: -gen cutoffdfbeta = abs(_dfbeta_1) > 2/sqrt(e(N)) & e(sample)
binary exposure and disease (misclassification slides)
--misclassification = moving subjects between cells --note= moves up and down on the 2x2 table based on exposure [A vs C and B vs D] --note= moves side to side on the 2x2 table based on disease [A vs B and C vs D] --OR= ad / bc --RR= (a / (a + b)) / (c / (c + d))
last slide table (confounding and mediation slides)
--model A: (X only), β (unadjusted beta), 0.4 [conclude full confounding], 0.4 [conclude partial confounding toward the null (unadj < adj = underestimate)], 0.4 [conclude partial confounding away from null (unadj > adj = overestimate)], and 0.4 [conclude no confounding (10% rule)] --model B: (X + Z), β1 (adjusted beta), 0 [conclude full confounding], ≥ 0.44 [conclude partial confounding toward the null (unadj < adj = underestimate)], + but ≤ 0.36 [conclude partial confounding away from null (unadj > adj = overestimate)], and > 0.36 and < 0.44 [conclude no confounding (10% rule)] --model A: (X only), β (unadjusted beta), 0.4 [conclude full mediation], 0.4 [conclude partial mediation], and 0.4 [conclude no mediation (unadj < adj)] --model C: (X + V), β2 (adjusted beta), 0 [conclude full mediation], + but ≤ 0.36 [conclude partial mediation], and > 0.4 [conclude no mediation (unadj < adj)]
multicollinearity (multivariate analysis II: categorical outcomes slides)
--moderate multicollinearity is fairly common since any correlation among the independent variables is an indication of collinearity -note= ex- hemoglobin A1C is highly correlated to diabetic status (when looking at stroke and want to look at diabetes they usually choose A1C or diabetes status) --when severe multicollinearity occurs the standard errors for the coefficients tend to be very large (inflated) and sometimes the estimated logistic regression coefficients can be highly unreliable --note= this is an assumption
so what about multivariable survival analysis (survival analysis slides)
--need a regression technique for censored data --Cox (1972) proportional hazards model -relates survival to covariates -can handle incomplete follow up --this slide shows a graph [hazard function: time-specific event rate] -proportional: h little B= Kh little A(t)
comparing observed vs expected joint effects (interaction and effect modification slides)
--no interaction - A + Z where A+Z= length of expected and length of observed are the same (in the visual) -note= expected and observed are the same --positive interaction (synergism) -A + Z where A+Z= length of expected is shorter than length of observed (in the visual) -note= observed larger than expected --negative interaction (antagonism) -A + Z where A+Z= length of expected is longer than length of observed (in the visual) -note= smaller observed than expected
information bias and misclassification (misclassification slides)
--non-differential misclassification= exposure: towards null and outcome: towards null --differential misclassification= exposure: away or towards null and outcome: away or towards null
types of survival analysis (survival analysis slides)
--non-parametric method -Kaplan-Meier analysis --semi-parametric method -Cox regression handles confounding variables effects given as hazard rations (HR) [relative risk measures (similar to what you'll calculate for risk ratios)] assumes proportional hazards --parametric method -not addressed in this course --note= there are a couple of issues with survival analysis
Cox proportional hazards model (survival analysis slides)
--nonparametric -does not involve any assumptions as to the form or parameters of the distribution --assumptions -non-informative censoring study design must ensure that the mechanisms giving rise to censoring of individual subjects are not related to the provability of an event occurring -proportional hazards survival curves for 2 strata have hazard functions that are proportional over time --gives us a hazard ratio (HR) *-interpretation similar to a risk ratio* --can have fixed and time dependent covariates [ex= measure blood pressure every follow up/ at the start] --1st run logistic regression because Cox model should be very similar to it -slide shows STATA output --22% decrease risk of stroke for people with infection compare to without it [day 4,313 = last exit] -slide shows STATA output --slide shows more STATA output --slide shows K-M survival estimates graph -note= time interval= days to stroke; crossed lines= bad explains decrease infection stroke risk group --if death= completing risk it can really influence survival analysis -slide shows more STATA output
assumption 3: normality (multivariate analysis I: continuous outcomes slides)
--normality= the residual/ error are normally distributed at each value of the predictor -the estimates of regression coefficients may not be biased -however the SEs of regression coefficients are likely biased --if any of these assumptions is violates (i.e., if there are nonlinear relations between dependent and independent variables or the errors exhibit correlation, heteroscedasticity, or non-normality) then the forecasts, confidence intervals, and scientific insights yielded by a regression model may be (at best) inefficient or (at worst) seriously biased or misleading --ideally your statistical software will automatically provide charts and statistics that test whether these assumptions are satisfied for any given model --unfortunately many software packages do not provide such output by default (Additional menu commands must be executed or side must be written) and some (such as excel's built-in regression add-in) offer only limited options --note= be careful with excel it's regression algorithm is not good so don't use it --note= SAS you have to download this --note= can't use linear regression on binary outcomes (the outcome has to be continuous (this is why we have logistic regression)) --it is important to assess normality via chart as well as statistics were are occasions where a model that violates all of the assumptions above would likely be accepted by a naive user on the basis of a large value of R-squared --you will sometimes see additional (or different) assumptions listed such as "the variables are measured accurately" or "the sample is representative of the population," etc --these are important considerations in ANY form of statistical modeling and they should be given due attention although they do not refer to properties of the linear regression equation per se that normality assumption... --linear regression is actual quite robust and can handle data that is not quite normally distributed --but when the dependent variable is highly skewed you can have problems
how can we prevent information bias (misclassification slides)
--precise operational definitions of variables --detailed measurement protocols --repeated measurements of exposure and outcome --validation of measurements through multiple sources --training certification and re-certificiation of study staff --masking to subject's disease or exposure status --standardizing questions with close-ended questions --use better information sources, memory aids *--using most objective measures available*
tonight's data (multivariate analysis I: continuous outcomes slides)
--note= multivariate vs multivariable= in statistics these have very different meanings and they aren't interchangeable -multivariate= multiple outcomes -multivariable= multiple exposures --Baltimore Prevention Project --N= 2,311 --RCT of classroom intervention -intervention in 1st and 2nd grade -assessments fall of 1st grade and spring of 1st through 6th grades subsample followed up in middle school everyone followed up at age 18-21 (n= 1,715)
assessing confounding strategy #2: does controlling for the putative confounder change the magnitude of the exposure-outcome association (confounding and mediation slides)
--note= putative means generally considered or reputed to be --stratified analysis of the association between gender and malaria (from exhibit 5.2) according to whether individuals work mainly outdoors or indoors mostly outdoor occupation 2x2 table with cases and controls on top and males and females on the side A= 53 (male cases), B= 15 (male controls), C= 10 (female cases), and D=3 (female controls) OR= 1.06 mostly indoor occupation 2x2 table with cases and controls on top and males and females on the side A= 35 (male cases), B= 53 (male controls), C= 52 (female cases), and D=79 (female controls) OR= 1.00
excessive correlation between the confounder and the exposure of interest (confounding and mediation slides)
--note= there are visuals of this in the power point --perfect correlation between dichotomous exposure of interest and confounding factor there are no cells in which the exposure is present and the confounder is absent vice versa [visual shows A and D on the 2x2 table) --correlation between an exposure of interest and a confounding factor all four cells for a cross-tabulation of dichotomous categories are represented in this schematic representation the larger sizes of cells A and D denote the magnitude of the positive correlation between the exposure and the confounder [visual shows the 2x2 table with A and D big and B and C small]
stratification and adjustment (misclassification slides)
--note= use this for confounding control --as a reminder -to make causal inferences ideally we compare the risk of the outcome in the exposed (actual) with the risk of the outcome in the same people had they not been exposed (counterfactual) -but since we can't do this in a real world study we must select different sets of individuals for exposed and a comparison group (unexposed) that are similar as possible [note= different techniques are needed]
survival analysis (survival analysis slides)
--objective= compare risk among exposed to risk among unexposed --analysis= # events and time events occur *--account for time events occur --account for different follow up time* [note= accounts for short, long, mismatch follow up] *--note= HAS TO HAVE TIME* *--modeling time to event data -death or disease development (failure) --the primary function that is estimated is the survival function* *-S(t) = Pr(T>t)* *where t is sometime and T is a random variable denoting the event of interest* --note= developed to look at time of death; now it's expanded to look at other things (ex= recovery time) --note= measure of association; incorporate time people have in the study; similar to incidence; *extra credit on the test= how to calculate survival analysis tables*
web of causation (causation slides)
--occurrence if disease can be explained by a complex web of many interconnected factors including host and environmental determinants --implications -multiple causes -multiple places for intervention but usually more proximal causes are targeted (note= intervene on what's easiest) --note= to describe multiple causes of disease and multiple points of intervention --this slide has a picture of the web of causation
indirect adjustment: SMR (misclassification slides)
--on the powerpoint there is a table for this --SMR= Σ(ixAI) / Σ[Mi x n(little Ai)]
assumption 5: no outliers (multivariate analysis I: continuous outcomes slides)
--outliers --distance measures -Mahalanobis (not available in STATA user written program is available) -Cook's D -leverage values --note= outliers could influence data; some outliers may be data points entered wrong; sometimes you leave your outcome and sometimes you enter it in as missing (take it out)
the essential problem (confounding and mediation slides)
--outside the context of study (real world) many reasons why people drink coffee --these reasons are often related to outcomes of interest --confounding= when risk factors for lung cancer are associated with drinking coffee (ex= smoking) --if randomization is successful no confounding at baseline
why is survival analysis tricky (survival analysis slides)
--patient 1= 1976 -> 1980; 4+ years -administratively withdrawn alive --patient 2= 1972 -> 1978; 6+ years -lost to follow up -patient 3= 1970 -> 1975; 5 years -dies --a chart is on this slide --note= you censor someone when they die *-you can calculate competing risks (had they lived what would their risk be) but you start by censoring*
non-differential misclassification: what to do (misclassification slides)
--planning -thin about susceptible variables -read up on experience in other studies -devise ways to minimize misclassification --study design -implement a quality control program observe data collection review incoming data conduct interim analysis -consider multiple exposure measures using same technique using different approaches -consider a validation sub-study --analysis -assess the reliability of the data -consider adjusting for misclassification if it is both 1. substantial and 2. understood -note= problem with whether or not it is understood is we very rarely know the true extent of the error
types of confounding (confounding and mediation slides)
--positive confounding= when the confounding effect results on an *overestimation* of the effect (i.e., the crude estimate is further away from 1.0 than it would be if confounding were not present) -ex= OR crude is 4.0 -> both strata OR adjusted 2.0 positive (towards the null) -ex= OR crude is 4.0 -> both strata OR adjusted 1.0 positive (towards the null) -ex= OR crude is 0.2 -> both strata OR adjusted 0.9 positive (towards the null) --negative confounding= when the confounding effect results in *underestimation* of the effect (i.e., the crude estimate is closer to 1.0 than it would be if confounding were not present) -ex= OR crude is 4.0 -> both strata OR adjusted 8.0 negative (away from the null) -ex= OR crude is 1.0 -> both strata OR adjusted 3.0 negative (away from the null) -ex= OR crude is 0.9 -> both strata OR adjusted 0.2 negative (away from the null) --qualitative confounding= an extreme case of confounding when the confounding effect results in an inversion of the direction of the association -note= doesn't happen a lot but it changes direction of relationship (risk to protective relationship once confounder's controlled for) -ex= OR crude is 4.0 -> both strata OR adjusted 0.5 qualitative (reversal of effect)
non-differential misclassification (misclassification slides)
--probability of misclassification is *independent* of disease and exposure status -non-differential misclassification of disease sensitivity and specificity of diagnosing a disease is the same for exposed and unexposed -non-differential misclassification of exposure sensitivity and specificity of detecting an exposure if the same for those with (i.e., cases) and without disease *--note= generally results in bias towards null*
publication bias (causation slides)
--publication bias as conventionally defined occurs when assumption 2 (that published studies constitute an unbiased sample of a theoretical population of unbiased studies) cannot be met because besides the quality of the report other factors dictate acceptability for publication --for example -"positive" vs "negative" results -statistical significance -caused by editors and reviewers vs authors -source of support/ funding -language bias --note= now have the journal of negative findings for studies with no association -many journals refuse to publish no association studies so it causes others to believe no studies on that topic have happened
multiplicative interaction, case control study: homogeneity of effects approach (interaction and effect modification slides)
--putative effect modifier= no -> exposure= no -> cases -> controls -> OR= 1.0 -> meaning of OR= referent group putative effect modifier= no -> exposure= yes -> cases -> controls -> OR -> meaning of OR= effect of exposure in absence of effect modifier --putative effect modifier= yes -> exposure= no -> cases -> controls -> OR= 1.0 -> meaning of OR= referent group putative effect modifier= yes -> exposure= yes -> cases -> controls -> OR -> meaning of OR= effect of exposure in presence of effect modifier
statistical modeling and statistical tests for interaction (interaction and effect modification slides)
--question: is the observed heterogeneity is produced by chance -interaction term in a regression models -formal statistical tests when using stratification *--statistical tests although helpful are not sufficient to evaluate interaction fully* -note= use p-value that's different for this 0.1 or 0.2 or lower [note= unsure if I wrote those numbers in this note down correctly] -when sample sizes are large even a slight heterogeneity of no practical value or biological importance may be statistically significant -on the other hand although not statistically significant relative risk point estimates that are markedly different from each other suggest the possibility of true heterogeneity -ideally such non-statistically significant yet marked heterogeneity should be confirmed by a study with sufficient statistical power to detect it
strength of association (causation slides)
--rationale= it is more difficult to explain away a stronger than a weaker association on the basis of confounding or bias --for public health purposes careful consideration of whether a weak association is causal is justified by the possibility that it may result in a high population attributable risk if the exposure prevalence is high --strong associations are less likely to be caused by chance or bias --a strong association means a very high or very low relative risk *--caveat=* environmental associations with very low relative risks --note= weaker association but is on causal path so can happen
replication of findings (causation slides)
--relations that are demonstrated in multiple studies are more likely to be causal --consistent results found in different populations in different times with different study designs *--caveat=* heterogeneity of effect in different countries [or neighborhoods, or age groups, or subpopulations, etc]
residual vs fitted values (multivariate analysis I: continuous outcomes slides)
--residuals are the difference between the observed and predicted values of y --rvfplot, line(0) --a bit of limited utility when you have a categorical predictor --this slide has a graph
cessation of exposure (causation slides)
--risk of disease expected to decline when exposure to a cause is reduced or eliminated *--caveat=* pathogenic process already started; removal of cause does not reduce disease risk
sensitivity analysis (causation slides)
--sensitivity analysis is used for the quantitative analysis of the impact of errors - systematic or random - on the effect estimates' validity sensitivity analysis and effectiveness: --probability of the event that vaccines standard and new are expected to prevent according to whether or not target population receives the vaccine assuming a prevalence of eligibility associated with the standard vaccine, new, and improved new vaccines of 90%, 70%, and 90% respectively --look at chart on the slides
measurement error: sensitivity and specificity (misclassification slides)
--sensitivity and specificity can directly influence the different types of bias --2x2 table (disease on the top and test on the side): A= true positive (diseased and + test), B= false positive (not diseased and + test), C= false negative (diseased and - test), and D= true negative (not disease and - test) *-sensitivity goes A to C (downward on the side of diseased)* *-specificity goes D to B (upward on the side of not diseased)* *-predictive value positive goes A to B (arrow goes to the right (from A -> B))* *-predictive value negative goes from D to C (arrow goes to the left (C <- D))*
sensitivity (misclassification slides)
--sensitivity is the ability of a test to correctly identify those with a disease (or characteristic) of interest *--sensitivity= a / (a + c) = true positive / (true positive + false negative)* --note= calculation is based off of the positive gold standard test column --2x2 table gold standard on the top and test result on the side *-A= true positive (positive test result and positive test), B= false positive (positive test result and negative test), C= false negative (negative test result and positive test), and D= true negative (negative test result and negative test)* --note= way to try yo identify measurement error; this 2x2 table is no longer disease vs exposure (it's test 1 vs test 2); true positive
imperfect measure of disease (misclassification slides)
--sensitivity= A moves to B and C moves to D --specificity= B moves to A and D moves to C --note= diseased misclassified
imperfect measure of exposure (misclassification slides)
--sensitivity= A moves to C and B moves to D --specificity= C moves to A and D moves to B --note= exposure misclassified
assessing publication bias (causation slides)
--slide has an image -Begg's funnel plot for assessing publication bias in relation to gluthatione S-transferase M1 (GSTM1) null status and bladder cancer risk; the horizontal line corresponds to the meta-analysis pooled odds ratio estimate --slide has an image -funnel plot of odds ratio (OR) of family history of stroke as a risk factor for stroke vs precision (i.e., inverse of the standard error of the OR) in case-control (full circles) and cohort studies (empty circles); note (it says this in the slide) the asymmetry of the plot due to lack of estimates when OR < 1 (i.e., small negative studies) -asymmetry significant at p < 0.0001
confounders: coffee and pancreatic cancer (causation slides)
--smoking -known risk factor for pancreatic cancer -associated with coffee drinking but is not a result of coffee drinking -adjust for smoking and there is no relationship between pancreatic cancer and coffee drinking --coffee consumption <- smoking -> pancreatic cancer
confounders: coffee and pancreatic cancer (confounding and mediation slides)
--smoking -known risk factor for pancreatic cancer -associated with coffee drinking but is not a result of coffee drinking -adjust for smoking and there is no relationship between pancreatic cancer and coffee drinking --coffee consumption <- smoking <-> pancreatic cancer --coffee consumption --> pancreatic cancer
confounding vs effect modification (interaction and effect modification slides)
--sometimes confused because both involve a 3rd variable and both involve a stratified analysis for evaluation *--confounding:* are stratum specific estimates different than crude estimate -note= account for *--effect modification:* do the stratum specific estimates different from one another -note= explain *--difference: try to eliminate confounding try to explain effect modification*
specificity of the association (causation slides)
--specific exposure associated with only one disease *--caveat=* meany exposures are linked to multiple diseases
specificity (misclassification slides)
--specificity is the ability to correctly identify those without a disease (or characteristic) of interest *--specificity= d / (b + d) = true negative / (true negative + false positive)* --note= calculation is based off of the negative gold standard test column --2x2 table gold standard on the top and test result on the side *-A= true positive (positive test result and positive test), B= false positive (positive test result and negative test), C= false negative (negative test result and positive test), and D= true negative (negative test result and negative test)* --note= another way to identify measurement error; true negative
measure of accuracy root mean squared error (RMSE) (misclassification slides)
--square root of the average squared deviation of the measured values from the true value --RMSE = sqrt(Σ [n on top and i= 1 on bottom] (xi - μi)^2 / n) -xi = observed value -μi = true value
stratification (confounding and mediation slides)
--stratify potential confounder --assess association between the exposure and disease within each strata of the potential confounder --compare measures of association (RRs, Ors) between different strata --if the observed association between exposure and disease is due to confounding -crude measure of association will differ from adjusted or stratum-specific measures of association *-expect no difference in measure of association between strata* --note= effect modifier: strata different and different from the crude --note= confounding: strata measures same but different from the crude --note= may have trouble with people who don't fit nicely into categories (ex= smoking vs non-smoking (smokes every day or smokes only when drinking)); careful with how you're stratifying confounders; don't want residual confounding due to that
studentized residuals in STATA (multivariate analysis I: continuous outcomes slides)
--studentized residuals can be interpreted as the t statistic for testing the significance of a dummy variable equal to 1 in the observation in question and 0 elsewhere --such a dummy variable would effectively absorb the observation and so remove its influence in determining the other coefficients in the model *--STATA calculates what Chatterjee & Hadi call externally studentized residuals for the standardized residuals* -uses the root mean squared error of regression omitting the observation in question for δ *--values of 3 or greater (or -3 or less) may be problematic and are considered outliers* --predict rstudent if e(sample), rstudent
survival curves (survival analysis slides)
--summary statistics -do not give the whole picture -1, 2 year survival rates *-caveat= medians do not describe whole curve* --picture is on the slide
Kaplan-Meier estimate (survival analysis slides)
--survival times (n= 10) 3 4 5+6 6 8+ 10+ 12 15 17 + censored --order survival times S(3)= 9/10= 0.90 S(4)= S(3) x 8/9= 9/10 x 8/9 = 0.80 S(6)= S(4) x 5/7= 9/10 x 8/9 x 5/7 = 0.57 S(12)= 0.57 x 2/3 = 0.38 S(15)= 0.38 x 1/2 = 0.19 S(17)= 0 --the slide has a graph --"limiting liftable analysis" --product limit estimate -take intervals so small so you only have individuals with exactly the same survival time in each interval --note= try to use small intervals --product limit estimate -ordered survival times -computed at observed deaths -multiplying conditional probabilities [account for previous calculations for survival] --what happened to the censored data -we used it -because we used conditional probabilities
information bias (misclassification slides)
--systematic error in obtaining information regarding subjects in the study -occurs *after* subjects have entered study [note= selection bias may or may have not happened when choosing sample] -pertains to how data are collected -often results in incorrect classification of participants as either exposed or unexposed *(exposure misclassification)* or a diseased or not diseased *(disease/ outcome misclassification)* i.e., measurement error --types (to name a couple...) -interviewer bias- systematic difference in soliciting information -recall bias- differential level of accuracy in the information provided by compared groups --in what types of studies can information bias occur -any -why is recall bias a particular threat to case control studies -using what type of measure --note= RCT= account for confounding but you can still have bias; ex= misclassification can happen if stroke is defined in a certain way that can detect all strokes (ex= imaging doesn't always detect stroke)
indirect adjustment example (misclassification slides)
--table on the powerpoint -study group A: potential confounder stratum 1= N- 100, deaths- 10, and rate- 10% stratum 2= N- 500, deaths- 100, and rate- 20% -study group B: potential confounder: stratum 1= N- 500 , deaths- 50, and rate- 10% stratum 2= N- 100, deaths- 20, and rate- 20% -external reference rates: stratum 1= 12% stratum 2= 50% -total: study group A= N- 600, deaths- 110, and rate- 18.3% study group B= N- 600, deaths- 70, and rate- 11% --expected deaths obtained by applying reference rates to groups A & B potential confounder stratum 1: study group A- 12% x 100 = 12 study group B- 12% x 500 = 60 stratum 2: study group A- 50% X 500 = 250 study group B- 50% x 100 = 50 total number expected: study group A- 262 study group B- 110 SMR (obs / exp): study group A- 110 / 262 = 0.42 study group B- 70 / 110 = 0.64
another direct adjustment example (misclassification slides)
--table on the powerpoint -study group A: potential confounder stratum 1= no.- 100, cases- 20, and incidence[cases / no.]- 20 stratum 2= no.- 200, cases- 100, and incident[cases / no.]- 50 -study group B: potential confounder stratum 1= no.- 400, cases- 40, and incidence[cases / no.]- 10 stratum 2= no.- 200, cases- 80, and incidence[cases / no.]- 40 -AR (%): stratum 1= 10 stratum 2= 10 -RR: stratum 1= 2.00 stratum 2= 1.25 -older population: potential confounder stratum 1= no.- 100, expected cases if A rates- 20, and expected cases if B rates- 10 stratum 2= no.- 500, expected cases if A rates- 250, and expected cases if B rates- 200 -total: study group A= no.- 300, cases- 120, and incidence[cases / no.]- 100 study group B= no.- 600, cases- 120, and incidence[cases / no.]- 20 AR (%)= 20 RR= 2.00 younger population= no.- 600, expected cases if A rates- 270, and expected cases if B rates- 210 -adjusted rate= expected cases if A rates- 45 and expected cases if B rates- 35 -AR= 10% -RR= 1.29
direct adjustment example (misclassification slides)
--table on the powerpoint -study group A: potential confounder stratum 1= no.- 100, cases- 20, and incidence[cases / no.]- 20 stratum 2= no.- 200, cases- 100, and incident[cases / no.]- 50 -study group B: potential confounder stratum 1= no.- 400, cases- 40, and incidence[cases / no.]- 10 stratum 2= no.- 200, cases- 80, and incidence[cases / no.]- 40 -AR (%): stratum 1= 10 stratum 2= 10 -RR: stratum 1= 2.00 stratum 2= 1.25 -younger population: potential confounder stratum 1= no.- 500, expected cases if A rates- 100, and expected cases if B rates- 50 stratum 2= no.- 100, expected cases if A rates- 50, and expected cases if B rates- 40 -total: study group A= no.- 300, cases- 120, and incidence[cases / no.]- 100 study group B= no.- 600, cases- 120, and incidence[cases / no.]- 20 AR (%)= 20 RR= 2.00 younger population= no.- 600, expected cases if A rates- 150, and expected cases if B rates- 90 -adjusted rate= expected cases if A rates- 25 and expected cases if B rates- 15 -AR= 10% -RR= 1.67
mortality rate in six countries 1986 (confounding and mediation slides)
--tables are on the slides -first looked at unadjusted (is the observed association causal in nature i.e. is there something about living in Costa Rica or Venezuela that makes the population have lower risk of death than the population of Canada or the US) -then looked at possible confounders (note= have to think about confounders when investigating)
Dogs again... (confounding and mediation slides)
--the DAG (also known as "causal diagram") is a forma and more elaborate extension of traditional graphs to represent confounding --in DAGs the direction of the association between the variables of interests and other unknown confounders is explicitly displayed to facilitate and guide the causal inference process (note= indicate the direction of the association) --in DAG's jargon the confounding effect is called a "backdoor path" --note= most simple DAG: E -> D
causation (causation slides)
--the backbone of epidemiology --E -> D [if the association is causal] -could the association be real (asked when looking at causation) -ex= ice cream sales are associated with increased homicide but there's no reason for this (it is not a causal relationship (what actually happened was temperatures increased))
residual analysis (multivariate analysis I: continuous outcomes slides)
--the basic idea of residual analysis is to investigate the observed residuals to see if they behave "properly" --we analyze the residuals to see if they supper the assumptions of linearity, independence, normality, and equal variances
meta-analysis (causation slides)
--the epidemiologic study of epidemiologic studies -quantitative approach for systematically assessing the results of previous research in order to arrive at conclusions about the body of research *-meta-analysis uses "study" as the unit of analysis rather than "individual"* --slide shows a image of a meta-analysis chart -odds ratios and 95% confidence intervals for gluthatione S-transferase M1 (GSTM1) null status and bladder cancer risk; solid circles are proportional in area to the number of cases and the vertical axis is on a log scale
misclassification (misclassification slides)
--the erroneous classification of an individual, a value, or an attribute into a category other than that to which it should be assigned (a dictionary of epidemiology) --occurs when exposed individual is classified as unexposed or vice versa or diseased individual is classified as non-diseased or vice versa --can result in *over or underestimation* of the true association --direction of bias will depend on whether misclassification is *differential vs non-differential* (and how misclassification happened) *-non-differential generally results in bias towards null* *-differential can result in bias toward or away from the null* --can happen in -disease -exposure -confounders --some people use misclassification for errors in categorical variables and measurement error for errors in continuous variables *-be careful not to fall into the trap of ignoring sources of error other than the measurement techniques themselves*
multiplicative interaction, graphically (interaction and effect modification slides)
--the graph is on the slides --no multiplicative interaction= parallel lines that won't cross --multiplicative interaction= lines will cross
assumption 2: independence (multivariate analysis I: continuous outcomes slides)
--the outcome or error/ residual are independent -the outcome or errors are uncorrelated --the justification of this assumption requires knowledge of study design or data collection *-ex= longitudinal studies are likely to have correlated error terms (e.g., autocorrelation)* -this is because measures from a person at t1 are correlated with measures from the same person at t2 --if this assumption is violated the SEs of the regression coefficients are likely biased --therefore the significance tests are not accurate --note= outcome and errors aren't associated with each other --the first 2 assumptions require knowledge of study design or data collection in order to determine the validity of the assumption *--this equates to: no important info (x) being left out of the model*
differential misclassification (misclassification slides)
--the probability of misclassification is *different* in different study groups -differential misclassification of disease higher sensitivity of detecting disease amor exposed than unexposed individuals e.g., surveillance bias -differential misclassification of exposure higher sensitivity of detecting exposure in the cases than controls e.g., recall bias, interviewer bias *--differential misclassification leads to bias and the magnitude of this bias can be substantial and statistically significant* *--note= can result in bias towards or away from the null*
relative risk (risk ratio) (causation slides)
--the ratio of risks for two populations *--RR = Rexposed / Runexposed* --ranges from *0 to + infinity has no units* *--null hypothesis: RR= 1* --same formula for *incidence rate ratio (IRR) = rate among exposed / rate among unexposed*
interpreting interaction (interaction and effect modification slides)
--there are many reasons why an observed effect of an exposure may differ according to the level or presence of a 3rd variable -heterogeneity due to random variability -heterogeneity due to confounding -heterogeneity due to bias -heterogeneity due to differential intensity of exposure -interaction and host factors *conclusions* --if heterogeneity is present... is there interaction -what is the magnitude of the difference (p-value) -is it qualitative or quantitative -is it biologically plausible --if we conclude that there is interaction what should we do *-report the stratified measures of effect...* the interaction may be the most important finding of the study
Kaplan-Meier (survival analysis slides)
--this is an estimator for obtaining the survival function from data --non-parametric method (don't have to worry about distributions of data) *--the key component of survival analysis is knowing not whether the event occurred but WHEN the event occurred -useful due to censoring* *--note= can't adjust for anything due to this we moved to regression techniques* -only can compare groups I think she said
key assumption: independence of censoring and survival (survival analysis slides)
--those censored at t have the same prognosis as those not censored at t --examples of possible violations -occupational= LTFU --note= don't want LTFU due to exposure or outcome (want LTFU for random things not related this (things wanted are like someone moved away))
time scale (survival analysis slides)
--time from diagnosis -time from AIDS to death --time from treatment -time from mastectomy to death --time from infection -time from HIV to AIDS --chronological age --note= these are ex of time scales
direct adjustment (misclassification slides)
--traditionally used for age adjustment when comparing morbidity and mortality rates across countries, regions, or time periods *--basic approach -calculate the incidence of the study groups within strata of the potential confounder -identify standard population with a specific number of individuals in each stratum -calculate expected number of cases in each stratum of the standard population -overall sums of expected cases in the standard population is divided by total number of individuals in the standard population* --there's a study group A, study group B, and a standard population --look at table on the powerpoint --adjusted incidence -I[for A] = Σi[I(for Ai) x Wi] / ΣiWi -I[for B] = Σi[I(for Bi) x Wi] / ΣiWi --adjusted AR= I[for A] - I[for B] --adjusted RR= I[for A] / I[for B]
collider (confounding and mediation slides)
--two causes for coming to hospital yellow fingers --(+)-> hospital <-(+)-- lung cancer --select subjects in hospital yellow fingers --(+)-> hospital <-(+)-- lung cancer yellow fingers <---(- or + and)---> lung cancer --conditioning on a collider induces an association between its causes --"and" and "or" selection leads to different biases -note= generally induces an association when there shouldn't be one
types of causal relationships (causation slides)
--type 1= exposure- 1 and nonexposure- 1 = description- doomed (whether or not the exposure is there you'll get the disease) --type 2= exposure- 1 and nonexposure- 0 = description- exposure is causal --type 3= exposure- 0 and nonexposure- 1 = description- exposure is preventive --type 4= exposure- 0 and nonexposure- 0 = description- immune (whether or not you have the exposure you won't have the disease) --note= basically have 4 types of causal relationships
penalized likelihood criteria: AIC and BIC (multivariate analysis II: categorical outcomes slides)
--used for choosing best predictor subsets in regression and often used for comparing non-nested models --Akaike Information Criterion (AIC) -estimate of a constant plus the relative different between the unknown true likelihood function of the data and the fitted likelihood function of the model *-lower AIC means a model is consider to be closer to the truth* --Bayesian Information Criterion (BIC) -estimate of a function of the posterior probability of a model being true under a certain Bayesian setup *-lower BIC means that a model is considered more likely to be the true model -BIC penalizes model complexity more heavily than AIC*
statistical interaction (interaction and effect modification slides)
--using data from a specific study we estimate the amount of statistical interaction between two exposures in the study group by comparing estimated measures of association (e.g., RR, RD) for one exposure across different levels of the other exposure --these estimates however would almost never perfectly represent the true effect parameters in the source population -estimates of statistical interaction may reflect random and nonrandom error (i.e., chance and bias) as well as true effect modification --this the same accuracy problems that threaten the estimation of exposure effect also threaten the estimation of effect modification --furthermore in the assessment of effect measure modification the amount of error in estimating the exposure effect might vary across levels of covariate (e.g., different levels of confounding by gender) --since effect measure modification and statistical interaction are measure dependent it is important to specify the measure being used for their assessment --note= never going to perfectly represent the general population due to bias and error *--note= selection bias= influence EMM* *--note= info bias= influence EMM* *--note= specify measure used for interaction and EMM*
stratification assumptions (misclassification slides)
--virtually assumption free --however... -assumes strata are meaningful and properly defined there should be homogeneity within stratum -lack of residual confounding --note= all dependent on groups being meaningful and well defined --note= should have homogeneity within your stratum
assistive interaction, cohort study: homogeneity of effects approach (interaction and effect modification slides)
--when the *absolute difference* in the risk/ rate between exposed (X+0 and unexposed (X-) is not homogeneous in strata of the putative effect modifier (Z) -aflatoxin= no -> HBV= no -> incidence rate= 1.0 -> attributable risk= 0 aflatoxin= no -> HBV= yes -> incidence rate= 5.0 -> attributable risk= 4.0 -aflatoxin= yes -> HBV= no -> incidence rate= 21.0 -> attributable risk= 0 aflatoxin= yes -> HBV= yes -> incidence rate= 25.0 -> attributable risk= 4.0 -note= notice how there is no difference between both yeses so there is no additive interaction *-absolute risk difference between exposed and unexposed= attributable risk* -because ARs associated with HBV are not modified by aflatoxin there is no additive interaction -note= can even be 3.8 and 4.0 because they are close (rarely get the same number but you'll get close numbers) --when the *absolute difference* in the risk/ rate between exposed (X+) and unexposed (X-) is not homogeneous in strata of the putative effect modifier (Z) -aflatoxin= no -> HBV= no -> incidence rate= 1.0 -> attributable risk= 0 aflatoxin= no -> HBV= yes -> incidence rate= 5.0 -> attributable risk= 4.0 -aflatoxin= yes -> HBV= no -> incidence rate= 10.0 -> attributable risk= 0 aflatoxin= yes -> HBV= yes -> incidence rate= 40.0 -> attributable risk= 30.0 -note= there is a difference between yeses so this is heterogenous (additive interaction) *-attributable risk= absolute risk difference between exposed and unexposed* -because ARs associated with HBV are modified by aflatoxin there is additive interaction
multiplicative interaction, cohort study: homogeneity of effects approach (interaction and effect modification slides)
--when the relative difference in the risk/ rate between exposed (X+) and unexposed (X-) is not homogeneous in strata of the putative effect modifier (Z) -aflatoxin= no -> HBV= no -> incidence rate= 2.0 -> risk ratio= 1.0 aflatoxin= no -> HBV= yes -> incidence rate= 4.0 -> risk ratio= 2.0 -aflatoxin= yes -> HBV= no -> incidence rate= 20.0 -> risk ratio= 1.0 aflatoxin= yes -> HBV= yes -> incidence rate= 40.0 -> risk ratio= 2.0 -note= no differences between the yeses -because RRs associated with HBV are not modified by aflatoxin there is *no multiplicative interaction* --when the relative difference in the risk/ rate between exposure (X+) and unexposed (X-) is not homogeneous in strata of the putative effect modifier (Z) -aflatoxin= no -> HBV= no -> incidence rate= 2.0 -> risk ratio= 1.0 aflatoxin= no -> HBV= yes -> incidence rate= 4.0 -> risk ratio= 2.0 -aflatoxin= yes -> HBV= no -> incidence rate= 10.0 -> risk ratio= 1.0 aflatoxin= yes -> HBV= yes -> incidence rate= 120.0 -> risk ratio= 12.0 -note= the yeses are different -because RRs associated with HBV are modified by aflatoxin there is *multiplicative interaction*
dealing with confounding and mediation (confounding and mediation slides)
--y = α + β2(x) + ∑i(γi)(zi) + ∑j(μj)(vj) + e --g(y) = α + β2(x) + ∑i(γi)(zi) + ∑j(μj)(vj) + e --where x = exposure, y = disease, and z = set of confounders, and v = set of mediators --note= how much change in beta is accounted by change in μ --DAG E --(M)-> D E <- ( ) -> D
confounders example (confounding and mediation slides)
--yellow fingers <-(+)-- a common cause smoking --(+)-> lung cancer --yellow fingers <---(+)--->lung cancer --yellow fingers <-(+)-- adjust for smoking smoking --(+)-> lung cancer --no association between yellow fingers and lung cancer after adjusting for the confounder --a confounder induces an association between its effects --conditioning on a confounder removes the association --condition = (restrict, stratify, adjust) --note= *positive confounder:* induce a relationship between exposure and outcome but once you condition (control) for the confounder there's no or a weaker association
is confounding a bias (confounding and mediation slides)
--yes -because if one concludes that a variable is a *causal* risk factor by missing the existence of confounding this is an *erroneous (biases* conclusion -a confounded estimate is a "biased estimate" of the true *causal* association between exposure and disease --no -because 1. there are essentially different phenomena bias: *error* in the design of the study that result on an estimate of an association that is not really present (or absence of one that really exists) confounding: the observed association is *really present* (e.g., the total mortality is really higher in the US than in Mexico) even if it is not causal --note= if you don't account for confounding it is biased so it's ok to say it's biased in the conclusion if this is the case -the professor does not think confounding is a bias *confounding and bias* --schematic representation of positive confounding and selection bias in the total reference population confounding factor C is more common in cases than in controls assuming no random variability the study samples of cases and controls reflect the higher frequency of factor S is the same in cases and controls however through the selection process it becomes more common in cases than in controls thus confounding exists "in nature" whereas selection bias is a result of the sampling process -look at the image on the side for the pie interpretation --note= how does bias influence our measure of confounding (ex= how does selection bias influence confounding) *is confounding a bias* *--NO* -2. because of the different practical public health implications --the relationship between type of evidence needed in epidemiologic studies and type of prevention carried out (primary or secondary) -GOAL= *primary prevention:* prevention or cessation of risk factor exposure (e.g., saturated fat intake and atherosclerosis) TYPE OF EVIDENCE NEEDED= causal association *must* be present otherwise intervention on risk factor will not affect disease outcome for example if fat did not cause atherosclerosis a lower fat intake would not affect atherosclerosis risk -GOAL= *secondary prevention:* early diagnosis via selective screening of "high risk" subjects (e.g., identification of individuals with high triglyceride levels) TYPE OF EVIDENCE NEEDED= association may be *either* causal or statistical (the latter must not be biased) that is the association may be *confounded* for example even if hypertriglyceridemic individuals had a higher probability of developing atherosclerosis disease because of the confounding effect of low high-density lipoprotein levels atherosclerosis is *truly* more common in these individuals
confounding - diagram (confounding and mediation slides)
potential confounder ↓ ↓ exposure of interest ---> disease/ outcome ⇡ relationship of interest --note= arrows: association (except for the relationship of interest arrow thats just to label the ---> arrow)
residual statistics (multivariate analysis I: continuous outcomes slides)
reminder: residuals --not all of the data in a sample will fall right on the least squares regression line --the vertical distance between any one data point yi and its estimated value "y hat"i is its observed "residual"= *ei = yi - "y hat"i* --each observed residual can ve through of as an estimate of the actual unknown "true error" term= *ei = Yi - E(Yi)* [note= not sure if it was actually an e]