Logistic regression, Poisson regression & generalized linear models
Different models
- A binomial model with logit link leads to logistic regression. - A Poisson distribution with log link leads to Poisson regression. - A binomial distribution with an identity link leads to absolute risk regression. - A binomial distribution with a log link leads to relative risk regression. - R> In R we can use glm( , family=distribution('linkfunction')).G 3
Negative binomial regression
- Alternative approach: model the heterogeneity - Assumptions: --- event rate λi has a Gamma distribution with mean μi = exp(β0 + β1xi1 + . . . + βp xip ) and variance equal to μ2i /r --- Conditional on λi , counts Yi have a Poisson distribution with mean λi . - Using these assumptions, marginal distribution of Yi is negative binomial - Shape parameter r measures amount of overdispersion, must be estimated from the data - Variance is given by Var (Yi ) = μi + μ2i /r , so Var (Yi ) > μi
Special Observations Outliers
- An outlier is an observation with an unexpected result (Yi value). - In multivariable linear regression we look at Yi − ˆYi . - This is not a good measure here, but there are several other residual types: --- Pearson residual: zi = Yi − ˆπi√ˆπi (1 − ˆπi ) ----- (Divided by SD to account for nonconstant variance) --- Deviance residual: di = sign(Yi − ˆπi )√−2 [Yi log(ˆπi ) + (1 − Yi ) log(1 − ˆπi )] ----- (Contribution of observation i to the deviance) - The residuals can be 'studentized' to correct for the estimation of πi - Studentized residuals should have constant variance How to use residuals: - identify outliers (compare against 95th percentile of normal distribution) - Look for deviating patterns, get ideas about how to improve model using several plots: residuals against covariates; residuals against estimated probabilityˆπi
Guidelines for the number of covariates What to do if there are not enough cases per variable?
- Apply stricter criteria in the variable selection procedure - If still at the study design stage: try to remove the need for many adjusting variables by a matched design and conditional logistic regression - Advanced statistical techniques (discussed in next chapter): --- Exact logistic regression --- Firth's bias correction --- Bayesian methods
Poisson distribution
- Assume we want to model the number of falls in elderly people. - Simple idea: record whether (or not) a person falls within a time period/interval and use logistic regression. - This becomes be more accurate if we use shorter time interval. - In the limit, when we make the size of the intervals smaller, the Poisson distribution arises for the total number of falls, provided that: --- The probability of falling is equal for all time periods. --- The probability of falling in a period does not depend on falls that occurred in other periods. When Y has a Poisson distribution: P(Y = y ) = μyy ! exp(−μ) - Factorial y ! is product of all integers less than or equal to y , i.e.y ! = y × (y − 1) × . . . × 2 × 1 - Both mean and variance equal to μ - μ is called the rate To include the effects of covariates, we can model the rate of a Poisson distribution as: log(μi ) = β0 + β1xi1 + . . . + βp xip or μi = exp(β0 + β1xi1 + . . . + βp xip ) - This is the Poisson re
Conditional logistic regression Why match?
- By matching at the design stage we can make the distribution of matching variables (confounders) more balanced between cases and controls, there by increasing efficiency. - However: the exposure distribution may also change ⇒ this creates a bias. - Matching alone is not enough to remove confounding. - Therefore we need to account for the matching in the analysis
Continuation ratio model
- Can be seen as a kind of discrete survival model - Models odds of being in category j given that the category is at least j
Complete or quasi-complete separation
- Complete or quasi-complete separation: the outcome separates a covariate(or a linear combination of covariates) completely or almost completely. - Common instance of separation: only events or only non-events have beenobserved for a category of a categorical covariate. - Consequences of complete or quasi-complete separation: --- Estimated odds ratio is 0 or very large (infinity). --- Confidence intervals extremely wide, p-values not reliable --- Problem not always apparent, some programs do not give a warning
Summary Deviance versus Pearson χ2-test
- Deviance test is a likelihood ratio test. - Pearson χ2 is a score test. - The deviance test usually has better statistical properties. - However the components of the Pearson χ2 test might be easier to interpret. - Usually the differences are very small
Multinomial logistic regression
- Easy extension to K outcomes - Modeled as K − 1 logistic models relating outcome k to the baseline level
Remarks
- Exact inference is possible too in smaller data sets. - For one event and one non-event per stratum, and one dichotomous covariate, the score test is identical to the McNemar's test. - For one dichotomous covariate, the score test is identical to the Mantel-Haenszel test. - The stratum variable cannot be included as a covariate, but interactionswith the stratum variable can still be taken into account.
Model building steps after selecting the variables in the model:
- Examine assumption of linearity for continuous covariates. --- For example by using splines or polynomial expansion. - Interactions: test a) a few plausible interactions, b) all possible interactions, but conservatively (e.g. with significance level of 0.01) or c) use global tests. - Further examination of the goodness-of-fit of the model by looking at calibration, residuals and special points.
Embedding tests
- Extend the model in various directions: --- interaction terms, --- quadratic terms, --- etc. - Model of interest is 'embedded' in extended model. - Now we can perform a (likelihood ratio) test. - This is a common way of examining the fit of a model. - Be aware of multiple testing.
Guidelines for the number of covariates
- For logistic regression to work 'properly' we must have enough observations relative to the number of events. - Rule of thumb: '10 cases per variable'
(Unconditional) logistic regression in case-control studies Interpretation of coefficients
- If we apply logistic regression we can expect the intercept to be overestimated by the log of the ratio of the selection probabilities while the estimates of the odds ratios are unchanged. - This means that we can analyze the data with a logistic regression program as if it were a cohort study! - The intercept β0 = α + log(P(Z=1|Y=1)P(Z=1|Y=0))will be biased, but we are usually not interested in this coefficient. - Consequence: model cannot be used for prediction in new individuals from the population!
Proportional odds model
- Imagine we have an ordinal outcome with K categories. - Use the following K − 1 logistic regression models: - Imagine we have an ordinal outcome with K categories. - Use the following K − 1 logistic regression models: - αi is log odds of the outcome being greater than category i versus it being category i or lower, when all covariates are zero - βs are the log odds ratios of the outcome being greater than category i, for all i (model assumption: βs do not depend on category i)
Interaction
- Interaction effect: the effect of a variable X1 on the outcome (log odds)depends on another variable X2. - Works the same way in logistic regression as in linear regression but now onlog odds scale. - Important caveat: interaction on log-odds scale does not imply interactionon probability scale or vice versa
Review of key points
- Logistic regression and Poisson regression are examples of generalized linear models. - The response is modeled with a distribution from the exponential family and connected with a linear predictor using a link function. - The multinomial regression model generalizes the logistic regression model to categorical outcomes with more than two levels. - With an ordinal response, the proportional odds model or continuation ratio models are popular choices. - In the proportional odds model intercepts differ for the different categories,but odds ratios for a predictor remain the same.
Generalized linear models
- Logistic regression and Poisson regression are part of a larger family of models called Generalized Linear Models - A general framework by McCullagh and Nelder for a wide class of models λ = β0 + β1X1 + . . . βp Xp g (μ) = λ Y ∼ F (μ) GLMs are characterized by - The distribution F () - the link function g ()
Review of key points
- Many model diagnostics that are used in linear regression are also available for logistic regression, but often defined and calculated differently. -Grouped/aggregated data: compare the model to a saturated model that has a parameter for each covariate pattern with the deviance or Pearson χ2 test. - Ungrouped data: Hosmer and Lemeshow test, which creates groups based on risk quantiles (i.e. calibration), is suitable. - It is recommended to check for and inspect outliers and leverage points. - ROC curve shows all possible combinations of sensitivity and specificity. The area under the ROC curve shows discriminative ability of model. - Separation problems are a common and sometimes neglected issue in logistic regression.
Conditional logistic regression
- Matched case-control studies can be analyzed with conditional logistic regression:a technique for stratified analysis - Stratified data: strata s = 1, . . . , S Examples of stratification - Multi-centre clinical trial with randomisation stratified by center. - Two or more dichotomous outcomes per patient, for example in a cross-overtrial. Here patient is the stratum Stratified data: strata s = 1, . . . , S - The strata could be taken into account by giving each stratum its own intercept: - Problem: If number of strata is large, too many parameters have to be estimated and the likelihood method becomes unreliable. - Solution: Condition on number of events in a stratum
Regression splines for modeling nonlinear effects
- Model assumption of logistic regression: linearity of the log odds - Continuous covariates often have nonlinear effects. - Nonlinearity in logistic regression is not easy to detect graphically: nonlinear effects should be tested in the model - Splines often used to relax the linearity assumption of the log odds. - X continuous predictor - Range of X is divided in intervals defined by the 'knots' κ1, . . . , κk - The effect of X is modeled on each interval using a polynomial, typically linear, quadratic or cubic. At the knots the polynomials are smoothly connected. - The spline is called restricted if the relation is linear at the outer intervals.
Review of key points
- Poisson model is a regression model for counts. - The expected number of events is proportional to the exponential of the linear predictor: covariates have multiplicative effects on the expected counts. - Coefficients have an interpretation as log rate ratios, exp() of coefficient shave an interpretation as incidence rate ratios. - Use an offset to account for differences in exposure (follow-up time)between observations or patients. - In the Poisson model the variance is equal to the mean, but there is usually overdispersion. Quasi-Poisson or negative binomial regression may be solutions. - R> In R we use the glm command but with family=poisson or family=quasipoisson or the glm.nb command in the MASS package
R2 measures
- R2 measures attempt to quantify the proportion of explained 'variation' in the logistic model, similar to the R2 in multiple linear regression models R2 of Cox & Snell R2CS = 1 −( L0L) 2n, --- where L0 is the likelihood of the model with only a constant, L is the likelihood of the model under consideration, and n is the sample size - The problem with Cox & Snell's R2 is that it cannot achieve the value of 1, therefore Nagelkerke proposed a modification: R2 of Nagelkerke R2N = R2CSR2MAXwith R2MAX = 1 − L 2n0 , --- where L0 is the likelihood of the model with only a constant, L is the likelihood of the model under consideration and n is the sample sizeE 28
Special situations Non-convergence Solutions
- Remove the covariate if you can miss it - Leave the model unchanged, but do not report Wald confidence intervals and p-values of affected covariate(s) - Exact logistic regression - Firth's bias correction - Bayesian methods
Residuals
- Residuals can be defined in a similar way as for the logistic regression model. - The Pearson residuals are: rp = yi − μi/√μi - The deviance residuals are: rD = sign(yi − μi )√2(yi log( yi/μ)− (yi − μi ))
(Unconditional) logistic regression in case-control studies Setup and notation Imagine a simple case-control study
- Sample of size n1 with disease (Y = 1) - Sample of size n0 without disease (Y = 0) - For each sample we observe the status of the exposure and other covariates X = (X1, . . . , Xp ) - We further introduce the random variable Z which denotes if a subject from the population is selected in the sample of cases/controls. (We only observe a subject if Z = 1) - In a case-control study generally P(Z = 1|Y = 1) > P(Z = 1|Y = 0), but we assume these inclusion probabilities do not depend on covariates X
Modeling count data
- Sometimes we encounter data where the outcome of interest is a count variable. - A count variable can take values 0, 1, 2, 3, . . .. - A count variable typically represents the number of events that have occurred over a period of time. - Examples: --- The number of traffic accidents in an area --- The number of incontinence episodes in patients with an overactive bladder --- The number of falls of elderly patients - Linear and logistic regression analysis not suitable for modeling count data.
Review of key points
- The case-control design can be an efficient study design with rare outcomes. - Key characteristic is that cases and controls are sampled separately from thepopulation. - In unmatched studies we can apply ordinary logistic regression. - In matched studies conditional logistic regression is most appropriate. - In conditional logistic regression we condition on the number of events in as tratum and consider the conditional likelihood that the events come from the cases
Extensions of logistic regression
- The logistic regression model can be extended to several outcome categories - We can distinguish between ordered and unordered categories Examples of ordered outcomes: - pain levels (high, medium, low) - education (high, medium, low) Examples of unordered outcomes: - location of tumor - type of surgery
Special situationsNon-convergence
- The optimization routine aims to find the parameter estimates that maximize the likelihood function. - In some cases, the likelihood function is not maximized for any finite value of the ˆβs. - This leads to non-convergence: the optimization routine will not be able to find unique maximum likelihood parameter estimates, and typically terminates with a warning message. - Typical cause of non-convergence: complete or quasi-complete separation
Global goodness-of-fit test for non-grouped data
- We discuss only Hosmer & Lemeshow test - Power of HL test is low, limiting the usefulness in practice. - Subjects are divided into groups of approximately equal size, according to their predicted probability. Then the following table is made: - Under H0 of perfect fit, ˆC is approximately χ2 distributed with df=m-2
Case-control studies
- When events are very rare, a regular sample from the population contains few cases → not efficient - A solution is a case-control study: separate samples of cases and controls - Two main variants: matched and unmatched
Pearson's χ2-test
E3 Idea: The test looks at the discrepancy between observed (O) and expected(E ) number of events within each covariate pattern.Note that the denominator is the variance of O − E - Under 'H0: model correct' X 2 is approximately χ2 distributed, withdf = k − p and p is the number of parameters.
Goodness-of-fit measures
F17 - Most common goodness-of-fit measure: likelihood ratio statistic comparing the model with the saturated model; also known as the 'deviance': - For a well fitting model the deviance/df should be not too much larger than 1 - In our example, D = 8.25 on df = 7, so D/df = 1.2 - When the counts are large enough the deviance D has a χ2 distribution. - Likewise we can define the Pearson statistic as:
Global goodness-of-fit tests ('omnibus tests')
Global goodness-of-fit tests for aggregated (grouped) data - There is a limited number of different covariate patterns (k groups). Assumption: k relatively small compared to n. --- Pearson χ2-test --- The likelihood ratio or deviance testII. Global goodness-of-fit tests for non-grouped data --- Test of Hosmer & Lemeshow
Goodness-of-fit Overview
Goodness-of-fit: are the model assumptions satisfied?- Global tests - Embedding tests Special observations - Are there individuals for whom the model fits badly? (Outliers) - Are there individuals with strongly deviating covariate patterns? (Leverage points) - Are there individuals with a relatively strong influence on the estimated βs?(Influential points) Measures of discrimination
Measures of discrimination
How well does the model predict? - Look at the 2×2 classification table of YPRED vs Y - For a certain probability cut-off value c: YPRED = 1 if ˆπ > c and 0 otherwise. - This is called the confusion matrix Then: - sensitivity = % correctly classified among Y = 1 - specificity = % correctly classified among Y = 0
Heterogeneity in the exposure
In a cohort study individuals are followed for the occurrence of a certain event. Yi = no. events in category i ni = no. of person years in category i xi = covariates Now Yi has a Poisson distribution with mean μi = ni λiλi = exp(β0 + β1X1i + . . . + βp Xpi ), - where λi is the event rate per person year. To account for variation in exposure/follow-up time between individuals we canuse an offset: μi = exp(β0 + β1X1i + . . . + βp Xpi + 1 ∗ log(ni )), - so log(ni ) (the offset variable) can be seen as an additional covariate with βvalue fixed to 1.F 12
Conditional likelihood
In general the conditional likelihood with ms cases in stratum s is --- C7 - Note that the likelihood does not depend on the stratum specific intercepts. - General likelihood methods can be used.
Violation of the Poisson assumptions: overdispersion Causes
Possible causes of overdispersion: - Dependence of events, either between or within patients, e.g. when the count is the number of people with an infectious disease - Events cannot occur for some categories/patients, leading to too many zero counts - Misspecification in the systematic part of the model (i.e. the linear predictor): nonlinear effects of covariates, omitted interaction effects and unobserved/omitted confounders Main consequences: - Standard errors too small - P-values too small (inflated type I error rate)F 20
Conditional logistic regression A bit of theory
Suppose we have one case and one control in each stratum (1-1 matching). - What is the probability that individual 1 is the case? --- C5 --- Note that this does not depend on the βs0
Violation of the Poisson assumptions: overdispersion
The Poisson assumption is correct if: - The hazard rate (event rate) is constant within a category - The hazard rate (event rate) does not differ across individuals - Assumptions are often not met, leading to more variation than compatible with aPoisson distribution (i.e. Var (Yi ) > μi ) - This is called overdispersion or extra-Poisson variation
Violation of the Poisson assumptions: overdispersion Available Solutions
Available solutions: - Quasi-Poisson regression (quasi-likelihood approach) - Negative binomial regression - Zero-inflated Poisson regression models (not further discussed here) Warning: always do model building first - None of these solutions address overdispersion due to model misspecificationF 21
Exact logistic regression
Can be done by the R package elrm (or the commercial program LOGXACT). This is very computationally intensive, and only feasible if the number of events and the number of different covariate patterns are small. The method is a generalization of the exact test and confidence interval for the odds ratio in a 2x2-table.
ROC curve
Definition of ROC curve - The ROC curve (receiver operating characteristic) is the sensitivity plotted against (1-specificity) for all values of c. - The area under the ROC curve (AUC) is a measure of discrimination. - AUC is the probability that, when the ˆπ of an arbitrary individual withY = 1 is compared with the ˆπ of an arbitrary individual with Y = 0, the first is higher than the last. Interpretation of area under ROC curve: Area under ROC = 0.5: --> no discrimination (might as well flip a coin) Area under ROC ≥ 0.7: --> acceptable discrimination Area under ROC ≥ 0.8: --> excellent discrimination Area under ROC ≥ 0.9: --> outstanding discrimination Note: Summarizing the curve as a single value causes information loss.
Negative binomial regression versus quasi-Poisson regression
Differences between these two models: - Negative binomial regression uses a full likelihood function, and parameters are estimated in classical way. - Quasi-Poisson regression can model both underdispersion and overdispersion, negative binomial regression only models overdispersion. - Coefficients in quasi-Poisson regression are the same as in Poisson regression, but they are different in negative binomial regression. - Interpretation of coefficients as the log of incidence rate ratios is the same in all models.F 24
Distributions and links
Distributions we can choose are e.g.: - binomial - Poisson - gamma - Gaussian - inverse Gaussian Link functions we can choose are e.g.: - identity - logit - log - probit (inverse of cumulative distribution function of normal) - square root
Special Observations Influential points
Do the estimates (or standard errors) change substantially when an individual is left out? - For each covariate, one can compute the change in ˆβ if individual i is left out. - A global measure of the influence of individual i on all βs simultaneously is Ci = hi z2ip(1 − hi )2 - This is the analog of Cook's influence measure in multiple linear regression.(SPSS does not divide by p) Recommendation: - Inspect subjects with a large Ci ; guidelines vary from > 4n−p to > 1
Firth's bias correction
Introduced as a way to reduce small sample bias in logistic regression, but it can also be used to prevent separation (implemented in the R packages logistf and brglm)
Special Observations Leverage Points
Leverage points are individuals with deviant covariate patterns. - The leverage of individual i is characterized by a number hi . - The hi 's are the diagonal elements of a matrix called the "hat" matrix). - Large hi : individual i has a covariate vector far away from the meancovariate vector (taking correlation into account).
Model building
Model building: process of specifying different parts of a statistical model This includes: - Choosing the type of model/outcome - Choosing the covariates/predictors (variable selection) - Accounting for interaction effects between covariates - Accounting for nonlinear effects, e.g. using transformations of covariates or possibly the outcome - Verifying the assumptions and the goodness of fit of the model
Quasi-Poisson regression
Model for expected value of Y: μi = exp(β0 + β1xi1 + . . . + βp xip ) Model for variance of Yi : Var (Yi ) = αμi - The 'scale' parameter α (in standard Poisson regression fixed to 1) willgenerally be larger than 1, allowing more variation. - No assumption about the shape of the distribution, so standard likelihoodestimation not possible. Instead the quasi-likelihood approach is followed
The likelihood ratio or deviance test
This test compares the fitted model with the complete, ('saturated') model.Saturated: each group has its own parameter, so model is always correct.The test statistic is the 'deviance': - Under H0, G is approximately χ2 distributed with df = k − p Conditions for both tests: - Groups large enough (say 80% of groups have Ei ≥ 5 and (n − Ei ) ≥ 5)
Bayesian methods
Usually we know that ORs higher than say 5 are extremely unlikely. This can be incorporated into a prior to avoid separation. Implemented in the function bayesglm() in R package arm.
Unconditional) logistic regression in case-control studies Logistic relation
We can now show that the likelihood can be written as the product of the likelihood of the logistic regression model times a factor that does not depend on the βs but on the selection probabilities of cases and controls