GLM
Classification error rate
(FN + FP) / n 1-accuracy
Working residual
(Observed - predicted) * first derivative of link function of (predicted)
Accuracy statistic
(TN + TP) / n Weighted average of specificity and sensitivity, where the weights are the proportions of observations belonging to the two classes (positive with sensitivity and negative with specificity)
Pearson statistic
1/n * sum of (y - y hat) ^2 / y hat Used as a performance metric for count GLMs Can be computed on the training or test set Scales the squared discrepancies by the predicted values to avoid the statistic being disproportionately dominated by large predicted values The smaller the statistic for the test set, the better the model
Interpretation in terms of probability for a numeric predictor using a logic link
A % increase in exposure increases the target, variable probability by (old prediction - new prediction)
If we are more interested on positive outcomes, we should focus on
A higher sensitivity, not specificity
What type of distribution should we use on aggregate data that is continuous with a large mass at zero
A mixture of discrete and continuous components so we could use the Tweedie distribution, which is a compound Poisson and gamma distribution Log link
Interpretation of coefficient, in terms of odds, for a numerical variable
A unit increase in a log of x is associated with the multiplicative increase of e^b in the odds of the target variable, holding all other features fixed
If X is a numeric, and we are using a logit link, interpret the coefficient
A unit increase in a numeric predictor with coefficient B is associated with the multiplicative change of e^B in the odds and a percentage change of 100*(e^B -1)
What is still important for all observations in the GLM
All observations are mutually independent
Link function
Also called linear predictor g() Links the target mean to the linear combination of predictors Examples are identity, log, inverse
Log link
Always ensures positive predictions Easy to interpret ln u = n
Deliverable of GLM
Analytical equation clearly showing how predicted mean of the target variable depends on features g(u) = n = B0 + B1*X1
What are three modeling options for highly skewed, continuous positive target variables
Apply log transformation and fit a normal linear model Build GLM with normal distribution and log link function Build GLM with gamma or inverse Gaussian distribution
What are three considerations for selecting a link function?
Appropriateness of predictions Interpretability Canonical link
If the GLM is fitted appropriately, deviance residuals follow these properties
Approximately normally distributed for most target distributions in the linear exponential family so we can still look at the QQ pot No systematic patterns when on their own or with respect to predictors Approximately constant variance only if they are the standardized deviance residuals (otherwise heteroskedastic)
When we have ridge regression lambda go to infinity what happens to the intercept
As all coefficients shrink to 0 there will only be the intercept so it will be the sample mean of the target variable
Interpretation in terms of probability for a categorical predictor using a logit link
Being in this category increases the target variable probability by (old predictor - new predictor) compared to being in the baseline category
Use dummyVars() to
Binarize categorical predictors with three or more levels to view each level separately and add or remove one level at a time, when doing stepwise selection, instead of having to choose between adding and removing all the dummy variables entirely
Why do we use the confusion matrix?
Binary classifiers produce a prediction of the probability that the event of interest occurs, but we want to translate this into the predictive classes so we need a pre-specified cut off
What are some examples of target variables that will be used in GLM but shouldn't be used in linear models and why?
Binary variables Count variables Skewed, continuous variables These need to use GLM because they are not normally distributed
Anova
Can be used to compare deviance between models If use test = "LRT" it will also do a likelihood ratio test and give a p value
Regularized GLMs pros
Categorical predictors: by using model matrices Binarization of categorical variables is done automatically, and each factor level is treated as a separate feature Tuning: can be tuned by cross validation using the same criterion ultimately used to judge the model against unseen test data Variable selection: for Lasso and elastic nets variable selection can be done by making the regularization parameter large
Do we need to binarize variables before linear regression or GLM
Categorical variables should be binarized lm and glm functions will do this automatically
If we change a numeric variable with two levels, to a factor variable, what happens
Changing the binary variable will have no effect on the GLM model, but will affect the graphs
What are some examples of positive Continuous, right skewed data?
Claim amounts Incomes Insurance coverage amount
how should we reduce the number of factor levels?
Combine levels that have similar target distributions, or have very few observations to form more populous levels We need to find a home for the smaller groups If the target is similar for a large group of levels, you can just combine them all together in a big level called other
Cons of GLMs
Complex relationships: unable to capture non-linear or non-additive relationships unless additional features are manually incorporated Interpretability: for some link functions, like the inverse, the coefficients may be difficult to interpret
How do we compare predicted performance with ROC curves?
Compute the area under the curve, and the higher the better
Performance metrics for classification targets
Confusion matrix ROC curves (AUC)
If we want to make predictions for aggregate payments, we should
Create a number of claims model and a severity model, and then multiply them together Either use the predict() function, or find the Y hat manually by using the coefficients from the output
Deviance mathematically
D = 2 * (l saturated - l) For linear model D = RSS / o^2 D = sun of di ^ 2
Goodness of fit measures for GLM
Deviance Deviance residuals
Weights
Different observations in grouped data may have different exposures and thus different degrees of precision so we can attach a higher weight to observations with a larger exposure so that these more credible observations will carry more weight in the estimation of the model coefficient Used when data is averaged over homogeneous policies, larger exposure more precise observations and lower variance and the model will give more weight to these
Drop1() function
Displays the AIC using the model along with the AIC that would result if the indicated variable was dropped Factor variables are looked at as wholes
Why are GLMs more flexible than linear models
Distribution of the target variable: GLM's are not confined to normal random variables, but only have to be a member of the exponential family of distributions, and can be numerical or categorical Relationship between the target mean and linear predictors: GLM's set a function of the target mean to be linearly related to the predictors, instead of equating the mean, directly with the linear combination of predictors
Under sampling
Drawing fewer observations from the negative class, and retaining all positive class observations to have more balanced data Drawback is there is now less data and information about the negative class which makes this less robust and more overfit
Grouped data
Each row of the data set corresponds to a collection of policyholders sharing the same set of predictor values rather than a single policyholder
Interpretability of link functions
Easy to interpret if it is easy to form statements that describe the association between the predictors, and the target mean in terms of model coefficient
If we use a gamma GLM with a log link, what does ensure?
Ensures all predictions are positive and easy to interpret which is why we use a log over the canonical link, which is inverse Use gamma if the target is positive, continuous and right skewed
If we use a normal distribution GLM with a log link this ensures what about our predictions?
Ensures positive predictions, but allows for the possibility that the observations of the target are negative
What happens when our cut off a zero
Everything exceeds the cut off and is predicted positive, so the sensitivity is one and the specificity is zero
What happens when out cut off is one
Everything is below the cut off and is predicted negative so the specificity is one and the sensitivity is zero
To find if we need an interaction term what should we look at?
Find predictors that you think will interact and look at a graph to confirm
confusionMatrix()
First two arguments need to be factors R will default to taking the alphanumerically first level as the level of interest unless you use the argument positive = ""
Why may the p values change if you manually binarize the categorical variable
GLM internally binarizes all levels of a factor variable, except for the baseline. Therefore, changing the baseline will cause different values to be calculated since they relate to the hypothesis that another variable has a different impact than the baseline level.
What are the differences in GLM and linear models in terms of transforming data
GLM transformation is on the target mean and linear model transformations is on the observations In GLM the data is modeled directly and the target variable is not really transformed, but it just plays a role within GLM
Monotonic relationship
Generally, just increasing or decreasing
par(pty="s")
Generates a square plotting region, which improves the appearance of the ROC pot
3 assumptions of GLMs
Given the predictor variable values, the target variables are independent Given the predictor variable values, the target variables distribution is a member of the exponential family Given the predictor variable values, the expected value of the target variable is inverse of the link function on the predictive equation
Deviance
Global goodness of fit measure that measures the extent to which the GLM departs from the most elaborate GLM, or saturated model Null deviance measures when the target is predicted using its sample mean (similar to TSS)
How do we see if we should use some thing as an offset?
Graph the relationships in a scatterplot to see if they have a strong, proportional relationship
Normal canonical link
Identity
How can we check the prediction performance of GLM?
If the area under the curve increased
How does the cut off sort observations?
If the predicted probability is above the cut off then the event is predicted to occur, and if it is below then it is not predicted to occur
How to interpret the likelihood ratio test
If the test statistic is large, then the bigger model has a better fit to the training data and we should reject the null hypothesis that the extra coefficients should be zero
If we have a log link what is the different interpretations of using E or ln E as an offset
If we use ln E we are assuming that the target mean varies in direct proportion to E If we use E, then we are assuming that the target mean has an exponential relationship with E
What are the two differences between weight and offset?
Impose different structures on the mean and variance of the target variable Impose a different structures on the form of the target variable (average or aggregate) Offsets are used to multiple the prediction to get it on the scale we want which would be the total number whereas the target for weights is the ratio
What is the goal of using weights?
Improve the reliability of the fitting procedure
Why should we refit the final model to the whole data set?
In a real world application, new data will be available in the future on a periodic basis and these data points can be combined with the existing data to improve the robustness of the model
What are the different reasons for transformations in linear models and GLM
In linear regression, the main reason for using a log transformation is to reduce skewness and in GLM we usually transform to ensure appropriate predictions and ease of model interpretation, not for skewness, because this can be accommodated directly by a suitable target distribution like gamma
Disadvantages to changing a numeric Variable with order to a factor
Inflates the dimension of the data as this adds multiple dummy variables May result in overfitting Stepwise selection, or regularization can help reduce these levels but the Model fitting may take longer to run
When asked to interpret results for GLM include the following
Interpret the precise values for the coefficients Comment on whether the sign of the coefficients makes sense Relate the findings to the big picture and how they can help clients make better decisions
Canonical link for gamma
Inverse link It's more common to use the log link because the inverse link doesn't generate positive predictions, and is not easy to interpret
Applicability of the likelihood ratio test is restricted, because
It can only be used to compare one pair of GLMs at a time The simpler GLM must be a special case or nested within the more complex GLM
What is the downside of maximum likelihood estimation
It is occasionally plagued by convergence issues, which may happen when a non-canonical link is used, and no estimates will be produced
What does the link function transform
Just the target mean not the target observations
Over sampling
Keeps all original data, but over samples with replacement the positive class so they will appear more than once Drawback is there is more data which means more computational burden This should only be performed after we split out training and test data
Likelihood ratio test, mathematically
LRT = 2 * (l1 - l0) = D0 - D1
Why may we need to use a GLM instead of a linear model?
Linear models rely heavily on the target variable being normally distributed and this allows for negative values which would not be allowed if the range of target variables is positive Variance of the target variable depends on the mean Target variable is binary, which does not fit a normal distribution Relationship between the predictor variables and the target may not be linear
Canonical links
Link function associated with each target distribution
Deviance residuals
Local goodness of fit measure that is the signed square root of the contribution of the ith observation to D
If we have a positive mean, what type of link function should we use?
Log link is good since it ensures that the mean of GLM is always positive and any positive value can be captured Used for Poisson, gamma, and inverse Gaussian, since they have a positive mean that is unbound from above
If we have a number of claims target that is a non-negative integer valued variable with a right skew we should use
Log link with Poisson distribution since it ensures that the model predictions are always non-negative, makes for an interpretable model and is the canonical link
Binomial canonical link
Logit Other links used our probit and cloglog
Binomial distribution can be used with what five different links
Logit - most commonly used and most interpretable Probit - uses standard normal CDF, hard to interpret Cauchit - CDF of t-distribution, hard to interpret Log - not used because it won't be between zero and one Complementary log link = ln(-ln(1-p)), hard to interpret
If we have a unit valued mean, what type of link function should we use?
Logit link, since the mean is the probability of the event of interest Used for binary distributions
How to interpret the exploration of the relationship between categorical in numeric Paris
Look at a box pot split by the categorical variable, and look at the mean and median of the numeric variable split by the categorical They should very significantly across different levels if it is an important predictor
How to interpret the exploration of the relationship between two categorical Pairs
Look at filled bar charts color filled by the target, and if the target is binary look at the mean by different levels of predictors These proportion should vary significantly across different levels if it is an important predictor
How should we compare link functions?
Look at the predictive performance on the test set or if we want to look at the training set we can compare with AIC or BIC, so that the complexity of the model is being taken into consideration
QQ plot interpretation
Looks at the normality of the standardized, deviance residuals We want this to be on the 45° angle, and if the points towards the end are deviating, there's a lot more skewness than the model is handling
Difference between accuracy and AUC
Main difference is accuracy looks at the one specific cut off point and the AUC looks at all possible cutoff points
Confusion matrix
Matrix is a 2 x 2 table of four possibilities Prediction is what the model produced and reference is the actual class
Maximum likelihood estimation
Maximizes the likelihood of observing the given data Typically achieved by running optimization algorithms Produces estimates with a desirable statistical properties like asymptotic unbiasedness, efficiency, and normality
What do we use to estimate the coefficients in GLM
Maximum likelihood estimation
Poisson distribution
Mean and variance have to be equal When we have over dispersion, the distribution can be tweaked
Test confusion matrix interpretation
Measures how well the classifier performs on unseen data This is what we care about We can compare this to the training matrix to detect overfitting
Regularization go GLMs
Minimizes deviance and regularization penalty where deviant looks at goodness of fit in regularization penalty looks at model complexity Similar to linear in that, as the regularization parameter increases, the GLM becomes more flexible, and we can use cross validation to find the regularization parameter, and we can use a lasso and ridge regression
Advantages of a tweedie distribution
More efficient to fit and maintain a single model Doesn't require the independence between claim counts and claim severity
Canonical link for Poisson
Natural log link
Response residual
Normal residual of observed - predicted
Structure of mean, and variance for offsets
Observations of the target variable should be aggregated over exposure units Exposure is in direct proportion to the mean of the target and leaves the variance unaffected
Structure of mean and variance for weights
Observations of the target variable should be averaged by exposure Variance of each observation is inversely related to the size of the exposure, which will serve as the weight Weights do not affect the mean of the target
Unbalanced data
One class of binary target variable is much more dominant than the other class in terms of proportion of observations
When should oversampling be performed
Only after the training data is split from the test data, because if we do it before we may have the same observations in multiple folds, which defeats the whole purpose of splitting them
What does the interpretation of the GLM coefficients depend on?
Only depends on the link function, since this determines the functional relationship between the target mean and the features Doesn't depend on the target distribution
Gamma distribution
PDF will be right skewed Right skewness will increase as the mean increases Most widely used for claim severity since the mean and variance are positively related Positive outcomes only Need target to be strictly positive
What type of distribution to be used on count data?
Poisson because it assumes a non negative integer value and is possibly right skewed Could also use negative binomial
Tweedie distribution
Poisson sum of gamma, random variables Discrete probability mass at zero and a PDF on the positive real line Cplm package in R
What is the effect of fixing unbalanced data?
Positive data is more prevalent so their predicted probabilities will increase and there will be a higher sensitivity and lower specificity
Appropriateness of predictions when selecting a link function
Range of the values of the target mean implied by GLM is consistent with the range of the values of the target mean in the situation
ROC curves
Receiver operating characteristic curve Graph that plots sensitivity again specificity at each cut off ranging from 0 to 1 Specificity is plotted on the horizontal axis in reverse order The curve should bend to the upper left and approach the top left corner The closer to the top left corner, the better the predictability used for looking at performance of classification models (GLM or trees)
Training confusion matrix interpretation
Reflects how well the classifier matches the training set
GLMs
Relate a function of the target mean linearly to a set of predictors
When we edit our data, what are some things we need to look for?
Remove target leakage variables Remove mysterious variables that we don't know what they are Remove observations that have missing values or values of zero where there should not be zero Convert numeric variables to factors if needed Real level categorical variable, so their baseline is the most observations
What do we look for in the residuals versus fitted plot
Residuals are mostly scattered in a structuralist manner with no fluctuation
AIC and BIC for GLM
Same as for linear D = -2l so this can be replaced in the formulas
Saturated model
Same target, distribution and link function, but as many model parameters as the size of the training set Very flexible and over fitted
What happens as out cut off rises
Sensitivity decreases and specificity increases
Is sensitivity or specificity more important
Sensitivity is more important because our main interest is a positive event
How do we choose a target distribution?
Should choose the one that best aligns with the characteristics of a given target variable
Inverse Gaussian distribution
Similar to gamma but more highly skewed Need target to be strictly positive
Canonical link for inverse Gaussian
Squared inverse link Also use log link
What is the goal of offsets?
Substantial improvement in Model fit
Fisher scoring
Summary will tell you how many iterations of solving the MLE were needed to converge the model Not too useful
Characteristics of a normal distribution
Symmetric about the mean Continuous Can assume all positive and negative values
Compare GLM and linear model flexibility
TLM's are more flexible and have a wider scope of applications
Specificity statistic
TN / (TN + FP) Proportion of negative observations that are correctly negative The larger, the better at identifying negative cases Also known as true negative rate Complement of false positive rate
Precision statistic
TP / (FP + TP) Proportion of positive predictions that truly belong to the positive class Captures how often when the model makes a positive prediction this prediction is correct
Sensitivity statistic
TP / (TP + FN) Proportion of positive observations that are correctly positive The larger, the better at identifying positive cases Also known as true positive rate
roc()
Take the observed levels of the target in predicted probability supplied an order and produces a list that can be passed to the plot() and auc() functions
Advantages to changing a numeric variable with order to factor (like age)
Taking this as a numeric Variable will impose the restriction that it has a global monotonic association with the target across all categories but this is lifted if these are treated as categorical predictors with dummy variables Provides more flexibility
Pros of GLMs
Target distribution: good at accommodating a wide variety of distributions for the target variable Interpretability: the model equation clearly shows how the target mean depends on features, and the coefficients are interpretable measures of directional affect of features Implementation: simple to implement
Regularized GLMs cons
Target distribution: limited and restricted model forms allowed by glmnet Categorical predictors: possible to see some non-intuitive or nonsensical results when only a handful of the levels are selected Interpretability: coefficient estimates are more difficult to interpret because the variables are standardized
Two key components of GLM
Target distributions: choose one that aligns with the characteristics of the target Link function
Convexity
Target mean has to be increasing at a faster rate or decreasing at a slower rate than X increases
What is true in the area under the ROC curve is one
The classifier has perfect discriminatory power, no matter what the cut off is This looks like an upside down L
What is true in the area under the ROC curve is .5
The classifier is the naïve classifier that is random and classified based on P and 1 - P arbitrarily so sensitivity = P and specificity = 1-P Looks like the 45° line
Other things equal what is better for accuracy, sensitivity, specificity and precision
The higher, these are the more attractive, the classifier
How do we interpret deviance?
The lower the deviance, the closer the GLM is to the model of perfect fit and a better fit to the training set Can only be used to compare GLM's having the same target distribution Always decreases as new predictors are added, even if prediction performance would not be better so this should be used for different degrees of complexity
What are the advantages of using a canonical link?
The math is simplified for the estimation procedure, and it makes it more likely to converge
Unbalanced data is especially bad if
The minority group is the positive class
In terms of accuracy, how do we know if a model is over fit?
The model is over fit if the accuracy, sensitivity, specificity and AUC of the model is higher for the training set then the
Interpretation of a coefficient, in terms of odds, for a categorical variable
The odds of the target variable for policyholders in this level is e^b times of that in the baseline level
When choosing a link, it is more important to consider these factors, then whether it's the canonical link
The predictions provided by the link align with the characteristics of the target variable The resulting GLM is easy to interpret
To get a perfect fit we would need to have
The same number of observations as parameters (including intercept)
When we add an exposure variable, this means
The target is directly proportional to the exposure
If X is a dummy variable and we are using the log link, interpret the coefficient
The target mean when the categorical predictor lies in the non-baseline level is e^B times of that when the categorical predictor is in the baseline level, holding all predictors fixed Target meet at the non-baseline level is 100*(e^B -1)% higher than that at the baseline level
What are weights and offsets designed to incorporate?
These are designed to incorporate a measure of exposure into a GLM to improve the fitting
Why don't we see R^2 and F stat in summary(GLM)
These are statistics and tests that assume normal distributions
Why are regular residuals not useful in GLM
They aren't fit to a normal distribution and they don't have a normal variance
What is true of any reasonable classifier in terms of the ROC curve?
They should have an area under the curve higher than .5
Why do you be like that the GLM allows for the target mean to be linearly related to predictors
This allows us to have linearity on a different scale, and analyze situations with predictors and target mean more complex than additive
What are problems with unbalanced data?
This places more weight on the majority class, and tries to match training observations in that class, without paying enough attention to the minority Accuracy will not be a good indicator of the future
Likelihood ratio test
Traditional model selection method Compares GLM's with same target distribution and link function Generalization of the t and F test Capture is the difference between the goodness of fit of the 2GLM's
Four classifications of a confusion matrix
True positive: predicts that the event occurs and it does True negative: predicts the event does not occur and it doesn't False positive: predicts that the event occurs, but it doesn't (type I error) False negative: predicts that the event does not occur, but it does (type II error)
What are two solutions for unbalanced data
Under sampling Over sampling These are used to pick up the signal from the minority class correctly and translate this into results from the original unbalanced data
What type of distribution should be used on binary data?
Use a binomial or Bernoulli distribution since the mean is modeling the event probability on a classification problem
In practice, how should we approach unbalanced data?
Use a combination of under sampling and over sampling to retain information about the positive class and ease computational burden
If we have count data but have overdispersion what should we do
Use a quasipoisson distribution
What distribution should be used if we have positive, Continuous, right skewed data
Use gamma or inverse Gaussian to capture the skewness of the target variable directly with using transformations Both need the target variable to be strictly positive
How do we come up with probability based statements for interpreting coefficients using the logit link?
Use p = e^n / (1+ e^n) Create an average policyholder as the baseline case, where they have the main values of numeric predictors, and the most common level of categorical predictors. then change one variable about this baseline case, and look at the difference in prediction
Logit link
Used for binary variables This is just the log link applied to the odds instead of the target mean so use the interpretations for the log link phrased in terms of changes in odds Odds = e^n
Offsets
Used when the value of the target variable in each row represents the number of claims aggregated over a group of homogeneous policies so different row have different numbers of policies so we can use the number of policies as an offset to better account for the number of claims in different rows Used with rows are aggregated over a group of homogenous policies, larger exposure larger u We already know the coefficient and don't need to estimate it (1)
Over dispersion
Variance exceeds mean
Binarization
Viewing each level separately for categorical variables Has no effect on the model, but will help when we do stepwise selection so we can recognize each level Afterwords, we need to attach these variables to the data set and delete the original categorical variables from the data set and then re-partition the data set into training and test sets
What are we trying to do by transforming the target in GLM
We are not trying to transform the target in GLM to make it more normal. We want it to better fit any exponential family distribution that the data follows. We are only transforming the mean of the target, not the target variable as a whole
How do we compare GLM and OLS methods
We can look at the test RMSE and choose the model with the lowest
How do we decide our cut off?
We need to do a cost benefit analysis about the benefits and cost of correct and incorrect classifications
If you use a log link, what should we remember to do when we use predict
We need to make sure we put them back to their original scale so we need to use exp(predict()) Use predict(type = "response") to make sure predictions are on the original scale
How do you edit for target leakage
We need to make sure we remove all variables that would not be known until after the claim is observed
What is something we can't forget when using offsets
We need to make the offset on the same scale is a linear predictor so if we are using the log link we need ln(E)
Why do we need to consider our cut off?
We need to make the sensitivity and specificity close to one, but this is almost impossible. Since these are in conflict and the selection involves a trade-off between having high sensitivity and high specificity.
When we want to predict claims per policyholder on averaged data why can't we just used that as the target
We use that as the target but we also need to include a weight because it allows the observations averaged over a higher number to make a greater contribution to the parameter estimates
What are things to remember when they ask you to convert to factors?
We want to convert any numeric variable that should be categorical because their values are merely group labels or they have no order We want to re-level, so the baseline has the most observations
Instead of using an offset for Poisson distribution, we could use
Weight to reflect the difference in precision of the observations caused by different values This will create essentially the same exact model but since the values are no longer integers There will be warning messages because we are using Poisson and we won't be able to calculate AIC
If X is a numeric predictor, and we are using the log link, interpret the coefficient
When all other variables are held fixed, the unit increase in X is associated with the multiplicative increase in the target Variable, by a factor of e^B Algebraic change in the target mean associated with the unit increase in X is (e^B - 1) * u so the proportional change in the mean is 100*(e^B -1)
When should we use inverse Gaussian GLM?
When the data is more secure than gamma Use a log link because it is the most interpretable
When is the GLM the same as the linear model?
When the target is normally distributed and the link function is identity
Difference between using a log transformation and a log link
When using log distribution there will be a multiplicative model with the error also being multiplicative with a lognormal distribution When using log link on normally distributed data Y will remain normally distributed with the log of its mean = the equation (without error term)
In what scenario will using weights and offsets give the same results
When we have a poisson regression with log link Typically we would choose offsets since this is a count model
When do we usually use offsets?
When we have count data When we use a log link and assume the target mean is directly proportional to the exposure
When we split into training and test sets and check that the target distributions are similar, what should we look at when the data is skewed?
When we have really skewed data, we are more concerned with the medians being similar than means because a mean can be skewed by a couple observations
Exponential family of distributions
Which class of distributions that include a number of discrete and continuous distributions common in practice, like normal, poisson, binomial, gamma, and inverse Gaussian
Deviance residuals, mathematically
di = sign(y - u hat) * sqrt(Di) Positive if y > u hat If y is close to u hat this will be close to 0
Odds
e ^ (B0 + B1*X1) ln(odds) = B0+B1*X1
GLM r function
glm(y ~ x, family = FAMILY(link = ""), data = )
Offsets mathematically
ln u = ln E + n where e is the exposure of the ith observation and offset u = E * e^n Special predictor, whose regression coefficient is one
Logit link
ln(u / 1-u) = ln(odds) where u is the target mean Scale of likelihood from zero to infinity Ratio of probability occurring to the probability of non-occurrence Easily interpretable Called logistic regression model
To model expected number of events per exposure use
ln(u/E) = n