GLM

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Classification error rate

(FN + FP) / n 1-accuracy

Working residual

(Observed - predicted) * first derivative of link function of (predicted)

Accuracy statistic

(TN + TP) / n Weighted average of specificity and sensitivity, where the weights are the proportions of observations belonging to the two classes (positive with sensitivity and negative with specificity)

Pearson statistic

1/n * sum of (y - y hat) ^2 / y hat Used as a performance metric for count GLMs Can be computed on the training or test set Scales the squared discrepancies by the predicted values to avoid the statistic being disproportionately dominated by large predicted values The smaller the statistic for the test set, the better the model

Interpretation in terms of probability for a numeric predictor using a logic link

A % increase in exposure increases the target, variable probability by (old prediction - new prediction)

If we are more interested on positive outcomes, we should focus on

A higher sensitivity, not specificity

What type of distribution should we use on aggregate data that is continuous with a large mass at zero

A mixture of discrete and continuous components so we could use the Tweedie distribution, which is a compound Poisson and gamma distribution Log link

Interpretation of coefficient, in terms of odds, for a numerical variable

A unit increase in a log of x is associated with the multiplicative increase of e^b in the odds of the target variable, holding all other features fixed

If X is a numeric, and we are using a logit link, interpret the coefficient

A unit increase in a numeric predictor with coefficient B is associated with the multiplicative change of e^B in the odds and a percentage change of 100*(e^B -1)

What is still important for all observations in the GLM

All observations are mutually independent

Link function

Also called linear predictor g() Links the target mean to the linear combination of predictors Examples are identity, log, inverse

Log link

Always ensures positive predictions Easy to interpret ln u = n

Deliverable of GLM

Analytical equation clearly showing how predicted mean of the target variable depends on features g(u) = n = B0 + B1*X1

What are three modeling options for highly skewed, continuous positive target variables

Apply log transformation and fit a normal linear model Build GLM with normal distribution and log link function Build GLM with gamma or inverse Gaussian distribution

What are three considerations for selecting a link function?

Appropriateness of predictions Interpretability Canonical link

If the GLM is fitted appropriately, deviance residuals follow these properties

Approximately normally distributed for most target distributions in the linear exponential family so we can still look at the QQ pot No systematic patterns when on their own or with respect to predictors Approximately constant variance only if they are the standardized deviance residuals (otherwise heteroskedastic)

When we have ridge regression lambda go to infinity what happens to the intercept

As all coefficients shrink to 0 there will only be the intercept so it will be the sample mean of the target variable

Interpretation in terms of probability for a categorical predictor using a logit link

Being in this category increases the target variable probability by (old predictor - new predictor) compared to being in the baseline category

Use dummyVars() to

Binarize categorical predictors with three or more levels to view each level separately and add or remove one level at a time, when doing stepwise selection, instead of having to choose between adding and removing all the dummy variables entirely

Why do we use the confusion matrix?

Binary classifiers produce a prediction of the probability that the event of interest occurs, but we want to translate this into the predictive classes so we need a pre-specified cut off

What are some examples of target variables that will be used in GLM but shouldn't be used in linear models and why?

Binary variables Count variables Skewed, continuous variables These need to use GLM because they are not normally distributed

Anova

Can be used to compare deviance between models If use test = "LRT" it will also do a likelihood ratio test and give a p value

Regularized GLMs pros

Categorical predictors: by using model matrices Binarization of categorical variables is done automatically, and each factor level is treated as a separate feature Tuning: can be tuned by cross validation using the same criterion ultimately used to judge the model against unseen test data Variable selection: for Lasso and elastic nets variable selection can be done by making the regularization parameter large

Do we need to binarize variables before linear regression or GLM

Categorical variables should be binarized lm and glm functions will do this automatically

If we change a numeric variable with two levels, to a factor variable, what happens

Changing the binary variable will have no effect on the GLM model, but will affect the graphs

What are some examples of positive Continuous, right skewed data?

Claim amounts Incomes Insurance coverage amount

how should we reduce the number of factor levels?

Combine levels that have similar target distributions, or have very few observations to form more populous levels We need to find a home for the smaller groups If the target is similar for a large group of levels, you can just combine them all together in a big level called other

Cons of GLMs

Complex relationships: unable to capture non-linear or non-additive relationships unless additional features are manually incorporated Interpretability: for some link functions, like the inverse, the coefficients may be difficult to interpret

How do we compare predicted performance with ROC curves?

Compute the area under the curve, and the higher the better

Performance metrics for classification targets

Confusion matrix ROC curves (AUC)

If we want to make predictions for aggregate payments, we should

Create a number of claims model and a severity model, and then multiply them together Either use the predict() function, or find the Y hat manually by using the coefficients from the output

Deviance mathematically

D = 2 * (l saturated - l) For linear model D = RSS / o^2 D = sun of di ^ 2

Goodness of fit measures for GLM

Deviance Deviance residuals

Weights

Different observations in grouped data may have different exposures and thus different degrees of precision so we can attach a higher weight to observations with a larger exposure so that these more credible observations will carry more weight in the estimation of the model coefficient Used when data is averaged over homogeneous policies, larger exposure more precise observations and lower variance and the model will give more weight to these

Drop1() function

Displays the AIC using the model along with the AIC that would result if the indicated variable was dropped Factor variables are looked at as wholes

Why are GLMs more flexible than linear models

Distribution of the target variable: GLM's are not confined to normal random variables, but only have to be a member of the exponential family of distributions, and can be numerical or categorical Relationship between the target mean and linear predictors: GLM's set a function of the target mean to be linearly related to the predictors, instead of equating the mean, directly with the linear combination of predictors

Under sampling

Drawing fewer observations from the negative class, and retaining all positive class observations to have more balanced data Drawback is there is now less data and information about the negative class which makes this less robust and more overfit

Grouped data

Each row of the data set corresponds to a collection of policyholders sharing the same set of predictor values rather than a single policyholder

Interpretability of link functions

Easy to interpret if it is easy to form statements that describe the association between the predictors, and the target mean in terms of model coefficient

If we use a gamma GLM with a log link, what does ensure?

Ensures all predictions are positive and easy to interpret which is why we use a log over the canonical link, which is inverse Use gamma if the target is positive, continuous and right skewed

If we use a normal distribution GLM with a log link this ensures what about our predictions?

Ensures positive predictions, but allows for the possibility that the observations of the target are negative

What happens when our cut off a zero

Everything exceeds the cut off and is predicted positive, so the sensitivity is one and the specificity is zero

What happens when out cut off is one

Everything is below the cut off and is predicted negative so the specificity is one and the sensitivity is zero

To find if we need an interaction term what should we look at?

Find predictors that you think will interact and look at a graph to confirm

confusionMatrix()

First two arguments need to be factors R will default to taking the alphanumerically first level as the level of interest unless you use the argument positive = ""

Why may the p values change if you manually binarize the categorical variable

GLM internally binarizes all levels of a factor variable, except for the baseline. Therefore, changing the baseline will cause different values to be calculated since they relate to the hypothesis that another variable has a different impact than the baseline level.

What are the differences in GLM and linear models in terms of transforming data

GLM transformation is on the target mean and linear model transformations is on the observations In GLM the data is modeled directly and the target variable is not really transformed, but it just plays a role within GLM

Monotonic relationship

Generally, just increasing or decreasing

par(pty="s")

Generates a square plotting region, which improves the appearance of the ROC pot

3 assumptions of GLMs

Given the predictor variable values, the target variables are independent Given the predictor variable values, the target variables distribution is a member of the exponential family Given the predictor variable values, the expected value of the target variable is inverse of the link function on the predictive equation

Deviance

Global goodness of fit measure that measures the extent to which the GLM departs from the most elaborate GLM, or saturated model Null deviance measures when the target is predicted using its sample mean (similar to TSS)

How do we see if we should use some thing as an offset?

Graph the relationships in a scatterplot to see if they have a strong, proportional relationship

Normal canonical link

Identity

How can we check the prediction performance of GLM?

If the area under the curve increased

How does the cut off sort observations?

If the predicted probability is above the cut off then the event is predicted to occur, and if it is below then it is not predicted to occur

How to interpret the likelihood ratio test

If the test statistic is large, then the bigger model has a better fit to the training data and we should reject the null hypothesis that the extra coefficients should be zero

If we have a log link what is the different interpretations of using E or ln E as an offset

If we use ln E we are assuming that the target mean varies in direct proportion to E If we use E, then we are assuming that the target mean has an exponential relationship with E

What are the two differences between weight and offset?

Impose different structures on the mean and variance of the target variable Impose a different structures on the form of the target variable (average or aggregate) Offsets are used to multiple the prediction to get it on the scale we want which would be the total number whereas the target for weights is the ratio

What is the goal of using weights?

Improve the reliability of the fitting procedure

Why should we refit the final model to the whole data set?

In a real world application, new data will be available in the future on a periodic basis and these data points can be combined with the existing data to improve the robustness of the model

What are the different reasons for transformations in linear models and GLM

In linear regression, the main reason for using a log transformation is to reduce skewness and in GLM we usually transform to ensure appropriate predictions and ease of model interpretation, not for skewness, because this can be accommodated directly by a suitable target distribution like gamma

Disadvantages to changing a numeric Variable with order to a factor

Inflates the dimension of the data as this adds multiple dummy variables May result in overfitting Stepwise selection, or regularization can help reduce these levels but the Model fitting may take longer to run

When asked to interpret results for GLM include the following

Interpret the precise values for the coefficients Comment on whether the sign of the coefficients makes sense Relate the findings to the big picture and how they can help clients make better decisions

Canonical link for gamma

Inverse link It's more common to use the log link because the inverse link doesn't generate positive predictions, and is not easy to interpret

Applicability of the likelihood ratio test is restricted, because

It can only be used to compare one pair of GLMs at a time The simpler GLM must be a special case or nested within the more complex GLM

What is the downside of maximum likelihood estimation

It is occasionally plagued by convergence issues, which may happen when a non-canonical link is used, and no estimates will be produced

What does the link function transform

Just the target mean not the target observations

Over sampling

Keeps all original data, but over samples with replacement the positive class so they will appear more than once Drawback is there is more data which means more computational burden This should only be performed after we split out training and test data

Likelihood ratio test, mathematically

LRT = 2 * (l1 - l0) = D0 - D1

Why may we need to use a GLM instead of a linear model?

Linear models rely heavily on the target variable being normally distributed and this allows for negative values which would not be allowed if the range of target variables is positive Variance of the target variable depends on the mean Target variable is binary, which does not fit a normal distribution Relationship between the predictor variables and the target may not be linear

Canonical links

Link function associated with each target distribution

Deviance residuals

Local goodness of fit measure that is the signed square root of the contribution of the ith observation to D

If we have a positive mean, what type of link function should we use?

Log link is good since it ensures that the mean of GLM is always positive and any positive value can be captured Used for Poisson, gamma, and inverse Gaussian, since they have a positive mean that is unbound from above

If we have a number of claims target that is a non-negative integer valued variable with a right skew we should use

Log link with Poisson distribution since it ensures that the model predictions are always non-negative, makes for an interpretable model and is the canonical link

Binomial canonical link

Logit Other links used our probit and cloglog

Binomial distribution can be used with what five different links

Logit - most commonly used and most interpretable Probit - uses standard normal CDF, hard to interpret Cauchit - CDF of t-distribution, hard to interpret Log - not used because it won't be between zero and one Complementary log link = ln(-ln(1-p)), hard to interpret

If we have a unit valued mean, what type of link function should we use?

Logit link, since the mean is the probability of the event of interest Used for binary distributions

How to interpret the exploration of the relationship between categorical in numeric Paris

Look at a box pot split by the categorical variable, and look at the mean and median of the numeric variable split by the categorical They should very significantly across different levels if it is an important predictor

How to interpret the exploration of the relationship between two categorical Pairs

Look at filled bar charts color filled by the target, and if the target is binary look at the mean by different levels of predictors These proportion should vary significantly across different levels if it is an important predictor

How should we compare link functions?

Look at the predictive performance on the test set or if we want to look at the training set we can compare with AIC or BIC, so that the complexity of the model is being taken into consideration

QQ plot interpretation

Looks at the normality of the standardized, deviance residuals We want this to be on the 45° angle, and if the points towards the end are deviating, there's a lot more skewness than the model is handling

Difference between accuracy and AUC

Main difference is accuracy looks at the one specific cut off point and the AUC looks at all possible cutoff points

Confusion matrix

Matrix is a 2 x 2 table of four possibilities Prediction is what the model produced and reference is the actual class

Maximum likelihood estimation

Maximizes the likelihood of observing the given data Typically achieved by running optimization algorithms Produces estimates with a desirable statistical properties like asymptotic unbiasedness, efficiency, and normality

What do we use to estimate the coefficients in GLM

Maximum likelihood estimation

Poisson distribution

Mean and variance have to be equal When we have over dispersion, the distribution can be tweaked

Test confusion matrix interpretation

Measures how well the classifier performs on unseen data This is what we care about We can compare this to the training matrix to detect overfitting

Regularization go GLMs

Minimizes deviance and regularization penalty where deviant looks at goodness of fit in regularization penalty looks at model complexity Similar to linear in that, as the regularization parameter increases, the GLM becomes more flexible, and we can use cross validation to find the regularization parameter, and we can use a lasso and ridge regression

Advantages of a tweedie distribution

More efficient to fit and maintain a single model Doesn't require the independence between claim counts and claim severity

Canonical link for Poisson

Natural log link

Response residual

Normal residual of observed - predicted

Structure of mean, and variance for offsets

Observations of the target variable should be aggregated over exposure units Exposure is in direct proportion to the mean of the target and leaves the variance unaffected

Structure of mean and variance for weights

Observations of the target variable should be averaged by exposure Variance of each observation is inversely related to the size of the exposure, which will serve as the weight Weights do not affect the mean of the target

Unbalanced data

One class of binary target variable is much more dominant than the other class in terms of proportion of observations

When should oversampling be performed

Only after the training data is split from the test data, because if we do it before we may have the same observations in multiple folds, which defeats the whole purpose of splitting them

What does the interpretation of the GLM coefficients depend on?

Only depends on the link function, since this determines the functional relationship between the target mean and the features Doesn't depend on the target distribution

Gamma distribution

PDF will be right skewed Right skewness will increase as the mean increases Most widely used for claim severity since the mean and variance are positively related Positive outcomes only Need target to be strictly positive

What type of distribution to be used on count data?

Poisson because it assumes a non negative integer value and is possibly right skewed Could also use negative binomial

Tweedie distribution

Poisson sum of gamma, random variables Discrete probability mass at zero and a PDF on the positive real line Cplm package in R

What is the effect of fixing unbalanced data?

Positive data is more prevalent so their predicted probabilities will increase and there will be a higher sensitivity and lower specificity

Appropriateness of predictions when selecting a link function

Range of the values of the target mean implied by GLM is consistent with the range of the values of the target mean in the situation

ROC curves

Receiver operating characteristic curve Graph that plots sensitivity again specificity at each cut off ranging from 0 to 1 Specificity is plotted on the horizontal axis in reverse order The curve should bend to the upper left and approach the top left corner The closer to the top left corner, the better the predictability used for looking at performance of classification models (GLM or trees)

Training confusion matrix interpretation

Reflects how well the classifier matches the training set

GLMs

Relate a function of the target mean linearly to a set of predictors

When we edit our data, what are some things we need to look for?

Remove target leakage variables Remove mysterious variables that we don't know what they are Remove observations that have missing values or values of zero where there should not be zero Convert numeric variables to factors if needed Real level categorical variable, so their baseline is the most observations

What do we look for in the residuals versus fitted plot

Residuals are mostly scattered in a structuralist manner with no fluctuation

AIC and BIC for GLM

Same as for linear D = -2l so this can be replaced in the formulas

Saturated model

Same target, distribution and link function, but as many model parameters as the size of the training set Very flexible and over fitted

What happens as out cut off rises

Sensitivity decreases and specificity increases

Is sensitivity or specificity more important

Sensitivity is more important because our main interest is a positive event

How do we choose a target distribution?

Should choose the one that best aligns with the characteristics of a given target variable

Inverse Gaussian distribution

Similar to gamma but more highly skewed Need target to be strictly positive

Canonical link for inverse Gaussian

Squared inverse link Also use log link

What is the goal of offsets?

Substantial improvement in Model fit

Fisher scoring

Summary will tell you how many iterations of solving the MLE were needed to converge the model Not too useful

Characteristics of a normal distribution

Symmetric about the mean Continuous Can assume all positive and negative values

Compare GLM and linear model flexibility

TLM's are more flexible and have a wider scope of applications

Specificity statistic

TN / (TN + FP) Proportion of negative observations that are correctly negative The larger, the better at identifying negative cases Also known as true negative rate Complement of false positive rate

Precision statistic

TP / (FP + TP) Proportion of positive predictions that truly belong to the positive class Captures how often when the model makes a positive prediction this prediction is correct

Sensitivity statistic

TP / (TP + FN) Proportion of positive observations that are correctly positive The larger, the better at identifying positive cases Also known as true positive rate

roc()

Take the observed levels of the target in predicted probability supplied an order and produces a list that can be passed to the plot() and auc() functions

Advantages to changing a numeric variable with order to factor (like age)

Taking this as a numeric Variable will impose the restriction that it has a global monotonic association with the target across all categories but this is lifted if these are treated as categorical predictors with dummy variables Provides more flexibility

Pros of GLMs

Target distribution: good at accommodating a wide variety of distributions for the target variable Interpretability: the model equation clearly shows how the target mean depends on features, and the coefficients are interpretable measures of directional affect of features Implementation: simple to implement

Regularized GLMs cons

Target distribution: limited and restricted model forms allowed by glmnet Categorical predictors: possible to see some non-intuitive or nonsensical results when only a handful of the levels are selected Interpretability: coefficient estimates are more difficult to interpret because the variables are standardized

Two key components of GLM

Target distributions: choose one that aligns with the characteristics of the target Link function

Convexity

Target mean has to be increasing at a faster rate or decreasing at a slower rate than X increases

What is true in the area under the ROC curve is one

The classifier has perfect discriminatory power, no matter what the cut off is This looks like an upside down L

What is true in the area under the ROC curve is .5

The classifier is the naïve classifier that is random and classified based on P and 1 - P arbitrarily so sensitivity = P and specificity = 1-P Looks like the 45° line

Other things equal what is better for accuracy, sensitivity, specificity and precision

The higher, these are the more attractive, the classifier

How do we interpret deviance?

The lower the deviance, the closer the GLM is to the model of perfect fit and a better fit to the training set Can only be used to compare GLM's having the same target distribution Always decreases as new predictors are added, even if prediction performance would not be better so this should be used for different degrees of complexity

What are the advantages of using a canonical link?

The math is simplified for the estimation procedure, and it makes it more likely to converge

Unbalanced data is especially bad if

The minority group is the positive class

In terms of accuracy, how do we know if a model is over fit?

The model is over fit if the accuracy, sensitivity, specificity and AUC of the model is higher for the training set then the

Interpretation of a coefficient, in terms of odds, for a categorical variable

The odds of the target variable for policyholders in this level is e^b times of that in the baseline level

When choosing a link, it is more important to consider these factors, then whether it's the canonical link

The predictions provided by the link align with the characteristics of the target variable The resulting GLM is easy to interpret

To get a perfect fit we would need to have

The same number of observations as parameters (including intercept)

When we add an exposure variable, this means

The target is directly proportional to the exposure

If X is a dummy variable and we are using the log link, interpret the coefficient

The target mean when the categorical predictor lies in the non-baseline level is e^B times of that when the categorical predictor is in the baseline level, holding all predictors fixed Target meet at the non-baseline level is 100*(e^B -1)% higher than that at the baseline level

What are weights and offsets designed to incorporate?

These are designed to incorporate a measure of exposure into a GLM to improve the fitting

Why don't we see R^2 and F stat in summary(GLM)

These are statistics and tests that assume normal distributions

Why are regular residuals not useful in GLM

They aren't fit to a normal distribution and they don't have a normal variance

What is true of any reasonable classifier in terms of the ROC curve?

They should have an area under the curve higher than .5

Why do you be like that the GLM allows for the target mean to be linearly related to predictors

This allows us to have linearity on a different scale, and analyze situations with predictors and target mean more complex than additive

What are problems with unbalanced data?

This places more weight on the majority class, and tries to match training observations in that class, without paying enough attention to the minority Accuracy will not be a good indicator of the future

Likelihood ratio test

Traditional model selection method Compares GLM's with same target distribution and link function Generalization of the t and F test Capture is the difference between the goodness of fit of the 2GLM's

Four classifications of a confusion matrix

True positive: predicts that the event occurs and it does True negative: predicts the event does not occur and it doesn't False positive: predicts that the event occurs, but it doesn't (type I error) False negative: predicts that the event does not occur, but it does (type II error)

What are two solutions for unbalanced data

Under sampling Over sampling These are used to pick up the signal from the minority class correctly and translate this into results from the original unbalanced data

What type of distribution should be used on binary data?

Use a binomial or Bernoulli distribution since the mean is modeling the event probability on a classification problem

In practice, how should we approach unbalanced data?

Use a combination of under sampling and over sampling to retain information about the positive class and ease computational burden

If we have count data but have overdispersion what should we do

Use a quasipoisson distribution

What distribution should be used if we have positive, Continuous, right skewed data

Use gamma or inverse Gaussian to capture the skewness of the target variable directly with using transformations Both need the target variable to be strictly positive

How do we come up with probability based statements for interpreting coefficients using the logit link?

Use p = e^n / (1+ e^n) Create an average policyholder as the baseline case, where they have the main values of numeric predictors, and the most common level of categorical predictors. then change one variable about this baseline case, and look at the difference in prediction

Logit link

Used for binary variables This is just the log link applied to the odds instead of the target mean so use the interpretations for the log link phrased in terms of changes in odds Odds = e^n

Offsets

Used when the value of the target variable in each row represents the number of claims aggregated over a group of homogeneous policies so different row have different numbers of policies so we can use the number of policies as an offset to better account for the number of claims in different rows Used with rows are aggregated over a group of homogenous policies, larger exposure larger u We already know the coefficient and don't need to estimate it (1)

Over dispersion

Variance exceeds mean

Binarization

Viewing each level separately for categorical variables Has no effect on the model, but will help when we do stepwise selection so we can recognize each level Afterwords, we need to attach these variables to the data set and delete the original categorical variables from the data set and then re-partition the data set into training and test sets

What are we trying to do by transforming the target in GLM

We are not trying to transform the target in GLM to make it more normal. We want it to better fit any exponential family distribution that the data follows. We are only transforming the mean of the target, not the target variable as a whole

How do we compare GLM and OLS methods

We can look at the test RMSE and choose the model with the lowest

How do we decide our cut off?

We need to do a cost benefit analysis about the benefits and cost of correct and incorrect classifications

If you use a log link, what should we remember to do when we use predict

We need to make sure we put them back to their original scale so we need to use exp(predict()) Use predict(type = "response") to make sure predictions are on the original scale

How do you edit for target leakage

We need to make sure we remove all variables that would not be known until after the claim is observed

What is something we can't forget when using offsets

We need to make the offset on the same scale is a linear predictor so if we are using the log link we need ln(E)

Why do we need to consider our cut off?

We need to make the sensitivity and specificity close to one, but this is almost impossible. Since these are in conflict and the selection involves a trade-off between having high sensitivity and high specificity.

When we want to predict claims per policyholder on averaged data why can't we just used that as the target

We use that as the target but we also need to include a weight because it allows the observations averaged over a higher number to make a greater contribution to the parameter estimates

What are things to remember when they ask you to convert to factors?

We want to convert any numeric variable that should be categorical because their values are merely group labels or they have no order We want to re-level, so the baseline has the most observations

Instead of using an offset for Poisson distribution, we could use

Weight to reflect the difference in precision of the observations caused by different values This will create essentially the same exact model but since the values are no longer integers There will be warning messages because we are using Poisson and we won't be able to calculate AIC

If X is a numeric predictor, and we are using the log link, interpret the coefficient

When all other variables are held fixed, the unit increase in X is associated with the multiplicative increase in the target Variable, by a factor of e^B Algebraic change in the target mean associated with the unit increase in X is (e^B - 1) * u so the proportional change in the mean is 100*(e^B -1)

When should we use inverse Gaussian GLM?

When the data is more secure than gamma Use a log link because it is the most interpretable

When is the GLM the same as the linear model?

When the target is normally distributed and the link function is identity

Difference between using a log transformation and a log link

When using log distribution there will be a multiplicative model with the error also being multiplicative with a lognormal distribution When using log link on normally distributed data Y will remain normally distributed with the log of its mean = the equation (without error term)

In what scenario will using weights and offsets give the same results

When we have a poisson regression with log link Typically we would choose offsets since this is a count model

When do we usually use offsets?

When we have count data When we use a log link and assume the target mean is directly proportional to the exposure

When we split into training and test sets and check that the target distributions are similar, what should we look at when the data is skewed?

When we have really skewed data, we are more concerned with the medians being similar than means because a mean can be skewed by a couple observations

Exponential family of distributions

Which class of distributions that include a number of discrete and continuous distributions common in practice, like normal, poisson, binomial, gamma, and inverse Gaussian

Deviance residuals, mathematically

di = sign(y - u hat) * sqrt(Di) Positive if y > u hat If y is close to u hat this will be close to 0

Odds

e ^ (B0 + B1*X1) ln(odds) = B0+B1*X1

GLM r function

glm(y ~ x, family = FAMILY(link = ""), data = )

Offsets mathematically

ln u = ln E + n where e is the exposure of the ith observation and offset u = E * e^n Special predictor, whose regression coefficient is one

Logit link

ln(u / 1-u) = ln(odds) where u is the target mean Scale of likelihood from zero to infinity Ratio of probability occurring to the probability of non-occurrence Easily interpretable Called logistic regression model

To model expected number of events per exposure use

ln(u/E) = n


Ensembles d'études connexes

Ethics-Chapter 3- Conscience and Moral Development

View Set

7 Major Schools of Thought in Psychology:

View Set

HBEH 761 (Module 2): Mediation and Moderation

View Set

Chapter 24 - Asepsis and Infection Control

View Set

Chapter 2: Confronting Scarcity: Choices in Production

View Set