Linear Model Topics

¡Supera tus tareas y exámenes ahora con Quizwiz!

What are the two estimated coefficients in a simple regression model and thier meaning?

(Intercept) = y-intercept "a body weight of 0 kg would result in a heart weight of -0.36 g" VariableX = slope of VariableX "an increase in 1 kg body weight results in an increase of 4 g heart weight"

What is the interpretation of p-values for estimated coefficients for categorical variables?

(Intercept) represents the mean value for the reference value while "speciesLarix" etc. are the difference in intercepts to the reference level p-values for the reference represents how likely it is that the reference mean equals zero p-values for the rest represent how significant the difference between mean to the reference is (e.g. "there is strong evidence that the mean growth rates of the other three species differ from the reference growth rate")

What type of transformation of the response variable (y) should be done depending on the type of data?

1. "Amounts" -> log-transform (log(y)) 2. Counts (# purchased, etc.) -> square-root transformation (sqrt(y) or Poisson model GLM is better) 3. Proportions or percentages (continuous/continuous) -> arcsine-squre-root transformation (asin(sqrt(y)) 4. Binomial data (# successes/# trials) -> arcsine-square-root transformation (asin(sqrt(y)) or binomial model is better) These rules of thumb apply to predictor variables (x) too, however, predictors are often kept untransformed to facilitate interpretation of the regression coefficients

Other peculiarities of modeling count data

1. As opposed to the LM, where there is an analytical solution, for GLMs a numerical approximation is required to estimate the model coefficients (summary output includes number of iterations needed to make the algorithm converge) 2. Testing factors with > 2 levels in a quasipoisson model requires the use of the anova() function and setting the test to "F-test" (not just looking at summary() is based on z-values in this case). In addition, when comparing two quasi-poisson models, the anova function must be used and the test set to "F-test". This is different from comparing two Poisson models - in this case, the anova function should be used and the test set to "Chisq". (Likelihood Ratio Tests) anova(quasi.glm.complaints, quasi.glm.complaints, NoGender, test = "F")

Explain the different plots/analyses (residual analysis) used to make sure the assumptions about errors or residuals of a linear model are fulfilled.

1. Assess normality of errors/residuals with a Q-Q plot. This plots the residuals from sample quantiles vs theoretical quantiles. If residuals lie almost perfectly along the reference line, the residuals are normally distributed. 2. Assess if expected value of residuals is zero with a Tukey-Anscombe plot. This plots residuals against fitted values. If the smoother is along zero, this supports that the expected value is zero and the model is linear. 3. Assess homoscedasticity of residuals (if variability is constant) by plotting the square-root transformed absolute value of the residuals against the fitted values. A smoother that remains relatively flat supports that the variance is constant and the model is linear. 4. Assess the impact of influential observations on estimated regression coefficients by measuring Cook's distance for all observations. Cook's distance estimates the effect that omitting a given observation has on the regression coeffficients. The higher the value, the larger the effect. This can be visualized by plotting standardized residuals vs leverage (distance from center). The rule of thumb is that a Cook's distance > 0.5 should be inspected and a distance > 1 require action. Action can either involve retrieving the true observation (if value is a mistake), including another predictor in the model that explains this value, or removing the value (not recommended because most of the time the next most influential observation pops up) 5. Assess if observations are totally independent (are there any groupings within observations that the model does not take into account with its predictors?). This can be done by plotting residuals against time or over space. If necessary, add another predictor to the model. It is also good practice to plot the residuals against all predictors in a model!

What are the key shortcomings of using polynomials to model non-linear relationships?

1. Collinearity: raw polynomials suffer from collinearity issues. Fortunately, this can be solved by using orthogonal polynomials (poly function). However, using orthogonal polynomials limits interpretability of regression coefficients. 2. Extrapolation is unreliable: if we want to use the model to make predictions outside the data fitting range, the extrapolated values are unreliable and confidence intervals are overconfident (the higher the degree of the polynomial, the less reliable) 3. The degree of a polynomial must be chosen by the user (no simple rule to tell us what degree to choose - too high a degree risks overfitting)

What are three possible solutions to deal with missing values?

1. Do nothing 2. Discard the columns/predictors that contain MANY missing values 3. Impute (i.e. assign a value) missing values (not recommended) "Higher" functions such as lm() have the na.action argument set to na.omit by default. In a linear model, all rows of data that contain a missing value will be dropped which can decrease sample dramatically.

What are four assumptions about error for linear models?

1. Errors follow a normal distribution (a symmetric bell-shaped curve parallel to the y-axis that follows the trend line) 2. Errors expected value is zero (distances between regression line and each value averages to zero) 3. Errors are homoscedastic i.e. variability is constant (spread from regression line is constant) 4. Errors are independent (data not somehow grouped as in the case of measurements all from same person, all students within same class, etc.)

Why should data be log-transformed in a linear model before assuming the model is non-linear?

1. Log-transforming is a simpler method which is preferable 2. If a quadratic model is fit, the outliers/extreme values will completely drive the model

What are the two major problems with adding several indices of biodiversity to a model?

1. Non-significant (larger) p-values are misleading 2. Interpretation of the regression coefficient may be difficult (cannot interpret variable change while assuming all others are constant)

What steps can you take if four assumptions are not fulfilled (in residual analysis)?

1. Normality is tested via the Q-Q plot. If not fulfilled, transform the response variable (log(y)). If this fails, consider robust regression or bootstrapping techniques. 2. Expected error on zero tested via the Tukey-Anscombe plot (residuals vs fitted values). If smoother is not flat on zero, transform response variable (log(y)) or fit a more complex model (add quadratic terms, interactions). 3. Homoscedasticity is tested via the scale-location plot (sqrt-residuals vs fitted values). If smoother is not flat: (1) transform response variable (log(y)) or (2) fit a more complex model (add missing term, quadratic terms, interactions). If both fail, (3) update the model such that a predictor indicates that residuals have different variances (conduct a weighted regression). 4. Influential observations are assessed by plotting residuals vs leverage. If an observation lies at Cook's distance > 0.5, remove observation and refit the model if it is an error. If not an error, (1) transform response variable (log(y)), (2) update model equation (add or drop terms), (3) if all else fails use robust regression, or (4) remove influential observation (never recommended)

What four problems arise from fitting a linear model to binomial data? (For example, insect mortality rate (0-100%) vs. insecticide concentration)

1. Predictions may lay outside of [0,1] 2. Even fitted values may lay outside of [0,1] 3. y is not normally distributed because its distribution is bounded between 0 and 1 while normal distribution is unbounded 4. Variability is not constant as assumed by the normal distribution

Why is running a simple t-test to compare different levels of a factor inferior to fitting a linear model followed by contrasts?

1. Reduced power: fitting model allows us to control the effect of many other predictors (controls for noise, higher precision estimate) 2. Non-independence of the observation: design variables are present and must be considered/included in the model. Testing via t-test would wrongly assume independence of observations. 3. Multiple testing: if several tests are performed, p-value correction should be performed (run all tests together in glht function) 4. Interactions must be taken into account. For example, if species and density interact, conducting a t-test on "low" vs "high" density would have to be done for EACH species.

What properties should an appropriate model for count data have?

1. Simulated and fitted values cannot be negative 2. Simulated values must be integers 3. Variability of the observations increases with expected value (variability in # cigs/day for a heavy smoker > variability for a normal smoker)

What three elements must be specified for a glht() function?

1. The model 2. Which predictor is involved in testing 3. Which levels of the predictor are to be compared glht(lm.trees, linfct = mcp(species = "Quercus - Picea = 0")

Why is it important to conduct residual analysis?

1. To draw valid inference (p-values, confidence intervals) which requires that all model assumptions are fulfilled (inference on regression coefficients is based on the assumption that errors are normally distributed) 2. To better understand the data (insights) 3. To better models (draw better conclusions)

What kind of model should be used to deal with proportion data or binomial/binary data?

A link function and a distribution other than the normal distribution (logit as link function and assume the data follow a binomial distribution)

What is a statistical model and why do we use it?

A mathematical representation of a data-generating process. When using statistical models, we use predictors to model response variables. We use statistical models to extract relevant information from data and simplify complex processes We do this by estimating effects and by quantifying uncertainty of these estimated quantities (b) is the effect, var(b) is its incertitude). Test hypotheses: does smoking affect the probability of having lung cancer? Estimate effects: HOW MUCH does smoking increase the probability of having lung cancer?

Can you have a predictive model with collinear predictors?

A model with collinear predictors loses its natural interpretability in terms of regression coefficients and p-values. However, there is no problem with including collinear predictors if scope of the model is PURELY to make predictions.

What will a plot with binomial model fit look like?

A sigmoid fit smoother that is symmetric around 50%. The steepness of this fit is proportional to the estimated coefficient. In other words, the larger the regression coefficient (for the same insecticide concentration, a higher mortality rate), the steeper the inverse logit function and the steeper the curve

What is a mixed model?

A statistical model that contains both random and fixed effects mod.0 <- lmer(distance ~ age * Sex + (1 | id) + (1 | school)) random effects = id, school Example: clinical trial where a new drug is tested along a gold standard Usually fixed effects are predictors of interest for which we draw conclusions Random effects are predictors that need to be considered in the analysis (design variables e.g. patient) but are usually not of primary interest Whether a variable is fixed on random depends on experiment design and question addressed Mixed models are useful when grouped data is present (arises when an experimental unit is measured more than once e.g. diameter of a tree measured yearly)

What are smoothing splines?

A valid solution to the problem of choosing the correct degree of complexity to model non-linear relationships because it penalizes the wiggliness of estimated smooth function. Thus, the user does not need to select the complexity of the smooth term - it is calculated from the data. Smoothing splines are basically penalized regression splines: (1) do not suffer collinearity issues, (2) have better properties outside the fitting region (model fit and CIs), but degree does NOT have to be chosen by the user (objective, based on data) Coefficients of a linear model are estimated with the least-squares method. With smoothing splines, the best fit is also obtained by minimizing the sums of squares while ALSO adding a penalty term dependent on the second derivative of the smooth function. A very wiggly function with a large second derivative will be penalized more. On the other extreme, linear function has a second derivative that equals zero.

How to account for and estimate overdispersion in a Poisson model? How will your summary output look different when accounting for overdispersion?

Add another parameter (dispersion parameter) by setting the family to "quasipoisson" The dispersion parameter in the summary output then indicates if/by how much the data is overdispersed Regression coefficients in the quasipoisson model will be unchanged, however, the standard errors (and thus the p-values) are different. Usually, by including dispersion you end up with LARGER (less significant) p-values and HIGHER standard error A "negative binomial" distribution is also well suited to deal with this kind of data (count data) without overdispersion

What is the use of the 'facets' argument?

Allows us to create a graph with multiple panels to gather information about different categories separately (e.g. gender)

What is the difference between errors and residuals?

An error is the difference between the true regression line (true value) and an observed value (true values must be known) A residual is the difference between the estimated/predicted regression line (fitted value) and the observed value. Residuals are an "approximation" of the errors. In the real world with real data, we only know the estimated regression line so we can only calculate the residuals, not the errors

How to assess model assumption for a mixed effects model?

As for linear models, fundamental graphics to assess model assumptions are: 1. Tukey-Anscombe plot 2. Scale-location plot 3. Quantile-quantile plots (residuals AND random effects) 4. Residuals vs Leverage plot If required, other further graphics (e.g. residuals vs predictor plots)

p-values in a Binomial model

As in a Poisson model, the p-values are based on z-values rather than t-values (since the results of the inference procedure are only asymptotically valid)

How to address problems 2 and 3 for count data (ensure simulated values are always integers and that variability of observations increases with the mean value)?

Assume a Poisson distribution (Generalized Linear Model GLM) of the data log(y) = b0 + b1*x1 becomes y = Poisson(lambda) The Poisson model fulfills all three requirements for count data In a Poisson distribution, the VARIANCE of the observations increases linearly with the mean value (mean of 5 cigs/day = variance of 5; mean of 20 cigs/day = variance of 20) glm.smokers <- glm(cigarette.count ~ person, family = "poisson")

How many observations are needed to make a Q-Q plot informative?

Between 20-30. This should be taken into account when plotting Q-Q plots for random effects (sometimes not as many as fixed effects).

Binomial distribution

Binomial distribution only works with discrete integers as the response variable (NOT continuous data) For example, providing mice groups with 100 g of food and measuring average proportion of food consumed is NOT binomial However, measuring how many rats consume > 50 g of food IS binomial logit(y) = b0 + b1*x1 y ~ Binomial(n, pi) n = number of trials, pi = success probability

Interpreting coefficients and inference procedure for binary data

Both are exactly the same as in a binomial model

How are smoothing splines estimated?

By minimizing the sums of squares while also adding a penalty term dependent on second derivative of the smooth function. Penalty term is controlled by lambda which is estimated from the data via Generalized Cross Validation (GCV) - this implies that the wiggliness of the smooth functions is estimated in a data-driven and objective way (so user does not need to set the complexity of the smooth term)

What is a common way to evaluate the goodness of fit of a model? What is a potential problem with this method and how to solve it?

Coefficient of determination: R^2 This quantity represents the part of the variation present in the data that is explained by the model (ranges between 0 and 1) The higher the R^2 value, the better the fit (the less variability remains unexplained by the model) Importantly, a model that contains additional parameters will always have a higher R^2 value. This can be solved by using the adjusted R^2 instead (takes into account the number of parameters) Unfortunately, there is no good general threshold for R^2 that indicates if a model is "good" or not and interpretation varies from case to case -> as a result, R^2 is not usually used to compare two models (F-test, Anova)

What is collinearity? How does a model with collinear predictors differ from one without? How can you avoid this when adding quadratic terms and what are the downsides?

Collinearity occurs when predictors in a model have very similar values. For example, two predictors that both quantify biodiversity will be highly correlated. If they are both included in a model, the model will not know to whom to attribute the "diversity effect". In the model that includes both predictors, standard errors will be higher and the p-values will be much larger despite estimated coefficients being the same. In other words, the measure of precision of this estimate will be lower. When adding quadratic terms to a model, quadratic terms can be highly collinear with the corresponding linear term. To avoid this, quadratic terms should be added to the model using the poly() function. This function creates orthogonal polynomials that are not affected by collinearity issues. It is important to note that the use of polynomials implies that we modify a predictor such that several predictors are created and the model clearly becomes more complex. In addition, the price to pay when using orthogonal polynomials is that the regression coefficients lose their interpretability.

Testing continuous vs categorical variables

Continuous variables can be equivalently tested using either summary() function (t-tests) or drop1() function (F-tests) Sometimes the inferential results for continuous variables are best displayed and communicated with confidence intervals Categorical variables with two levels only (dummy variables) can be equivalently tested using summary() function or drop1() function Categorical variables with more than two levels must be tested using drop1() function since the summary() function would only tell us whether intercept differs from zero and whether there is a difference between reference and other levels.

What are fitted values?

Each observation can be forecast using all previous observations (these follow the trend line exactly)

What is rank deficiency of a model and why can it occur?

Error occurs when the model is trying to estimate more parameters than there are groups available. 1. When collinearity between two predictors is too extreme that a model can't be fitted (e.g. one variable is the sum or linear combination of other variables) 2. When there are nested categorical variables (species and sub-species) 3. Several design variables included (building and room) can result in nested variables and rank deficiency

How can you determine if a categorical parameter has an impact on your response variable (e.g. "do species differ in growth rates?")

F-test: 1. Fit a model that considers species to have no effect at all (computes mean growth rate across all species, only contains an intercept) 2. Compare this model to your model with an F-test (anova function) The p-value of this test indicates whether the addition of the categorical parameter to the model is significant (if the addition of this parameter allows model to better fit the data) F-test is important because it takes into account (1) the number of additional parameters and (2) the drop in unexplained variance (RSS). In other words, if the F-test identifies a significant drop in unexplained variance but at the cost of an additional 100 parameters, the p-value would not be significant. Importantly, the F-test just tells us that at least one species differs from the others in growth rate (not useful for comparing rates across the species factor)

How would you interpret coefficients of a binomial model differently for categorical parameter?

Factors must be interpreted differently since the coefficients are reflecting a comparison of odds to the reference level rather than an incremental increase in odds with each additional unit (continuous) exp(coef(glm.insects.factor))["conc.asFactorlow.conc"] = 8.7 The odds of dying for a low insecticide concentration are about 8.7 times higher than the odds for dying with no insecticide

Which distributions should you use to account for over/underdispersion in binomial vs count data?

For binomial data: quasibinomial (except for binary data which must be binomial) For count data: negative binomial or quasipoisson

What are GAMs?

Generalized Additive Models (GAMs) allow the user to fit models where several smoothing splines are fitted simultaneously. In other words, it allows the user to fit models with several predictors where continuous predictors can have non-linear effects. In other words, a GAM allows us to fit a different smoothing spline to several predictors within a model simultaneously. If polynomials are a cheap approximation to model non-linearities, GAMs are an expensive but sophisticated solution. In practice, smoothing splines are simply building blocks of GAMs and in most real settings more than one predictor is available so smoothing splines are not used (people mostly work with low-degree polynomials or GAMs) Do not suffer collinearity issues, have better properties outside the fitting region, degree does not have to be chosen, AND can simultaneously estimate several smooth terms gam.1 <- gam(yy ~ s(x1)) The s() function allows the user to specify smooth terms (effect of x1, the unique predictor present in the model, is modeled with a smooth term) The smooth function has a default degree of complexity equal to 10 and penalization is applied to overfitted degrees of complexity Extrapolation properties of GAMs: (1) Wide confidence intervals (very sensible) (2) GAM assumes that the relationship outside the fitting region continues linearly

What is a GLM?

Generalized Linear Model: extension of the linear model used to deal with particular types of data. Examples: Poisson and Binomial In particular, these models assume a normal distribution of the data and use a link function. Just as for linear models, GLMs can be used for testing hypotheses and estimating effects

How could we use a linear model to make predictions for new data?

Given a fitted model (fitted values), both predicted values and confidence intervals can be calculated for new data

Steps in Analysis

Graphical analysis aka Exploratory Data Analysis: enables us to (1) inspect the data, (2) find mistakes, and (3) better understand data structure (interactions between variables) Usually done by plotting reponse variable against all predictors (determine if relationship appears linear or non-linear) then using coloring or panelling to visually examine interactions between predictors Model fitting then follows as an iterative process (starting model fitted, then residual analysis conducted + model diagnostics taken, then a better model is fitted based on these results)

When should response variables and/or predictors be transformed in a model? What effect does this have?

If a variable is an "amount" (all data > 0 i.e. revenue, age, height), it will likely have to be log-transformed in order to accurately model it. This impements the natural logarithm function. Doing so will reduce the driving force that outliers have on the model and will reduce the skewness of the data (symmetry of data). This will result in a better fit of a linear model. For response variables, amounts should always be log-transformed. For predictors, amounts should sometimes be log-transformed especially if skewness is visible. If transforming a predictor does not visibly improve the model fit, it can be left untransformed to facilitate interpretation of the coefficients.

What is overdispersion and why does it occur in Poisson models? How to check if your data is overdispersed?

In real settings, it is very often the case that count data do NOT follow a Poisson distribution (10 complaints for a doctor = variance of 10, 2 complaints = variance of 2). Most often this type of data show a variability that is LARGER than expected under the Poisson distribution (the mean value). In other words, real settings count data is often found to be overdispersed. To check this, compare the residual deviance and corresponding degrees of freedom in the summary output. If data is truly Poisson distributed, these should be approximately the same. If residual deviance > df = data is overdispersed.

How to test predictors in a Poisson model? How is this different from a Linear model?

In the same way as for a Linear Model, the summary() function provides p-values for all parameters present in the model. For Poisson models, p-values are based on z-values instead of t-values (since results of the inference procedure are "only" asymptotically valid). In the same way as for the LM, summary() function can be used for inference of continuous predictors and factors with 2 levels However, differently from LM, Poisson models should be compared using Likelihood Ratio Tests (LRTs). This means that when drop1() and anova() functions are used, the test applied must be "Chi-squared" rather than "F-test". Factors with > 2 levels cannot be tested using summary() function. For quasi-poisson models, all the same above applies except F-tests should be used rather than Chi-squared.

What is the correct interpretation of regression coefficients from lm(Hwt ~ Bwt + Sex)? Why will these be slightly different from lm(Hwt ~ Bwt)?

In this case, the model assumes that the effect of Bwt on Hwt is the same for males and females and therefore controls for gender. In other words, "by increasing one unit of body weight, while keeping all other predictors fixed, we expect an increase of ... in heart weight." This is why estimated coefficients will be slightly different than for a linear model that does not include other predictors (Hwt ~ Bwt)

How could you better test for interaction between two continuous variables?

Interpreting results when testing for interaction between two continuous variables is not straightforward. A possible alternative is to discretise one of the variables into a factor ("high, low, medium density")

Inference procedure for a mixed effects model: (1) Fixed Effects

It is not entirely clear how statistical inference should be carried out within the framework of mixed effects models - so the model 'summary' does not report p-values. To test statistical significance of fixed effect predictors in a mixed model, Likelihood Ratio Tests (LRTs) are used (statistic is assumed to be chi-squared-distributed) To do this, use anova or drop1() functions (which automatically refit the model using Maximum Likelihood ML instead of REML) drop1(mod.0, test = "Chisq") or anova(mod.0, mod.1, test = "Chisq") These will provide equivalent p-values

Confidence intervals for coefficients in binomial data

It is recommended to compute confidence intervals via likelihood profiling for binomial data (MASS packge) confint(glm.insects)

What problems arise by modeling Count Data with a Linear Model?

Linear model: response variable is a continuous variable that is assumed to follow a normal distribution with constant variability Count data: variability is not stable but increases with the mean value. For example, number of cigarettes smoked per day does not increase linearly from non-smoker -> normal smoker -> heavy smoker Problems that arise when count data is used to simulate a linear model: 1. Negative counts (negative cigarettes smoked/day) 2. Variability is assumed to be the same across groups 3. Non-integer counts arise (22.5 cigarettes/day)

What would the Tukey-Anscombe and scale-location plot look like for untransformed "amount" data?

Many observations overlap in the left-hand side of the graph (because no data < 0 so for low fitted values close to 0, all residuals are positive) and the smoother is not on zero (has a "U-shape"). Variance of residuals also increases with the fitted values (can be seen in Scale-Location plot).

What does the summary output of a GAM model tell us?

Model with two smooth terms: gam.1 <- gam(growth.rate ~ species + s(density.site) + s(diversity.site) "Parametric coefficients" section is identical to that obtained with a linear model. Estimated coefficients, standard errors, t-values, and p-values for parametric variables are provided (in this case only species) "Approximate significance of smooth terms" section shows the complexity of the smooth function for all smooth terms with estimated degrees of freedom (edf). If edf = 1 for a smooth term, effect of this variable is estimated to be linear. If edf = 2.67, the effect of this variable is estimated to be between quadratic and cubic. Unfortunately, no regression coefficients are reported for smooth terms since the model is not estimating a single regression coefficient for each smooth term but rather a smooth function. To get the slope for a predictor, it must be included in the model as a parametric variable (effect taken to be linear).

How can you compare all levels within a categorical parameter at once in a pairwise manner? What is the problem you can run into when doing this?

Multiple testing is an issue that arises by performing a large number of tests at once: some of them will be significant just be chance (in 100 tests under null hypothesis at 5% level, 5 will be significant by chance) Tukey Honest Significant Difference Test controls for the number of tests performed when testing all possible pairs within a categorical parameter ph.test.1 <- glht(model = lm.trees, linfct = mcp(species = "Tukey")) The resulting estimates and p-values will show pairwise comparison between each level and will account for the multiple testing issue

What are non-problematic vs problematic errors?

Non-problematic errors are those that deviate greatly from the regression line but are found in the middle of the data or those that are extreme on the x-axis but lie on the regression line Problematic erorrs are those that are extreme on the x-axis and deviate greatly from the regression line. These will have a large impact on the regression estimates.

What do degrees of freedom tell us?

Number of levels in a parameter - 1 (due to the presence of a reference level)

What is a starting model? What does model "development" look like?

One that contains all predictors and all corresponding interactions that we consider important. Most of the time, the effect of all continuous predictors is assumed to be linear in a starting model. Afterwards, we can check whether all interactions are needed (drop1 function), whether data should be transformed (log), and/or if non-linearities should be added to the model (add a quadratic term and use anova or drop1 to determine if it is needed).

Why is adding a quadratic term to a model a good idea? What are the downfalls?

PROS: The model is closer to reality The model better fits the data There is higher power to detect effects of other variables (more variation in the data can be explained) Model assumptions are better fulfilled CONS: Implies estimating more parameters Assumes that the effects are truly quadratic (GAMs provide more flexible tools for modeling non-linear relationships)

How to determine if you can/should drop a parameter you aren't interested in (i.e. design variable like "Subject ID") from the model?

Plot residuals against the variable - if variability is not consistent across different levels of the variable, the observations are not independent by and this variable has a clear effect. Omitting this parameter from the analysis would not allow us to assume the observations are independent and thus the inference procedure (computing p-values and estimating confidence intervals) would not be valid

What methods can be used to model a non-linear relationship?

Polynomials, regression splines, smoothing splines, Generalized Additive Models (GAM)

What is the risk of testing parameters if sample size is small?

Power to detect any effect, if present, is lower the smaller the sample size (ideally you need 20 observations per parameter to be an acceptable sample size)

Binary data

Proportion data (0 or 1) - binary data is a particular case of binomial data where the sample size equals 1 examples: whether a client visiting a website clicks on an advertisement, whether a cancer patient is alive after "invasive" surgery, presence of invasive species on one unit of grassland, presence of malnutrition given a risk factor

Binomial data

Proportion data (number of successes/number of trials) Refers to integers (e.g. seven rats out of 10 survived) NOT continuous measurements (e.g. rats consume 120 g out of 200 g provided) Binomial data is bounded on the left (0%) and the right (100%) examples: proportion of students that pass an exam, proportion of trains that make it to the station on time in a day, aggregated weekly data for drug compliance (50%, 100%, etc)

Inference procedure for a mixed effects model: Random Effects

Random effects are usually part of the design and should not be removed from the model. We prefer to estimate the INCERTITUDE of variance components instead of testing them. Although it is possible to test random effects with LRTs, doing so fails to fulfill one of the assumptions of LRTs since the tested value lies at the boundary of its space As a result, we prefer to estimate confidence intervals for variance components by using the profiling likelihood prof.0 <- profile(mod.0) confint(prof.0) Profiling is executed for all estimated parameters (random AND fixed effects) -> this can be solved by specifying for random effects: profile(mod.0, which = "theta_")

How to improve the interpretability of the (Intercept) estimated regression coefficient?

Re-center the data 1. Create new predictor variable whose values are all original predictor values subtracted by the mean (d.cats$Bwt.centered <- d.cats$Bwt - mean.Bwt) 2. Re-plot the data using new re-centered predictor variable (Bwt.centered) 3. Fit a linear model using the re-centered predictor variable and interpret coefficients = lm(Hwt ~ Bwt.centered)

What is RSS?

Residual Sums of Squares: variance unexplained by a model - the lower the RSS, the better the model (the more variability in the data is explained)

What are other diagnostic plots (outside of four main for residual analysis)?

Residuals vs predictor plots: Non-linearities, missing interactions (i.e. model equation), constant variance Residuals vs time plots: Time correlation Residuals over space plots: Space correlation ACF, PACF and Variograms to assess independence: Time and space correlation

What are regression splines? (not used in practice)

Similar to polynomials in that they create several continuous predictors from one. However, splines are based on basis functions rather than on polynomials. lm.regressionSplines <- lm(yy ~ ns(x1, df = 3)) If using df of 3, the summary output will include estimated coefficients and p-values for three continuous predictors (creates three continuous predictors from a single continuous predictor - same as cubic polynomial function) Pros: (1) Have better properties than polynomials in terms of reliability outside the fitting region (the fit extrapolates linearly outside the fitting region and has wider confidence intervals to reflect the lack of data) AND (2) do not suffer from collinearity issues Cons: BUT (1) require the user to set the dimension (the degree of complexity) which risks under/overfitting the data

How to interpret coefficients in a Poisson model? How is this different from a Linear Model?

Since a link function is used in a Poisson model (natural logarithm applied to response variable) the estimated coefficient for the parameter of interest must be interpreted by taking its inverse function (exponential function) For categorical variable "gender": exp(coef(glm.complaints)["genderM"] = 1.12 correct interpretation: male doctors get, on average, 12% more complaints than women doctors (multiplicative, not additive) For continuous variable "visits": exp(coef(glm.complaints)["visits"] = 1.001 correct interpretation: for a given doctor, increasing the number of visits by one would result in about 0.1% more complaints. This interpretation can be improved by selecting a more meaningful increase in visits exp(coef(glm.complaints)["visits"]*50) = 0.0475 correct interpretation: for any given doctor, increasing the number of visits by 50 would result in ~5% more complaints

What problems arise when fitting smoothers to small sample sizes?

Smoothers may be unnecessarily wiggly when fitted to a small sample size i.e. non-linearities observed may be due to natural noise present in the data rather than a true non-linear relationship

Why can linear models effectively model non-linear affects?

The adjective "linear" refers to the linearity in coefficients (meaning all coefficients are multiplied by the predictor variables) not to the type of relationship modelled.

What are residuals?

The difference between observed values and fitted values

Principle of Marginality

The drop1() function does not test for main effects that are involved in an interaction. This is the principle of marginality. Main effects must be tested only if they are not involved in interactions or higher-degree terms. More in general, marginality implies the testing of higher-degree terms first.

Explain the difference between these models and their coefficients: (1) lm.cats.2 <- lm(Hwt ~ Bwt + Sex) vs. (2) lm.cats.3 <- lm(Hwt ~ Bwt * Sex) OR lm.cats.3 <- lm(Hwt ~ Bwt + Sex + Bwt:Sex)

The first assumes that the two genders differ in intercepts only (same slope between Bwt and Hwt) whereas the second relaxes the assumption to fit a model that allows the two genders to have different intercepts and different slopes In the first case: (Intercept) refers to intercept of females Bwt refers to the slope for BOTH genders SexM refers to the difference between male and female intercepts In the second case: (Intercept) refers to the intercept of females Bwt refers to the slope for females only SexM refers to the difference between male and female intercepts Bwt:SexM refers to the difference between male and female slopes

How can you interpret the regression coefficients of a model where the response variable and/or predictors are log-transformed?

The inverse transformation of the estimated regression coefficient can be taken and interpreted in a multiplicative manner. For example, if the regression coefficient is 0.04, the inverse natural log of this coefficient (e^0.04) = 1.04. This can be interpreted as: when we increase the predictor by one unit, the reponse variable is MULTIPLIED by 1.04. In other words, the response variable increases by 4%. If both response and predictor are log-transformed, we can directly interpret the regression coefficient as percentage changes for both variables. For example, if you increase the predictor by 1%, the response variable increases by 4%.

What is a likelihood ratio test (LRT)?

The likelihood ratio test (LRT) is a statistical test of the goodness-of-fit between two models - it assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods. anova(data1, data2, test = "F" or "Chisq") drop1(data, test = "F" or "Chisq")

When should a low-degree polynomial (quadratic or cubic) be used in practice over GAMs?

The main reasons to choose polynomials over GAMS: 1. Simpler function 2. Has fewer parameters that need to be interpreted (although this is not true for high-order polynomials) PROS for polynomials: - simple concepts to explain - coefficients are easy to interpret - can be implemented within the normal linear model "machinery" CONS for polynomials: - involves estimating one more parameter (compared to linear term only) - requires the user to define the complexity of the relationship - makes strong assumptions on the form of the non-linear relationship (frequently, effects such as "hockey stick" or "plateaus" cannot be modeled with low-degree polynomials) PROS for GAMs: - the complexity of the smooth function is determined in an objective way (data-driven) - very little assumptions are made about the form of the non-linear relationship CONS for GAMs: - complex concept and still look "exotic" - fitting a smooth term requires estimating several parameters (samples size must be large enough) - smooth terms are not associated with regression coefficients that can be easily interpreted - require a more complex fitting machinery than linear models (convergence problems) - complex GAMs (many smooth terms, large sample size) may take very long to fit

How can you compare levels within a categorical parameter by their impact on your response variable (e.g. "does Quercus differ from Picea in terms of growth rate?")

These types of tests are called contrasts (glht function) ph.test.1 <- glht(model = lm.trees, linfct = mcp(species = c("Quercus - Picea = 0"))) Summary of ph.test.1 will show the estimate and p-value comparing growth rates between Quercus and Picea (which one grows faster, by how much, and is there strong evidence that the rates differ?) The growth difference (estimate) is nothing other than the difference between the two species coefficients

What does the plot.lm() function do?

This is an object-oriented function (acts differently depending on the object you feed it) that produces the four most important plots required for a residual analysis of a linear model (Q-Q plot, Tukey Ascombe, Scale-Location i.e. square-root-residuals vs fitted values, Residuals vs Leverage)

What are treatment contrasts?

This is how R deals with categorical variables/factors - it sets one level (the first alphabetically) as a reference (e.g. Female is the reference for Sex variable) Comes from clinical studies where different treatments are compared to a gold-standard/reference The reference level can be re-set using the "relevel" function

"Dispersion parameter for poisson family taken to be 1" meaning

This is printed in the summary output for a Poisson model and highlights that by fitting this model, we assume the variance increases linearly with the mean

What is a global F test?

This tests whether any of the predictors in the model have an influence on the response variable. This is done by comparing the full model with the simple model (containing only the intercept). This F-test is actually reported by default at the very end of the "summary" output (F-statistic)

Overdispersion in binomial models

To check for overdispersion, residual deviance and the degrees of freedom can be compared (summary function). If residual deviance is > degrees of freedom, overdispersion is present (variability of the data in reality is higher than expected by the model) This can be corrected by using the "quasibinomial family" when fitting a model to the data IMPORTANTLY, the difference between residual deviance and degrees of freedom is not informative for binary data and, thus, binary data CANNOT BE MODELED WITH A QUASIBINOMIAL MODEL

How to test categorical predictors in a binomial model?

To test the effect of concentration as a factor (no, low, or high concentration), anova() or drop1() can be used To test single levels of a factor (high concentration vs low concentration), the glht() function can be used

How to avoid negative fitted values (count data)?

Use a "link function": in the context of GLM, link function is a transformation applied to the response variable y = b0 + b1*x1 y = exp(b0 + b1*x1) aka log(y) = b0 + b1*x1 Now the fitted values will always be positive

How can you fix the problem of predictions lying outside of [0,1]?

Use link function to constrain predictions to the [0,1] range (default choice for binary or binomial data is the logit function) logit(y) = log(y/1-y) logit(y) = bo + b1*x1 inverse logit function produces a sigmoidal function with 2 asymptotes (0 and 1) As a result, whatever value is fed to the inverse logit function will produce a result bounded between 0 and 1

How to test continuous predictors in a binomial model?

Use the anova() or drop1() functions with test = "Chisq". These methods are preferred to the summary() function for binomial data

How to estimate the strength of a relationship between two variables rather than just if one exists?

Use the estimated regression coefficients and associated confidence intervals: a non-significant p-value corresponds to a confidence interval that contains zero

How can you estimate collinearity in a model?

Use the vif() function to get an estimate for the GVIF (Generalized Variance Inflation Factor) which indicates how severe the issue of collinearity is. If value is above 5, varible should be removed from the model if possible. If value is above 10, the variable or those correlated with it must be removed.

Variability in binomial data

Variability of binomial data is HIGHEST at a success rate of 50% and minimal for the success rates of 0% and 100% (the same is true for proportions). Variability increases with mean but there is a limitation based on the number of trials

What are the components of the Random Effects section in a mixed model summary?

Variance and sd for subject (Intercept): estimate of variability AMONG persons Variance and sd for Residual: estimate of variability WITHIN persons (over time) The difference between these two can be very important to design new experiments (Be able to illustrate this with a graph)

What is a non-identifiable model?

When collinearity is so high that the model cannot be fit. Model is said to be non-identifiable due to rank-deficiency of the design matrix.

What does an overfitted model look like?

When noise in the data is modelled as being structure (fit is too wiggly)

How can you include a design variable (such as Subject ID) in the model without estimating parameters for each level? What will the summary output of such a model show?

When subject is included in the model, estimated coefficients are calculated for each unique ID (very many parameters). We are not interested in the single levels or in any particular contrasts but we DO want to control for this parameter. To solve this, a mixed effect model (lmer) can be used which, rather than estimating an intercept for each subject, estimates "variability among subjects" mod.0 <- lmer(distance ~ age * Sex + (1 | subject) This will by default use Restricted Maximum Likelihood (REML) to fit the model (gives better estimates of the random effects i.e. variance components). The summary of this model separates random and fixed effects - reports the estimated variances and their square root (sd) for the random effects and the estimated regression coefficients for the fixed effects.

How to test for interactions between variables?

When testing interactions between predictors, interacitons must make sense in relatiy and there must be at least a few observations for each combination. Interactions can be tested between 2 categorical variables or between a categorical and continuous variable (most common) as well as between 2 continuous variables (less common) Add a term to the model (age:species) then perform the drop1() function. The Pr(>F) value for the interaction term indicates if there is strong evidence that age interacts with species i.e. if the effect of age is different among species. This is useful because the effect of age may be absent or weak when averaged over all species if some species have a positive relationship between age and growth rate and some have a negative effect. If the interaction term is significant, both predictors are considered to be relevant

When can you not perform a meaningful comparison between levels of a categorical variable (contrasts)?

When the categorical variable is involved in an interaction

When are confidence intervals more helpful than p-values?

When we are not interested in testing a variable but rather on estimating its effect and the precision of this estimation.

Simulating from a binomial model

You can use real data (number of participants trying two diets, number of participants completing each diet to the end) to calculate success rate for each group (diet old, diet new). The success probability and sample size (assume 45 participants in each group at the beginning of a study) can be used to simulate data from a binomial distribution. Variability will be max at 50% success rate and decrease as you move away in either direction. As a result, the variability will be larger for the group with probability parameter closer to 50%. If old diet has 67% success and new diet has 95% success, old diet will have higher variability (larger box plot range)

What is the problem with using the anova function on a single model (not comparing two models)?

anova(lm.trees) This implements the sequential sums of squares Anova (type III sums of squares). The results obtained with this type of analysis depends on the ordering in which predictors enter the formula. This problem does not arise when using the drop1() function.

Explain the meaning of box plot

black line = median (50% of the data falls above and 50% falls below this line) boxes = IQR (50% of values fall within the box) error bars = min and max values

"str(dataset)" function

completely displays internal structure of an object/dataset

What does the drop 1() function do? How can its results be interpreted?

drop1(lm.trees, test = "F") This function can be used for categorical or continuous predictors This function automatically performs F tests for each variable present in a given model. This testing is fully equivalent to the anova() function. The RSS and Pr(>F) values represent the unexplained variability remaining and significance of the effect after removing [predictor] from the model. Importantly, the effect of each variable is computed while accounting for hte efefct of hte other predictors present in the model. THIS is why it is important to fit models with several predictors instead of fitting several models with one predictor - we can estimate the effect of a predictor while removing the effect of all others.

How to fit a binomial model to a dataset and interpret the coefficients (for continuous variables)?

glm.insects <- glm(cbind(dead,alive)) ~ conc, family = "binomial") Because of the link function (logit), the interpretation of coefficients must be adapted. If the exponential function is applied, we can get the increased or decreased risk in odds. exp(coef(glm.insects)) = 3.2 = Increasing the concentration of the insecticide by 1 unit will obtain an increased risk in the odds (p/1-p) of death of about 3 times In words, if at concentration 2 units, the probability of death (mortality rate) is 50% and at 3 units it is 75%, the odds of survival are 1:1 and 1:3 respectively. As. result, increasing the concentration from 2 to 3 (by one unit) results in a three-fold increase in the odds of death (1:1 to 1:3) Another example: exp(coef(glm.xray)["age"] = 1.05 Increasing the age of a child by one year, the odds increase by about 5% (exp(coef(glm.xray)["age"] = 1.05)*10 = 1.57 Increasing the age of a child by 10 years, the odds increase by bout 57%

Predictor, Control, and Design Variables

lm.trees <- lm(growth.rate ~ species*(age + density + diversity + siteID)) This model includes all predictors and interactions (predictors = variables of interest, control variables, design variables) Variable of interest = diversity: aim of the study is to test whether biodiversity has a positive effect on tree growth Control variable = species, age, density: we are aware this variable plays an important role and we want to control for it. We expect all predictors to have a different effect on growth depending on species Design variable = siteID: predictor under control of the experimenter (often relate to logistic constraints of the experiment). Design variables must always be included in a model but should never be tested for significance

What is the practical meaning of the p-value for parameters in a linear model?

p-value quantifies the probability of observing the value of the test statistic, or a more extreme value, under the null hypothesis In other words, p-values refer to the hypothesis that the slope of a given parameter (variable) equals 0 If p-value is < 0.05 for a parameter, this is strong evidence that the slope for this parameter is not flat (i.e. not zero) The p-value associated with the (Intercept) parameter is not of interest (because any linear model should include an intercept) The least meaningful way to treat p-values is to dichotomise them using the 5% threshold - they should be explained using the "amount of evidence against the null hypothesis of a parameter being zero/no effect" Important to remember that p-values rely on several assumptions

What is the problem with evaluating the appropriateness of a model's complexity by only looking at p-values?

p-value significance depends on sample size (the larger the sample size, the smaller the p-value)

Inference procedure for a mixed effects model

p-values and confidence intervals can be computed with the anova() drop1() and confint() functions Fixed effects are usually tested with Likelihood Ratio Tests (LRTs) For random effects we prefer to estimate confidence intervals via profiling likelihood

"head(dataset, n)" function

returns the first n rows of a dataset


Conjuntos de estudio relacionados

HW #3: Categorical Graphs and Summaries

View Set

Smartbook Assignment Chapter 9: Long-Term Liabilities

View Set

Fundamentals of Nursing Ch 2: Theory, Research, and Evidence-Based Practice

View Set

Previous Quiz Questions CSCE 221

View Set