Hair Chapter 4: Multiple Regression

¡Supera tus tareas y exámenes ahora con Quizwiz!

Parameter

A characteristic of the population (for example, mu and sigma squared are the symbols of the population parameters of mean and variance). Estimated from sample data.

Correlation coefficient (r)

A coefficient that indicates the strength of the association between any two metric variables. The sign (+ or -) indicates the direction of the relationship (perfect positive to perfect negative with 0 as no relationship).

Tolerance

A commonly used measure of collinearity and multicollinearity in a regression variate. As the tolerance value gets smaller, the variable is more highly predicted by the other IVs (i.e., there is more collinearity).

Homoscedasticity

A description of the data for which the variance of error (e) terms appears constant over the range of values of an independent variable; studied by looking at residuals. The assumption of equal variance of the population error (E; where E is estimated from the sample value e) is critical to properly applying linear regression. When the error terms have increasing or modulating variance, the data are heteroscedastic.

Normal probability plot

A graphical comparison of the shape of the sample distribution to the normal distribution. A normal distribution is represented by a straight line at 45 degrees, and the sample distribution is plotted against it so as to see deviations.

Partial regression plot

A graphical representation of the relationship between the DV and a single IV. The scatterplot of points depicts the partial correlation between the two variables with the effects of the other IVs held constant. Helps assess the form of the relationship (linear vs. nonlinear) and the identification of influential observations.

Regression variate

A linear combination of weighted independent variables used collectively to predict the dependent variable.

Coefficient of determination (R-squared)

A measure of the proportion of variance of the dependent variable about its mean that is explained by the independent, or predictor, variables. The coefficient can vary between 0 and 1; the higher the R-squared, the greater the explanatory power of the regression equation and therefore the better prediction of the dependent variable.

Standard error of the estimate (SE-sub-E)

A measure of the variation of the predicted values that can be used to develop confidence intervals around any predicted value. Similar to the standard deviation of a value around its mean, but instead is the expected distribution of predicted values that would occur if multiple samples of the data were taken.

All-possible-subsets regression

A method for selecting the variables for inclusion in the regression model that considers all possible combinations of the independent variables. For example, if a researcher had four potential independent variables, this technique would estimate all possible models with one, two, three, and four variables and identify the one with the best accuracy.

Stepwise estimation

A method of selecting variables for inclusion in a regression model that starts by selecting the best predictor of the DV and then adding additional IVs in order of incremental explanatory power. IVs are added as long as their partial correlation coefficients are still significant. IVs may also be dropped if their power drops to a nonsignificant level when another IV is added.

Forward addition

A method of selecting variables for inclusion in the regression model by starting with no variables in the model and then adding one variable at a time based on its contribution to the prediction.

Backward elimination

A method of selecting variables for inclusion in the regression model that starts by including all independent variables in the model and then eliminating those variables that do not make a significant contribution to the prediction.

Regression coefficient (bn)

A numerical value of the parameter estimate directly associated with an independent variable. Represents the amount of change in the DV for a one-unit change in the IV. In a multiple predictor model, the regression coefficients are partial coefficients because each takes into account not only the relationships between the DV and that IV, but also between the IVs. The coefficient is not limited in range because it is based both on the degree of association and the scale units of that IV.

Partial F (t) values

A partial F test is a statistical test for the additional contribution of prediction accuracy of a variable above that of the variables already in the equation. A low or insignificant partial F value for a variable not in the equation indicates its low or insignificant contribution to the model as already specified. A t value may be calculated instead of F in all instances, with the t value being approximately the square root of the F value.

Null plot

A plot of residuals versus the predicted values that exhibits a random pattern; a null plot is indicative of no identifiable violations of the assumptions underlying regression analysis.

Standardization

A process whereby the original variable is transformed into a new variable with a mean of 0 and a standard deviation of 1. Typical process is to subtract the variable mean from each observation and then divide by the standard deviation. When all variables in a regression variate are standardized, the intercept assumes a value of 0 and the regression coefficients are known as beta coefficients (which allows the researcher to compare directly the effect of each IV on the DV).

Simple regression

A regression model with a single IV (also called bivariate regression).

Multiple regression

A regression model with two or more independent variables.

Statistical relationship

A relationship based on the correlation of one or more IVs with the DV. Measures of association (usually correlations) represent the relationship.

Beta coefficient

A standardized regression coefficient. Allows for a direct comparison between coefficients as to their relative explanatory power of the dependent variable. Regression coefficients are expressed in terms of the units of that variable (which makes comparisons inappropriate), beta coefficients use standardized data and can be directly compared.

Linearity

A term used to express the concept that the model possesses the properties of additivity and homogeneity. Linear models predict values that fall in a straight line by having a constant unit change (slope) of the DV for a constant unit change of the IV.

Transformation

A transformation (e.g., taking the log or square root of a variable) creates an new variable and eliminates an undesirable characteristic (e.g., nonnormality) that detracts from the ability of the correlation coefficient to represent the relationship between it and another variable. Transformations can be applied to IVs, DVs, or both. The need and type of transformation can be determined based on theoretical or empirical reasons.

PRESS statistic

A validation measure obtained by eliminating each observation one at a time and predicting this dependent value with the regression model estimated from the remaining observations.

Degrees of freedom (df)

A value calculated from the total number of observations minus the number of estimated parameters. These parameter estimates are restrictions on the data because, once made, they define the population from which the data are assumed to have been drawn. Df provide a measure of how restricted the data are to reach a certain level of prediction; if the df is small, the prediction may be less generalizable because all but a few observations were incorporated in the prediction. If the df is large, the prediction is robust and representative of the overall sample of respondents. One example is a regression model with a single IV; we estimate two parameters for that IV (the intercept and the regression coefficient for the IV). In estimating the random error (the sum of the prediction errors -- actual minus predicted dependent values) for all cases, we would find n-2 degrees of freedom.

Intercept (b0)

A value on the y-axis where the line defined by the regression equation crosses the axis (described by a constant term in the equation). In addition to its role in prediction, it also can be used for interpretation (it is the value of the DV if all other IVs are absent). In most cases, you would have to center the data to allow for actual interpretation because in most situations, the absence of all other IVs is not realistic.

Part correlation

A value that measures the strength of the relationship between a dependent and a single independent variable when the predictive efforts of all other IVs in the regression model are removed. The objective is to portray the unique predictive effect due to a single IV among the set of IVs. Different from the partial correlation coefficient, which is concerned with incremental predictive effect.

Partial correlation coefficient

A value that measures the strength of the relationship between the DV and a single IV when the effects of the other IVs in the model are held constant. Used in stepwise, forward addition, backward addition, etc. to identify the IV with the greatest incremental predictive power beyond the IVs already in the regression model.

Independent variable

Also called a predictor variable; variable(s) selected as predictors and potential explanatory variables of the dependent variable.

Dependent variable

Also called the criterion variable; the variable being predicted or explained by the set of independent variables.

Significance level (alpha)

Also called the level of statistical significance; represents the probability the researcher is willing to accept that the estimated coefficient is classified as different from zero when it actually is not. Also known as Type I error. Most use .05, but .01 is more conservative and .10 less conservative (easier to find significance).

Moderator effect

Also known as an interactive effect; an effect in which a third independent variable (the moderator) causes the relationship between an IV and DV to change depending on the value of the moderator.

Specification error

An error in predicting the DV caused by excluding one or more relevant IVs. This omission can bias the estimated coefficients of the variables included and decrease the overall predictive power of the regression model.

Least squares

An estimation procedure used in simple and multiple regression whereby the regression coefficients are estimated so as to minimize the total sum of the squared residuals.

Collinearity

An expression of the relationship between two (collinearity) or more (multicollinearity) independent variables. Complete (multi)collinearity is represented by a correlation coefficient of 1; this is an extreme case called singularity, which occurs when an IV is perfectly predicted by another (or more than one other) IV. Complete lack of (multi)collinearity is represented by a correlation coefficient of 0.

Singularity

An extreme case of (multi)collinearity in which an IV is perfectly predicted (correlation of -1 or 1) by one or more other IVs. Regression models cannot be tested in singularity exists -- the researcher must eliminate one of the variables.

Dummy variable

An independent variable used to account for the effect that different levels of a nonmetric variable have in predicting the dependent variable. To account for L levels of a nonmetric IV, L-1 dummy variables are needed (for just male vs. female, you don't need two variables because one indicates the lack of the other). Two most common methods are indicator coding (specifies the reference category as 0, so the regression coefficients represent the group differences in the DV from the reference category) and effects coding (specifies the reference category as -1, so the regression coefficients represent group deviations on the DV from the mean of the DV across all groups).

Variance inflation factor (VIF)

An indicator of the effect that the other IVs have on the standard error of a regression coefficient. The VIF is directly related to the tolerance value. Large VIF values indicate a high degree of (multi)collinearity among the IVs.

Suppression effect

An instance in which the expected relationships between IVs and DVs are hidden or suppressed when viewed in a bivariate relationship. When additional IVs are entered, the multicollinearity removes "unwanted" shared variance and reveals the "true" relationship.

Influential observation

An observation that has a disproportionate influence on one or more aspects of the regression estimates. This influence may be based on extreme values of the IV, DV, or both. Influential observations can either be good or bad, depending on how they affect the pattern of the remaining data. An influential observation does not have to be an outlier, but many times outliers are classified as influential observations.

Outlier

An observation that has a substantial difference between the actual value of the DV and the predicted value; also includes cases where the case is substantially different from either the IV or DV. The main point is to identify observations that may not be good representations of the population.

Residual (e or E)

Error in predicting the sample data. We always assume there will be random error, but we assume that this error is an estimate of the true random error in the population (E), not just error in prediction of our sample (e). We assume the population error is distributed with a mean of 0 and a constant (homoscedastic) variance.

Measurement error

The degree to which the data values do not truly measure the characteristic being represented by the variable; they make the data values imprecise. Sources include the participants not answering questions or making errors in estimating an answer to a question.

Prediction error

The difference between the actual and predicted values of the DV for each observation in the sample.

Standard error

The expected distribution of an estimated regression coefficient. Similar to standard deviation, but it denotes the expected range of the coefficient across multiple samples of the data. It is useful in significance tests to see if the coefficient is significantly different from zero at a given level of confidence. The t value of a regression coefficient is the coefficient divided by its standard error.

Sampling error

The expected variation in any estimated parameter (intercept or regression coefficient) that is due to the use of a sample rather than the population. Sampling error is reduced as sample size increases and is used to test whether the estimated parameter differs from zero.

Adjusted R-squared (adjusted coefficient of determination)

The modified measure of the coefficient of determination (R-squared) that takes into account the number of independent variables included in the regression equation and the sample size. Adding independent variables will always cause R-squared to increase, but the adjusted R-squared may fall if the added independent variables have little explanatory power or if the degrees of freedom become too small. A useful statistic for comparing equations with different numbers of independent variables, different sample sizes, or both.

Studentized residual

The most commonly used form of a standardized residual. Differs from other methods in how it calculates the standard deviation used in standardization; to minimize the effect of any observation on the standardization process, it computes the standard deviation of the residual for observation i from regression estimates omitting the ith observation in the calculation of those estimates.

Reference category

The omitted level of a nonmetric variable when a dummy variable is formed from that nonmetric variable.

Power

The probability that a significant relationship will be found if it actually exists. Complement of the more widely used significance level alpha.

Sum of squares regression (SS-sub-R)

The sum of the squared differences between the mean and predicted values of the DV for all observations. Represents the amount of improvement in explanation of the DV attributable to the IV.

Sum of squared errors (SS-sub-E)

The sum of the squared prediction errors (residuals) across all observations. Used to denote the variance in the DV not yet accounted for by the regression model.

Total sum of squares (SS-sub-T)

The total amount of variation that exists to be explained by the IVs. It is a baseline value calculated by summing the squared differences between the mean and actual values for the DV across all observations.

Polynomial

The transformation of an IV to represent a curvilinear relationship with the DV. By including a squared term (X2), a single inflection point is estimated (and added with each higher power).

Leverage points

Types of influential observation defined by one aspect of influence called leverage. These observations are substantially different on one or more independent variables, so they affect the estimation of one or more regression coefficients.


Conjuntos de estudio relacionados

Chapter 14 | Efficient Captial Markets and Behavioral Challenges

View Set

Chapter 13 Important Concepts and Key Terms

View Set

UND Archer V-Speeds & Emergencies

View Set

Chapter 37: Introduction To Forms Of Business And Formation Of Partnerships

View Set