CCJ 6706 Final

¡Supera tus tareas y exámenes ahora con Quizwiz!

Relevant Variable

1. A "relevant" variable is excluded from the regression equation because you forgot to gather information on the variable (primary analysis) or the dataset you are analyzing failed to collect information on the relevant variable (secondary analysis). 2. When a relevant variable is excluded from the equation, you not only fail to get an estimate for the missing variable but the missing variable can also bias the beta coefficients for the other independent variables included in the regression equation (specification bias or omitted variable bias). 3. The stronger the correlation between the omitted relevant variable and one or more of the independent variables included in the regression equation the greater the bias of the estimated beta coefficients. 4. Bias in the estimated coefficients will not occur if the omitted variable is totally uncorrelated with the independent variables included in the regression equation. However, this situation is very unlikely to occur because the independent variables are usually correlated among themselves. 5. You can tell if you excluded a relevant variable if all the independent variables in the equation are statistically significant, but the R2 for the model is very small. 6. You might also get a beta coefficient that is in the opposite direction (+ or -) than predicted by theory. 7. The only way to correct for omitted variable bias is to gather information on the omitted variable and then include it in the regression equation.

Testing and Correcting Heteroscedasticity

1. A White-test can be used for detecting heteroscedasticity. SPSS does not have this test. 2. Using heteroscedastic-corrected standard errors is the most popular method to correct for heteroscedasticity This approach improves the Testing and Correcting for Heteroscedasticity. This approach improves the estimation of the model's standard errors (SE) without having to transform the estimated model. SPSS does not have this option.

Hypothesis

1. A hypothesis is the verbal representation of a theorized relationship between an independent and dependent variable. A hypothesis can never be proven to be true. It can only be shown to be false.

Problems with R2

1. A problem with R2 is that adding additional independent variables to the regression equation virtually guarantees an increase in R2. There are two ways to deal with this problem. 2. Adjusted R2 can be used to compare regression equations with varying numbers of independent variables. Adjusted R2 penalizes you for each independent variable included in the model. 3. The change in R2 can also be examined to determine whether adding more independent variables to the equation improves the fit of the model. 4. A baseline model is estimated and an R2 is calculated. A second regression model is estimated with one or more additional independent variables and an R2 is calculated (nested model).

Outliers

1. An outlier is an observation that is numerically distant from the rest of the data. 2. Outliers might occur by chance, by measurement error or by a non-normal distribution. Keep in mind that in large samples, a small number of outliers is to be expected. 3. It is possible that an outlier, if present in the data, might overly influence the drawing of the regression line (called an influential case).

Centering - Creating Interaction Variable

1. Before multiplying the two variables together to create the interaction variable, any interval variable comprising the interaction term should be centered (the variable mean is subtracted from each case) to help reduce multicollinearity. Dichotomous variables do not need to be centered. 2. The two variables that comprise the interaction variable and the interaction variable (three variables total), are included in the regression equation. If any of the interaction variables are centered, the original non-centered variable(s) is excluded from the equation.

Pure Heteroscedasticity

1. Pure heteroscedasticity occurs in a correctly specified model and typically is caused by measurement error not being uniform across observations. When talking about heteroscedasticity, most people are referring to pure heteroscedasticity (example: age and weight). While pure heteroscedasticity can sometimes be a problem at the individual-level (micro-level) of analysis, it tends to be more likely to occur when aggregate units are analyzed (cities, counties, countries, etc.).

Interaction Variable

1. The effect of an independent variable on the dependent variable depends on the values of another independent variable. 2. An interaction variable is created by multiplying two independent variables together (independent variable x independent variable = interaction variable). 3. Interactions can explain more variation in the dependent variable (higher R2) than the simple sum of the separate effects of the individual variables. 4. An independent variable x another independent variable is called a two-way interaction (race x gender). An independent variable x another independent variable x another independent variable is called a three-way interaction (race x gender x prior record).

Reporting Significance Levels

26. The beta coefficient and significance level for education would be reported as: 732.40***. This means that there is an "extremely significant" relationship between education and income level. More formally, there is a 99.9% chance that for every one year increase in formal education, a person's income level rises by 732 dollars. 27. If you do not sample cases but analyze the entire population, you can still conduct significance tests by saying that your are "generalizing to a hypothetical population

One and Two Tailed Test

If you cannot determine the direction of the effect of the X on Y then you use a non-directional or two-tailed test of significance. This is the default in SPSS. It is more difficult to find a statistically significant effect using a two-tailed test. For directional tests use a one-tailed test.

Null Hypothesis

There is no relationship between formal education and yearly income in the population.

Problems with Heteroscedasticity

a. OLS estimates remain unbiased. b. OLS underestimates the variances and standard errors of the estimated coefficients Both t tests Problems Caused by Heteroscedasticity errors of the estimated coefficients. Both t-tests and F-test are not reliable. c. The t-statistics tend to be higher leading you to reject the null hypothesis that should not be rejected (Type 1 error). d. Heteroscedasticity has to be very marked to impact your regression results adversely

Statistical Theory

enables us, with some degree of error, to use our single sample to determine the amount of variation that there is in our estimates.

Simple Random Sample/population

means that you randomly select your sample from a given population. You need a complete listing of all cases in the population (a total enumeration of cases) and each case must have an equal chance of selection (sample with replacement) 1. If a simple random sample is used, your results can be generalized to the population from which the sample was drawn. 2. The sample used to generate the coefficients (the betas and the intercept) is only one of a number of different samples that could have been randomly drawn from the population 3. The coefficients are thus estimates of the actual population parameters and will vary to some degree depending on the sample drawn.

Multicollinearity

1. Collinearity refers to a situation in which two independent variables from a dataset are highly correlated, whereas multicollinearity refers to a situation in which three or more independent variables from a dataset are highly correlated. 2. A high correlation among the independent variables makes it difficult to distinguish the unique effects of one independent variable from another in predicting the dependent variable. 3. A small amount of multicollinearity is expected because the independent variables in a regression analysis are going to be intercorrelated. 4. But when the correlation among the independent variables is too high, the beta coefficients for these variables remain unbiased. However, the beta coefficients are less likely to be statistically significant ( Type II error) 5. Example: The effect of education and age on income level.

Cook's Distance

1. Cook's distance is used to determine whether any of the data points overly influence the drawing of the regression line. 2. Data points with a Cook's distance of 1 or greater are considered potentially problematic because these data points are overly influencing the drawing of the regression line. 3. If a case has a Cook's distance of 1 or greater, delete the case and re-estimate the regression equation.

Determining the Degree of Multicollinearity

1. Exam the bivariate correlations among the independent variables (correlation matrix). A correlation of .80 or larger may cause problems. However, one of the independent variables might still be a nearly perfect linear combination of two or more of the explanatory variables included in the analysis. a. The correlation coefficient ranges from -1 to 1 and indicates the linear strength and direction of a relationship between two independent variables. b. The closer the coefficient is to either −1 or 1, the stronger the correlation between the variables. c. R2 is the correlation coefficient squared. 2. Multicollinearity may also be a problem if the R2 for the estimated regression model is relatively large, but none or only a few of the independent variables are statistically significant. 3. Multicollinearity may also be a problem if the values of the estimated coefficients change dramatically when an independent is added or dropped from the regression equation. 4. A fourth symptom of severe multicollinearity is if a coefficient for a variable turns out to be in the opposite direction (i.e., positive or negative) than predicted by theory and or previous research.

Impure Heteroscedasticity

1. Impure heteroscedasticity is caused by specification error, such as the failure to include a relevant variable in the model. The best way to correct for impure heteroscedasticity is to include the missing relevant variable in the model (example: income and vacations).

Dealing with Multicollinearity

1. Increase sample size. A larger sample size will allow more accurate estimates by reducing the variance of the estimated coefficients. But increasing the size of your sample is often not feasible. 2. Drop any independent variable with a VIF of 10 or more from the regression equation. However, omitting a relevant variable causes the estimates of the remaining independent variables to be biased, unless the remaining variables are uncorrelated with the omitted variable. 3. Merge the highly correlated variables into a single composite measure (principal components analysis or factor analysis). 4. Run a baseline model without the problematic variables (variables with a VIF score of 10 or greater). Then run a second regression model and add all the variables with a high VIF score and check the change in R2. You can interpret the R2 change because R2 is not impacted by multicollinearity. Do not interpret the significance tests for the variables that were added to the model.

Irrelevant Variable

1. Irrelevant variable bias occurs when a researcher includes an independent variable in the equation that is not associated with the dependent variable. For example, would you expect that hair color influences income level? 2. The problems associated with adding an irrelevant variable to the equation are less troublesome than excluding a relevant variable from the equation. 3. The inclusion of an irrelevant variable in the regression equation does not bias the beta coefficients of the other independent variables. However, the irrelevant variable can reduce the statistical significance of the other variables. 4. Adding an irrelevant variable will increase R2 and should decrease Adjusted R2 . The change in R2 when the irrelevant variable is added to the equation should not be statistically significant.

Perfect Multicollinearity

1. Perfect multicollinearity occurs when one independent variable is a linear function of another. 2. With perfect multicollinearity between two of the independent variables, the computer cannot derive unique estimates for the variables. SPSS will automatically drop one of the variables from the equation. 3. Perfect multicollinearity can transpire when a researcher misspecified his or her regression equation by mistakenly including a set of independent variables that have a "built-in" linear relationship among them. 4. For example, linear dependency would result if both a respondent's "year of birth" and the "respondent's age" were both included as predictors in the same regression equation. 5. Another type of built-in linear relationship occurs when a researcher creates a set of dummy variable measures, but he or she forgets to exclude one of the dummy coded variables (k - 1). 6. Perfect multicollinearity can also result when the number of independent variables is greater than or equal to the number of cases in the sample. 7. As a general rule there should be a minimum of 20 cases for every independent variable (including the intercept) in the regression equation. 8. Perfect multicollinearity can also occur if you include an independent variable in the equation that is defined in the same way as the dependent variable. 9. For example, you include the concealed gun permit rate as an independent variable predicting the rate of crimes committed by people with a concealed weapon permit. The beta coefficient for the concealed permit rate variable would be highly significant because crimes committed by people with a concealed weapon permit cannot occur unless the offender had a concealed permit. However, the other independent variables included in the regression equation will be less likely to be statistically significant in this situation.

Regression Assumptions

1. The error term has a zero population mean. This assumption is assured as long as the constant is included in the regression equation. 2. All the independent variables are uncorrelated with the error term. The dependent variable cannot cause the error term. The dependent variable cannot cause an independent variable. Theory determines cause and effect. 3. The error term from one observation is independent of the error terms from the other observations (no serial correlation / autocorrelation). With crosssectional data, serial correlation should not be a problem. 4. The regression model is linear in the coefficients and the error term. This assumption relates to omitted and irrelevant variables in the equation. Additionally, the effect of an independent variable is theorized to be linear. 5. No explanatory variable is a perfect linear function of other explanatory variables (no high multicollinearity). 6. The error term has a constant variance (no heteroskedasticity).

Intercept & Constant

1. The intercept is the point where the regression line touches the Y-axis. 2. The coefficient for the intercept represents the value of Y when the independent variables equal zero. In the textbook (page 19) the regression analysis the textbook (page 19), the regression analysis shows that the income level for a person with zero education is 5,078 dollars. Although needed to estimate the regression line, the intercept is usually of little interpretive value. It is rarely interpreted in studies. Sometimes it is even non-nonsensical. For example, what is the weight of a person with zero height?

Interpret P value

1. The stronger the relationship between the independent and dependent variable in the population the smaller the p-value. Thus, a p-value of ≤ .01 signifies a stronger relationship than a p-value of ≤ .05. If we find a relationship (beta coefficient is statistically significant), we might say that: "We rejected the null hypothesis at the .05 level of analysis." 2. If the beta coefficient is not statistically significant at the .05 level, we might say that: "We failed to reject the null hypothesis at the .05 level of analysis." 3. Thus, if we estimated a p-value of .75 for the beta coefficient for education (1 . - 75 . = 25 x 100 25 x 100 = 25%). This means that a relationship between education and income in the population has a chance of only being 25% true.

Tolerance

1. The tolerance for a variable is 1 - R2 for the regression of an independent variable on all the other independents, ignoring the dependent. So each variable in the equation has a tolerance value. 2. Tolerance has a range from zero to one. 3. When an independent variable has a tolerance score of .20 or less, this variable is highly correlated with the other variables included in the regression equation. Multicollinearity is a problem.

Type 2 Error

1. Type 2 error is when you accept the null hypothesis when it is actually false. You believe there is no relationship when there actually is a relationship. As significance level gets smaller (.05, .01 or .001), the likelihood of Type 1 error decreases and Type 2 error increases. Type 2 error is more conservative (less problematic) than Type 1 error.

Variance Inflation Factor (VIF)

1. Variance inflation Factor (VIF) is 1/Tolerance. 2. There is no formal cutoff value to use with VIF, which ranges from 1 to infinity, for determining the presence of multicollinearity. However, VIF values exceeding 10 are generally regarded as indicating that multicollinearity is problematic.

Type 1 Error

1. When you reject the null hypothesis when it is actually true (b actually true (b = 0) it is called Type 1 error or a false positive. The more analyses you perform on a data set, the more likely your results will meet "by chance" the conventional significance level.

Sample Size

1. With an extremely large sample size (+5,000) even a very weak relationship might be statistically significant, whereas in a very small sample even a very strong relationship might not be considered reliable (statistically significant) because aberrations in your results are much more likely to occur with small samples. 2. People often interpret nonsignificant coefficients when analyzing very large samples as being a substantive finding. 3. With a very large sample all the variables in the equation might be significant, but R2 could be small. 4. Determining sample size. You should have about 20 cases per each independent variable included in the regression equation, including the constant.

Curvilinear Relationship

A curvilinear relationship (parabola) between two interval variables can be U-shaped or inverted U-shaped. The shape of the relationship (U-shaped or inverted U-shaped) is determined by the direction of beta coefficients, which will be in the opposite direction. ---U-shaped relationship if the age-centered variable is negative and the age2 variable is positive. ---Inverted U-shaped relationship if the age-centered variable is positive and the age2 variable is negative. 1. To test for a curvilinear relationship between age and length of prison sentence, you first need to center the age variable by subtracting the average age of the offender from all cases. 2. Multiple the centered variables together (age-centered x age-centered = age2). 3. Include the age-centered variable and the Age2 variable in the equation. The original age variable is excluded from the equation. 4. For there to be a curvilinear relationship both the age-centered variable and the age2 variable must be statistically significant.

F-Test

An F-test is then used to determine whether the change in R2 (the difference between the two R2 s) is statistically significant.

Affirmative Hypothesis

As formal education increases, yearly income increases in the population.

Imperfect Multicollinearity

Imperfect multicollinearity is a common problem that occurs when correlations among the independent variables are too large to allow for precise estimates of the unique effects of the independent variables. Two types of variation in an independent variable: (1) variation unique to the variable and (2) variation that is common among the independent variables. Regression will only use variation that is unique to a variable in determining its effect (beta coefficient) on the dependent variable. Common variation among the independent variables is ignored. When the correlation among the independent variables is too high, the beta coefficients for these variables remain unbiased but they are less likely to be statistically significant (Type II error). Independent variables not highly correlated among themselves are not impacted. Thus, the significance tests for these variables are correct.

Significance Test ( t-test)

Is used to accept or reject the null hypothesis of no relationship between the independent and dependent variable in the population from which the sample was drawn. 1. A significance test informs us as to how sensitive a beta coefficient is to changes in the composition of the sample. The smaller the probability generated in the significance test, the less variability of the beta coefficient in the population and the more likely it is to reject the null hypothesis of no relationship.

P Value

P-Value- .05= 95% .01=99% .001= 99.9 % 1. ≤ .05 means that there is a 95% (1 - .05 = .95 x 100 = 95%) chance of there actually being a relationship between the independent and dependent variable in the population. A 95% chance of there being a relationship also means that there is a 5% chance of no relationship between the independent variable and dependent variable in the population. Thus, out of every 100 significance tests calculated that are statistically significant at the .05 level, the odds are that five of the tests are wrong.

R2

R2 is a measure of the amount of error in predicting the dependent variable. The closer the data points are to the regression line in the scatter plot the higher the R2 and the better your predicted model fits the observed data. 2. R2 ranges between 0 and 1.3. SPSS gives you R2 as a decimal. Just move the decimal two places to the right (.56 x 100 = 56). An R2 of .56 means that education explains 56% of the variation in income. In the Field of Criminal Justice, R2 ranges from around 10-20%.


Conjuntos de estudio relacionados

Essentials of Human Anatomy and Physiology(11) Activity lab 5

View Set

Statistic: 2.2 Organizing Quantitative Data

View Set

Saunders Med Administration NCLEX Questions

View Set