Methods II - Midterm
First three assumptions of multiple regression:
1. For any value of X, the errors in predicting Y are normally distributed with a mean of zero. (For each value of X, the errors in predicting Y are normally distributed and the mean of the error is zero.) Importance: Whenever e has a mean of zero and is normally distributed, statisticians have found that sample slopes (b) have a mean equal to the population slope 1b2 and are distributed as a t distribution with a standard deviation sb. When the sample size is fairly large 1N . 302, the t distribution resembles the normal distribution, and z scores can be used as estimates of t scores. Because the t distribution is at- ter than the normal distribution (the t has greater probability in the tails), it is important to use large samples whenever possible. 2. The variance of the error term is constant, regardless of the value of X. (Assumption of homoscedasticity: the errors do not get larger as X gets larger) Importance: is violation can be serious because the standard error and the t statistic are used for testing the statistical significance of the slope; that is, whether there is a relationship between the independent variable X and the dependent variable Y. 3. The errors of independent of each other. (For two observations, the error terms are uncorrelated) Another way of saying: the size of one error is not a function of the size of any previous errors.
Last 4 assumptions of multiple regression:
4. Both the independent and dependent variables are interval variables. 5. The relationships between independent and dependent variables are linear. 6. X and Y are measured without error. (Chapter 20 states that the model is specified correctly meaning a well-specified regression equation contains all or most of the independent variables known to be relevant predictors of the dependent variable.) 7. X is a cause of Y. (The relationship between X and Y has some justifiable theoretical relationship.) (Chapter 20 states low collinearity) Multicollinearity - the case where two or more independent variables are highly correlated (have a linear relationship) Important of Collinearity - Multicollinearity makes it difficult for the regression equation to estimate unique partial slopes for each independent variable. Partial slope estimates and the associated t values can be misleading if one independent variable is highly correlated with another. Not only is it difficult to distinguish the effect of one independent variable from another, but high multicollinearity also typically results in partial slope coefficients with inflated standard errors, thus making it hard to obtain statistically significant results.
How can you use a simple regression model (or bivariate regression) for testing your null and alternative hypotheses? What test statistic is used, and how is it defined? At the minimum, you should explain the purpose of stating a null hypothesis, and state the rules for determining whether a relationship between the dependent and independent variables is statistically significant.
A simple regression model can be used for testing null and alternative hypotheses when you are curious about the relationship between two variables. In a simple regression model, the t-test is used. A null hypothesis and research hypothesis need to be developed to clarify what your assumptions are. The null hypothesis suggests there is no relationship between the variables, while the research hypothesis assumes there is a relationship, whether it is positive or negative. The critical t is compared to the calculated t to determine whether the relationship between the two variables is statistically significant. If the calculated t falls outside the critical t, there is evidence to support the assumption that there is a relationship between the two variables.
Define and contrast variable and concept. What purpose do they serve?
A variable is a measured quantity or characteristic that can take on a variety of values. A variable assigns numerical scores or category labels to each case in the sample on a given characteristic. A concept is the basic building block of research or theory that summarizes a critical characteristic or aspect of a class of events. The difference between the two terms is that concepts are developed regarding overarching themes, and variables are used to explain and test these concepts. Both are part of statistical analysis.
What values can the goodness-of-fit measures r and R assume? How are these interpreted? What is the interpretation of r^2 and R^2? What undesirable property does R^2 have, and how is this problem overcome?
Goodness of fit measures r and R give you the degree to which two variables are related. This value can land anywhere between -1 and +1. The closer this variable is to +1 or -1, the more the variables are related. The coefficient of determination (r^2 or R^2) will give you the ratio of explained variation to unexplained variation in the relationship between the variables. R^2 will always increase to some degree when new variables are added to the equation, even when they have no effect on the dependent variable. The adjusted R^2 is a measure of how well the independent, or predictor, variables predict the dependent, or outcome, variable. A higher adjusted R^2 indicates a better model. Adjusted R^2 adjusts the R^2 for the sample size and the number of variables in the regression model. So close to .5 is best
Discuss the difference between the predicted and actual values of the dependent variable. At the minimum, you should explain how you might calculate predicted values of a regression model and what signifies the difference between the predicted and actual values.
Predicted values are values that are achieved by plugging in a number for the independent variable in a regression equation and taking the output for the dependent variable. This is a "best guess" number for the practical analyst. This is depicted by "y(hat)". The actual value is the number that would be obtained if that same number for the independent variable was measured in real life and gave a corresponding output for the dependent variable. This is depicted as "y". For example, a regression line could predict that Santa will stay in house for 14 minutes when there are 2 kids in the house, when in reality, he actually stayed 15 minutes. In this case, the predicted value is 14 and the actual value is 15. For every regression line, there will always be error between the predicted and actual value. The point of the regression is to give us a tentative guideline for making predictions and assumptions.
What is standard deviation and what purpose does it serve?
Standard deviation is a measure of the variability of dispersion of the data. It is the adjusted average of the squared differences between each value and the mean of the data set. Standard deviation is a relative measure, meaning that we can compare two standard deviations to see which data set is more stable/consistent than the other. Thus, larger standard deviations indicate a wide range of variation, while smaller standard deviations indicate consistently similar data. Standard deviation tells us the distance from the mean; expresses how spread out the data set is; important because it expresses how spread out the data is from the average or expected value.
What is statistical significance?
Statistical significance measures the probability of a suspected event occurring. It focuses on compatibility between the null hypothesis and the data provided. If the test statistic is far from zero, it indicates the data are not compatible with the null hypothesis. This means there is evidence to suggest that the research hypothesis in this particular situation may be correct. For tests of significance, the researcher has to choose a confidence level appropriate to the question being asked. For example, a question regarding whether more cops present on the highway leads to lower average speeds of motorists may be okay with using a 95% confidence level. However, in child abuse cases or juvenile delinquency, a 99% confidence level is more appropriate to ensure the decisions made are based on solid data.
Explain what is meant by the terms "dependent" and "independent" samples. What purpose do they serve? Illustrate your answers with examples.
The difference between dependent and independent samples really comes into play when discussing t-tests. A dependent t-test looks at the same sample twice to see if something has changed. For example, a results of a quiz taken before a training and results of the same quiz after the training look at the same sample twice: Did the training increase knowledge? An independent t-test looks for a non-random difference between two separate samples to see if it's possible they came from the same population. An example of this would be measuring speed of 50 vehicles before cops are stationed on
What is standard error and what purpose does it serve?
The standard error of the mean is an estimate of the amount of error in a sample estimate of a population mean. However, in multivariate regression, one must also consider standard error of the parameter estimate and standard error of the slope. Standard error of the parameter estimate is an estimate of the error (equivalent to one standard deviation) in an estimate of Y derived from a regression equation; a measure of goodness of fit. An estimate of the error or equivalent to one standard deviation in an estimate derived from a regression equation; the measure of deviation of a sample from the mean of the population. Always expressed as being within plus or minus of the mean population. A regression coefficient is the ratio of change in Y (dependent variable) to the change in X (independent variable). Regression equations always have some error, therefore the standard error calculates that error.
What do the values of standardized coefficient of the partial slope signify? When and why are they used when interpreting results of a multiple regression model?
The standardized coefficient of the partial slope is used to standardize the variables used in the analysis. This can be achieved by converting its scores into standard deviation units from the mean. For example, rain and temperature are measured in two different units. We want to know which of the two variables has a stronger effect on the number of forest fires that occur. We can use the standardized coefficients of the partial slopes to compare the two variables. These values basically state that the average standard deviation change in Y (forest fires) associated with a standard deviation change in X (rain or temperature), when the other independent variables (rain or temperature) are held constant. BETA WEIGHTS