Chapter 14 - Simple Linear Regression (Sections 1-9)

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Test For Significance

- A regression is measuring the relationship between two variables. The sample correlation coefficient r (0.95) is an estimate of the true population correlation coefficient ρ between the two variables. If ρ = 0, there is no relationship between the two variables. If ρ is anything else (so ρ ≠ 0), then the variables have a relationship. - We will run a hypothesis test with r as our sample value for ρ. Our null hypothesis is that ρ = 0 (no relationship) because if you think of all of the variables in the world, most of them have no relationship (for example the number of pages in our textbook and the number of minutes it takes me to drive to work). If r is significantly far enough away from 0, we'll have enough evidence to reject the null hypothesis and state that ρ ≠ 0, and therefore the two variables do have a significant relationship. - Ho : p = 0 Ha: p does not = 0 - use test statistic (excel calculates)

scatterplot

- A scatterplot is a graphical tool to investigate the relationship between the dependent variable and the independent. - We plot each observation as an ordered pair (x,y), where the independent variable is the horizontal variable and the dependent variable on the vertical variable.

Detecting Outliers

- By standardizing the residuals, we are really converting the residuals to z scores (aka standard deviations). - Remember, usual z scores are between -2 and 2 - If an observation has a standardized residual greater than 2 or less than negative 2 it is unusual and considered an outlier. - If an outlier is detected, always check to see that the data was entered correctly. - Removing or correcting an outlier can dramatically alter the goodness of fit of the model.

Estimated Regression Equation

- If the values of the population parameters β0 and β1 were known, we could use the regression equation to find the mean value of y for a given value of x. - In practice, the parameter values are not known and must be estimated from sample data. We use b0 and b1 to estimate β0 and β1, respectively.

Residual Plot

- If the variance is constant, the residual plot should show that the residuals are fairly symmetrically distributed over the x values. Look for a horizontal band, centered at the x-axis, which displays no pattern. - Our Excel regression can produce a residual plot. - Think about drawing a rectangle around the dots. Do the dots fill the rectangle well? - If the assumption of constant variance is supported, we will observe a horizontal band, centered at the x-axis, which displays no pattern. - If we observe any pattern that is not random, we can conclude that the assumption of constant variance has been violated and the relationship is not linear.

Coefficient of Determination

- The coefficient of determination, r2, tells us the percentage of the variation in y that is explained by the regression model -

Assumptions for the Regression Model

- The error term ε is a random variable with a mean of expected value of zero. - The variance of ε, denoted by σ2, is the same for all values of x. - The values of ε are independent. - The error term ε is a normally distributed random variable.

simple linear regression

- The simplest type of regression analysis involving one independent variable and one dependent variable. - Approximated by a straight line.

Confidence Interval for the Mean Value of y

- We can use a confidence interval to say with a given level of confidence that we are that percent confident that the mean value of the dependent variable (Sales) for all observations (every restaurant) with a given value of the independent variable (Student Population) lies between 2 numbers. - The formula to find a confidence interval is based around 3 parts: confidence level (we'll use 95% for this problem), regression model standard deviation, and how far the input value for the independent variable is away from the rest of the data. You may recall from STAT I that it is generally not a good idea to use a value for x that is too far away from the rest of the x values.

Least Squares Method

- We want to find the best fit line for the data. To do this, we minimize the distance between the actual data value and the predicted value. Another way to think of this is that we mathematically make the line go as close as possible to each data point. -

outlier

- a data point that does not fit the trend shown by the remaining data - in STAT2... any residual point that is 2 or more standard deviations away from the mean (the mean of the residuals will always be 0) - Outliers represent observations that are suspect and require further investigation. *Incorrect data > Correct *Violate model assumptions > Consider different model *Unusual by chance > Keep

influential observations

- an observation that when removed from consideration will dramatically alter the model, whereas the removal of other single data points would not - may be an outlier (y value very different than the trend), it may correspond to an x value very far from the mean or may be caused by a combination of the two. - warrants further investigation. It may be data that is incorrect and can be fixed. If it is valid it must be kept and considered in explaining the relationship between the variables - How to identify 1. Data points with extreme values for the independent values are called High Leverage Points. 2. Leverage is a measure of the distance of an independent variable from the mean. 3. An observation is considered to have high leverage if its leverage value is greater than 6/n (rule of thumb value)

Regression analysis

- can be used to develop an equation showing how the variables are related - Managerial decisions are often based on the relationship between 2 or more variables. Examples: Advertising expenditures and Sales Daily high temperature and electricity usage

residual

- the difference between an actual observed y value and the value a regression model would predict it to be. - These residuals are designed to be minimized by the Least Squares Method, however they still exist and studying them can tell us more about our model - A residual plot against x will allow us to validate the assumption that the variance of ε is constant for all values of x. We will use a plot of the residuals, which will help eliminate potential problems whose investigation is beyond the scope of the course

Simple Linear Regression Model

- β0 and β1 are referred to as the parameters of the model and ε is the error term. The error term accounts for the variability in y that cannot be explained be the linear relationship between x and y

Simple Linear Regression Equation

- β0 is the y-intercept of the regression line, β1 is the slope, and E(y) is the mean or expected value of y for a given value of x E(y) = β0 +β1x basically y = mx + b

How to Report Regression Analysis Results

1. State the model. - Ex: Sales = 60 + 5 * Student Population 2. When reporting the model, we must also explain the interpretation of each parameter, b0 and b1. - Example: b0 = 60, meaning we predict the sales for a store with nearby student population of 0 to be 60. b1= 5, meaning for every one unit increase in student population, we expect an increase of 5 units in sales. 3. Report the coefficient of determination and explain its meaning. - Ex: r2 = .9027 meaning 90.27% of the variation in Sales is explained by Student Population. 4. Perform hypothesis test for significance. Explain results in thorough manner including p-value, conclusion of hypothesis test, and conclusion in context of situation. - Ex: With a p-value of .000, we reject the null hypothesis and conclude that there is a significant relationship between the 2 variables. 5. Produce Standardized Residual plot and provide a sketch. Interpret results of plot. - Residuals form a Pattern: Not a linear relationship - Residuals do not form a Pattern: Verifies the linear relationship - Ex: The standardized residual plot shows that the assumption of constant variance is not violated as we observe a horizontal band with no pattern, centered at the x-axis. 6. Identify Outliers. - Ex: Observation 4 is an outlier because it has a standardized residual of -2.33, which is < -2 Further investigation of this observation is warranted. 7. Identify High Leverage Points. - Ex: Observation 7 is a high leverage point as its leverage value of .94 exceeds .86.

Reporting Predictions

1. To report a prediction for an individual observation, report the predicted value, y-hat, and the prediction interval. - Ex: We predict the sales for a restaurant with a student population of 10,000 to be 109.96 thousand dollars and are 95% confident that the actual value is between 75.61 and 144.30 thousand dollars. 2. To report a prediction for a population with a given mean value of the independent variable, report the predicted value, y-hat, and the confidence interval. - Ex: We predict the mean sales for all restaurants with a student population of 10,000 to be 109.96 thousand dollars and are 95% confident that the actual value is between 98.30 and 121.53 thousand dollars.

Influential observation vs outlier

In general, you can think of outliers as points where the y value is out of whack and influential points as points where the x value is out of whack

independent variable

The variable or variables being used to predict the value of the dependent variable, denoted by x

Prediction Interval for an Individual Value of y

We can use a prediction interval to say with a given level of confidence that we are that percent confident that the value of the dependent variable (sales) for one observation with a given value of the independent variable (student population) lies between 2 numbers.

dependent variable

variable being predicted, denoted by y


Kaugnay na mga set ng pag-aaral

Ed Psych 321: Exam 1 Practice Questions

View Set

Social Media Management (SMM) for PHOTO

View Set

Chapter 13 managing Diversity MGMT 330

View Set

"Charles" by Shirley Jackson - Test Review

View Set