Chapter 3: Linear Regression

Ace your homework & exams now with Quizwiz!

Interpreting the intercept

y=a+bx a= the value that we might expect for sales if population was zero.

Measures similar to RMSE

-Mean Squared Error (MSE) -Mean Absolute Error (MAE)

Interpreting the RMSE

A small RMSE manes that the distance between the line to the points is small which means there is a good fit between the line and the data The actual RMSE gives an idea off the scale of the expected errors when using the model

Interpreting the Coefficient of Determination

Higher values for R^2 indicate a better fit - ie how well the linear model describes the relationship between x and y. The advantage of using r^2 is that it does not depend on the scale of the data - the lowest value is always 0% and the highest value is always 100%. The RMSE is still useful for finding the mean error of your predictions and how good or bad your predictions may be

In and Out of Sample Data

In sample data is the data which we use to train the model Out of sample data is the data on which the model will be run later. Sometimes the a model will preform well (low RMSE and high r^2) on the in sample data but preform badly on the out of sample data. Sometimes the data will not generalise well. Sometimes the data on which the model will be used might not be similar to that on which the model was trained

Predictions and Responses

In statistics the variable we are trying to model is usually represented with the letter y and can have several names including: -dependent variable -response variable -regressand The variable we are using to predict with is typically represented with the letter x and referred to as: -predictor variable -independent variable -explanatory variable -feature -regressor The predictions of a model are typically reffered to as F(x),y' or somtimes y hat

Interpolation

Making predictions for x values inside the range of existing data This tends to be more reliable

Extrapolation

Making predictions outside the range of existing data points

Root Mean Square Error (RMSE)

Measures how good the line of best fit is It is the square root of the mean of all the squared differences between the data points, y, and the values at the line at those points - ie the root mean square of residuals (errors)

Predicting with Regression

Once we have an equation for the regression line, we can use it to make predictions

Least Squares Coefficients in Excel

Play x and y in two columns. Enter formula =LINEST, putting the dependent (y) variable first. Use training data only, never use testing data to build your model. Press Control, Shift and Enter - the rectange of cells will be built, givign the b and a value of the line of best fit

Coefficient of Determination

RMSE depends on the scale of the data Another measure how well the line fits the data is to use the Coefficient of Determination aka R^2. This is a measure that indicates the proportion of the variance of y that is predictable when using the model ie how much the model changes in y when x is changed It is between 0% (we cannot tell how y will change) and 100% (we can precisely tell)

Regression

Says exactly how large a change in one variable to expect from a particular change in another

Residual

The distance of the line to each point - the difference between the observed y value (eg sales) and the y value through which the line passes.

Calculating the residual

The ditance of the point to the line at xi is (yi - fx(i)) x bar and y bar are averages of x and y

Training and Testing Error

We only use some of the available data for training. We divide the data we have into training and testing data. We use the training data to create the model that gives us low training error and high accuracy. When estimate how the model will do with unseen data we measure its test error with info that is already in the test data. We can't always do this. Somtimes data sets are very small and further breaking them down into train and test data would leave very few samples to build a model.

Least Squares Method

We want to minimise the sum of the squares of the residuals when picking a line of best fit/equation of the line. This equation is called the residual sum of squares. n=number of data points xi and yi are the values of the ith obersvation a and b are the intercept and slope of the line

The Coefficient of Determination can also be calculated as

Where r^2(y,f(x)) is the square of Pearson's correlation coefficient between the observed values yi and the values predicted by the linear regression (y=a+bx)

RMSE and R^2 Calculation in Excel

Work through examples in notes

Model Visualisation in Excel

You can visualise the lie of best fit. Start with a scatter plot of the data, right click a data poiny and chose to add a trendlinei

Interpreting the coefficient

y=a+bx b= tells us how fast sales might change in response to population etc. If x increased by one unit, y would increase by B (This is what the slope means)


Related study sets

CCHS 9th Literature/Comp , CCHS GA-Foundations of Algebra , CCHS Health and Personal Fitness ( 1 full credit) , CCHS Introduction to Business and Technology, 6/21 (FIRST)

View Set

PrepU & Picmonic: CH 29 Nonmalignant Hematologic Disorders

View Set

Chapter 4 - Radio Frequency Components, Measurements, and Mathematics

View Set

The Kinetic Molecular Theory of Matter

View Set

33 Diuretics, 34 Antihyperlipidemic Drugs, 37 Anticoagulant and Thrombolytic Drugs, 38 Cardiotonic and Inotropic Drugs, 39 Antiarrhythmic Drugs

View Set

Modules 4 - 7: Ethernet Concepts Exam

View Set

Prevention of Sun-Induced Skin Disorders (Ch 38, 40)

View Set