Chapter 3: Linear Regression
Interpreting the intercept
y=a+bx a= the value that we might expect for sales if population was zero.
Measures similar to RMSE
-Mean Squared Error (MSE) -Mean Absolute Error (MAE)
Interpreting the RMSE
A small RMSE manes that the distance between the line to the points is small which means there is a good fit between the line and the data The actual RMSE gives an idea off the scale of the expected errors when using the model
Interpreting the Coefficient of Determination
Higher values for R^2 indicate a better fit - ie how well the linear model describes the relationship between x and y. The advantage of using r^2 is that it does not depend on the scale of the data - the lowest value is always 0% and the highest value is always 100%. The RMSE is still useful for finding the mean error of your predictions and how good or bad your predictions may be
In and Out of Sample Data
In sample data is the data which we use to train the model Out of sample data is the data on which the model will be run later. Sometimes the a model will preform well (low RMSE and high r^2) on the in sample data but preform badly on the out of sample data. Sometimes the data will not generalise well. Sometimes the data on which the model will be used might not be similar to that on which the model was trained
Predictions and Responses
In statistics the variable we are trying to model is usually represented with the letter y and can have several names including: -dependent variable -response variable -regressand The variable we are using to predict with is typically represented with the letter x and referred to as: -predictor variable -independent variable -explanatory variable -feature -regressor The predictions of a model are typically reffered to as F(x),y' or somtimes y hat
Interpolation
Making predictions for x values inside the range of existing data This tends to be more reliable
Extrapolation
Making predictions outside the range of existing data points
Root Mean Square Error (RMSE)
Measures how good the line of best fit is It is the square root of the mean of all the squared differences between the data points, y, and the values at the line at those points - ie the root mean square of residuals (errors)
Predicting with Regression
Once we have an equation for the regression line, we can use it to make predictions
Least Squares Coefficients in Excel
Play x and y in two columns. Enter formula =LINEST, putting the dependent (y) variable first. Use training data only, never use testing data to build your model. Press Control, Shift and Enter - the rectange of cells will be built, givign the b and a value of the line of best fit
Coefficient of Determination
RMSE depends on the scale of the data Another measure how well the line fits the data is to use the Coefficient of Determination aka R^2. This is a measure that indicates the proportion of the variance of y that is predictable when using the model ie how much the model changes in y when x is changed It is between 0% (we cannot tell how y will change) and 100% (we can precisely tell)
Regression
Says exactly how large a change in one variable to expect from a particular change in another
Residual
The distance of the line to each point - the difference between the observed y value (eg sales) and the y value through which the line passes.
Calculating the residual
The ditance of the point to the line at xi is (yi - fx(i)) x bar and y bar are averages of x and y
Training and Testing Error
We only use some of the available data for training. We divide the data we have into training and testing data. We use the training data to create the model that gives us low training error and high accuracy. When estimate how the model will do with unseen data we measure its test error with info that is already in the test data. We can't always do this. Somtimes data sets are very small and further breaking them down into train and test data would leave very few samples to build a model.
Least Squares Method
We want to minimise the sum of the squares of the residuals when picking a line of best fit/equation of the line. This equation is called the residual sum of squares. n=number of data points xi and yi are the values of the ith obersvation a and b are the intercept and slope of the line
The Coefficient of Determination can also be calculated as
Where r^2(y,f(x)) is the square of Pearson's correlation coefficient between the observed values yi and the values predicted by the linear regression (y=a+bx)
RMSE and R^2 Calculation in Excel
Work through examples in notes
Model Visualisation in Excel
You can visualise the lie of best fit. Start with a scatter plot of the data, right click a data poiny and chose to add a trendlinei
Interpreting the coefficient
y=a+bx b= tells us how fast sales might change in response to population etc. If x increased by one unit, y would increase by B (This is what the slope means)