stats
when we increase our sample size
our data becomes more accurate when confidence interval becomes narrower - less likely to contain zero.
Homoskedasticity
the pattern of the covariation is constant (the same) around the regression line, whether the independent values are small, medium, or large
ways to determine relationship / statitica significance between 2 variables
- r-squared (closer to 1 ) -hetero/homo skedacity - null hypothesis (if p value of coefficent is less than .05) : p value would be the likelyhood of having a slope greater than zero if the slope is in fact zero the smaller the p value the more confident we are that we can regect the null hypthesis p value is the prob that the null is true If the p-value for a slope coefficient is greater than 0.05, then we do not have enough evidence to conclude with 95% confidence that there is a significant linear relationship between the variables. if when looking at the confidnence interval for the coefficent that it doesnt contian zero ways that effect relationship: a p-value greater than .05 and having less data low r-sqaure low p value =/= high r-squared ( example with low r^2 and a high p-value )
Regression can help us:
1. forecast the future 2. helps us understand the structure of 2 variables by expressinbg the relationship mathmatically:( if one variable changes, how much does the other variable change? ) predictions we are forcasting should be within historical range. so if we are inputing various x's to predict an outcome it should be within range regression equation helps us understand the relationship of 2 variables, you can have also havehadtiple variables producting an outcome
R-squared
A measure of the explanatory power of a regression analysis. measures how much of the variation in the dependent variable can be explained by the variation in the independent variable(s). is calculated by forming the ratio of the regression sum of squares to the total sum of squares.
adjusted R-squared
A measure of the explanatory power of a regression analysis. Used in multiple regression analysis, adjusted R-squared is R-squared multiplied by an adjustment factor. The adjustment factor compensates for the increase in R-squared gained only by increasing the number of independent variables as opposed to the increase in R2 gained by any real explanatory value of the added independent variables
correlation coefficient
A measure of the strength of a linear relationship between two variables. from[ -1, 1 ] a correlation coefficient of -1, 1 = perfect negativ/positive relationship 0 = no correlation the square root of r^2
Multicollinearity
A situation in which several independent variables are highly correlated with each other. This characteristic can result in difficulty in estimating separate or independent regression coefficients for the correlated variables.
lagged variable
A type of independent variable often used in a regression analysis. When data are collected as a time series, a regression analysis is often performed by analyzing values of the dependent alongside independent variables from the same time period. However, if a lag between changes in an independent variable and the associated changes in the dependent variable is presumed to exist, the regression can be performed by analyzing values of the dependent variable alongside values of the independent variables from a previous time period.
multicollinearity
An interrelationship among multiple independent variables in a regression analysis that makes it difficult to identify the net relationships between the dependent variable and the independent variables. Multicollinearity can obscure the results of a regression analysis by increasing the uncertainty about the values of the regression equation's coefficients. A symptom of multicollinearity is a high adjusted R-squared accompanied by low significance of the regression coefficients.
multivariate regression.
Multiple regression is an extension of simple regression that allows us to analyze the relationships between multiple independent variables and a dependent variable. Relationships among independent variables complicate multivariate regression. With more than two independent variables, graphing multivariable relationships is impossible, so we must proceed with caution and conduct additional analyses to identify patterns. The regression equation in our housing example will have the form below: house size and distance each have their own coefficients, and they are summed together along with the constant coefficient a.
regression equation
The _______ ___________ tells us how the dependent variable has typically changed with changes in the independent variable.
Regression pack example
The advertising coefficient tells us how sales have changed on average as advertising has increased. so if advertising inc on average by x, the effect of sales (dependent is correlation coefficent * x in our example, for every incremental $1 in advertising, sales increase on average by $50. Thus, for every incremental $10,000 in advertising, sales increase on average by $500,000.
residual
The difference between the predicted value of the dependent variable and the actual value of the dependent variable. Residuals are the error terms in a regression; they represent the variation in the dependent variable that is not "explained" by the regression equation
regression sum of squares
The difference between the total sum of squares and the sum of squared errors for the regression equation. The regression sum of squares measures the variation in the dependent variable "explained" by the regression equation.
regression line
The line that best fits data displayed in a two-dimensional scatter diagram plotting a dependent variable against an independent variable. The regression equation defines the regression line: it is the line that minimizes the sum of squared errors. (best fit) A regression line is typically used to forecast the behavior of the dependent variable and/or to better understand the nature of the relationship between the dependent and the independent variable.
significance level
The probability of a Type I error. A benchmark against which the P-value compared to determine if the null hypothesis will be rejected. See also alpha.
coefficient of variation
The ratio of a data set's standard deviation to its mean. A standardized measure of the spread of a data set, the coefficient of variation is useful for comparing the magnitude of variability in data sets that have different means
Heteroskedasticity
The residuals appear to be getting larger for higher values of the independent variable. This phenomenon is known as heteroskedasticity Since the variance of the residuals — which contributes to the variation of the dependent variable — is affected by the behavior of the independent variable, we can conclude that there must be more to the story than just the linear relationship. = in residual plot means there is more to the story, more independent variables .
Total Sum of Squares
The sum of squared differences between the actual values of the dependent variable and the mean of the dependent variable.
sum of squared errors
The sum of the squared vertical differences between the actual data point and the predicted one on the regression line.
example of a r-squared validation statement
We can do better than explaining 39% of the variation in occupancy as we did with our earlier regression using advance bookings as the independent variable.
regression
We can use ____________ to predict the value of the dependent variable for a specified value of the independent variable.
how to interpret coefficients in a multivariate regression model
We should always be careful to interpret regression coefficients properly. A coefficient is "net" with respect to all variables included in the regression, - holding them constant (as if they were the same) but "gross" with respect to all omitted variables. An included variable may be picking up the effects on price of a variable that is not included in the model — school district, for example. - simple regression style
residual sum of squares v regression sum of squares
at
when regression line is good
each of the residual plots should reveal a random distribution of the residuals. (plot residual against value of indep variable ) The distribution should be normal, with mean zero and fixed variance. (alt-x-m-d) Get descriptive statistics and to find if mean =0 an if the variance is fixed
Error
error in prediction or the residual error, or simply the error. The error is the difference between the observed value and the line's prediction for our dependent variable Then the error is y - (y-hat), the difference between the actual and predicted values of the dependent variable. the y value of any data point is the y value of the regression line + error the error in all data points measures the fit of the line. we take the squared value of all thouse to dictate how well the line fits in the data
Sum of Squared Errors (SSE), or the Residual Sum of Squares,
gives us a good measure of how accurately a regression line describes a set of data.
Regression analysis
helps us find the mathematical relationship between two variables. We can use regression to describe a linear relationship: one that can be represented by a straight line and characterized by an equation of the form y = a + bx.
residual plot is important because
helps us see errors better If the only pattern in the dependent variable is accounted for by a linear relationship with the independent variable, then we should see no systematic pattern in the residual plot. The residuals should be spread randomly around the horizontal axis. In fact, the distribution of the residuals should be a normal distribution, with mean zero, and a fixed variance. Residuals are called homoskedastic if their distributions have the same variance.
coefficent of the slope tell us
how the variation in the independent variable effects the dependent variable
finding estimated dependent variable in a multivariate regression
knowing 2 coefficents in a multiple regression problem (ex: house size and distance) you can get the estimated dependent variable by mult the data point x * the coefficient for the variable + the intercept
Plane
regression equation with two independent variables is a ___________
R squared
regression sum of squares/ total sum of squares the variation of the dependent variable as a result of the regresion line (total - residual) // the mean of the dependent variable - observed value of y (actual data pint)
e t-stat
tells us how many standard errors the coefficient is from the value zero. Thus, if the t-stat is greater than 2, we are quite sure (approximately 95% confident) that the true coefficient is not zero.
when we fail to reject the null
the data is not statically significant
sum of errors
the less well the line fits the higher
Testing Coefficients
the regression is an est. story. estimates of the coefficient and y int is not certain, we use estimates of alpha and beta to come up with the coeffiencet and int ( a est of alpha - intercept of tru line ) b =est of coeeficent; beta - true slope of line since we only have the sample not the true popultion data / we dont see the whole relationship as a result our slope can be within the confidence interval we are not certain that that the line is a result of a linear relationshoip unless,,, we know that our estimates have a linear relationship if there is no zero within the confidence interval of the slope we can test relationship if variables. no zero we can say we are 99% confident that there is a linear relationship or do make a null hypothesis that the slop is zero (no linear relationship)
total sum of squares
the sum of the squared deviations of each data point from the grand mean
Regression Sum of Squares (RSS)
total sum of squares- residual sum of squares
r-squared is improved by adding more _______________, but this is _________________ because r-square will either inc or stay the same when adding a new variable
variables, cheating
R squared tells us
what proportion of the total variation in the dependent variable is explained by its linear relationship with the independent variable.
closely
when model has one independent variable, r-squared is __________ related to the correlation coefficient
independent variable
x variable explanatory variable change in the independent variable (here advertising) is typically accompanied by a proportional change in the dependent variable (here sales), regression analysis can identify and formalize that relationship.
Variance formula
∑(x - X)²