Exam 3 (Ch. 14 & Multiple Regression Analysis)
β0
-(unknown, original) Slope Parameter -The true slope -We use b0 to estimate β0
Standard Error of the Regression (sY|X)
-(sample) Standard Deviation of Y given X (sY|X) -Tells us how far off our predictions are in general (after we fit a line to our data & use it to make predictions) -Interpretation: Predictions of 'Y variable' tend to be off by '# for R's Residual Standard Error', in general
β1
-(unknown, original) Intercept (Y-intercept) Parameter -Determines whether the linear relationship between x and and E(y) (=the expected value of y) is Positive (β1 > 0) or Negative (β1 < 0) & β1 = 0 indicates no linear relationship -The true intercept -β1=0 : NOT useful -β1≠0 : useful -We use b1 to estimate β1
Negative Association*
-A Negative sign (-1) indicates that Higher values of one variable are associated w/ Lower values of the second variable -When Above Average (Mean) values of one variable occur w/ Below Average (Mean)values of the other variable, then there's a Negative association -Scatterplot moves downward from Left to Right
Positive Association*
-A Positive sign (+1) indicates that BOTH variables increase together -i.e. As one increases, the other one tends to also increase -When Above Average (Mean) values & Below Average (Mean) values occur together, then there's a Positive association -Scatterplot moves upward from Left to Right -Ex: r>0 is a Positive association
Method of Least Squares/Ordinary Least Squares (OLS)
-A common approach to fitting a Line to the scatterplot -A regression technique for fitting a straight Line whereby the Error (Residual) Sum of Squares is minimized -Is used to estimate β0 & β1 -Chooses the Line where the Error Sum of Squares (SSE) is minimized -Produces the straight line that's "closest" to the data (when using SSE as the distance measure)
Influential Point
-A point is Influential if leaving it out yields a substantially different regression model -Not all influential points have a high residual -If a point is influential, you should check to make sure there is not an error
Correlation Coefficient (r)*
-A statistic that measures strength & direction of a linear association between 2 (QuaNtitative) variables, X and Y -A measure that quantifies the linear relationship between 2 QuaNtitative variables (from txtbk) -Establishes a linear relationship between 2 variables (txtbk) -Implies an apparent relationship between the 2 variables (x & y)--Tells if relationship is Positive or Negative -Values must fall between -1 & 1 (-1 < r < 1) - r=0 indicates: mean Y does NOT change with X (NO linear relationship; absence of correlation) - r=1 OR r=-1 indicates: perfect straight line (perfect positive/negative relationship; no variation from fit line) - r > 0 indicates: a Positive association - r < 0 indicates: a Negative association
Regression Analysis
-A statistical method for analyzing the relationship between variables -Allows us to Predict/Describe changes in a variable on the basis of other variables -With Regression Analysis, we assume that one variable (the Response Variable, y), is influenced by other variables (Explanatory Variables, x1) -Estimates the conditional mean/expectation of the Response variable (y) given the Explanatory variable (x/x1) (i.e. the average value of the Response variable (y) when the Explanatory variable(s) (x/x1) is fixed -simple regression & multiple regression
To check for normality
-Construct a Normal Probability Plot (plots the standard scores of the Residual vs. their Expected scores if their Residuals were normal) -Y axis: Sample Quantiles -X axis: Theoretical Quantiles
To check for constant variance
-Construct plot of the residuals (y - ŷ) versus the fitted (ŷ) -resid() on Y axis -fitted() on X axis
ANOVA (Analysis of Variance) (Review term)*
-Differences in Mean of a QuaNtitative Response for different levels of a Categorical variable
Simple Linear Regression (SLR)*
-Explore how the Mean for a QuaNtitative Response changes w/ differing values of a Quantitative Explanatory variable -ONE Y variable (the Response variable/the dependent variable) & ONE X variable (the Predictor variable/the independent variable -ONE Explanatory Variable (x1) is used to explain the Variability in the Response Variable (y) (txtbk) -Estimating a linear relationship between only TWO variables (txtbk) -Goals of a SLR Model: 1) Determine if there's a linear relationship between X & Y, and if so, estimate the linear model (estimate the Y intercept & slope) 2) Use the SLR Model to predict the value of Y based on the value of X
Overall Test/Global Test
-For MLR -To see if our overall model is useful -After we did the scatter plot & the regression output in R (<-lm()) &
Error Sum of Squares (SSE)
-In ANOVA: A measure of the degree of variability that exists even if all population means are the same -In Regression Analysis: It measures the unexplained variation in the Response variable (y) -The sum of the squared difference between the Observed values (y) & their Predicted values (ŷ) -The sum of the squared distances from the Regression Equation -A distance measure
Response Variable (y)
-In Regression Analysis: -The variable that is influenced by the Explanatory Variable(s) (x1) -We use info on the Explanatory Variables to Predict/Describe changes in the Response variable (txtbk) -Also called: the Dependent Variable, the PredictED Variable (??? isnt this y hat????), the Explained Variable, or the Regressand
Explanatory Variable(s) (x1,...)
-In Regression Analysis: -The variables that influence the Response Variable (y) -Explains the variation in the Response variable (y) (txtbk) -We use info on the Explanatory Variables to Predict/Describe changes in the Response variable (txtbk) -Also called: the Independent Variables, the PredictOR Variables, the Control Variables, or the Regressors -Denoted by x1--BUT in SLR we often drop the subscript 1 for ease, and just refer to it as x
b0 (in SLR)*
-Intercept (Y-intercept) estimate (coefficient) (in SLR) -A regression coefficient -The point where the line crosses the Y-axis (X=0) -As X (the X variable used) increases by 1 unit, the Predicted value of Y (the Y variable used) increases/decreases (by some amount) -Represents the Predicted value of ŷ when x has a value of zero (txtbk) -Fitted value -We use b0 to estimate β0 -For Ŷ = b0 + b1X (the simple linear model relating Y and X)
Multiple Linear Regression (MLR)
-MORE than ONE Explanatory Variable (x1) is used to explain the Variability in the Response Variable (y) -More than one Explanatory Variable (x1) is presumed to have a linear relationship w/ the Response Variable (y) -Allows us to examine how the Response variable (y) is influenced by 2 or More Explanatory variables (x1,x2...) -Goals of a MLR Model: 1) Determine if there's a linear relationship between one or more of the X's & Y, and if so, estimate the linear model (estimate the Y intercept & slopes for the X variables) 2) Use the MLR Model to predict the value of Y based on the values of the X's
Coefficient of Determination (r^2)
-Measures the % of the Variability in Y that is explained using X in the regression -The proportion of the Sample Variation (s^2) in the Response variable (y), that is explained/is accounted for by the sample regression equation (ŷ=b0+b1x) (txtbk) -To determine the Proportion of the Variation in Y that can be explained by the X variables in the model -It's value never decreases as you add more X variables (Explanatory) to your model -Closer r^2 is to 1, the better the fit -Sometimes denoted as R^2 -Interpretation: #% of the variabilty in 'y variable' is explained using 'x variable' in this model -In R: = "Multiple R-squared"
Extrapolation
-Occurs when one uses the model to Predict a y value corresponding to an x value which is NOT within the range of the Observed x's
Leverage Points
-Points with x-values that are far from the Mean of the x's (where most data points are concentrated on scatterplot)
Outliers
-Points with y-values that are far from the regression model -Can be points with large residuals
Adjusted R^2
-R^2adj will NOT necessarily Increase if you add another X variable (Explanatory/Predictor) to the model -Adding a X variable (Explanatory/Predictor) that does little or nothing to further explain the variation in Y, then R^2adj will Decrease -R^2adj will ALWAYS be Less than R^2 -If it is NOTICEABLY smaller, this indicates that there's a X variable in model that shouldn't be there--should remove it -^^ Larger gap = weak predictors ARE in model -If there is very little difference between R^2adj & R^2, this indicates all variables currently in model should be kept in model -^^ Smaller gap = little evidence that weak predictors are in model -F test is equivalent to testing R^2=0* -Higher R^2adj the better the model
bj (in MLR)
-Slope estimate (in MLR) -For each Explanatory variable, xj(j=1,...,k), the corresponding slope coefficient, bj, is the estimate of βj -Interpretation: Measures the change in the Predicted value of the Response variable (ŷ), given a unit increase in the associated Explanatory variable (xj), HOLDING ALL OTHER (EXPLANATORY) VARIABLES CONSTANT (i.e. It represents the partial influence of xj on ŷ) -j represents any # (j=1,2,...,k)
b1 (in SLR)*
-Slope estimate (in SLR) -A regression coefficient -The change in Y over the change in X (rise over run) -Represents the change in ŷ when x increases by one unit (txtbk) -Fitted value -We use b1 to estimate β1 -For Ŷ = b0 + b1X (the simple linear model relating Y and X)
Standard Error of the Model (se)
-Tells how tight (or loose) scatter is around your regression plane -Standard Error of the Residual (txtbk) -Called the Standard Deviation of the Estimate in the txtbk -The net effect, shown by se, helpus us determine if the added Explanatory variables improve the fit of the model (txtbk) -In R: It's called "Residual Standard Error" -Closer se is to 0, the better the model fits the sample data -Less scatter = Smaller se --> implies model is a good for for the sample data
k
-The # of X variables in the model -The # of Explanatory variables in the Linear Regression Model (txtbk) -Used for MLR -For a SLR Model: k=1 always
Residual (e)
-The difference between the Observed (actual) value & the Predicted value of the Response variable (y) -The Vertical distance between any data point on the scatterplot (y) & the corresponding point on the (Prediction) line (ŷ), represents the Residual, e=y-ŷ -It is the error of Prediction -Accounts for the variability in y that can't be explained by the linear relationship between x & y (small business article) -The deviations of the observed values y from their means (µy) -The Mean for residual (µe) is ALWAYS Zero -Line that best fits data has the smallest possible n prediction errors (e) (one for each data point) (use OLS) (If each point is on the line, then each residual equals 0, which means no dispersion between the observed & predicted values) -Residual = Error (="Vertical Deviation") e = y - ŷ
Conditional Mean
-The expected value of a random variable, given the certain set of 'conditions' (ex: Y|X) is known to occur -Another name for the Conditional Expected Value
Ill-Conditioned Data*
-Unusually large or small data -Can lead to loss of regression accuracy (from rounding) -Awkward estimates with scientific notation -Awkward interpretations -You should try to put all data on the same scale (or similar scales) OR a scale that is not unwieldy, to try to avoid ill-conditioned data
the Line
-the line that minimizes the sum of squared vertical deviations (residuals) -In SLR: It approximates the relationship between the response variable & the explanatory variable -Can show a Positive linear relationship, a Negative linear relationship, or No relationship between the 2 variables (If the line is flat, not sloped, there's no relationship)