Exam 3 (Ch. 14 & Multiple Regression Analysis)

Ace your homework & exams now with Quizwiz!

β0

-(unknown, original) Slope Parameter -The true slope -We use b0 to estimate β0

Standard Error of the Regression (sY|X)

-(sample) Standard Deviation of Y given X (sY|X) -Tells us how far off our predictions are in general (after we fit a line to our data & use it to make predictions) -Interpretation: Predictions of 'Y variable' tend to be off by '# for R's Residual Standard Error', in general

β1

-(unknown, original) Intercept (Y-intercept) Parameter -Determines whether the linear relationship between x and and E(y) (=the expected value of y) is Positive (β1 > 0) or Negative (β1 < 0) & β1 = 0 indicates no linear relationship -The true intercept -β1=0 : NOT useful -β1≠0 : useful -We use b1 to estimate β1

Negative Association*

-A Negative sign (-1) indicates that Higher values of one variable are associated w/ Lower values of the second variable -When Above Average (Mean) values of one variable occur w/ Below Average (Mean)values of the other variable, then there's a Negative association -Scatterplot moves downward from Left to Right

Positive Association*

-A Positive sign (+1) indicates that BOTH variables increase together -i.e. As one increases, the other one tends to also increase -When Above Average (Mean) values & Below Average (Mean) values occur together, then there's a Positive association -Scatterplot moves upward from Left to Right -Ex: r>0 is a Positive association

Method of Least Squares/Ordinary Least Squares (OLS)

-A common approach to fitting a Line to the scatterplot -A regression technique for fitting a straight Line whereby the Error (Residual) Sum of Squares is minimized -Is used to estimate β0 & β1 -Chooses the Line where the Error Sum of Squares (SSE) is minimized -Produces the straight line that's "closest" to the data (when using SSE as the distance measure)

Influential Point

-A point is Influential if leaving it out yields a substantially different regression model -Not all influential points have a high residual -If a point is influential, you should check to make sure there is not an error

Correlation Coefficient (r)*

-A statistic that measures strength & direction of a linear association between 2 (QuaNtitative) variables, X and Y -A measure that quantifies the linear relationship between 2 QuaNtitative variables (from txtbk) -Establishes a linear relationship between 2 variables (txtbk) -Implies an apparent relationship between the 2 variables (x & y)--Tells if relationship is Positive or Negative -Values must fall between -1 & 1 (-1 < r < 1) - r=0 indicates: mean Y does NOT change with X (NO linear relationship; absence of correlation) - r=1 OR r=-1 indicates: perfect straight line (perfect positive/negative relationship; no variation from fit line) - r > 0 indicates: a Positive association - r < 0 indicates: a Negative association

Regression Analysis

-A statistical method for analyzing the relationship between variables -Allows us to Predict/Describe changes in a variable on the basis of other variables -With Regression Analysis, we assume that one variable (the Response Variable, y), is influenced by other variables (Explanatory Variables, x1) -Estimates the conditional mean/expectation of the Response variable (y) given the Explanatory variable (x/x1) (i.e. the average value of the Response variable (y) when the Explanatory variable(s) (x/x1) is fixed -simple regression & multiple regression

To check for normality

-Construct a Normal Probability Plot (plots the standard scores of the Residual vs. their Expected scores if their Residuals were normal) -Y axis: Sample Quantiles -X axis: Theoretical Quantiles

To check for constant variance

-Construct plot of the residuals (y - ŷ) versus the fitted (ŷ) -resid() on Y axis -fitted() on X axis

ANOVA (Analysis of Variance) (Review term)*

-Differences in Mean of a QuaNtitative Response for different levels of a Categorical variable

Simple Linear Regression (SLR)*

-Explore how the Mean for a QuaNtitative Response changes w/ differing values of a Quantitative Explanatory variable -ONE Y variable (the Response variable/the dependent variable) & ONE X variable (the Predictor variable/the independent variable -ONE Explanatory Variable (x1) is used to explain the Variability in the Response Variable (y) (txtbk) -Estimating a linear relationship between only TWO variables (txtbk) -Goals of a SLR Model: 1) Determine if there's a linear relationship between X & Y, and if so, estimate the linear model (estimate the Y intercept & slope) 2) Use the SLR Model to predict the value of Y based on the value of X

Overall Test/Global Test

-For MLR -To see if our overall model is useful -After we did the scatter plot & the regression output in R (<-lm()) &

Error Sum of Squares (SSE)

-In ANOVA: A measure of the degree of variability that exists even if all population means are the same -In Regression Analysis: It measures the unexplained variation in the Response variable (y) -The sum of the squared difference between the Observed values (y) & their Predicted values (ŷ) -The sum of the squared distances from the Regression Equation -A distance measure

Response Variable (y)

-In Regression Analysis: -The variable that is influenced by the Explanatory Variable(s) (x1) -We use info on the Explanatory Variables to Predict/Describe changes in the Response variable (txtbk) -Also called: the Dependent Variable, the PredictED Variable (??? isnt this y hat????), the Explained Variable, or the Regressand

Explanatory Variable(s) (x1,...)

-In Regression Analysis: -The variables that influence the Response Variable (y) -Explains the variation in the Response variable (y) (txtbk) -We use info on the Explanatory Variables to Predict/Describe changes in the Response variable (txtbk) -Also called: the Independent Variables, the PredictOR Variables, the Control Variables, or the Regressors -Denoted by x1--BUT in SLR we often drop the subscript 1 for ease, and just refer to it as x

b0 (in SLR)*

-Intercept (Y-intercept) estimate (coefficient) (in SLR) -A regression coefficient -The point where the line crosses the Y-axis (X=0) -As X (the X variable used) increases by 1 unit, the Predicted value of Y (the Y variable used) increases/decreases (by some amount) -Represents the Predicted value of ŷ when x has a value of zero (txtbk) -Fitted value -We use b0 to estimate β0 -For Ŷ = b0 + b1X (the simple linear model relating Y and X)

Multiple Linear Regression (MLR)

-MORE than ONE Explanatory Variable (x1) is used to explain the Variability in the Response Variable (y) -More than one Explanatory Variable (x1) is presumed to have a linear relationship w/ the Response Variable (y) -Allows us to examine how the Response variable (y) is influenced by 2 or More Explanatory variables (x1,x2...) -Goals of a MLR Model: 1) Determine if there's a linear relationship between one or more of the X's & Y, and if so, estimate the linear model (estimate the Y intercept & slopes for the X variables) 2) Use the MLR Model to predict the value of Y based on the values of the X's

Coefficient of Determination (r^2)

-Measures the % of the Variability in Y that is explained using X in the regression -The proportion of the Sample Variation (s^2) in the Response variable (y), that is explained/is accounted for by the sample regression equation (ŷ=b0+b1x) (txtbk) -To determine the Proportion of the Variation in Y that can be explained by the X variables in the model -It's value never decreases as you add more X variables (Explanatory) to your model -Closer r^2 is to 1, the better the fit -Sometimes denoted as R^2 -Interpretation: #% of the variabilty in 'y variable' is explained using 'x variable' in this model -In R: = "Multiple R-squared"

Extrapolation

-Occurs when one uses the model to Predict a y value corresponding to an x value which is NOT within the range of the Observed x's

Leverage Points

-Points with x-values that are far from the Mean of the x's (where most data points are concentrated on scatterplot)

Outliers

-Points with y-values that are far from the regression model -Can be points with large residuals

Adjusted R^2

-R^2adj will NOT necessarily Increase if you add another X variable (Explanatory/Predictor) to the model -Adding a X variable (Explanatory/Predictor) that does little or nothing to further explain the variation in Y, then R^2adj will Decrease -R^2adj will ALWAYS be Less than R^2 -If it is NOTICEABLY smaller, this indicates that there's a X variable in model that shouldn't be there--should remove it -^^ Larger gap = weak predictors ARE in model -If there is very little difference between R^2adj & R^2, this indicates all variables currently in model should be kept in model -^^ Smaller gap = little evidence that weak predictors are in model -F test is equivalent to testing R^2=0* -Higher R^2adj the better the model

bj (in MLR)

-Slope estimate (in MLR) -For each Explanatory variable, xj(j=1,...,k), the corresponding slope coefficient, bj, is the estimate of βj -Interpretation: Measures the change in the Predicted value of the Response variable (ŷ), given a unit increase in the associated Explanatory variable (xj), HOLDING ALL OTHER (EXPLANATORY) VARIABLES CONSTANT (i.e. It represents the partial influence of xj on ŷ) -j represents any # (j=1,2,...,k)

b1 (in SLR)*

-Slope estimate (in SLR) -A regression coefficient -The change in Y over the change in X (rise over run) -Represents the change in ŷ when x increases by one unit (txtbk) -Fitted value -We use b1 to estimate β1 -For Ŷ = b0 + b1X (the simple linear model relating Y and X)

Standard Error of the Model (se)

-Tells how tight (or loose) scatter is around your regression plane -Standard Error of the Residual (txtbk) -Called the Standard Deviation of the Estimate in the txtbk -The net effect, shown by se, helpus us determine if the added Explanatory variables improve the fit of the model (txtbk) -In R: It's called "Residual Standard Error" -Closer se is to 0, the better the model fits the sample data -Less scatter = Smaller se --> implies model is a good for for the sample data

k

-The # of X variables in the model -The # of Explanatory variables in the Linear Regression Model (txtbk) -Used for MLR -For a SLR Model: k=1 always

Residual (e)

-The difference between the Observed (actual) value & the Predicted value of the Response variable (y) -The Vertical distance between any data point on the scatterplot (y) & the corresponding point on the (Prediction) line (ŷ), represents the Residual, e=y-ŷ -It is the error of Prediction -Accounts for the variability in y that can't be explained by the linear relationship between x & y (small business article) -The deviations of the observed values y from their means (µy) -The Mean for residual (µe) is ALWAYS Zero -Line that best fits data has the smallest possible n prediction errors (e) (one for each data point) (use OLS) (If each point is on the line, then each residual equals 0, which means no dispersion between the observed & predicted values) -Residual = Error (="Vertical Deviation") e = y - ŷ

Conditional Mean

-The expected value of a random variable, given the certain set of 'conditions' (ex: Y|X) is known to occur -Another name for the Conditional Expected Value

Ill-Conditioned Data*

-Unusually large or small data -Can lead to loss of regression accuracy (from rounding) -Awkward estimates with scientific notation -Awkward interpretations -You should try to put all data on the same scale (or similar scales) OR a scale that is not unwieldy, to try to avoid ill-conditioned data

the Line

-the line that minimizes the sum of squared vertical deviations (residuals) -In SLR: It approximates the relationship between the response variable & the explanatory variable -Can show a Positive linear relationship, a Negative linear relationship, or No relationship between the 2 variables (If the line is flat, not sloped, there's no relationship)


Related study sets

Chapter 8: Productivity & Growth

View Set

Apologia Marine Biology Module 7

View Set

Chapter 6 Interpersonal Communication

View Set

AP U.S. Government and Politics Final

View Set

Anatomy - Chapter 24: The Urinary System

View Set