CHAPTER 4 - STATS
HOW DO YOU FIND THE LEAST SQUARES REGRESSION LINE USING EXCEL
1. Enter the explanatory variable in column A and the response variable in column B. 2. Select the Data menu and then Data Analysis. 3. Select the Regression option. 4. With the cursor in the Y-range cell, highlight the column that contains the response variable. With the cursor in the X-range cell, highlight the column that contains the explanatory variable. Select the Output Range. Press OK X VARIABLE = YOUR SLOPE AND INTERCEPT IS YOU Y - INTERCEPT TO DRAW THE SCATTER PLOT WITH THE TRENDLINE SELECT YOUR DATA > GO TO INSERT SCATTER PLOT > CLICK ON SCATTERPLOT UNTIL THE GREEN PLUS SIGN SHOWS UP THAT SAYS CHART ELEMENTS > THEN CLICK ON TRENDLINE
when an influential observation occurs in a data set and its removal is not warranted, what can you do?
1. collect more data so that additional points near the influential observation are obtainedd 2. use teqniques that reduce the influence of the influential observation
how do you determine whether a linear relation exists between two variables
1. determine the correlation coefficient 2. find the critical value on table 11 appendix - based on the number of observations a linear relation exist if the correlation coefficient is greater than the critical value if the correlation coefficient is negative and less than the opposite of the critical valie - then the variables are negatively associated. if the correlation coefficient is positive and the value is not higher than the critical value (appendix A) then no linear relation exists'
Properties of the Linear Correlation Coefficient
1. linear correlation is always between -1 and 1 2. if r = +1 then a positive relation exists between two variables 3. if r = -1 then a perfect linear relation exists between two variables, 4. the close r is to +1 THE STRONGER THE EVIDENCE OF POSITIVE ASSOCIATION BETWEEN 2 VARIABLES.. 5. the close r is to -1 the stronger evidence of negative association. 6. if r is close to 0 then there is little to no evidence exist of a linear relation. however it does NOT imply there is NO RELATION. 7. THE CORRELATION COEFFICIENT IS NOT RESISTANT.
(d) Why is it necessary to show a scatter diagram with the correlation coefficient when claiming that a linear relation exists between two variables?
all of the above: Influential observations can cause the correlation coefficient to increase substantially, thereby increasing the apparent strength of the linear relation between two variables. If these observations are excluded, the data may appear more scattered than they are. Influential observations can give insight into values of data outside the range of the rest of the sample. If these observations are excluded, the data may appear linear when they are only linear for a specific range. Influential observations can cause the correlation coefficient to decrease substantially, masking other trends in the data. If these observations are excluded, the true trend of the data may be more apparent.
explanatory variable
also called predictor variable, or independent variable, it helps explain the variability in the response variable.
response variable
also called the dependent variable and is the variable whose value can be explained by the value of the explanatory or predictor variable
interpreting the y-intercept is especially important
because we should not use the regression model to make predictions outside the scope of the model, meaning we should not use the regression model to make predictions for values of the explanatory variable that are much larger or much smaller than those observed. This is a dangerous practice because we cannot be certain of the behavior of data for which we have no observations
how can you determine outliers in your data
by using a residual plot or building box plot OF THE RESIDUALS,
is the variance of the residual constant
constant error variance can occur if residuals increase or decrease as the explanatory variable increases if a model does not have a constant error varience , statistical inference using the regression model is not reliable, .
Simpson's Paradox
describes a situation in which an association between two variables inverts or goes away when a third variable is introduced to the analysis
What are deviations?
factors and variables thAT AFFECT THE RESPONSE VARIABLES , DEVIATION MEANS "TO STRAY"
correlation implies causation
false
how to make a contigency table
find the row totals and column totals for each catagory 2nd step is to find the relative frequency: divide each number in a cell by the total sum of row total and column total (the totals for both are the same) relative frequencies must be used because there are different numbers of observations for the categories. plus , frequencies are difficult to compare when there are different numbers of observations for the categories of a variable. if comparing two data sets from a certain column or row.use conditional distribution and make sure to use the total from the column or row to find the relative frequency.
HOW DO YOU COMPUTE THE CORRELATION BY HAND
find the sample mean of the explanatory variable (x) and the sample mean of the response variable (y) and then the standard deviation of explanatory variable (sx) and of the response variable (sy) SECOND, USE THE following formula for each observation for the explanatory (xi) and observations of response variable (yi): xi - mean of x / sx yi - mean of y / sx THIRD STEP, MULTIPLY THE ANSWERS GOTTEN OF EACH SET fOURTH STEP, ADD up the answers gotten . and do: total / # of observation - 1 (if a sample) CORRELATION IS REPRESENTED BY THE LETTER R
difference between correlation and causation
if the data is from an experiment done then causation or a claim can be stated if it is from an observational study we cannot conclude the 2 correlated variables have a causal relationship. its possible that sometimes a lurking variable can cause correlation between two variables that do not really have a relationship.
describing slopes in algebra vs statistics
in algebra you express your results with certainty , while in statistics you are never 100% sure with ur answer so you have to use phrases like: on average, or the expected slope is...
define leverage
is a measure that depends on how much the observations value of the explanatory variable differs from the mean value of the explanatory variable.
WHAT IS AN INFLUENTIAL OBSERVATION
its an observation that significantly affects the least squares regression line's slope and or y-intercept remove the point that you think is and influential observation, then compute the correlation or regression line. if the correlation coefficient/ lope/y-intercept changes dramatically then the removed point is influential observations with high leverage and large residual tend to be influential influence is affected by the following 2 factors 1. relative vertical position of the observation (residual) 2. the relative horizontal position of the observation (leverage)
nonlinear vs linear
linear is in a straight line non linear is not
conditional distribution
lists the relative frequency of each category of the response variable, given a specific value of the explanatory variable in the contingency table.
marginal distribution
marginal distributions because they appear in the right and the bottom margin of the contingency table. is a frequency or relative frequency distribution of either the row or column variable in the contingency table.
difference between positively associated and negatively associated scatter plots
positively associated - if one variable increases , the value of the other also increases- usually also linear negatively associated - when the value of one variable increases the value of the other variable decreases
how do you find the coefficient of determination
put the linear correlation coeficient on excel then square the number you get the # is the turned into a percentage'' and that percentage is the variation that is explained by the least squares regression line, the rest is explained by other factors.
coefficient of determination
represented by r^2 , measures the proportion of total variation in the response variable that is explained by the least-squares regression line. is a number between 1 and 0 If R2 = 0, the least-squares regression line has no explanatory value. If R2 = 1, the least-squares regression line explains 100% of the variation in the response variable
residual
represents how close our prediction comes to the actual observation, the smaller the residual the better the prediction. residual = observed y - predicted y this is what helps determine the line that best describes the relation between two variables. the method that best reduces the residual is the methoed of least squares.
how do you find an equation that describes linearly related data
step 1) choose two points from the table of observation step 2 ) using the point label them x1,y1 and x2 and y2 and find the slope using the formula: y2-y1/ x2-x1 step 3)use the point slope formula to find the equation of the line: y - y1 = m (x - x1) look at the following example: y - 258 = 2.8333 (x - 99) y - 258 = 2.8333x - 280.4967 ***notice how 2.8333 is multiplied by x & not -99. step 4 ) once you have the equation then just insert the x or y that is mentioned in the problem..
least squares regression line
the equation for this method is : y = b1 (x) + b0 WHERE b1 =( r ) multiplied : SAMPLE STANDARD DEVIATION OF THE RESPONSE VARIABLE / SAMPLE STANDARD DEVIATION OF THE EXPLANATORY VARIABLE. and b0 = the y -intercept if r is positive then b1 will also be positivethe larger standard deviation because the boxplot appears to have a greater spread, which likely results in a larger standard deviation.the larger standard deviation because the boxplot appears to have a greater spread, which likely results in a larger standard deviation.
should you only use scatter plots for your data
to not reach incorrect conclusions we should always use scatter plot alongside graphs
TYPES OF DEVIATION
total deviation = observed mean - mean of response variable explained deviation: predicted value- mean value of response variable unexplained deviation observed value- predicted value
If r equals=0, then no linear relationship exists between the variables.
true
how do you determine whether a linear model is apropriate.
you need to draw a residual plot- its just a scatter diagram with residuals on the vertical axis and the explanatory variables on the horizontal axis if the plot shows a patter like a curve then the explanatory and response variables may not be linearly related no pattern shown = linear model appropriate.
total deviation =
Unexplained Deviation + Explained Deviation
IN WHAT INSTANCE SHOULD YOU REMOVE AN OUTLIER
YOU SHOULD REMOVE THEM IF THEY ARE THE RESULT OF AN ERROR IN RECORDING , A MISCALCULATION , OR SOME OTHER OBVIOUS BLUNDER IN THE DATA COLLECTION PROCESS
scatter diagram
a graph that shows the degree and direction of relationship between two variables explanatory variable is plotted on the horizontal axis and the response variable is plotted on the vertical axis
contigency table or a two-way table
a method of presenting the relationship between two categorical variables in the form of a table. in other words it relates two catagories of data the column variable is the level or class each box inside the table is called a cell
