Statistics Ch. 4: Describing the Relation Between Two Variables
Positively associated variables
1) Linear variables 2) Are + correlated when above average values of one variable is associated with an above average values of the other variable. 3) In other words when one variable increases the other increases.
If r is close to +1, -1
+1: stronger the evidence of a positive correlation between the variables; -1: stronger evidence of negative correlation between the 2 variables.
Negatively Associated Variables.
1) Linear variables 2) Are - correlated when above average values of one variable is associated with below average values of the other variable. 3) In other words when one variable increases the other decreases.
If the correlation coefficient shows no linear relation, (2 things!)
1) we can't use the least sqaures regression line to make predictions (y hat) 2) y hat or predicted observation is equal to the mean of the response variables.
How to compute linear correlation coefficient on Calculator
1. Go to 2nd then 0 = catalog 2. Go to diagnostic on and click Enter twice to see the word done. 3. Go to stats, calc. go to Linear regression and put for example L1, L2 (exp., resp. variable respectively).
Scatter Diagram Definition
A graph that shows the relationship between two quantitative variables measured on the same individual.
. . . . . What's sort of relationship does the above scatter diagram show?
A positvely-correlation linear relationship between the explanatory varaible on the x axis and the response variable on the y axis.
Residual is like
An error because the observed value is different from our predicted value.
Marginal Distributions ... why are they called so?
Because each marginal distribution appear either at the right margin or the bottom margin of the contingency table
Why are scatter diagrams not sufficient to help us determine whether a relationship exists between two data?
Because the horizontal and/or vertical scales of the graph can be manipulated and hence mislead,distort or show a different representation of the relationship between the two variables. Hence, we need a numerical summary called the linear correlation coefficient to help us determine any relation that exists between two variables.
Synonym for Response Variable
Dependent Variable
Simpson's Paradox
Describes a situation in which an association between two variables inverts or goes away when a third variable is introduced into the analysis. (see pg. 241 if necessary)
SOMETHING VERY IMPORTANT TO REMEMBER about the usage of least squares regression line.
Do not use the least squares regression line equation for explanatory variables that are NOT within the range of values in the data set because the linear relation that we computed may not hold true for values that are smaller or larger than scope of values in the data set.
How is scatter diagram constructed.
Each individual in the data set is represented by a point on the scatter diagram. the explanatory variable is plotted on the horizontal axis and the response variable is plotted on the vertical axis.
How do you: 1) Test for the strength or direction of a linear relationship
First you have to determine whether a linear relationship exists or not: First: Compute the ABSOLUTE VALUE OF THE LINEAR COEFFICIENT Second: Look at critical value of r in an appendix for the given sample size. Third: If computed absolute value is greater than the critical value than a linear relationship exists between the two variables. Fourth: If linear, Observe the actual (not absolute value of r); if r> critical value: positive association between the variables, r< critical value: negative association between the variables.
Give an example of how conditional frequency can help us see if there is an association between two variables.
For instance: level of high school education and employment. If the relative frequencies for employed people who graduated high school and who did not graduate high school are both close to the relative frequency marginal distribution for employment then the level of high school education is not really a factor in employment and is NOT ASSOCIATED with better chance of employment.
Marginal Distribution of a variable compare with conditional distreubtion
Frequency or relative frequency of either the row or column variable in the contingency table. Conditional: the RELATIVE FREQUENCY (only) of a category of the response variable (e.g. employment) given a certain value of the explanatory variable (e.g. number of high school graduates) (see pg. 238 if necessary)
The residual value represents
How close the observation that we predicted using the linear equation: least squares regression line: to the actual observation. The smaller the residual (or the difference between the value of the observed response variable to the predicted value of the response variable), the better our prediction was.
Interpretation of a regular slope value of m=-y/x and m=3/2 vs interpretation of slope of least squares regression line.
In general: slope measures the rate of change m = slope = change in y over change in x. m=-1/1, if x increases by 1, y decreases by 1, m=3/2, if x increases by 2, then y increases by 3. In case of least squares regression line if m=-1/1, if x increases by 1, y decreases, ON AVERAGE by 1, (this is because in statistics everything is based on statistics and there is no 100% certainty). m=3/2, if x increases by 2, then y increases, on average by 3. (same reason)
What's an interesting application of the y hat. Give an example.
It can be used to represent the mean response variable for any value of the explanatory variable. For instance if the best-fit regression line equation relates student's amounts of hours studying to student's gpa then for instance if student studies 12 hours per day get 3.8 gpa then we can say that the mean gpa for all students who study 12 hours per day is 3.8.
When does a strong positive or a strong negative r value imply/doesn't imply causality between the two variables for e.g. the change in the value of 1 variable CAUSES the CHANGE IN THE VALUE of the other variable?
It depends on the nature of data collection. If data was collected in an OBSERVATIONAL STUDY, then the results no matter how strong the r value is are only ASSOCIATION OR STRONGLY CORRELATED TOGETHER and the cause of the relation might be a lurking variable not accounted for in the study. But if data was collected through an EXPERIMENT, than that implies CAUSATION.
Linear Correlation Coefficient Definition
It is a measure of the STRENGTH and DIRECTION of the LINEAR relation between two quantitative variables.
What is the units for the linear correlation coefficient "r"
It is unitless
Conditional Distrebution by X
It means that X is the denomenator. (see h.w. q.2 if necessary)
The least sqaured regression line is a line that
Minimizes the SUM of (the SQUARED:::: errors or residuals). In other words it minimizes the SUM of (the SQUARED:::: vertical distance between the observed y values and the predicted y values which are also called y hat).
What's important to remember when drawing scatter diagrams.
NEVER connect the dots with lines.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What type of relationship is implied here?
No relation between the explanatory and the response variable.
The linear correlation coefficient is .......? what does that mean
Not resistant. That means that any value in the data set that doesn't follow the overall pattern of the data set may affect the value of r. We don't need to know the formula of how to find "r" but the formula takes into consideration ALL THE VALUES of both the explanatory and response variable and hence extreme values of any of the variables can definitely the value of r. See pg. 193 to see the formula which we don't need to memorize.
Other words for Linear Correlation Coefficient
Pearson Linear Correlation Coefficient or Pearson Product Moment Correlation Coefficient.
If r=+1
Perfect positive Linear relationship exists.
Synonyms for Explanatory Variable
Predictor or Independent Variable
What's the purpose of drawing a scatter diagram.
Scatter diagram serves as the first step in helping us identify whether a relationship exist or doesn't exists between two variables.
Summarize how we observe association between qualitative or categorical variables.
See compare the relative frequency of the explanatory variable in each category of the explanatory variable. Differences in the values of the frequencies between the different categories MIGHT BE attributed to difference in the explanatory variable and hence the expl. and response variable are correlated.
Conditional Distribution of Y
That means Y is the numerator, makes sense because imagine question asking what's the conditional distribution of employed people who are high school graduates. (see h.w. q.2 if necessary)
Hint: When asked what proportion of ..........,
The .............. is the denominator.
Relationship between the scatter diagram and correlation coefficient from section 1 with what we learn in section 2. Tie it to the whole chapter.
The main theme of the chapter is how to describe a relationship between two variables. Section 1 taught us how to test for a linear relationship that exists between 2 variables first by drawing a scatter diagram and then confirming a linear relationship by computing the correlation coefficient. Let's say that there is a linear relationship that exists. We then need to learn how to describe it as accurately as POSSIBLE in an equation called the least squares regression line equation. This is the focus of section 2.
Response Variable
The variable whose value can be explained by the value of the explanatory or predictor variable.
If r=-1
Then perfect negative linear relationship exists
If r=0
Then there is LITTLE or NO evidence of a LINEAR RELATIONSHIP between the two variables.
. . . . . . . . . . . . What type of relationship is shown here?
This is a non-linear relationship
. . . . . . . . . What type of relationship is shown here
This is a nonlinear relationship
. . . . . . . What type of relationship is shown here
This is negatively associated linear relationship between the explanatory variable and the resopnse variable.
What's the purpose of computing the conditional distribution? what do we use
To describe whether there is a relationship that exists between two categories of qualitative data. We use relative frequency numbers because there are different number of observations for each category of data (see pg.237 if necessary)
Contingency Table aka .... define
Two-way table. Relates between two Categorical or QUALITATIVE data. For instance level of education and employment status.
What is y hat?
Y hat is the predicted value of the response variable using the least squares regression line equation.
How to determine if there is pos, neg or no linear association especially if r is ambiguous like not very close to 1 to really show that there is a positive linear assocation for instance. Take the r value computed for e.g. -.5 or +.4;
if r is a positive number and > critical value of the sample size then + association, if < critical value then no linear relation if r is a negative number and < critical value of the sample size then - assocation, if > critical value then no linear relation.
Set of values for linear correlation coefficient "r"
r can be any value between -1 and 1, it can also be -1 or 1
Caution about r=0
this is different from saying that there is NO RELATIONSHIP at all between the two variables.