Unit 4
A good Linear model will have:
-Points above and below the line in random scatters. -No curved pattern in the residuals -Equal variability throughout the entire residual plot.
Finding r from r squared
1. Take the square roof of r squared. 2. Look at the graph to determine sign.
For a Biology assignment, Lisa collected data on plant growth of a sunflower every week for 9 weeks. When Lisa first planted the sunflower, it was 10 centimeters tall. The time (in weeks) is plotted against the height (in centimeters) as shown below.
30. ON THE LINE.
Correlation Coefficient = r
=CORRE on excel
Coefficient of Determination (r^2)
A value that explains the percent of variation in the response variable that can be explained by a linear association with the explanatory variable. It is the square of the correlation coefficient.
Which of the following data sets would most likely have a negative association and a correlation coefficient between 0 and -1? a.) Average annual temperature in the United States;Annual sweater sales by an American retailer b.) Age of baby; Weight of baby c.) Number of minutes spent exercising; Number of calories burned d.) Number of miles driven;Number of radio stations listened to
A) Average annual temperature in the United States;Annual sweater sales by an American retailer. For a negative correlation, we would expect as the value of one variable goes up, the other goes down. As the annual temperature goes up, people would need less thick clothes, so sales of sweaters should go down.
Correlations
Allow you to observe the strength and direction of a linear association between two quantitative variables. Number range between negative 1 and positive 1. Between -0.5 and +0.5 is Weak. +0.5 and +0.8, -0.5 to -0.8. Moderately Strong +0.8 to +1, -0.8 to -1 is Strong.
Multiple Regression
Allow you to predict a response based on more than one explanatory variable, although they have to be independent.
Residual
Amount by which the predictions are off from the actual amount. Dont lie exactly on a line. Points are going to have some difference between what the line predicts and the value that they actually are.
Non-Linear Relationships
Associations between two variables that can be modeled better with a curve than a line.
X bar
Average x value for a smaple
Y bar
Average y value for a sample
Extreme y value
Outlier in the y direction because it's so much higher than the other y direction, not the x direction.
Inappropriate Grouping
Combining together subgroup that should not be combined, resulting in a weakened, or even reversed, association.
Least Squares Lines
Common type of best fit line and focuses on the residuals. Calculated by minimizing the sum of the squers of the vertical differences from the line of bet fit to each point. y(hat) = b0+b1x
Non-Association
Data points resemble a cloud and there is not clear pattern
Neither extreme.x or y values
Even though it is not extreme in either the x or y direction, it doesn't fit the overall trend established by the rest of the data.
Describing Scatter plots
Form Direction Strength
The weekly feed cost for David's rabbit is $2.20. The rabbit used in a study weighs nine pounds.Using the equation ŷ = 0.5 + 0.16x for the regression line of weekly food cost on weight (weight is explanatory), what is the residual for David's rabbit?
Get the residual, take the actual value minus the predicted value. Weight is 9 lb and feed cost is 2.20. Determine the predcited cost. Using regression line : y(hat) = 0.5+0.16x 0.5+0.16(9) 0.5_1.44 1.94 2.20-1.94=0.26
Reversed Assocation
If we don't know the direction of the cause and effect of two variables, we cannot say that it is a casual relationship, only that they are strongly correlated.
Why do we square the residuals when using the least sqaures line method to find the line of best fit?
It cancels out the effect of having negative and positive residuals. If we didn't square them, the sum of the residuals would be zero cancelling each other out.
Best Fit Line/ trend line / regression line.
Line going through a pack of points. A line that closely approximates the response values for given explanatory values when the form of the scatterplot is linear. Have the features: -Roughly half the points above and below the line -No pattern to how the points are off from the line.
Form
Look at the pattern. Is it linear? Does the data show a curve? Do they start low, then peak , then end low? The overall shape of the data points.
Consistency
Look for cases when correlation remains while other factors vary Does the association remain even when other variables are allowed to vary? Does this work across different races and genders? Do high amounts of the alleged cause lead ot high or low amounts of the alleged effect?
Extreme x and y value
Outlier in both the x and the y direction because it's so much further to the right and also higher than the rest of the points.
Extreme x-values
Outlier in the x direction because it's so much further to the right of the pack of points but not in the y direction.
Stacey finds a scatterplot that shows data for nine schools. It relates the percentage of students receiving free lunches to the percentage of students wearing a bicycle helmet. The plot shows a strong negative correlation. Stacey recalls that correlation does not imply causation. In this example, Stacey sees that increasing the percentage of free lunches would not cause children to use their bicycle helmets less. Identify the confounding variable that is causing Stacey's observed association.
Parent's Annual Salary: A confounding variable is a variable that helps to explain the correlation between 2 variables. It must be related to both variables. We can note that parents' salary would determine if a student qualifies for free school lunches. The higher the salary, the lower percentage of free lunches. We can also note that as a parent's salary increases, bicycle helmet use should increase as they would be able to afford helmets. So, this confounding relationship helps to explain the reason we see this association. It is not the case that helmet use and receiving free lunches has any type of causal relationship.
Moderate
Points are less clustered in a line or curve, however, direction is still clear.
Weak
Points are much more spread out and the direction maybe less clear.
Slope
Rate of change. m= y2-y1 divided by x2-x1 value b = y intercept The rate of change relating the increase or decrease in y to an increase of 1 in x.
Direction
Refers to how the y axis variable responds as you move to the right on the x axis variable. The way one variable responds to tan increase in the other. With a negative association, an increase in one variable is associated with a decrease in the other, whereas with a positive association, an increase in one variable is associated with an increase in the other.
Residual Formula
Residual = y - y(hat) = Actual response - predicted response.
Strong
Scatterplot would most resemble the form. The data is clustered around either a line or a curve.
Pete has measured the diameter and circumference of four different tires, listed the data in a table, and graphed the results on a scatterplot. He noticed the points fall closely on a line.Using the data values that Pete collected, select the correct slope and y-intercept. Diameter = 20, 22, 23, 24 Circumference = 63, 69, 72, 75
Slope = 3 Y intercept = 3 Use the forumla m = y2-y1 divided by x2-x1. Pick any 2 points, such as (20, 63) and (22, 69). 69-63 divded by 22-20 = 6/2 = 3. Y intercept is the corresponding y value when x = 0. y = mx+b and plug in one point such as (20, 63) for x and y and the slope of 3. 63 = (3)(20) + b 63 = 60+b 3=b
Correlation
Strength and direction of a linear association between two quantitative variables. Positive number if there's a positive association Negative number if there's a negative association
Negative Correlation
Tendency of the response variable to decrease in response to an increase in the explanatory variable. Less than or equal to -0.5.
Positive correlation
Tendency of the response variable to increase in response to an increase in the explanatory variable. Great than or equal to 0.5
Strength
The closeness of the points to the indicated form. Points that are strongly linear will all fall on or near a straight line.
Non-linear
The data points follow a curve
Extrapolation
Whole idea of making predictions outside of a range. Using the linear model to make predictions outside the range of values for which estimate was intended.
Correlation coefficient (r)
The numerical value between -1 and +1 that measure the correlation between two quantitative variables.
Linear
The scatterplot is approximating a line
Relative Zero Correlation
The type of correlation present when two variables have a correlation coefficient generally between -0.5 and 0.5.
Y intercept
The value of Y when x=0
Negative
The variable go in opposite directions
Response Variable
The variable that tends to increase or decrease due to an increase or decrease in the explanatory variable.
Positive
The variables increase or both decrease
Lurking Variables
Variable that could be confusing the relationship between the explanatory variable and the response variable.
Explanatory Variable
Variable whose increase or decrease we believe helps explain a tendency to increase or decrease in some other variable.
Scattterplots
Ways that you can show more than one quantitative attribute at a time for a particular data set.
Causation
When one variable actually causes another variable to occur. correlation does not imply causation.
Control
You need something similar to a control. It's not exactly using a control group, but it's similar to what you would do if you had done an experiment. This is essentially like splitting a group of volunteers into two groups and having a treatment group. Is the effect absent when the cause is absent? Is the effect present when the cause is present?
Outlier
any point that deviates substantially from the overall form of the remainder of the data points.
Slope of Least Squares Line
b1 = r * sy/sx multiplying the correlation coefficient by the ratio of the standard deviation of the y data to the standard deviation of the x data.
Jaime finished analyzing a set of data with an explanatory variable x and a response variable y.He finds that the mean and standard deviation for x are 5.43 and 1.12, respectively. The mean and standard deviation for y are 10.32 and 2.69, respectively.The correlation was found to be 0.893. Select the correct slope and y-intercept for the least-squares line.
slope = -2.14 Y intercept = -3.03
Weight is the explanatory variable and has a mean of 24.5 and a standard deviation of 25.44. Weekly feed cost is the response variable and has a mean of 7.6 and a standard deviation of 2.97.The correlation was found to be 0.879.Select the correct slope and y-intercept for the least-squares line. Answer choices are rounded to the hundredths place.
slope = 0.10 Y intercept = 5.15 slope = r * sy/sx = 0.879 *2.97/25.44= 0.879* 0.117 = 0.10 7.6=b0+0.10(24.5) 7.6=b0 +2.45 5.15 = b0
Linear equation / Slope interept form
y = mx + b
Slope intercept form of a linear equation
y = mx + b y = b0 + b1x m the slope because b1 b, the y intercept because b0 instead of y, we have the y hat.