5.07 Correlation and Causation
Observational Study
- Observes responses to variables - Does not attempt to influence variables - Cannot determine causation; determines correlation only An observational study, simply observes and records behavior. It does not impose any treatment to manipulate the response. Because of the difference in the two, experiments are able to determine causation, whereas an observational study can only determine a correlation between or among the variables.
Experiment or Controlled Study
- Observes responses to variables - Influences variables as part of the treatment that results in a response - Can determine causation Experiments are able to determine causation, whereas an observational study can only determine a correlation between or among the variables.
Using Table and Graphs to Show Correlation - Equation
Strengths: - Easy to find the exact values of the independent and dependent variables. - Easy to update the equation if changes happen in the scenario. - Easy to make a table of values. Limitations: - Hard to visually see what the equation represents, especially if the equation is complicated.
Table of Values
Strengths: - Quickly find the exact values. - Easy to make a graph. - Easy to calculate the change of values for each variable. Limitations: - Need to update the table if a solution isn't included in it - Need to update each value if changes occur.
Graph
Strengths: - Easy to see the relationship between the two quantities. - Easy to see the whole range of solutions. Limitations: - May be difficult to find the exact values if the solutions are not whole numbers. - Need to make a new graph if the situation changes.
What does an r-value of −0.12 say about the two variables in a set of data?
The scatter plot would be very spread out and angled only slightly down to the right. The data are not strongly correlated and are closer to not being correlated at all. If the r-value of two variables 0.99, what can you say about the data? The data has a very strong, positive correlation. The scatter plot would closely resemble a straight line with a positive slope.
Causation
When one set of data directly causes the other to occur Causation is when one set of data can be positively shown to have caused another set to occur. For instance, you are hungry by 11 a.m. because you didn't eat breakfast. This is a simple cause (not eating breakfast) and an effect (feeling hungry earlier than usual).
Correlation and Lines of Best Fit
With data, we often talk about correlations, of which there are three types: positive, negative, or no correlation. Statisticians and analysts use what they know about correlations and lines of best fit to make predictions. A line of best fit is a line drawn through the points on a scatter plot to summarize the relationship between the two sets of data. How well a model actually fits the data can be determined by its correlation coefficient.
Which r value shows the strongest correlation? −0.98 −0.20 −0.25 −0.75
−0.98
Correlation Versus Causation
If two variables have a strong correlation, then there is a strong relationship between the sets. Causation is a special type of relationship where changing the independent variable causes a change in the dependent variable. Lots of commercials say that smoking causes cancer, but is that true? We cannot perform a controlled experiment to prove causation because it would be unethical to force people to smoke and measure the occurrence of cancer. We can, however, show that there is a strong correlation based on several years of observational studies. To be able to assess the results in a study, it is important to remember the distinction between an experiment and an observational study.
Sum It Up
Causation is when one set of data can be positively shown to have caused another set to occur. Correlation is a measure or degree of relation between two sets of data. The two sets of data might be positively correlated (there is a definite connection between the two), negatively correlated (the second set of data shows the opposite of the first set), or not correlated at all (the data are scattered all over the graph). Correlation can be determined using equations, tables, or graphs; each of these has its own strengths and limitations. When variables are identified, the independent variable should be (x) and the dependent variable should be (y). The correlation coefficient (r) will always be between −1.0 and +1.0. If the correlation coefficient is positive, there is a positive relationship. If the correlation coefficient negative, the relationship is negative. If the two are not correlated at all, the correlation coefficient will be 0.
Causation or Correlation? A student notices she gets better grades on tests when she goes to bed at 9 p.m. instead of 10 p.m.
Correlation. Although it may be true that getting more sleep helps this student do better on tests, without a controlled study we cannot be sure.
Causation or Correlation? A tennis player notices she wins more games in the evening than in the afternoon.
Correlation. The player isn't attributing her better scores to anything at this point, just making a correlation.
Causation vs. Correlation
Correlation: - Is a measure of the strength of linear association between two variables - Is always between −1.0 and +1.0 - Can be positive or negative Can be proven by observational study Causation: - Is a demonstrable cause and effect - Can be measured by controlled studies or experiments - Should not be assumed even when correlation is strong and predictable - Cannot be proven by observational study alone
The Correlation Coefficient
The correlation coefficient (r) is a number that describes how closely the numbers in the data set are related.
Important!
The correlation coefficient will always be between −1.0 and +1.0. If the correlation coefficient is positive, there is a positive relationship. If the correlation coefficient is negative, the relationship is negative. If the two are not correlated at all, the correlation coefficient will be 0. Strong and weak correlations are a little more subjective in that there is no exact cutoff between strong and weak, but generally, any r value that is close to either 1 or negative 1 is considered strong. Any value of r that is closer to 0 is considered weak.
Independent variable
The experimental factor that is manipulated; the variable whose effect is being studied.
Dependent variable
The outcome factor; the variable that may change in response to manipulations of the independent variable.
Correlation
The relationship between two groups of data Correlation is a measure or degree of relation between two sets of data. The two sets of data might be positively correlated (there is a definite connection between the data sets), negatively correlated (the second set of data shows the opposite of the first set), or not correlated at all (the data are scattered all over the graph).
