Module 6
How to determine there is an association or causation between data
Also, keep in mind that we always need to know if the data we're looking at is just observed data or data from an experiment. If there is no experiment (that is no control group and no treatment), we cannot conclude if there is any causation. If there is no control group but we do notice a correlation between two variables, we can at least say there is an association and may even recommend a full experiment to determine if causation is involved.
In addition to measuring linear relationships, correlation can be used to measure curvilinear and parabolic relationships. True or False? a. True b. False
Correlation measures if two variables have a linear relationship with one another.
The variable that may be the cause of some result, or is presented as variable that offers an explanation. Also called an independent variable is _______________ _________________
Explanatory Variable
What variable represents the lurking variable among these three scatterplots? a. Obesity b. Testosterone Level c. Systolic Blood Pressure d. There is no lurking variable
Obesity; The scatterplot reveals there is a weak negative association between systolic blood pressure and testosterone levels. There is also a strong positive correlation between obesity and systolic blood pressure and a strong negative correlation between obesity and testosterone level. Therefore, obesity is the lurking variable between testosterone level and systolic blood pressure. The variable of obesity is associated with both SBP and Testosterone
_______________________ is used to describe a trend or predict values based on known values.
Regression
______________ _____________ is used when multiple variables' quantities relate to each other. Regression analysis is an extension of scatterplots, associations, and correlations. If we determine that an association exists between two or more variables, we can use regression analysis for description and prediction. Regression is used to describe a trend or predict values based on known values
Regression analysis For example, a patient's weight is associated with blood pressure, cholesterol, and blood sugar. These variables can be modeled with a regression analysis. Once their association has been established and quantified, regression analysis can be used to predict future data values. So, we can quantify all of the health risks a patient may have based on a risk factor: weight.
7. If a patient's risk of cancer is associated with genetics, diet, and tobacco use, which of these is the response variable. a. Risk of cancer b. Genetics c. Diet d. Tobacco Use
The answer is a. In this example, the risk of cancer is the response variable because it "responds" to genetics, diet, and tobacco use
According to the first scatterplot, what type of relationship (if any) is there between systolic blood pressure and incidence of lung cancer? a. There is a causal relationship between systolic blood pressure and incidence of lung cancer. b. There is a positive association between systolic blood pressure and incidence of lung cancer. c. There is a negative association between systolic blood pressure and incidence of lung cancer. d. There is a parabolic association between systolic blood pressure and incidence of lung cancer.
The answer is b. This scatterplot reveals a positive association between systolic blood pressure and incidence of lung cancer.
. What type of variable can cause a Simpson's Paradox? a. Nominal variable b. Categorical variable c. Lurking variable d. Ordinal variable
The answer is c. Simpson's Paradox is always caused by the inclusion of a Lurking Variable.
Overall we identified that Pepcid is more effective than Nexium at controlling heartburn. What lurking variable would make you question this result? a. No Simpson's paradox is evident, therefore there is no lurking variable b. The people taking Nexium in Trial 1 c. The people taking Pepcid in Trial 1 d. Which trial a person was in
The answer is d. Considering which trial a person was in would lead you to conclude that Nexium was more effective than Pepcid. As a result, this is the lurking variable associated with Simpson's paradox in this case.
What can be deduced from the testosterone levels vs. systolic blood pressure scatterplot? a. Elevated systolic blood pressure causes increased testosterone levels. b. Elevated systolic blood pressure causes decreased testosterone levels. c. There is a strong negative association between testosterone levels and systolic blood pressure. d. There is no causation evident between these two variables.
The answer is d. From the scatterplot, we can deduce that males with elevated testosterone levels may have lower blood pressure values, but no definite causation can be identified.
8. If a patient's blood sugar, cholesterol, and blood pressure are all dependent on diet, which of these is the explanatory variable a. Blood sugar b. Cholesterol level c. Blood pressure d. Diet
The answer is d. In this example, diet is the explanatory variable because all of the other variables respond to it.
Estimate the correlation coefficient for the scatterplot in question #8. a. −0.87 b. −0.1 c. 0.5 d. 0.87
The answer is d. Strong correlations have correlation coefficients close to 1 or −1. Because this is a strong positive correlation, the correlation coefficient is positive and close to 1, so it can be estimated to be around 0.87
Y= mx + b
The slope-intercept form of a linear equation, where m represents slope and b represents y-intercept.
4. One dose of the measles vaccine is about 93% effective at preventing measles, if exposed to the virus. Doctors, scientists, and epidemiologists have shown that the vaccine is the cause of the reduction in measles cases. Therefore, there is a causal relationship. True or False? a. True b. False
True; Correct. This statement is true. There is a proven cause-and-effect relationship between the measles vaccine and a reduction in measles cases.
10. Suppose there is a linear regression equation y=180-0.01x where x is equal to dietary calories eliminated per day and y is equal to blood cholesterol levels measured in mg/dL. Which of the following is the correct interpretation of this slope? a. For every dietary calorie eliminated per day there is a corresponding 0.01 point decrease in blood cholesterol levels. b. For every 0.01 point reduction in blood cholesterol level there is 180180increase in eliminate dietary calories c. For every 0.01 dietary calories consumed, there is a corresponding 11 point increase in blood cholesterol levels. d. For every 180 dietary calories consumed, there is a corresponding 0.01 point gain in blood cholesterol levels.
a. For every dietary calorie eliminated per day there is a corresponding 0.01 point decrease in blood cholesterol levels.
Regression analysis can help determine the association between risk factors ( ____________________ variables) and diseases or disorders ( __________________ variables)
explanatory variables; response variables
The biggest takeaway is that _______________________ studies help us show an association while ___________________studies give evidence of a causation relationship between variables.
observational; experimental
Recall that the _____________ ____________ is the variable whose value "responds" to the other variables in the equation
response variable*
For hypothesis testing to be carried out successfully, there are some important measurements that should be obtained during the data collection process, including the
sample size*, mean*, standard deviation*, range*, and degrees of freedom*.
Some common tools used in hypothesis testing include z-tests, ______________, chi-squared tests, and _________________________________
t-tests; analysis of variance (ANOVA) Regardless of the test used, the result of the test is a number that is known as a p-value*.
3. Sunscreen sales increase when shark attacks become more frequent, therefore there is a causal relationship. True or False? a. True b. False
False. There is not a proven cause-and-effect relationship between sunscreen sales and shark attacks. This is an association, but a causal relationship does NOT exist.
1. An association between two variables is a strong indication of causation. True or False? a. True b. False
False; Correct. This statement is false. An association between two variables is not a strong indication of causation. Often there are associations between two variables where causation does not exist.
5. Obesity rates decrease as exposure to sunlight increases. Therefore, there is a causal relationship between sunlight exposure and obesity. True or False? a. True b. False
False; Correct. This statement is false. We have not shown a proven cause-and-effect relationship between lack of sunlight exposure and obesity. This is an association, but we cannot be sure that a causal relationship exists without more information.
A ____________________ variable* is a variable that is associated with both variables, but was not included in the study. As you may have guessed, the lurking variable in this example is age. Age increases the risk of high blood pressure. Bone density also decreases with age.
Lurking Variable
________________ ________________ One way of measuring the relationship between two variables is by using a correlation. A correlation quantitatively describes whether two variables have a linear relationship with one another. Keep in mind that a correlation measures how two variables are associated with one another — a correlation does not establish causation in any way (you still need a full experiment with control and treatment groups to establish causation).
Measuring Correlation One way of measuring the relationship between two variables is by using a correlation. A correlation quantitatively describes whether two variables have a linear relationship with one another. Keep in mind that a correlation measures how two variables are associated with one another — a correlation does not establish causation in any way (you still need a full experiment with control and treatment groups to establish causation).
There is no universally accepted line that separates what is considered a strong correlation and a weak correlation. That being said, we will use the following common guide: r values between −0.3 and 0.3 are considered weak linear correlations; r values between −0.7 and −0.3 and between 0.3 and 0.7 are considered moderate linear correlations; r values between −0.7 and −1 and between 0.7 and 1 are considered strong correlations.
Memorize chart
If the p-value is less than the significance level, the result is ________________________. The probability that the result is caused by chance is low enough for us to accept. If the p-value is greater than the significance level, the result is _____________ significant. The probability that the result is caused by chance is too high for us to accept.
Significant; Not
__________ ____________ _____________ is the prediction of one response variable's value from one or more explanatory variables' values. In a simple linear regression, we model the relationship between the variables as perfectly linear. So, the closer the data's correlation is to being perfectly linear, the more accurate the linear regression model will be.
Simple linear regression
________________ _______________* is a counterintuitive situation that occurs when a trend or result that appears in groups of data disappears when we combine the groups. This paradox can be hard to conceptualize without an example. To understand Simpson's Paradox, look at the example below:
Simpson's Paradox
5. According to the first scatterplot, what type of relationship (if any) is there between testosterone levels in males and systolic blood pressure? a. Weak negative association between systolic blood pressure and testosterone levels. b. Weak positive association between systolic blood pressure and testosterone levels. c. Parabolic association between systolic blood pressure and testosterone levels. d. No association is seen between systolic blood pressure and testosterone levels.
The answer is a. The scatterplot reveals there is a weak negative association between systolic blood pressure and testosterone levels.
Consider the data in the table above. Will a Simpson's Paradox be evident, with respect to the relative success rates of Tamoxifen and Paclitaxel? Why or why not? a. No, because an equal number of people are being treated with each drug in each trial. b. Yes, because there are different numbers of people in the different trial groups. c. Yes, because the success rates were different for each trial. d. Yes, because Tamoxifen is the gold standard for treatment of breast cancer.
The answer is a. Because the trend that Paclitaxel has a greater rate of success in each of the trial groups remains when the groups are combined, we do not see Simpson's Paradox occurring here. This was expected because an equal number of people are being treated with each drug in each trial.
What variable represents the lurking variable among these three scatterplots? a. Number of Cigarettes Per Day b. Systolic Blood Pressure c. Incidence of Lung Cancer d. There is no lurking variable
The answer is a. The first scatterplot reveals a positive correlation between systolic blood pressure and incidence of lung cancer. Science has proven that cigarette smoking is a risk factor for lung cancer. The scatterplot on the left illustrates a positive correlation between the incidence of lung cancer and the number of cigarettes smoked, while the scatterplot on the right shows a slight positive correlation with high blood pressure and the number of cigarettes smoked, elevated systolic blood pressure has not been proven to be a risk factor for lung cancer. Therefore, cigarette smoking is the lurking variable in the first scatterplot between systolic blood pressure and incidence of lung cancer.
What can be deduced from these scatterplots? a. There is a positive correlation between cigarette smoking and incidence of lung cancer. b. There is a negative correlation between cigarette smoking and incidence of lung cancer. c. There is a negative correlation between cigarette smoking and elevated systolic blood pressure. d. There is causality between cigarette smoking and incidence of lung cancer.
The answer is a. The scatterplot shows a positive association between cigarette smoking and incidence of lung cancer. While you may know that this relationship is causal, this is not a conclusion we can draw from the scatterplot.
10. If a hypothesis test yields a result that 85% of the results did not occur by chance, what is the p-value of this study? a. 0.08 b. 0.15 c. 0.58 d. 0.85
The answer is b. If 85% of results did not occur by chance, therefore, 15% of the results occurred by chance and the p-value for this study is 0.15
A lurking variable only affects the response variable. True or False?
This statement is false. A lurking variable can have an effect on both the explanatory and response variables.
A lurking variable is a variable that is included in an original study. True or False?
This statement is false. A lurking variable is not a variable that is intended to be included in the original analysis.
Lurking variables are unimportant for a correct understanding of cause and effect. True or False?
This statement is false. Lurking variables can offer a more accurate explanation of cause and effect, such as age causing higher blood pressure and lower bone density.
A lurking variable can hide the true relationship, or falsely identify a strong relationship, between variables. True or False?
This statement is true. A lurking variable impacts analysis because it is a variable that is not included as an explanatory variable or response variable.
2. Causation is a cause-and-effect relationship between two variables. True or False? a. True b. False
a. True Correct. This statement is true. Causation indicates a cause-and-effect relationship between two variables.
5. Least squares estimation is a technique for predicting future data values. True or False? a. True b. False
b. Correct. This statement is false. Least squares estimation is a technique used to estimate the best-fit-line in linear regression.
A strong correlation always explains a cause-and-effect relationship between two variables. True or False? a. True b. False
b. False; A correlation does not prove that one variable causes another. It is possible that both of the variables in question are affected by some other factor, or that one variable is a subset of the other.
Significance Level: The significance level specifies how certain you need to be to state that a result is statistically significant. It is used as the p-value cutoff for statistical significance, which means that any p-value ________________ the set significance level is considered statistically significant.
below; For example, if you decide to use a significance level of 0.1, that means you accept a conclusion only if there is less than a 0.1 (or 10%) probability that a result was caused by chance. Here, you would declare something significant, knowing that there is less than a 10%10% chance your conclusion is wrong. How can you apply these concepts and draw conclusions from studies? By examining the p-value, that results from a statistical hypothesis test. The p-value is the probability that a result was caused by chance. It will always fall between 00 and 11. So, if a hypothesis test results in a p-value of 0.8, that means that there is an 80% probability the result was caused by chance. This value is very high and indicates that the result is not significant: it was most likely caused by random variation or "chance."
A study shows that there is a negative relationship between a student's anxiety before a test and the student's score on the test. Which is a possible lurking variable?
c. Student's time spent studying
The strength of a linear relationship between two variables can be measured by the __________________ __________________
correlation coefficient* A strong linear relationship indicates that the data will bunch around a straight line, while a weak linear relationship does not. Although it will rarely follow a straight line exactly, if the points on a scatterplot do fall precisely along a straight line, we call it a perfect linear relationship.
9. Consider the data in the table above. Will a Simpson's Paradox be evident within this group of data? Why or why not? a. No, because an equal number of people are being treated with each drug in each trial. b. No, because it is clear from the data that Nexium has a greater rate of success. c. Yes, because the success rates were different for each trial. d. Yes, because while Nexium had better success in each trial, Pepcid had better success overall.
d. Yes, because while Nexium had better success in each trial, Pepcid had better success overall.
One of the primary uses of linear regression equations is to make estimations or predictions from a given set of data. If the prediction is between known data points, it is called __________________ _________________*. If the prediction is outside of known data points, it is called ___________________ _________________*
linear interpolation linear extrapolation*. If we make a prediction about any of the clothing sizes between 11 and 55, the process we are using is linear interpolation - a prediction between known data points. If our prediction is for any of the clothing sizes bigger than 55, or smaller than 11, the process we are using is linear extrapolation - a prediction outside of the range of known data points. The important thing to know is that the further away you are from the data values you have, the less reliable your prediction is. Or said another way, the further out your extrapolation is, the less reliable your prediction is.
Here, the data points do not form a line, but they have a clear relationship nonetheless. The data points form a very distinct "u" shape, known as a ____________________
parabola
A _______________ ______________* indicates that two variables' values move in the same direction as one another. If two variables have a positive correlation, when one variable increases, the other also increases. If one variable decreases, the other variable also decreases. A _______________ _____________* indicates that two variables' values move in the opposite direction to one another. When one variable increases, the other decreases.
positive correlation; negative correlation