Module 6 statistics part 2
linear interpolation*
If the prediction is between known data points
In what form is a simple linear regression equation usually written? a) point-slope b) x=my+b c) slope-intercept d) a+b=y
c) slope-intercept Correct. The correct answer is c. A simple linear regression equation is usually written in slope-intercept form.
Linear extrapolation is always a reliable method of prediction. True or False?
false Correct. This statement is false. Extrapolation assumes that the linear pattern of the data will continue outside of the range of data points. This may not always be the case and therefore may not always be a reliable method of prediction.
Example using Extrapolation
ow let's use the linear regression equation to make an estimation using extrapolation. Again, we would use the linear regression equation for the data presented in the scatterplot of a child's age in years versus clothing size, y=1x+0 . Let's say we want to predict the age of a child whose clothing size is 8 . We substitute 8 for x in the regression equation and calculate y . The math is shown below:
linear extrapolation*.
the prediction is outside of known data points,
Recall that the response variable* is the
variable whose value "responds" to the other variables in the equation; in the previous example, the response variable could be blood pressure, cholesterol, or blood sugar, factors that are influenced by a person's weight. We assume in this course that explanatory variables* vary independently; in reality, they may influence one another. For example, people of the same gender and age have a range of different weights; those weights might affect other aspects of their health. The response variable is written on the y -axis, while the explanatory variable is written on the x -axis. A simple way to remember this fact is that the term "explanatory" has an 'x' in it.
4. How does bone density depend on age? Check
. How does bone density depend on age? Check explanatory is Correct × "Age" is the explanatory variable.
Complete the following sentence. Regression analysis is used to _________________. calculate the probability that an event will occur. determine if there is an association between two or more variables. predict future values based on known values. determine a cause and effect relationship between two or more values.
. Regression is used to describe a trend or predict future values based on known values.
significant difference*.
A difference that is statistically significant is called a significant difference*.
negative correlation*
A negative correlation* indicates that two variables' values move in the opposite direction to one another. When one variable increases, the other decreases.
3. How does blood sugar depend on a diet? Check
How does blood sugar depend on a diet? explanatory is Correct × "Diet" is the explanatory variable.
p-value*.
Regardless of the test used, the result of the test is a number that is known as a p-value*.
Which of the following significance levels carries the lowest risk of accidentally drawing a conclusion when the result was actually caused by chance? 0.08 0.06 0.04 0.02
ect × The answer is d. A significance level of 0.02 indicates that you only declare significance if there is less than a 2% chance that the result is caused by chance
Using the scatterplot Body weight vs. Systolic blood pressure, what can we expect the body weight to be for an individual with a systolic blood pressure of 195 ? Round your answer to the nearest whole number. Approx. 262 Approx. 275 Approx. 287 Approx. 295
he answer is c. Solving algebraically, a linear regression analysis with an equation of y=0.4546x+64.313. By incorporating 195 for y, 195=0.4546x+64.313. (195−64.313)0.4546=287.476. Therefore, the answer is around 287.
Once the regression line is established,
it can be used as a model to predict future outcomes. For example, when modeling a product's sales based on its price, the regression line can be used to determine future sales and to set a price for a product that will make the company the most profit.
Estimate the correlation coefficient for the scatterplot in question #10. −0.87 0.65 0.03 0.87
swer is c. A weak positive correlation will yield a coefficient correlation very close to 0.
Estimate the correlation coefficient for the scatterplot in question #8. −0.87 −0.1 0.5 0.87
swer is d. Strong correlations have correlation coefficients close to 1 or −1. Because this is a strong positive correlation, the correlation coefficient is positive and close to 1, so it can be estimated to be around 0.87.
A strong linear relationship indicates that the data will bunch around a straight line, while a weak linear relationship does not. Although it will rarely follow a straight line exactly, if the points on a scatterplot do fall precisely along a straight line, we call it a perfect linear relationshi
that the data will bunch around a straight line, while a weak linear relationship does not. Although it will rarely follow a straight line exactly,
Anytown Hospital is looking at the relationship between the daily high temperature in Anytown and the number of patients they admit for heat stroke. The following graph shows this relationship: How would you use this scatterplot to predict the number of patients that would be admitted to Anytown Hospital for heat stroke on a 35 -degree day? a) Using the regression equation, you would plug in the temperature ( 35 ) for x . b) It would not make sense to use this scatterplot, as that would be inappropriate extrapolation beyond the range of study. c) Take the number of patients admitted on a 70 -degree day, and divide this total by 2 . d) Using the regression equation, you would plug in the temperature ( 35 ) for y
.t would not make sense to use this scatterplot to predict the number of patients that would be admitted to Anytown Hospital for heat stroke on a 35 -degree day, as that would be inappropriate extrapolation beyond the range of study.
Which of the following correlation coefficients describes a strong, positive correlation? a) −0.9 b) −0.2 c) 0.2 d) 0.9
0.9 Correct. The correct answer is d. A strong, positive correlation is represented by a correlation coefficient that is close to 1 , such as 0.9 .
5. How does mortality rate depend on Body Mass Index (BMI)
5. How does mortality rate depend on Body Mass Index (BMI) explanatory is Correct × "BMI" is the explanatory variable.
positive correlation*
A positive correlation* indicates that two variables' values move in the same direction as one another. If two variables have a positive correlation, when one variable increases, the other also increases. If one variable decreases, the other variable also decreases.
Anytown Hospital is looking at the relationship between the daily high temperature in Anytown and the number of patients they admit for heat stroke. The following graph shows this relationship: Based on the scatterplot, roughly how many patients would you expect to be admitted to Anytown Hospital for heat stroke on a 95 -degree day? a) 20 b) 30 c) 50 d) 70
Based on the scatterplot, roughly 50 patients would be expected to be admitted to Anytown Hospital for heat stroke on a 95 -degree day.
least squares estimation is a technique for predicting future data values. True or False?
False Correct. This statement is false. Least squares estimation is a technique used to estimate the best-fit-line in linear regression.
A linear regression equation takes the following form: y=mx2+b . True or False?
False Correct. This statement is false. This is not the form that a linear regression equation takes. Linear regression is always of degree 1, so the exponent of 2 associated with the x makes this a non-linear equation.
For example, suppose we have the following data points for children's clothing sizes and ages (in years). The data and the linear regression line is illustrated below.
If we make a prediction about any of the clothing sizes between 1 and 5 , the process we are using is linear interpolation - a prediction between known data points. If our prediction is for any of the clothing sizes bigger than 5 , or smaller than 1 , the process we are using is linear extrapolation - a prediction outside of the range of known data points. There are no standard rules for how to handle extrapolation values. The important thing to know is that the further away you are from the data values you have, the less reliable your prediction is. Or said another way, the further out your extrapolation is, the less reliable your prediction is
Example of Regression Analysis
For example, a patient's weight is associated with blood pressure, cholesterol, and blood sugar. These variables can be modeled with a regression analysis. Once their association has been established and quantified, regression analysis can be used to predict future data values. So, we can quantify all of the health risks a patient may have based on a risk factor: weight.
Useful Measurements
For hypothesis testing to be carried out successfully, there are some important measurements that should be obtained during the data collection process, including the sample size*, mean*, standard deviation*, range*, and degrees of freedom*. Depending on the situation and statistical test being used, you may want (or need) to have measured values for some combination of these statistics.
2. How does high blood pressure depend on physical inactivity? Check
How does high blood pressure depend on physical inactivity? response is Correct × "High blood pressure" is the response variable.
Not a Representative Sample
If you're making an inference about a population*, your sample should represent the entire population by being proportionally distributed through each demographic that might give different responses. You should be able to spot when such a problem exists in a statistical study as such statistical studies are flawed. When you have a sample that is proportionally distributed through each demographic or characteristic of the population, it is called a "representative sample." If the experiment is applied to a good representative sample, it is still possible that the sample will not respond evenly. This situation can occur when a) data is missing, or b) some segments of the population are not included in the study. Finally, to ensure a valid sample accurately represents the population, the sample size must be large enough to contain the variation that exists in the target population while allowing for some degree of random missing data. In general, the risk of non-representative samples decreases as the sample size increases.
Determining Significance
In statistics, it is not enough to say that two means are "a little different" or "very different" from one another or that they are "close" to one another. Rather, we want to determine if a difference is statistically significant*. A statistically significant result is unlikely to be caused by random variation or errors. A difference that is statistically significant is called a significant difference*. A hypothesis test* will tell us whether the results are statistically significant or not.
Measuring Correlation
One way of measuring the relationship between two variables is by using a correlation. A correlation quantitatively describes whether two variables have a linear relationship with one another. Keep in mind that a correlation measures how two variables are associated with one another — a correlation does not establish causation in any way (you still need a full experiment with control and treatment groups to establish causation).
A strong correlation always explains a cause-and-effect relationship between two variables. True or False?
STatement is false. A correlation does not prove that one variable causes another. It is possible that both of the variables in question are affected by some other factor, or that one variable is a subset of the other.
From the scatterplot below, if the trend line would be extended indefinitely, it would correspond with a patient's systolic blood pressure in excess of 250 mmHg. What pitfall in regression analysis is evident in this chart? Inappropriate Extrapolation Association is Not Causation Not a Representative Sample Small Sample Size
The answer is b. This analysis is obviously missing a lurking variable which, in this case, is obesity. It is nonsense to try to estimate a patient's blood pressure based on testosterone level. Therefore, the association is not a causation.
graphing linear regression on a coordinate plane
We are graphing these linear regressions on a coordinate plane*. Therefore, simple linear regression is usually represented by an equation in slope-intercept form*:
What does a weak correlation look like on a scatterplot? a) The points follow closely along a line that moves down and to the right. b) The points do not follow very closely to a line. c) The points follow closely along a line that moves up and to the right. d) All of the above.
The points do not follow very closely to a line. Correct. The correct answer is b. In a weak correlation, the points do not follow very closely to a line.
Correlation Coefficient
The strength of a linear relationship between two variables can be measured by the correlation coefficient*. p.
) Which of the following scatterplots is most likely to have a correlation coefficient of 0.4 :
This scatterplot shows a positive, but not very strong, correlation, that might have a correlation coefficient of 0.4 .
In addition to measuring linear relationships, correlation can be used to measure curvilinear and parabolic relationships. True or False?
This statement is false. Correlation measures if two variables have a linear relationship with one another.
For the following questions, enter the letter that corresponds with your answer choice. 1. A scatter plot shows the potential correlations and relationships between two variables. True or False?
This statement is true. A scatterplot shows the potential correlation or relationship between the explanatory and response variable.
An Example of Regression Analysis
To predict sales figures based on the amount of money spent on regional advertising, we collect the following data comparing advertising dollars spent with projected sales. We can plot these points on a coordinate plane and form a scatterplot. As usual, we place the explanatory variable (advertising dollars spent) on the horizontal x -axis. On the vertical y -axis, we place the variable we are testing, the response variable (in this case projected sales). Remember, the response variable responds to the changes of the explanatory variable.
A linear regression "best-fit-line" can be estimated using least squares. True or False?
True Correct. This statement is true. Least squares estimation is the most common technique used to estimate the best-fit-line in linear regression.
Linear interpolation is a technique used to make a prediction that falls between known data points. True or False?
True Correct. This statement is true. Linear interpolation is a technique used to make a prediction that falls between known data points, using the linear regression equation.
Correlation in a Scatterplot
We can estimate a correlation by analyzing a scatterplot. Look for patterns in the data—data that rises from left to right suggests that a positive correlation exists, while data that falls as you move left-to-right across the chart indicates a negative correlation. Evaluate how tight the points fit together to determine whether you have a weak correlation or a strong one.
. Which of the following best describes the correlation in the scatterplot below? Strong positive correlation Strong negative correlation Weak positive correlation Weak negative correlation
Weak negative correlation orrect × The answer is d. This scatterplot reveals a weak negative correlation.
Using the scatterplot Total Cholesterol vs. BMI, what would the BMI be for an individual with a total cholesterol of 300 ? Round your answer to the nearest whole number. Approx. 36 Approx. 39 Approx. 41 Approx. 43
d is Correct × The answer is d. Solving algebraically, a linear regression analysis with an equation of y=6.4957x+18.514. By incorporating 300 for y, 300=6.4957x+18.514. (300−18.514)6.4957=43.3342. Therefore, the answer is approximately 43.
Estimate the correlation coefficient for the following scatterplot: a) −0.9 b) −0.2 c) 0.2 d) 0.9
d) 0.9 Correct. The correct answer is d. A good estimate for this scatterplot's correlation coefficient would be 0.9 , as it appears to have a strong, positive correlation.
Estimate the correlation coefficient for the scatterplot in question #5. −0.92 −0.1 0.5 0.92
orrect × The answer is a. Weak correlations have correlation coefficients close to 0. Because this is a strong negative correlation, the correlation coefficient is negative and close to −1, so it can be estimated to be around
Correlation coefficient is designated by which signs? Superscripts and subscripts Positive and negative signs Up and down arrows Decimal points
orrect × The answer is b. The sign of the coefficient, which is either positive or negative, designates the relationship between two variables.
For the following questions, enter the letter that corresponds with your answer choice. 1. Extrapolation is always inappropriate. True or False? True False
statement is false. There are applications of extrapolation, and times in which it is necessary. Be mindful of the situation and try to avoid inappropriate extrapolation by considering the context.
statistically significant*
statistically significant result is unlikely to be caused by random variation or errors
5. Which of the following best describes the correlation in the scatterplot below? Strong positive correlation Strong negative correlation Weak positive correlation Weak negative correlation
swer is b. This scatterplot reveals a strong negative correlation.
Which of the following best describes the correlation in the scatterplot below? Strong positive correlation Strong negative correlation Weak positive correlation No correlation
swer is c. The scatterplot reveals a weak positive correlation between age and body weight.
A hypothesis test*
A hypothesis test* will tell us whether the results are statistically significant or not.
What best describes a scatterplot that has a correlation coefficient of 0 ? a) The points closely follow a line that is moving down and to the right. b) The points do not seem to follow any discernible linear pattern. c) The points closely follow a line that is moving up and to the right. d) The points loosely follow a line that is moving down and to the right.
A correlation coefficient of 0 represents the weakest correlation. In fact, it indicates that there is no linear correlation. Therefore, it would be perfectly reasonable that the points do not seem to follow any discernible linear pattern.
For adults aged 20 and older, running speed and age have a negative correlation. Which of the following best describes the relationship between these two measures? a) Older people run slower. b) Younger people run faster. c) As age increases, the speed at which someone runs decreases. d) All of the above.
All of the above. Correct. The correct answer is d. All of the above. All of these statements are true because there is a negative correlation between these two variables.
Association is Not Causation
As we've seen earlier, an association between two variables does not necessarily indicate a causation. If there is a lurking variable that is not considered, we could draw an incorrect conclusion. For a reminder, here is a graph of a sample of patients' bone density and blood pressure: Performing a regression analysis, we would find that as bone density increases, blood pressure decreases. This conclusion misses the lurking variable—a patient's age. It would be nonsense to try to estimate a patient's blood pressure based on their bone density, without knowing the actual risk factor involved, age. Also, keep in mind that we always need to know if the data we're looking at is just observed data or data from an experiment. If there is no experiment (that is no control group and no treatment), we cannot conclude if there is any causation. If there is no control group but we do notice a correlation between two variables, we can at least say there is an association and may even recommend a full experiment to determine if causation is involved.
How can you apply these concepts and draw conclusions from studies?
By examining the p -value, that results from a statistical hypothesis test. The p -value is the probability that a result was caused by chance. It will always fall between 0 and 1 . So, if a hypothesis test results in a p -value of 0.8 , that means that there is an 80% probability the result was caused by chance. This value is very high and indicates that the result is not significant: it was most likely caused by random variation or "chance."
Hypothesis Testing
Data is significant, statistically speaking, if we can conclude that the results have an extremely low probability of happening by chance. The process of determining whether you can be confident that a specific result did not happen by chance (i.e., whether that result is statistically significant) is known as hypothesis testing. Hypothesis testing allows you to make a determination (based on the data and level of confidence you require) as to whether an observation happened randomly or if it is a non-random, meaningful occurrence. Some common tools used in hypothesis testing include z -tests, t -tests, chi-squared tests, and analysis of variance (ANOVA). While we will not be performing these statistical tests themselves, it is good to remember their names. Regardless of the test used, the result of the test is a number that is known as a p-value*.
Example Estimation using Interpolation
For the children's clothing size and age example, we'll use the linear regression equation to make an estimation using interpolation. The linear regression equation for the data presented in the previous scatterplot of a child's age in years versus clothing size is y=1x+0 , where the x -variable represents clothing size and the y -variable age in years. We can use this linear regression equation to predict a child's age (in years), given a value for their clothing size. Let's say we want to predict the age of a child whose clothing size is 3.5 . We substitute 3.5 for x in the regression equation and calculate y . The math is shown below:
Regression analysis can help determine the association between risk factors (explanatory variables) and diseases or disorders (response variables). In each case, determine whether the variable in bold is the explanatory or response variable by typing "explanatory" or "response" into the textbox. 1. How does the rate of lung disease depend on tobacco use? Check
How does the rate of lung disease depend on tobacco use? response is Correct × "The rate of lung disease" is the response variable.
Nursing Connections l How is Regression Analysis used in Nursing? An Example
How is Regression Analysis used in Nursing? An Example Regression analysis can be used in a variety of settings in health care. Predicting future health concerns of a patient, future staffing needs or future budgets are just some of the topics that may be reviewed. Regression analysis uses multiple variables that are known to be associated to predict future needs/concerns. Is there a nursing shortage? Regression analysis is used to evaluate supply and demand data for the health care industry and nursing occupations to identify existing and potential nursing workforce shortages and/or surpluses. Variables such as the number of currently licensed nurses, overall population growth, the number of nursing planning to retire in the next 10 years, and enrollment in nursing education programs are examined to predict future workforce needs within the nursing profession.
Relationships between variables are very easy to spot on a scatterplot.
If a relationship does exist, the data points on the diagram line up along a curve or straight line across the chart. If the data does not fall along a curve or line, it is likely that no relationship exists between the variables.
To draw a conclusion, you must compare this p -value to your significance level:
If the p -value is less than the significance level, the result is significant. The probability that the result is caused by chance is low enough for us to accept. If the p -value is greater than the significance level, the result is not significant. The probability that the result is caused by chance is too high for us to accept.
Segments of Population Not Included
In general, missing data is common in research studies. It is important to understand the source(s) of missing data and whether the missing data are random or non-random. If the missing data are random across the sample, it will not harm the validity of the data except to reduce the sample size*. If, however, missing data are non-random (for example, the study did not capture anyone between the ages of 20 and 40 , or did not include some specific portion of the patient population), then this issue negatively impacts the extent to which the sample responses replicate the entire population. While it is hard to identify if missing data skews the results of a study, one possible approach is to obtain some basic statistics for the entire population (geographic mix, demographics, household characteristics, etc.) and compare those same statistics for the sample being studied. If the distributions for the population and sample are similar for these statistics, then there is a good chance that the sample replicates the population well.
regression line (line of best fit)
In this example, there is not an absolute linear correlation between sales and advertising. However, there is an association between the two variables, and it appears to be linear in nature. To represent this correlation, we can impose a line through the middle of the points on the scatterplot. This line is called the "line of best fit" or regression line* for the data. This regression line can be represented by the linear regression equation
Potential Problems in Regression Analysis
Inappropriate Extrapolation Association is Not Causation Not a Representative Sample Missing Data Segments of Population Not Included Small Sample Size
Small Sample Size
It is important when evaluating an experiment to consider the size of a sample. In general, to ensure a valid sample accurately represents the population, the sample size must be large enough to contain the variation that exists in the target population while allowing for some degree of random missing data. In general, the risk of non-representative samples decreases as the sample size increases. For example, imagine you are performing a study in a town that has concerns about water contamination. You are studying whether residents have an increased incidence of various types of cancer. Would it be wise to draw a conclusion from two patients? With a larger sample size and more patients in the study, we would be able to make a more informed decision about the health effects experienced by residents of the town. A greater sample size improves the reliability of the least squares regression.
Missing Data
Missing data is a grave error and could cause the conclusions drawn from your experiment to be called into question. For example, if you are trying to determine if lowering BMI lowers total cholesterol for all patients, including only overweight people would flaw the study since you won't have a representative sample of all patients.
You can check this calculation visually by using the graph. First, find the place on the x -axis where the clothing size equals 3.5 and slide your finger up to where it meets the regression line. Then slide your finger across horizontally until you reach the y -axis. You should get a reading of 3.5 years. This problem is an example of linear interpolation, because 3.5 is inside of the range of the known data points. The image below illustrates this point.
N
Which of the following is true about simple linear regression? a) It can be used for any correlation, regardless of the degree of the relationship. b) Once the regression equation is found, any number of x can be plugged in to determine the expected y-value. c) It can be used to determine the relationship between two explanatory variables and one response variable. d) It is usually written in the form y=mx+b
Simple linear regression is usually written in the form y=mx+b .
Simple linear regression*
Simple linear regression* is the prediction of one response variable's value from one or more explanatory variables' values. In a simple linear regression, we model the relationship between the variables as perfectly linear. So, the closer the data's correlation is to being perfectly linear, the more accurate the linear regression model will be.
Which of the following statements is most appropriate with regards to representative samples? a. The risk of non-representative sample decreases as sample size increases. b. The risk of non-representative sample decreases as sample size decreases. c. The risk of non-representative sample increases as sample size increases. d. The sample size has no bearing on whether or not the sample size is representative.
The answer is a. In general, the risk of non-representative sample decreases as sample size increases.
If a patient's risk of cancer is associated with genetics, diet, and tobacco use, which of these is the response variable. Risk of cancer Genetics Diet Tobacco Use
The answer is a. In this example, the risk of cancer is the response variable because it "responds" to genetics, diet, and tobacco use.
From this scatterplot, if the trend line would be extended indefinitely, it would correspond with a patient's systolic blood pressure levels going indefinitely high or indefinitely low. What pitfall in regression analysis is evident from this chart? Inappropriate Extrapolation Association is Not Causation Not a Representative Sample Small Sample Size
The answer is a. It is obvious that systolic blood pressure levels cannot go indefinitely high or indefinitely low, so therefore it would be inappropriate to extrapolate beyond the range of the study. This analysis suffers from inappropriate extrapolation.
Using the scatterplot Age vs. Blood Calcium Level, how old can we expect a person to be if they have a blood calcium level of 10.2 ? Round your answer to the nearest whole number. Approx. 17 Approx. 25 Approx. 32 Approx. 40
The answer is a. Solving algebraically, a linear regression analysis with an equation of y=−0.0105x+10.381. By incorporating 10.2 for y, 10.2=−0.0105x+10.381. (10.2−10.381)−0.0105=17.238. Therefore, the answer is around 17.
Consider the following equation y=7x+15 . What is the slope of the line? a. 7 b. −7 c. 715 d. −17
The answer is a. The slope intercept equation in simple linear regression is y=mx+b. In this equation, m represents the slope of the line. Therefore, the slope of the line with equation y=7x+15 is 7.
Consider the following equation y=6−8x . What is the y -intercept for this equation? A. 6 B. −6 C. 8 D. −8
The answer is a. The slope intercept equation in simple linear regression is y=mx+b. The equation y=6−8x can be rewritten as y=−8x+6. In this equation, b represents the y-intercept of the line. Therefore, the y-intercept of the line with equation y=6−8x, is 6.
Suppose there is a linear regression equation y=180−0.01x where x is equal to dietary calories eliminated per day and y is equal to blood cholesterol levels measured in mg/dL. Which of the following is the correct interpretation of this slope? A. For every dietary calorie eliminated per day there is a corresponding 0.01 point decrease in blood cholesterol levels. B. For every 0.01 point reduction in blood cholesterol level there is 180 increase in eliminate dietary calories C. For every 0.01 dietary calories consumed, there is a corresponding 1 point increase in blood cholesterol levels. D. For every 180 dietary calories consumed, there is a corresponding 0.01 point gain in blood cholesterol levels.
The answer is a. When solving y=180−0.01x algebraically, when x=0, y=180, when x=1, y=179.99. Therefore, for every dietary calorie eliminated per day there is a corresponding 0.01 point decrease in blood cholesterol levels.
. Using the scatterplot Total Cholesterol vs. BMI, what would the total cholesterol level be for an individual with a BMI of 18 ? Round your answer to the nearest whole number. Approx. 120 Approx. 135 Approx. 145 Approx. 155
The answer is b. Solving algebraically, a linear regression analysis with an equation of y=6.4957x+18.514 . By incorporating BMI for x, y=6.4957(18)+18.514, y=135.4366. Therefore, an individual with a BMI of 18 will have a cholesterol level around 135.4366, or approximately 135.
What factor is most important to obtain a correct conclusion when performing regression analysis? Small sample size Large sample size Random missing data Variation in population
The answer is b. The greater the sample size, the more likely you are to come to a correct conclusion.
Consider the following equation y=94x+6.3 . What is the slope of a line with this equation? A. 4/9 B. 9/4 C. 4.05 D. 6.3
The answer is b. The slope intercept equation in simple linear regression is y=mx+b. In this equation, m represents the slope of the line. Therefore, the slope of the line with equation y=94x+6.3 is 94.
Consider the following equation, y+6.7x=−2/5 . What is the y -intercept of a line with this equation? A.2/5 B. −2/5 C. 6.7 D. −6.7
The answer is b. The slope intercept equation in simple linear regression is y=mx+b. The equation y+6.7x=−2/5 can be rewritten as y=−6.7x−2/5. In this equation, b represents the y-intercept of the line. Therefore, the y-intercept of the line with equation y=−6.7x−2/5 is −2/5.
Suppose there is a linear equation y=0.5x+39 where x is measured in kilocalories consumed and y is measured in pounds of weight. Which of the following is the correct interpretation of this slope? A. For every additional kilocalorie consumed there is a corresponding 39 pound weight gain. B. For every additional kilocalorie there is an additional 0.5 pound weight gain C. For every 1 pound of weight gained there is an additional 39 kilocalories that were consumed D. For every 39 additional kilocalories consumed, there is an 0.5 pound weight gain.
The answer is b. When solving y = 0.5x + 39 algebraically, when x=1, y=39.5, when x=2 y=40 and so on. Therefore, for every additional kilocalorie there is an additional 0.5 pound weight gain
Which of the following represents the explanatory variable? Type of Tobacco Used Cancer Mortality Average Cigarettes Smoked per Day Oxygen Saturation %
The answer is c. In this chart, Average Cigarettes Smoked per Day is the explanatory variable.
Using the scatterplot Total Cholesterol vs. BMI, what would the BMI be for an individual with a total cholesterol of 220 ? Round your answer to the nearest whole number. Approx. 21 Approx. 26 Approx. 31 Approx. 36
The answer is c. Looking at the scatterplot, we can see an x-value of about 31 corresponds to a y-value of about 220.
Which of the following improves a study's reliability as it increases? Correlation coefficient Regression equation slope Sample size Simpson's Paradox
The answer is c. Small study populations can impact the reliability of regression analysis. Nurses need to be aware of the study size when attempting to perform a regression analysis or interpret a study based on small study size.
Using the scatterplot Total Cholesterol vs. BMI, what would the total cholesterol level be for an individual with a BMI of 29 ? Round your answer to the nearest whole number. Approx. 169 Approx. 182 Approx. 207 Approx. 221
The answer is c. Solving algebraically, a linear regression analysis with an equation of y=6.4957x+18.514. By incorporating BMI for x, y=6.4957(29)+18.514, y=206.8893, therefore an individual with a BMI of 29 will have a cholesterol level around 207.
Using the scatterplot Age vs. Blood Calcium Level, what can we expect to be the blood calcium level of someone 75 years of age? Round your answer to the nearest tenth. Approx. 9.2 Approx. 9.4 Approx. 9.6 Approx. 9.8
The answer is c. Solving algebraically, a linear regression analysis with an equation of y=−0.0105x+10.381. By incorporating 75 in for x. y=−0.0105(75)+10.381, y=9.5935. Therefore, the answer is around 9.6.
This analysis suggests that as body weight increases, blood pressure decreased. What problem in regression is evident in this analysis? Inappropriate Extrapolation Not a Representative Sample Association is Not Causation Small Sample Size
The answer is c. This analysis is obviously missing a lurking variable which, in this case, age. It is nonsense to try to estimate a patient's blood pressure based on body weight. Therefore, the association is not a causation.
Consider the following equation y+6.7x=−2/5 . What is the slope of a line with this equation? A.2/5 B. −2/5 C. 6.7 D. −6.7
The answer is d. The slope intercept equation in simple linear regression is y=mx+b . The equation y+6.7x=−25 can be rewritten as y=−6.7x−25. In this equation, m represents the slope of the line. Therefore, the slope of the line with equation y=−6.7x−25 is −6.7.
Consider the following equation, y=94x+6.3 . What is the y -intercept for this equation? A. 4/9 B. 9/4 C. 4.05 D. 6.3
The answer is d. The slope intercept equation in simple linear regression is y=mx+b. In this equation, b represents the y-intercept. Therefore, the y-intercept of a line with equation y=9/4x+6.3 is 6.3
Consider the following equation y=6−8x . What is the slope of this line? A. 6 B. −6 C. 8 D. −8
The answer is d. The slope intercept equation in simple linear regression is y=mx+b. The equation y=6−8x can be rewritten as y=−8x+6. In this equation, m represents the slope of the line. Therefore, the slope of the line with equation y=6−8x, is −8.
Consider the following equation, y=7x+15 . What is the y -intercept for this equation? A. 1 B. 715 C. 7 D. 15
The answer is d.The slope intercept equation in simple linear regression is y=mx+b . In this equation, b represents the y-intercept. Therefore, the y-intercept of a line with equation y=7x+15 is 15.
Which of the following situations prevents an accurate regression analysis from being performed? a) The relationship between the explanatory and response variable is linear. b) The data you are looking for falls outside of the range of the study. c) The sample size in the study is too large. d) The association between the two variables is causal.
The data you are looking for falls outside of the range of the study.
regression line
The regression line is the line that best minimizes the distance between each data point and the line itself (hence the term "line of best fit'). On this graph, the red dotted line is the regression line, and the blue points are our actual data points. The linear regression equation of this line is actually y=10.33x-2330 . The slope, m , of the regression line in this case is 10.33 , which tells us valuable information: on average, for every dollar spent on ads, the projected sales go up by $10.33 . Another way of thinking of this is that the slope tells you how much the response variable will respond to a 1 -unit increase in the explanatory variable. So, for example, if we spend $1500 dollars on ads, we see that we can expect projected sales to be, on average, about y=10.33(1500)−2330=15,495-2330=$13,165 . If we spent just one dollar more, or $1501 , we can see that projected sales would be, on average, about y=10.33(1501)−2330=$13,175.33 . Notice that the projected sales went up by exactly $10.33 by increasing the amount spent on ads by one dollar.
Inappropriate Extrapolation
We have seen how to make predictions based on the results of regression analysis. Regression analysis can be an excellent tool to predict future or unknown results, but sometimes it doesn't make sense to extrapolate* beyond the range of the study. For example, examine the graph below: The chart displays the height of 30 males aged 4 to 14 . The dotted line represents the linear regression line of best fit with equation ( y=2.278x+31.87 ). The equation and line are both correct and were found using the least squares estimate. So, we can use regression analysis to estimate the height of a male between the ages of four and 14 . It would be inappropriate, however, to extrapolate the data to ages outside of this age range. Using the regression equation, we would find that males aged 18 would have an average height of 73 inches ( 6'1" ), males aged 30 would have an average height of 100 inches ( 8'4" ), and males aged 50 would have an average height of 146 inches ( 12'2" ). This analysis is obviously incorrect. The error lies in the fact that you extrapolated outside of the original range of study. The growth experienced by young males does not continue into adulthood. Therefore, you cannot draw any conclusions about the height of a 50 -year-old male from the growth of males aged 4 through 14 .
Least Squares Estimation
When analyzing data, we are often not supplied with the regression line. To determine the equation that best fits the data in a graph, there are many different estimations one can use. The most commonly used estimation technique for linear regression is known as the least squares* estimation, or ordinary least squares (OLS). In fact, ordinary least squares is such a standard technique; it is often referred to as ordinary linear regression. Most spreadsheet programs, such as Excel, can perform a least squares estimation calculation.
Regression analysis*
When analyzing the relationship between two or more variables, regression analysis is a helpful tool. Regression analysis* is used when multiple variables' quantities relate to each other. Regression analysis is an extension of scatterplots, associations, and correlations. If we determine that an association exists between two or more variables, we can use regression analysis for description and prediction. Regression is used to describe a trend or predict values based on known values.
Significance Levels
When conducting a hypothesis test, or analyzing the results of such a test, you will need to decide on a significance level*, to help you determine whether your results are significant. The significance level specifies how certain you need to be to state that a result is statistically significant. It is used as the p -value cutoff for statistical significance, which means that any p -value below the set significance level is considered statistically significant. For example, if you decide to use a significance level of 0.1 , that means you accept a conclusion only if there is less than a 0.1 (or 10% ) probability that a result was caused by chance. Here, you would declare something significant, knowing that there is less than a 10% chance your conclusion is wrong.
regression equation
`The regression equation* is the algebraic equation that models the regression. Normally, the regression can be of any degree* with any number of response variables. It need not be linear*. Here, we will use simple linear regression.
Regardless of the correlation suggested, practitioners should be careful not to assume that
a correlation among data proves that one variable causes another — it is possible that both of the variables in question are affected by some other factor (a lurking variable), or that one variable is a subset of the other. To truly know if one variable causes changes in another (i.e., if causation is present), then you must conduct an experiment with treatment and control groups to determine causation. Care must also be taken to ensure that the diagram is not hastily analyzed, since the shape of the diagram and the implied correlation can be easily misunderstood or manipulated by adjusting the length or the scales of the diagram's axes.
Which of the following best describes the relationship between the two variables? a) As the temperature increases, the number of patients admitted to the hospital for heat stroke also increases. b) As the number of patients admitted to the hospital increases, the temperature also increases. c) There is no relationship between temperature and the number of patients admitted for heat stroke. d) As the temperature increases, the number of patients admitted for heat stroke decreases.
a) As the temperature increases, the number of patients admitted to the hospital for heat stroke also increases. Correct. The correct answer is a. As the temperature increases, the number of patients admitted to the hospital for heat stroke also increases.
. From the chart above, Lung Cancer Mortality represents what type of variable? Categorical variable Explanatory variable Descriptive variable Response variable
d is Correct × The answer is d. From this this chart, Lung Cancer Mortality, represents the response variable.
If a patient's blood sugar, cholesterol, and blood pressure are all dependent on diet, which of these is the explanatory variable Blood sugar Cholesterol level Blood pressure Diet
d is Correct × The answer is d. In this example, diet is the explanatory variable because all of the other variables respond to it.
2. A result is significant if the p -value is .
answer is a. If the p-value is less than the significance level, the result is
Which of the following best describes the correlation in the scatterplot below? Strong positive correlation Strong negative correlation Weak positive correlation Weak negative correlation
answer is a. This scatterplot reveals a strong positive correlation.
Using the scatterplot Body weight vs. Systolic blood pressure, what can we expect the systolic blood pressure to be for an individual who weighs 190 pounds? Round your answer to the nearest whole number. Approx. 130 Approx. 140 Approx. 150 Approx. 160
c is Correct × The answer is c. Solving algebraically, a linear regression analysis with an equation of y=0.4546x+64.313. By incorporating 190 in for x. y=0.4546(190)+64.313, y=150.687. Therefore, the answer is around 150.
For a hypothesis test with a significance level of 0.05 , what p -value would be statistically significant? 0.02 0.06 0.08 0.1
ect × The answer is a. A p-value less than the significance level is considered to be statistically significant. For a test with a significance level of 0.05, a p-value of 0.02 is statistically significant.
If the significance level is set at .10 what p -value would be considered statistically significant? 0.08 0.12 0.14 0.16
ect × The answer is a. If 10% of the result occur by chance, the significance level for the study is 0.1, therefore, a p-value of 0.08 would be considered statistically significant.
If a hypothesis test results in a p -value of 0.2 , what is the probability the result was caused by chance? 10% 20% 60% 80%
ect × The answer is b. A p-value of 0.2 means that there is a 20% probability the result was caused by chance.
A result is significant if it was likely caused by chance. True False
ect × The answer is b. This statement is false. A result is significant if it was unlikely to have been caused by chance.
If the significance level is set at 0.08 what p -value would not be considered statistically significant? 0.02 0.04 0.07 0.10
ect × The answer is d. If 8% of the result occur by chance, the significance level for the study is 0.08, therefore, a p-value of 0.10 would not be considered statistically
For a hypothesis test with a significance level of 0.1 , what p -value would not be statistically significant? 0.1 0.14 0.02 0.08
er is b. A p-value greater than the significance level is considered to be not statistically significant. For a test with a significance level of 0.1, a p-value of 0.14 is not statistically significant.
If a hypothesis test yields a result that 85% of the results did not occur by chance, what is the p -value of this study? 0.08 0.15 0.58 0.85
er is b. If 85% of results did not occur by chance, therefore, 15% of the results occurred by chance and the p-value for this study is 0.15.
If a hypothesis test yields a result that 8% of the results occurred by chance, what is the p -value of this study? 0.02 0.2 0.08 0.8
er is c. If 8% of the results occurred by chance, the p-value for the study is 0.08.
One of the primary uses of linear regression equations is to make
estimations or predictions from a given set of data. If the prediction is between known data points, it is called linear interpolation*. If the prediction is outside of known data points, it is called linear extrapolation
.Using the scatterplot Age vs. Blood Calcium Level, what can we expect to be the blood calcium level of someone 56 years of age? Round your answer to the nearest tenth. Approx. 9.5 Approx. 9.8 Approx. 10.1 Approx. 10.3
he answer is b. Solving algebraically, a linear regression analysis with an equation of y=−0.0105x+10.381. By incorporating 56 in for x. y=−0.0105(56)+10.381, y=9.793. Therefore, the answer is around 9.8.
From the scatterplot below, if the trend line would be extended indefinitely, it would correspond with a patient's systolic blood pressure in excess of 250 mmHg. What pitfall in regression analysis is evident in this chart? Inappropriate Extrapolation Association is Not Causation Not a Representative Sample Small Sample Size
rect × The answer is a. It is obvious that blood pressure cannot go indefinitely high, so therefore it would be inappropriate to extrapolate beyond the range of the study. This analysis suffers from inappropriate extrapolation.
Through what method can you identify if missing data is skewing the results of the study? a Obtain basic statistics for the entire population and compare those with the sample being studied . b Perform regression analysis for every subgroup of the sample being studies. c Obtain a new sample of study subjects and compare the results with the original sample. d Reduce the sample size to ensure less variation across the study population.
rect × The answer is a. While it is difficult to identify if missing data is skewing the results of the study, one approach is to obtain basic statistics for the entire population and compare those with the sample being studied.
The more careful you want to be, the smaller your significance level should be. A significance level of 0.01 indicates that you only declare significance if there is less than a 1% chance that the result is caused by chance: less than a 1% chance that you are wrong.
significance level should be. A significance level of 0.01 indicates that you only declare significance if there is less than a 1% chance that the result is caused by chance: less than a 1% chance that you are wrong.
Which of the following qualities of a sample help ensure the accuracy of any analysis, including a regression analysis? a large sample a representative sample Both a and b
the answer is c. When conducting a study, it is important to use a large, representative
if the points on a scatterplot do fall precisely along a straight line,
we call it a perfect linear relationshiP
There is no universally accepted line that separates what is considered a strong correlation and a weak correlation. That being said, .
we will use the following common guide: r values between −0.3 and 0.3 are considered weak linear correlations; r values between −0.7 and −0.3 and between 0.3 and 0.7 are considered moderate linear correlations; r values between −0.7 and −1 and between 0.7 and 1 are considered strong correlations
Slope-Intercept in Simple Linear Regression
y=mx+b In this equation: b is the y -intercept: In linear regression as with algebra, this is simply the value of y when x=0 . m is the slope: In regression analysis, for every additional unit of the explanatory variable ( x ), we can predict an increase (or decrease, if negative) of m units of the response variable ( y ). x is the x -value: This is the explanatory variable. y is the y -value: This is the response variable. The y variable above is determined by the values of m and b as well as the values of the x variable. For example, the price of a bottle of wine (the response variable y ) is a function of the demand for wine (the explanatory variable x ): If more people want wine, it bids up the price and vice versa.