C459 Module 5: Examining Relationships

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Categorical explanatory and categorical response

C→C

The relationship between a categorical explanatory variable and a quantitative response variable is summarized using:

Data display: side-by-side boxplots Numerical summaries: descriptive statistics

Categorical explanatory and quantitative response

C→Q,

The Correlation Coefficient: r

(r) is a numerical measure that measures the strength and direction of a linear relationship between two quantitative variables.

Case C→Q: Exploring the relationship amounts to comparing the distributions of the quantitative response variable for each category of the explanatory variable. To do this, we use:

Display: side-by-side boxplots. Numerical summaries: descriptive statistics of the response variable, for each value (category) of the explanatory variable separately.

Case C→C: Exploring the relationship amounts to comparing the distributions of the categorical response variable for each category of the explanatory variable. To do this, we use:

Display: two-way table. Numerical summaries: conditional percentages of the response variable, for each value (category) of the explanatory variable separately.

We want to explore whether the outcome of the study—the score on a test—is affected by the test-taker's gender. Therefore:

Gender is the explanatory Test score is the response

In Case C→Q, we compared distributions of the quantitative response.

In Case C→C, we compared distributions of the categorical response.

Exploring the relationship between a categorical explanatory variable and a quantitative response variable amounts to comparing the distributions of the quantitative response for each category of the explanatory variable.

In particular, we look at how the distribution of the response variable differs between the values of the explanatory variable.

Is there a relationship between gender and test scores on a particular standardized test? Other ways of phrasing the same research question:

Is performance on the test related to gender? Is there a gender effect on test scores? Are there differences in test scores between males and females?

A store asked 250 of its customers whether or not they were satisfied with the service. The purpose of this study was to examine the relationship between the customer's satisfaction and gender. This study is an example of:

Case C→C Good job! Both the explanatory (gender) and response (satisfaction) variables are categorical in this case. Therefore, this is an example of Case C→C.

Case C→C: Summarize the relationship between two categorical variables

Case C→C: Two Categorical Variables

A study was conducted in order to determine whether longevity (how long a person lives) is related to a person's handedness (right-handed/left-handed). This study is an example of:

Case C→Q Good job! In this case the explanatory variable (handedness) is categorical and the response variable (longevity) is quantitative. Therefore, this is an example of Case C→Q.

Quantitative explanatory and categorical response

Case Q→C will not be discussed in this course, and is typically covered in more advanced courses.

A study was conducted in order to explore the relationship between the number of beers a person drinks, and his/her Blood Alcohol Content (BAC, in %). This study is an example of:

Case Q→Q Good job! Both the explanatory (number of beers) and response (BAC) variables are quantitative in this case, and therefore this is an example of Case Q→Q.

The relationship between two categorical variables is summarized using: Data display: two-way table, supplemented by Numerical summaries: conditional percentages.

Conditional percentages are calculated for each category of the explanatory variable separately. They can be row percentages, if the explanatory variable "sits" in the rows, or column percentages, if the explanatory variable "sits" in the columns.

Case C→Q: Categorical Explanatory Variable and Quantitative Response Variable

we essentially compare the distributions of the quantitative response for each category of the explanatory variable using side-by-side boxplots supplemented by descriptive statistics. Recall that we have actually done this before when we talked about the boxplot and argued that boxplots are most useful when presented side by side for comparing distributions of two or more groups. This is exactly what we are doing here!

Case Q→Q: We examine the relationship using display: scatterplot. When describing the relationship as displayed by the scatterplot, be sure to consider:

Overall pattern → direction, form, strength. Deviations from the pattern → outliers.

Quantitative explanatory and quantitative response

Q→Q.

When we explore a relationship using the scatterplot, we should describe the overall pattern of the relationship and any deviations from that pattern. To describe the overall pattern consider the direction, form and strength of the relationship.

Assessing the strength just by looking at the scatterplot can be problematic; using a numerical measure to determine strength will be discussed later in this course.

A lurking variable is a variable that was not included in your analysis, but that could substantially change your interpretation of the data if it were included.

Because of the possibility of lurking variables, we adhere to the principle that association does not imply causation.

explanatory

The key to deciding which variable is the explanatory variable and which is the response variable is to decide which variable affects (or explains) the other. If we can say that X affects (or explains) Y, then X is the explanatory variable. previous hint next hint

The slope of the least squares regression line can be interpreted as the average change in the response variable when the explanatory variable increases by one unit.

The least squares regression line predicts the value of the response variable for a given value of the explanatory variable. Extrapolation is prediction of values of the explanatory variable that fall outside the range of the data. Since there is no way of knowing whether a relationship holds beyond the range of the explanatory variable in the data, extrapolation is not reliable, and should be avoided.

The correlation is only an appropriate numerical measure for linear relationships, and is sensitive to outliers. Therefore, the correlation should only be used as a supplement to a scatterplot (after we look at the data).

The most commonly used criterion for finding a line that summarizes the pattern of a linear relationship is "least squares." The least squares regression line has the smallest sum of squared vertical deviations of the data points from the line.

When the relationship is linear (as displayed by the scatterplot, and supported by the correlation r), we can summarize the linear pattern using the least squares regression line. Remember that:

The slope of the regression line tells us the average change in the response variable that results from a one-unit increase in the explanatory variable. When using the regression line for predictions, you should beware of extrapolation.

In this study we explore whether the nearsightedness of a person can be explained by the type of light that person slept with as a baby.

Therefore: Light type is the explanatory Nearsightedness is the response

The relationship between two quantitative variables is visually displayed using the scatterplot, where each point represents an individual.

We always plot the explanatory variable on the horizontal X axis, and the response variable on the vertical Y axis.

Summary

When examining the relationship between two variables, the first step is to classify the two relevant variables according to their role and type:

A special case of the relationship between two quantitative variables is the linear relationship. In this case, a straight line simply and adequately summarizes the relationship.

When the scatterplot displays a linear relationship, we supplement it with the correlation coefficient (r), which measures the strength and direction of a linear relationship between two quantitative variables. The correlation ranges between -1 and 1. Values near -1 indicate a strong negative linear relationship, values near 0 indicate a weak linear relationship, and values near 1 indicate a strong positive linear relationship.

When examining the relationship between two variables (regardless of the case), any observed relationship (association) does not imply causation, due to the possible presence of lurking variables.

When we include a lurking variable in our analysis, we might need to rethink the direction of the relationship → Simpson's paradox .

Including a lurking variable in our exploration may: help us to gain a deeper understanding of the relationship between variables, or lead us to rethink the direction of an association.

Whenever including a lurking variable causes us to rethink the direction of an association, this is an instance of Simpson's paradox.

A survey was conducted to study the relationship between whether the family is buying or renting their home and the marital status of the parents. Data were collected from a random sample of 280 families from a certain metropolitan area. A meaningful graphical display of these data would be:

a contingency table Good job! A two-way table (also called a contingency table) is the appropriate display for the comparing two categorical variables (that is, for the relationship between a categorical explanatory variable, in this case "marital status of the parents," and a categorical response variable, in this case "buy or rent their home").

In order to study the relationship between IQ level and GPA, data were collected from a sample of 540 students. The data collected in this study would best be displayed using:

a scatterplot Good job! A scatterplot is the appropriate display for comparing two quantitative variables (in other words, to display the relationship between a quantitative explanatory variable, in this case "IQ," and a quantitative response variable, in this case "GPA").

The Role-Type Classification

most studies pose research questions that involve exploring the relationship between two variables using the collected data. Here are a few examples of such research questions with the two variables highlighted:

A store asked 250 of its customers how much they spend on groceries each week. The responses were also classified according to the gender of the customers. We want to study whether there is a relationship between amount spent on groceries and gender. A meaningful display of the data from this study would be:

side-by-side boxplots Good job! Side-by-side boxplots are the appropriate display for comparing several groups of quantitative data (in other words, for displaying the relationship between a categorical explanatory variable, in this case "gender," and a quantitative response variable, in this case "amount spent on groceries").


Ensembles d'études connexes

Chapter 10: Special Driving Conditions

View Set

CCNA V7 - Chapter 14 - Transport Layer - Module Quiz

View Set

N350 resp exam review questions pt 2

View Set

PATHO TEST 2 CHAPTER 26 MASTERY QUIZ

View Set