Ch. 3
T or F: It is easier to identify relationships between two categorical variables when looking at the percentages in the respective joint categories than when looking at the count data in the respective joint categories.
True - counts can make data relationship hard to tell if there is a big difference between the total number of observations in each category
If the correlation of variables is close to 0, then we expect to see
a cluster of points with no apparent relationship on the scatterplot
Example of comparison problems include: -salary being broken down between male and female, -cost of living being broken down by region of country, -recovery rate of patients broken down between placebo and actual drug -starting salary of graduates being broken down by major -all of the above
all of the above
The limitation of covariance as a descriptive measure of association is that it
is very sensitive to the units of the variables
The tool that provides useful information about a data set by breaking it down into categories is the
pivot table
We are usually on the lookout for large correlations near:
-1 and +1
A useful way of comparing the distribution of a numerical variable across categories of some categorical variable is: A. side by side pivot table B. side by side box plot C. both
Both side by side pivot table and box plot
Which of the following patterns on a scatterplot are most likely to indicate a correlation close to 0 between the respective variables? Select all that apply. Graph A: points along an upward-sloping line Graph B: points along a downward-sloping line Graph C: points are hovering around in a non-descript "cloud" Graph D: points slope downward then upward in a parabola shape
C and D C: no clustering so equals out to 0 correlation D: involves a correlation of -1 and then +1 so those average out to 0
Correlation and covariance measure:
The strength and direction of a linear relationship between 2 numerical variables
In a focus group, a group of people is asked whether or not they would buy a new smartphone for a suggested retail price of $1000. Their answers are reported by gender in the table below. Gender: No Yes Total Female 65.0% 35.0% 100% Male 44.1% 55.9% 100% TOTAL 54.55% 45.45% 100% Based on this data, what can we conclude about the relationship between gender and interest in the product?
There appears to be a relationship between the variables, because the percentage of males who would buy the new smartphone is higher than the percentage of females. ----yes because there is a big difference in the majority of females saying "No" and majority of males saying "Yes". Percentages between males and females are very different
Tables used to display counts of a categorical variable are called: -cross tabs -contingency tables -neither of above -both of above
both
To examine the relationship between ZIP code and household income, we can use: -counts, percentages, and corresponding charts -scatterplots -correlations -box-whisker plots
box whisker plots: zip code is a categorical variable; household income is a numerical variable: we have a cat. var and a num. var. so box-whisker plot to analyze those (counts are good for categorical variables (scatterplot and correlation are good for numerical vars
If the standard deviation of X is 15, the covariance of X and Y is 94.5, the coefficient of correlation r = 0.90, then the variance of Y is 7.0.
false
One characteristic of "paired variables" is
same # of observations
The most common data format is
stacked
A line or curve superimposed on a scatterplot to quantify an apparent relationship is known as a(n): -slope -trendline
trendline
The data set below contains data on different medical conditions from the 2015 Medical Expenditure Panel Survey by the Agency for Healthcare Research and Quality. HW 2 Data.xlsx The variable "Care Male" lists the number of males (in thousands) who received care for the respective medical condition. Likewise, the variable "Care Female" lists the number of females (in thousands) who received care. The variables "Total Male" and "Total Female" list the number of males and females who reported the respective condition. You may assume that the difference between the two variables did not seek and/or receive care. You want to know whether there is a relationship between gender and seeking care for each of the conditions (i.e. are males more or less likely to seek care for cancer than females). For which conditions does gender appear to have a relationship with whether or not the person sought and/or received care? -Cancer -complications of surgery or device -Hernias -Hypertension -Non-malignant neoplasm -Poisoning by medical and non-medical substances ***look at graph
-complications of surgery or device -Hernias -Non-malignant neoplasm -Poisoning by medical and non-medical substances ----calculate percentages: M or F who sought care divided by total M or F with condition cancer-numbers are identical complications-higher % of females hernias-higher % of females hypertension-almost the same non-malignant neoplasm-higher % of males poisoning- higher % of males
Correlation and covariance can be used to examine relationships between numerical variables as well as for categorical variables that have been coded numerically.
False
We do not even try to interpret correlations numerically except possibly to check whether they are positive or negative.
False
Relationships between two variables are less evident when counts are expressed as percentages of row totals or column totals.
False -esp for categorical vars
Correlation can be affected by the measurement scales applied to X and Y variables.
False- not as sensitive
Below you will find current annual salary data and related information for 30 employees at Gamma Technologies, Inc. These data include each selected employees gender (1 for female; 0 for male), age, number of years of relevant work experience prior to employment at Gamma, number of years of employment at Gamma, the number of years of post-secondary education, and annual salary. The tables of correlations and covariances are presented below.: Correlations: IQ Age Prior exp. Gamma Exp. Edu Salary IQ 1.000 Age -0.111 1.000 PE 0.054 0.800 1.000 GE -0.203 0.616 0.587 1.000 Edu -0.039 0.518 0.434 0.342 1.000 Sal -0.154 0.623 0.723 0.780 0.617 1.000 Covariances: IQ Age Prior exp. Gamma Exp. Edu Salary IQ 0.259 Age -0.633 134.051 PE 0.117 39.060 19.045 GE -0.700 72.047 17.413 49.421 Edu -0.033 9.951 3.140 3.987 2.947 Sal -1825.97 249700.35 78699.75 14303.29 24747.68 5846.. Which two variables have the strongest positive linear relationship? -Salary and Age -Prior experience and Age
Prior experience and Age --looking at Correlations table: Salary and Age= 0.623 Prior exp and Age= 0.800 0.800>0.623
An economic development researcher wants to understand the relationship between the average monthly expenditure on utilities for households in a particular middle-class neighborhood and each of the following household variables: -family size, -approximate location of the household within the neighborhood, -indication of whether those surveyed owned or rented their home, -gross annual income of the first household wage earner, -gross annual income of the second household wage earner (if applicable), -size of the monthly home mortgage or rent payment, - the total indebtedness (excluding the value of a home mortgage) of the household. **** look at table for data*** relationship between Debt and Utilities: 0.778 How would you interpret this relationship?
There is a positive linear relationship between debt and utility payments. ==0.778 is greater than 0
A trend line on a scatterplot is a line or a curve that "fits" the scatter as well as possible.
True
Correlation is a single-number summary of a scatterplot.
True
Counts for categorical variable are often expressed as percentages of the total.
True
If the coefficient of correlation r = 0 .80, the standard deviations of X and Y are 20 and 25, respectively, then Cov(X, Y) must be 400.
True
If the standard deviations of X and Y are 15.5 and 10.8, respectively, and the covariance of X and Y is 128.8, then the coefficient of correlation r is approximately 0.77.
True
It is possible that the data points are close to a curve and have a correlation close to 0, because correlation is relevant only for measuring linear relationships.
True
Problems in data analysis where we want to compare a numerical variable across two or more subpopulations are called comparison problems.
True
Side-by-side box plots allow you to quickly see how two or more categories of a numerical variable compare.
True
Statisticians often refer to the pivot tables that display counts as contingency tables or crosstabs
True
Strongly related variables have a relationship close to zero if the relationship is nonlinear.
True
T or F: Side-by-side box plots allow you to quickly see how two or more categories of a numerical variable compare.
True
The advantage that the coefficient of correlation has over the covariance is that the former has a set lower and upper limit.
True
The correlation between two variables is a unitless and is always between -1 and +1.
True
The scatterplot is a graphical technique used to make apparent the relationship between two numerical variables
True
To form a scatterplot of X versus Y, X and Y must be paired variables.
True
We must specify appropriate bins for side-by-side histograms in order to make fair comparisons of distributions by category.
True
Scatterplots are also referred to as
X-Y charts
A scatterplot allows one to see: A.whether there is any relationship between two variables B.what type of relationship there is between two variables C.Both (a) and (b) are correct
both A and B are correct
Displaying all correlations between 0.6 and 0.999 on a scatterplot as green and all correlations between -1.0 and -0.6 as red is known as
conditional formatting
Which of the following are considered numerical summary measures?
correlation and covariance
We study relationships among numerical variables using
correlation, covariances, and scatterplots
To examine relationships between two categorical variables, we can use
counts and corresponding charts of the counts
Correlation is useful only for
measuring the strength of a linear relationship
We can infer that there is a strong linear relationship between two numerical variables when:
the points on a scatterplot cluster tightly around an upward-sloping OR downward-sloping straight line
An example of a joint category of two variables is the count of all non-drinkers who are also nonsmokers.
true
For a sample of 100 employees, the covariance between annual salary (measured in $) and hours worked per week (measured in hours) is equal to 809.43, while the covariance between annual salary (measured in $) and sick days per year (measured in days) is equal to 450.86. Which of the following statements is correct? -The relationship between annual salary and and hours worked per week is stronger than the relationship between annual salary and sick days per year. -We need additional info to determine which relationship is stronger
we need additional info --need correlation data