Chapter 3
The histogram above shows the distribution of population in millions across states. Which of the following statements are true based on the histogram?
-Only a few states have very high populations. -The shape of the distribution is skewed to the right.
When do you tend to see lots of gaps between blocks in your histogram?
-Small bin width -Lots of bins
Let's say you make several histograms in the process of exploring the data. Among them is a frequency histogram of PhysicalActivityand a relative frequency histogram of PhysicalActivity. If you used default settings for each of them, what do the two have in common? (Check all that apply.)
-They display the same variable. -They have the same number of bars. -The shape of the distribution would be the same.
Revise your basic histogram of Smokers in the USStates data frame so that it includes just 5 bins. Locate the bin that represents the states with the highest percentage of residents who smoke. Around what number is that bin centered?
30
In USStates, what's the median percentage of residents with college degrees? (The variable is aptly named College.)
30.6
If we added another data point, 3.7, which bin would it go in?
4
The variable Pres2008 is categorical. It indicated whether it was McCain or Obama who won the state in the 2008 election. Which is the more appropriate visual representation for this data?
Bar Graph
In this case, the median weight (145) is equally distant from the minimum weight (about 50-ish pounds), as it is from the maximum weight (about 50-ish pounds). What does this tell you about the distribution?
It might be symmetrical.
Using the USStates data frame, make a bar graph to illustrate the number of states that voted for McCain and Obama (recorded as the variable Pres2008). Based on what you see, which of the following statements is true?
McCain won in more than 20 states.
If the distribution of Fat were roughly symmetrical and bell-shaped, what would that mean?
The most frequent number in the distribution would be very close to the average scores.
Given that the IQR for Population is about 27 million, and Q3 is about 31 million, which of the following countries' populations could be considered a large outlier? (This does not need very precise calculations.) Mark all that apply.
-China, 1304.50 million -BIndia, 1094.58 million -US, 296.51 million -Indonesia, 220.56 million
Which of the variables below would be appropriate for a histogram?
-College -IQ -Population
Why should you look at a histogram of a variable before you do other statistical analyses?
-You might catch errors made in data collection/entry. -You can see the shape of the distribution to see if it makes sense.
What proportion of states (recorded as the variable Pres2008) was won by Obama? (Hint: use the "tally" command.)
.56
The USStates data frame includes information on the percentage of residents in each state who smoke. Data is coded under the variable named Smokers. Produce a histogram of Smokers, without indicating a particular number of bins or indicating a particular bin size. Where is the peak of the histogram?
20
Now imagine this. Say we use the same histogram command with just 5 bins, but we add a new number to our variable: 3.2. Which bin do you think it will go in?
3
In our USStates data frame, the variable PhysicalActivity was obtained by surveying a random sample of residents in each state and asking them if they had competed in a physical activity in the last month. Given this information, interpret the 70.3 in the table below.
70.3 percent of the surveyed residents said that they competed in a physical activity that month.
Between what two values do we see the middle 50% of all IQ scores in the USStates data frame?
98.5 and 102.7
Experiment, using different numbers of bins in your histogram of Smokers. If you don't want to see gaps between the blocks in your histogram, which is the better choice?
A small number of bins
You decide to make a relative frequency histogram for PhysicalActivity and have added the following to your code: %>% gf_density() What will you now see that you didn't see before?
A smooth density plot overlaying your bars.
The NutritionStudy data frame includes information on the number of Caloriespatients consumed per day. Produce a histogram of Calories, without indicating a particular number of bins or a particular bin size. What is the peak of the histogram?
Around 1600
A boxplot with a bigger upper part (relative to the lower part) will
Be right-skewed (tail to the right)
Wait! Didn't we just exclude outliers? Why are there outliers in the boxplot above?
Because this distribution has a different Q3 and IQR and these outliers are greater than the new Q3 + 1.5*IQR
Between which two points do we see the middle .50 of values?
Between Q1 and Q3
Are the quartiles the four blue rectangles, or the five orange lines?
Four blue rectangles
If the x-axis represents the variable, what does the y-axis represent in these histograms?
Frequency (or count)
What would the following R code do, beyond creating a histogram? gf_histogram(~ College, data = USStates) %>% gf_labs(title = "Distribution of Residents with College Degrees", x = "Percentage")
Give the histogram a title and specify the label for the x-axis.
If you had another variable for which the median was much farther from the min, and much closer to the max, what would you expect the distribution to look like?
It would probably be skewed.
In the case of population, the median population (10 million people) is much closer to the min population (about 10 million away) than the max population (more than 1,290 million away). What does this tell you about the shape of this distribution?
It's probably skewed
When do you tend to see clumpy, big blocks of data in your histogram?
Large bin width, few bins
Do you think the distribution of this sample of 12 die rolls would look the same as the population distribution above?
No
Create a box plot for College in the USStates data frame. How many outliers do you see?
None
If you wanted to see the distribution for College (percentage of residents with college degrees), and run the following R code, what would be wrong? gf_histogram(~ College, data = USStates, bins = 10)
Nothing
The histogram below shows the distribution of Alcohol with an outlier removed. What is the "count" on the y-axis a count of?
Number of patients
What is the formula for finding IQR?
Q3 - Q1
Imagine that the PhysicalActivity histogram is skewed to the right. That is, the skinny, longer tail is on the right. What does that mean?
The population in most states is sedentary.
If you saw a study with a small sample that contradicts the results of a similar study with a very large sample, which one should you trust?
The study with the large sample
The quartiles are equally sized. What is "equal" about the quartiles?
They each have the same number of data points.
In the Population boxplot, do you see mostly outliers that are too small or too large?
Too large
This histogram shows the same data with binwidth = 4. Why are there only two columns?
Two columns of 4 is sufficient to include the full range of data.
Use "%>%" notation to add gf_denstity() (a density plot) to your density histogram of Smokers in USStates. What does the curve look like?
Unimodal
If we wanted to calculate the upper boundary for outliers, which number should be substituted here:
UpperBoundary <- 1Q3 (31.225) + 1.5 * (2Q3 (31.225) - 3Q1 (4.455))
Imagine that you created a histogram of PhysicalActivity. You meant to set it to have 15 bins, but you accidentally set 5 bins instead. How would the result be different from what you hoped?
You would see less detail that you would have otherwise depicted.
If you wanted to get the five-number summary for PhysicalActivity, what R code would you run?
favstats(~ PhysicalActivity, data = USStates))
If you were interested in proportions rather than counts, whch argument would you add to your code above?
format = "proportion"
What R code would produce a relative frequency histogram of PhysicalActivity?
gf_dhistogram(~ PhysicalActivity, data = USStates)
What R code would create a distribution of Smokers?
gf_histogram(~ Smokers, data = USStates)
Wait a second! Does this mean that the country with the maximum population has only 1,304 people? HINT: It might help to review how the data were coded: Country Name of country Region Three-digit country code Happiness Score on a 0-10 scale for average level of happiness (10 is happiest) LifeExpectancy Average life expectancy (in years) Footprint Ecological footprint, which is a measure of the (per capita) ecological impact HLY Happy Life Years combines life expectancy with well-being HPI Happy Planet Index (0-100 scale) HPIRank HPI rank for the country GDPperCapita Gross Domestic Product (per capita) HDI Human Development Index Population Population (in millions)
no
Which of the following R code would quickly help you find the number of states in which McCain won the 2008 election, and the number of states in which Obama won?
tally(~ Pres2008, data = USStates)
Change your histogram of Smokers in the USStates data frame into a density histogram by using gf_dhistogram() instead of gf_histogram(). What changed?
y-axis
You create a histogram of IQ and find that it looks relatively normal. Which of the following statements are likely true?
-It's unimodal. -Most scores are clumped at the center. -It's roughly symmetrical. -It's bell-shaped.
Here's something to think about. If we made a boxplot of the Population of SmallerCountries, what would the box look like?
A taller box on the top (above the median)
In our USStates data frame, the variable PhysicalActivity was obtained by surveying a random sample of residents in each state and asking them if they had competed in a physical activity in the last month. What's our goal in analyzing data like this?
-It helps us understand the population. -It helps us understand the processes that produced the variation we see.