chapter 3 and 4

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

In summary, the key difference between whether a histogram or a tally will be more useful has to do with the type of outcome variable.

A quantitative outcome variable would lead you to use a faceted histogram. A categorical outcome variable would lead you to use a tally.

Experiment, using different numbers of bins in your histogram of Smokers. If you don't want to see gaps between the blocks in your histogram, which is the better choice?

A small number of bins

You decide to make a relative frequency histogram for PhysicalActivity and have added the following to your code:

A smooth density plot overlaying your bars.

What does a point represent in a scatterplot?

An explanatory variable (e.g., Sex)

Interquartile Range (IQR)

Between Q1 and Q3

What kind of variables should go in tally()?

Categorical

What would the following R code do, beyond creating a histogram? gf_histogram(~ College, data = USStates) %>% gf_labs(title = "Distribution of Residents with College Degrees", x = "Percentage")

Give the histogram a title and specify the label for the x-axis.4. Imagine that you created a histogram of PhysicalActivity.

In our USStates data frame, the variable PhysicalActivity was obtained by surveying a random sample of residents in each state and asking them if they had competed in a physical activity in the last month. What's our goal in analyzing data like this?

It helps us understand the population. It helps us understand the processes that produced the variation we see.

In this case, the median weight (145) is equally distant from the minimum weight (about 50-ish pounds), as it is from the maximum weight (about 50-ish pounds). What does this tell you about the distribution?

It might be symmetrical.

If you had another variable for which the median was much farther from the min, and much closer to the max, what would you expect the distribution to look like?

It's probably skewed.

In the case of population, the median population (10 million people) is much closer to the min population (about 10 million away) than the max population (more than 1,290 million away). What does this tell you about the shape of this distribution?

It's probably skewed.

You create a histogram of IQ and find that it looks relatively normal. Which of the following statements are likely true? (Check all that apply.)

It's unimodal. Most scores are clumped at the center. It's roughly symmetrical. It's bell-shaped.

When do you tend to see clumpy, big blocks of data in your histogram?

Large bin width, few bins

bottom-up strategy

Looking at a distribution of data, you try to imagine what the population distribution might look like, and what processes might have produced such a distribution. As we move from concrete data to the more unknown, abstract DGP.

If our theory is correct, that some students reported their thumb lengths in centimeters instead of millimeters, what kind of error would this be?

Mistake

Consider the following model: Thumb = Height + other stuff. What is the best visualization to use?

Neither. Histograms are useful for visualizing the distribution of a single quantitative outcome variable.

Data generating process (DGP)

Not only do we want to generalize from our data to the population, but our real interest is in understanding the processes that produced the variation in the population itself, and then in the data what might the process be that could have generated a distribution of data that looks like this? unknown; we can't see it directly.

When do you tend to see lots of gaps between blocks in your histogram?

Small bin width, lots of bins

Consider the following model: WtLost = Condition + other stuff. What is the best visualization to use?

Tally is a good choice for a qualitative outcome variable because it shows the number of observations in each category of the variable.

What is IQR?

The distance between Q3 and Q1 The height of the box

Imagine that the PhysicalActivity histogram is skewed to the right. That is, the skinny, longer tail is on the right. What does that mean?

The population in most states is sedentary.

Where should you look in the histogram to notice within-group variation?

The spread of the distribution

What differs between the density histogram and the frequency histogram?

The y-axis

Let's say you make several histograms in the process of exploring the data. Among them is a frequency histogram of PhysicalActivity and a relative frequency histogram of PhysicalActivity. If you used default settings for each of them, what do the two have in common?

They display the same variable. They have the same number of bars. The shape of the distribution would be the same.

The quartiles are equally sized. What is "equal" about the quartiles?

They each have the same number of data points. Each quartile contains one-fourth of the observations, regardless of what their exact scores are on the variable.

top-down strategy

Thinking about the DGP, and all that you know about the world, you try to imagine what the data distribution should look like if your theory of the DGP is true. as we move from our ideas about the DGP to predicting actual data.

within-group variation (leftover variation)

This variation among members of the same group

Consider the following model: Thumb = Sex + other stuff. What is the best visualization to use?

Thumb is a quantitative variable, so a histogram is appropriate to visualize the distribution.

Why should you look at a histogram of a variable before you do other statistical analyses?

You might catch errors made in data collection/entry. You can see the shape of the distribution to see if it makes sense.

You meant to set it to have 15 bins, but you accidentally set 5 bins instead. How would the result be different from what you hoped?

You would see less detail that you would have otherwise depicted.

Spread

a way of characterizing how well distributed the cases are across the categories. Do most observations fall in one category, or is there an even distribution across all the categories?

The variable Pres2008 is categorical. It indicated whether it was McCain or Obama who won the state in the 2008 election. Which is the more appropriate visual representation for this data?

bar graph

If the values of the variable go on the y-axis, should the variable name appear before or after the ~?

before

distributions of quantitative variables

better explored with histograms and box plots

gf_facet_grid()

can split up your variable in diff categories for categorical variables

ordinal

categories with an order. values have a hierarchy or an order to them order matters here but distance between values is not always equal

discrete data

dichotomous, nominal, ordinal, numerical, interval observations can only exist at limited values often counted

y-axis represents...

either the frequency of some score or range of scores in a sample, or the proportion of a sample that had some score.

If you wanted to get the five-number summary for PhysicalActivity, what R code would you run?

favstats(~ PhysicalActivity, data = USStates)

gf_histogram()

for quantitative variables

If you were interested in proportions rather than counts, which argument would you add to your code above?

format = "proportion"

quartiles

four equal groups of values. It is as if the long vector of values have been cut into four equal sized pieces.

What do we look for when we explore distributions of a variable?

four things: shape, center, spread, and weird things.

distribution

"the pattern of variation in a variable or set of variables" and it is like a "lens" through which we can view variation in data The numbers must all be measures of the same attribute. for example, if you have measures of height and weight on a sample of 20 people, you can't just lump the height and weight numbers into a single distribution. You can, however, examine the distribution of height and the distribution of weight separately.

bimodal distribution

(having two clear clumps of scores around two parts of the measurement scale, with few in the middle

create a boxplot of Thumb length broken down by Sex.

gf_boxplot(Thumb ~ Sex, data = Fingers)

What R code would produce a relative frequency histogram of PhysicalActivity?

gf_dhistogram(~ PhysicalActivity, data = USStates)

%>%

goes at the end of a line to chain on a second command.

Variability

in a set of numbers, how widely dispersed the values are from each other and from the mean Data sets with similar values are said to have little variability, while data sets that have values that are spread out have high variability.

the law of large numbers

in the long run, by either collecting lots of data or doing a study many times, we will get closer to understanding the true population and DGP.

parameter

info that describes something about a population

what year is it

interval

𝑄1−1.5∗𝐼𝑄𝑅

is considered a small outlier.

𝑄3+1.5∗𝐼𝑄𝑅

is considered a very large outlier.

Explaining variation could help us in three ways:

it helps us understand what causes the variation in a variable; it helps us predict future observations; or, it helps us change the system we are studying to produce different outcomes.

after the ~

its values go on the x-axis

before the ~

its values goes on the y-axis

In a jitter plot, what tells you how many people there are with a particular thumb length?

less transparency

unimodal distribution

meaning that most scores are clustered in the center, with tails going out to either side

uniform distribution

meaning the number of observations is evenly distributed across the possible scores

three-number summary of the distribution

min, medium, max

nominal

name of a category (can use numbers) values refer to category numbers are arbitrary and should only be counted not added or subtracted

if you have a dichotomous variable, can you change it into an interval variable?

no. You can only change rich data into simpler data

student ID number

nominal

what is your preferred type of instruction? (in person, live, hybrid)

nominal

If you wanted to see the distribution for College (percentage of residents with college degrees), and run the following R code, what would be wrong? gf_histogram(~ College, data = USStates, bins = 10)

nothing

interval data

numeric values that represent a quantity. numbers have meaning unlike ordinal data, this data has equal spacing throughout the range 0 does mean mean absence of measure ex. measuring temperature (equal distance bw #s on the scale)

ratio

numerical values 0 has meaning (absence)

continuous

numerical, ratio measured data, can have infinite range of values

dichotomous

only 2 options ex: did you read through the syllabus?

highest level of education

ordinal

one a scale of 1-5, how much do u love stats

ordinal data

If we are mostly going to put the outcome variable on the y-axis (and the explanatory variable on the x-axis), what order would you expect in our R code?

outcome ~ explanatory

What kind of variables should go in gf_histogram()?

quantitative

weight

ratio

what % of the recent lecture did you watch

ratio

Relative frequency histograms

represent proportion instead of frequency of cases on the y-axis

Let's take a simulated sample of 12 die rolls (sampling with replacement from the six possible outcomes) and save it in a vector called sample1.

sample1 <- resample(model.pop, 12)

resample()

sampling with replacement

sample()

sampling without replacement

In our example about Thumb length and Sex, which is the explanatory variable?

sex

arrange()

sorts our outcome values to take a closer look at them.

gf_facet_grid()

splits the data according to the levels in the variable you specify. This works well with a categorical variable like sex or year in school, but not with a quantitative variable like height where you may have infinite different levels.

how do we find the middle 50% of data

subtracting the first quartile (the 25% mark) from the third quartile (the 75% mark): Middle 50% = Q3 - Q1.

Which of the following R code would quickly help you find the number of states in which McCain won the 2008 election, and the number of states in which Obama won?

tally(~ Pres2008, data = USStates)

five-number summary

the minimum, Q1, the median, Q3, and the maximum

mode of the distribution

the most frequent score.

A scatterplot is a common way to show

the relationship between an outcome variable and an explanatory variable.

The outcome variable

the variable whose variation we are trying to explain.

The explanatory variables

the variables we use to explain variation in the outcome variable.

density

think proportion of cases

In our example about Thumb length and Sex, which is the outcome variable?

thumb

inferential statistics

use descriptive statistics plus probability to draw conclusions (inferences) about a population and/or processes or relations between variables

explaining variation

what we mean is that knowing someone's value for the predictor will help us make a better prediction for their value for the outcome.

Distributions of categorical variables are best explored

with frequency tables (tallies) and bar graphs

Which axis represents the variable in these histograms?

x-axis


Kaugnay na mga set ng pag-aaral

Internet And Web 2016 Final review

View Set

143 Module 3 - Coronary Artery Disease (PRACTICE QUESTIONS)

View Set

Accounting Chapter 4 "Income Statement"

View Set

Summary of the Story of the Golden Fleece

View Set

Vocabulary Workshop Level B Unit 6

View Set