Stats 1430
Independent variable
"factor" - item being changed/studied
Dependent variable
"response" - outcome being measured
measures of center: mean
'x-bar' n= sample size (n=u)
Two challenges in getting good survey results
(1) Select a good sample (2) Collect good data
How to spot/avoid biased samples
(1) a sampling procedure must be used (2) the sample must represent the entire population (truly random)
Experiment: making comparisons
- differences in the response must be due to either treatment or random chance
Observational study
- we observe and record
Measures of variability: standard deviation
- xi values represent the data values - use to measure concentration of the data around the mean - use to compare data sets - do not calculate by hand
Interpreting a scatterplot
-describe the relationship between X and Y - simplest general pattern: linear - direction: negative and positive - strength: how closely the data follow the pattern
If you add 10 to every value of a data set, which of the following will also increase by 10?
Both the median and mean will increase by 10
Boxplot A and Boxplot B are drawn on the same axes. If Boxplot A is shorter in length than boxplot B, it also has to contain less data than Boxplot B
False
Smoker Nonsmoker Total Male 125 _____ 376 Female 104 233 337 Total 229 484 713 Above is a two-way table examining the relationship between gender and whether or not a person smokes. What is the marginal distribution of gender?
Females: 337 / 713 = 0.473 Males: 376 / 713 = 0.527
Your boss gives you the following regression equation. Selling price = $5,240 + $33.80 (Number of Square Feet). What is the correct interpretation of the slope of this equation?
For every additional square foot, we expect a home's selling price to increase by $33.80.
Which of the following summary measures can be directly calculated from a boxplot?
IQR
measures of variability: interquartile range
IQR - distance taken up by the middle 50% of the data - if high concentration of data in the middle, IQR is small - need to divide the data into quarters to find it
Quartiles and the IQR
Q1 = 1st quartile = 25th percentile = 25% below Q2= 2nd quartile = 50th percentile = in the middle = M Q3 = 3rd quartile = 75th percentile = 25% above IQR = Q3-Q1
How do residuals relate to SSE
SSE= sum of squares of residuals (errors)
If a data set is skewed to the left, how will the mean and median compare?
The mean will be less than the median
An experimenter compares a single brand of popcorn to see how much popcorn is popped using different time settings on the same microwave. The time settings are 1.5 minutes, 2 minutes, 2.5 minutes and 3 minutes. In this situation, what is the factor?
Time setting
A STAT 1430 student is interested in examining the relationship between the number of bedrooms in a home and it's selling price. After downloading a valid data set from the internet, the student creates a scatterplot and calculates the correlation. The correlation value they calculate is 0.67. This implies that the selling price of a house tends to increase as the number of bedrooms increases.
True
If there are a few very large values in a data set compared to the rest of the data, the mean will be larger than the median.
True
The mean is influenced by outliers (values that are much larger or much smaller than the rest of the data.)
True
in thinking about the 5-number summary, the percentage of data below Q1 and above Q3 combined is the same as the percentage of data in the IQR.
True
Bob wants to do a telephone survey based on 100 people. Knowing that some people won't answer the phone, he selects a random sample of 200 names to be safe, so if someone isn't home, he can just call the next person on the list. He continue this way until he gets 100 responses. Will this sampling method create bias in Bob's data?
Yes
A manager of a retail store is interested in the relationship between a person's annual income and their total purchase amount. Could he measure this relationship by finding the correlation?
Yes, because income and total purchase amount are quantitative variables
Which of the following best describes a confounding variable?
a variable you did not include in the study that may have had an effect on the results
The personnel department keeps records on all employees in a company. Here is the information they keep in one of their data files: Employee identification number Last name First name Middle initial Department Number of years with the company Salary Education (coded as high school, some college, or college degree) Age Which of the following combinations of variables would be appropriate to examine with a scatterplot?
age and salary
Biased Sample: Volunteer
aka self-selected example: call-in polls, web surveys issues: - no sampling procedure is used - the sample won't represent any population
best y-intercept
b0=___ ___ y - b1 x
best slope equation
b1= R* Sy/Sx
A conditional distribution summaries the information from one variable ONLY, without considering ANY information from another variable.
false
A flat histogram contains no variability whatsoever, according to our definition
false
A researcher is trying to determine the January temperature in regions of the United States using the degrees of latitude. After collecting data, she creates a scatterplot. Given the relationship the researcher is trying to predict, the latitude is the dependent variable and the temperature is the independent variable.
false
Changing the number of bins will never change the shape of a histogram
false
If the correlation coefficient, r, between two variables is 0, we can conclude that there is no relationship between the two variables.
false
If you switch X and Y the sign of the correlation changes.
false
Outliers significantly affect the value of the median
false
Suppose the correlation between X =price of a gallon of gaspline and Y = price of a gallon of milk is r = .40. Then the correlation between the price of a HALF gallon of milk and the price of a HALF gallon of gas must be r = .4/2 = .20.
false
The units of r, the correlation coefficient, are the same as the X variable.
false
Your boss gives you the following regression equation. Selling price = $5,240 + $33.80 (Number of Square Feet). It makes sense to interpret the Y-intercept for this equation.
false
Examining residuals
if a line fits well residuals should have: - no pattern (should have random scatter about the regression line) - no systematic change as X increases (ex: y values fan out as X increases - no unusually large values of a residual (outlier in the Y direction) - no influential points (outlier in the X direction)
measures of center: median
notation is 'M' - the number that splits the ordered data in half
correlation
notation: r formula: r interpretation: how x and y move together compared to separately
A veterinarian collects data on 100 of his patients who come in every year for their annual check-ups. After 5 years, he compares the health status of the dogs to the cats. What type of study is this?
observational study
properties of correlation
r (1) 2 quantitative variables only (2) linear relationship only (3) r has no units (4) switching x and y, r does not change (5) r is affected by outliers and skewness ( bc formula for r includes means, SDs, which are affected by outliers/skewness)
Residual
residual+ observed y - predicted y y - y hat
Which of the following is not one of the criteria for a good experiment? - avoid or minimize bias - make comparisons - collect enough data - select a random sample of individuals to participate
select a random sample of individuals to participate
Suppose your data represent revenues from a group of 20 stores in a retail chain across the country, and revenue is measured in millions of dollars. The first quartile of this data set would also be measured in millions of dollars.
true
Your boss gives you the following regression equation. Selling price = $5,240 + $33.80 (Number of Square Feet). The residuals have units of dollars.
true
Displaying the data distribution for categorical data
variable: gender, region, opinion, ... relative frequencies: % in each category - table, pie chart - bar graph showing % in each category (relative frequency bar graph) Frequencies: # in each category - table - bar graph showing number in each category (frequency bar graph)
Notes on histograms
- # of bars can affect the graph - starting point can affect the graph
coefficient of determination
- % of variability in y that is explained by x notation: R^2 compared with r: R^2=r^2 (correlation squared) - shows how well the model fits
Variability
- 'concentration around the center'
Biased Sample: Undercoverage
- a subgroup of the population is excluded from the very beginning ex: asking osu students in dorms about about tuition issues: - sampling procedure is used - can only represent the remaining population without the subgroup
Selecting a good sample
- all good samples are RANDOM
Designing a good survey: implementation (response bias)
- an individual in the sample responds but doesn't give the correct data ex: have you ever texted and drived what you can do to avoid this? - anonymity (can't link you to data) - confidentiality (won't link you to data)
Designing a good survey: implementation (nonresponse)
- an individual is selected to be in the sample but doesn't respond to the survey - respondents likely have stronger opinions
Watch for Simpson's Paradox
- break down results further, not just by one variable - look for lurking variables and collect data on them also - which is the most informed data set? the one that is further broken down - this is a common phenomena in the real world
Notes about boxplots
- cant tell what the sample size is - bigger boxes DON'T mean more data - boxplots can be horizontal or vertical - you can't see mean on a boxplot - there is always 25% of the data in each section of the boxplot - can't tell the type of symmetry - easy to compare centers
Biased sample: convenience
- choose individuals the easiest way ex: going to the oval and asking OSU students... Issue: - sampling procedure is used (technically) BUT sample won't represent any population (no system for representing population)
Being successful with your survey...
- clarify the question - define your intended population - choose a good (RANDOM) sample - make sure your survey is well designed (avoid misleading questions, encourage truthful responses) - implement it well (watch timing, minimize non response, response bias) - analyze your data properly
Histogram
- divides data into contiguous groups on the number line and shows how many are in each group Horizontal axis: the variable you measured Vertical axis: number or percentage in each group
Random sample
- each group of the same size has the same chance of being selected as the sample - allow no favoritism by the sampler or the sampled (bias)
Measures of center
- either mean or median
Experiment
- give treatments and record - observes their responses (compares to control group) - any sizable enough difference is deemed to be due to treatment
Control Group
- group getting fake (placebo) or existing treatments
categorical data
- groups example: gender, region, district
Treatment group
- groups getting the treatment
data distribution
- listing all the possible values that occurred in data and how often they occurred
Comments on nonresponse
- look for high percentage of respondents instead of high number of respondents (response rate)
A good experiment ...
- makes comparisons - avoids bias - has enough data
Descriptive statistics and boxplots
- measure, interpret, compare measures of center and variability in data sets center: mean and median variability: standard deviation, quartiles boxplot: another graph of quantitative data
Histogram notes
- nice way to see the overall shape of a data set - see data broken down into small groups but hard to identify quartiles - can only get a rough idea of center or variability - hard to compare data sets
quantitative data
- numbers - counts (discrete), measurements (continuous) example: "how many students?"
Methods of collecting data
- observational studies - experiments
Boxplot
- one dimensional graph breaking the data into 4 equal parts (25% each) - "5-number summary" - min, Q1, Q2,Q3, and max - special area around Q1 to Q3 to indicate the middle 50% of the data - lines go out to max and min from there - can see immediately median, IQR and if data is skewed
interpreting the y-intercept (correlation)
- only if appropriate! (1) must have data near x=0 (2) x=0 must make sense
Data displays: examining data distributions
- organize and summarize your data set as a first step in data exploration - organize data set using graphs - summarize data set using numbers (descriptive statistics)
Simpson's Paradox
- originally comparing a variable gets one set of results but these results are reversed with a 3rd variable gets involved
Joint ("and") distribution
- overall percentage in each cell - sums to one
Simple Random Sample
- purpose: examine the entire population as it exists and take 1 sample ex: what percent of Americans have a certain occupation?; using random-digit dialing or Gallup poll
Avoiding bias in an experiment
- randomly assign subjects to treatments - control for confounding variables - avoid experimenter and subject bias (doubly-blind study) - have enough data; good example n >30 in each treatment group
Two-way tables
- relations in categorical data
What is the impact of bias?
- results are off in one direction or the other
properties of standard deviation
- same units as the original data - never negative - can equal zero - is affected by outliers and skewness
Doing a good survey
- select a good sample - design a survey that avoids bias - implement your survey to avoid bias - analyze your data properly
How to get a better survey response rate
- select a smaller sample and follow through - provide appropriate incentives (not bribes!)
boxplot notes
- shows skewed vs. symmetic shapes - limitation: does not show what type of symmetric shape - easy to determine center and variability - good for skewed data sets - easy to see quartiles but can't see any other breakdown - easy to compare data dets
Types of random samples
- simple random sample - stratified random sample - many others
Skewed right vs. skewed left
- skewed right (positively skewed) = most data to the left; mean > median - skewed left (negatively skewed) = most data to the right; mean < median
What is Bias?
- systematic favoritism during sampling or data collection process
Designing a good survey: implementation (timing)
- the timing of a survey can affect the results ex: daytime home phone survey about job satisfaction, gun control survey after a shooting
Designing a good survey: Type of survey
- the type of survey you conduct can affect the results ex: land-line survey to get student opinions
Designing a good survey: question wording
- the wording of a survey question can affect the results ex: "don't you think" "should" ... can lead to bias good examples: "what's your opinion of...", "what do you think about..."
Displaying the distribution for quantitative data
- use graphs to show 3 important characteristics in a data set: shape, 'center' and variability - one way: histogram, box plot
Predicting Y using X; finding the best line
- use x to predict y using the "best" straight line use the model: y (hat) = b0+b1x b1= Slope b0 = y-intercept - smallest SSE (sum of squares for error)
Spotting/avoiding problems in pie charts and bar graphs
- watch the scale on bar graphs - always look for sample size
Suppose you have 4 data sets whose scatterplots all show possible linear relationships. The four data sets have correlations of -0.10, +0.25, -0.90, and +0.80, respectively. Which of the correlations shows the strongest linear relationship?
-0.90
Interpreting correlation
-1 < r < 1 (less than or equal/greater than or equal) when r is +/-.7: strong +/- .7 - +/- .5 : moderate +/- 3 or less : weak r = 0 : no linear association
Stratified Random Sample
-purpose: compare subgroups of the population equally ex: how do people with different occupations feel about the economy?; divide the population into subgroups (strata) of interest, choose a simple random sample from each subgroup