Stats 1430

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Independent variable

"factor" - item being changed/studied

Dependent variable

"response" - outcome being measured

measures of center: mean

'x-bar' n= sample size (n=u)

Two challenges in getting good survey results

(1) Select a good sample (2) Collect good data

How to spot/avoid biased samples

(1) a sampling procedure must be used (2) the sample must represent the entire population (truly random)

Experiment: making comparisons

- differences in the response must be due to either treatment or random chance

Observational study

- we observe and record

Measures of variability: standard deviation

- xi values represent the data values - use to measure concentration of the data around the mean - use to compare data sets - do not calculate by hand

Interpreting a scatterplot

-describe the relationship between X and Y - simplest general pattern: linear - direction: negative and positive - strength: how closely the data follow the pattern

If you add 10 to every value of a data set, which of the following will also increase by 10?

Both the median and mean will increase by 10

Boxplot A and Boxplot B are drawn on the same axes. If Boxplot A is shorter in length than boxplot B, it also has to contain less data than Boxplot B

False

Smoker Nonsmoker Total Male 125 _____ 376 Female 104 233 337 Total 229 484 713 Above is a two-way table examining the relationship between gender and whether or not a person smokes. What is the marginal distribution of gender?

Females: 337 / 713 = 0.473 Males: 376 / 713 = 0.527

Your boss gives you the following regression equation. Selling price = $5,240 + $33.80 (Number of Square Feet). What is the correct interpretation of the slope of this equation?

For every additional square foot, we expect a home's selling price to increase by $33.80.

Which of the following summary measures can be directly calculated from a boxplot?

IQR

measures of variability: interquartile range

IQR - distance taken up by the middle 50% of the data - if high concentration of data in the middle, IQR is small - need to divide the data into quarters to find it

Quartiles and the IQR

Q1 = 1st quartile = 25th percentile = 25% below Q2= 2nd quartile = 50th percentile = in the middle = M Q3 = 3rd quartile = 75th percentile = 25% above IQR = Q3-Q1

How do residuals relate to SSE

SSE= sum of squares of residuals (errors)

If a data set is skewed to the left, how will the mean and median compare?

The mean will be less than the median

An experimenter compares a single brand of popcorn to see how much popcorn is popped using different time settings on the same microwave. The time settings are 1.5 minutes, 2 minutes, 2.5 minutes and 3 minutes. In this situation, what is the factor?

Time setting

A STAT 1430 student is interested in examining the relationship between the number of bedrooms in a home and it's selling price. After downloading a valid data set from the internet, the student creates a scatterplot and calculates the correlation. The correlation value they calculate is 0.67. This implies that the selling price of a house tends to increase as the number of bedrooms increases.

True

If there are a few very large values in a data set compared to the rest of the data, the mean will be larger than the median.

True

The mean is influenced by outliers (values that are much larger or much smaller than the rest of the data.)

True

in thinking about the 5-number summary, the percentage of data below Q1 and above Q3 combined is the same as the percentage of data in the IQR.

True

Bob wants to do a telephone survey based on 100 people. Knowing that some people won't answer the phone, he selects a random sample of 200 names to be safe, so if someone isn't home, he can just call the next person on the list. He continue this way until he gets 100 responses. Will this sampling method create bias in Bob's data?

Yes

A manager of a retail store is interested in the relationship between a person's annual income and their total purchase amount. Could he measure this relationship by finding the correlation?

Yes, because income and total purchase amount are quantitative variables

Which of the following best describes a confounding variable?

a variable you did not include in the study that may have had an effect on the results

The personnel department keeps records on all employees in a company. Here is the information they keep in one of their data files: Employee identification number Last name First name Middle initial Department Number of years with the company Salary Education (coded as high school, some college, or college degree) Age Which of the following combinations of variables would be appropriate to examine with a scatterplot?

age and salary

Biased Sample: Volunteer

aka self-selected example: call-in polls, web surveys issues: - no sampling procedure is used - the sample won't represent any population

best y-intercept

b0=___ ___ y - b1 x

best slope equation

b1= R* Sy/Sx

A conditional distribution summaries the information from one variable ONLY, without considering ANY information from another variable.

false

A flat histogram contains no variability whatsoever, according to our definition

false

A researcher is trying to determine the January temperature in regions of the United States using the degrees of latitude. After collecting data, she creates a scatterplot. Given the relationship the researcher is trying to predict, the latitude is the dependent variable and the temperature is the independent variable.

false

Changing the number of bins will never change the shape of a histogram

false

If the correlation coefficient, r, between two variables is 0, we can conclude that there is no relationship between the two variables.

false

If you switch X and Y the sign of the correlation changes.

false

Outliers significantly affect the value of the median

false

Suppose the correlation between X =price of a gallon of gaspline and Y = price of a gallon of milk is r = .40. Then the correlation between the price of a HALF gallon of milk and the price of a HALF gallon of gas must be r = .4/2 = .20.

false

The units of r, the correlation coefficient, are the same as the X variable.

false

Your boss gives you the following regression equation. Selling price = $5,240 + $33.80 (Number of Square Feet). It makes sense to interpret the Y-intercept for this equation.

false

Examining residuals

if a line fits well residuals should have: - no pattern (should have random scatter about the regression line) - no systematic change as X increases (ex: y values fan out as X increases - no unusually large values of a residual (outlier in the Y direction) - no influential points (outlier in the X direction)

measures of center: median

notation is 'M' - the number that splits the ordered data in half

correlation

notation: r formula: r interpretation: how x and y move together compared to separately

A veterinarian collects data on 100 of his patients who come in every year for their annual check-ups. After 5 years, he compares the health status of the dogs to the cats. What type of study is this?

observational study

properties of correlation

r (1) 2 quantitative variables only (2) linear relationship only (3) r has no units (4) switching x and y, r does not change (5) r is affected by outliers and skewness ( bc formula for r includes means, SDs, which are affected by outliers/skewness)

Residual

residual+ observed y - predicted y y - y hat

Which of the following is not one of the criteria for a good experiment? - avoid or minimize bias - make comparisons - collect enough data - select a random sample of individuals to participate

select a random sample of individuals to participate

Suppose your data represent revenues from a group of 20 stores in a retail chain across the country, and revenue is measured in millions of dollars. The first quartile of this data set would also be measured in millions of dollars.

true

Your boss gives you the following regression equation. Selling price = $5,240 + $33.80 (Number of Square Feet). The residuals have units of dollars.

true

Displaying the data distribution for categorical data

variable: gender, region, opinion, ... relative frequencies: % in each category - table, pie chart - bar graph showing % in each category (relative frequency bar graph) Frequencies: # in each category - table - bar graph showing number in each category (frequency bar graph)

Notes on histograms

- # of bars can affect the graph - starting point can affect the graph

coefficient of determination

- % of variability in y that is explained by x notation: R^2 compared with r: R^2=r^2 (correlation squared) - shows how well the model fits

Variability

- 'concentration around the center'

Biased Sample: Undercoverage

- a subgroup of the population is excluded from the very beginning ex: asking osu students in dorms about about tuition issues: - sampling procedure is used - can only represent the remaining population without the subgroup

Selecting a good sample

- all good samples are RANDOM

Designing a good survey: implementation (response bias)

- an individual in the sample responds but doesn't give the correct data ex: have you ever texted and drived what you can do to avoid this? - anonymity (can't link you to data) - confidentiality (won't link you to data)

Designing a good survey: implementation (nonresponse)

- an individual is selected to be in the sample but doesn't respond to the survey - respondents likely have stronger opinions

Watch for Simpson's Paradox

- break down results further, not just by one variable - look for lurking variables and collect data on them also - which is the most informed data set? the one that is further broken down - this is a common phenomena in the real world

Notes about boxplots

- cant tell what the sample size is - bigger boxes DON'T mean more data - boxplots can be horizontal or vertical - you can't see mean on a boxplot - there is always 25% of the data in each section of the boxplot - can't tell the type of symmetry - easy to compare centers

Biased sample: convenience

- choose individuals the easiest way ex: going to the oval and asking OSU students... Issue: - sampling procedure is used (technically) BUT sample won't represent any population (no system for representing population)

Being successful with your survey...

- clarify the question - define your intended population - choose a good (RANDOM) sample - make sure your survey is well designed (avoid misleading questions, encourage truthful responses) - implement it well (watch timing, minimize non response, response bias) - analyze your data properly

Histogram

- divides data into contiguous groups on the number line and shows how many are in each group Horizontal axis: the variable you measured Vertical axis: number or percentage in each group

Random sample

- each group of the same size has the same chance of being selected as the sample - allow no favoritism by the sampler or the sampled (bias)

Measures of center

- either mean or median

Experiment

- give treatments and record - observes their responses (compares to control group) - any sizable enough difference is deemed to be due to treatment

Control Group

- group getting fake (placebo) or existing treatments

categorical data

- groups example: gender, region, district

Treatment group

- groups getting the treatment

data distribution

- listing all the possible values that occurred in data and how often they occurred

Comments on nonresponse

- look for high percentage of respondents instead of high number of respondents (response rate)

A good experiment ...

- makes comparisons - avoids bias - has enough data

Descriptive statistics and boxplots

- measure, interpret, compare measures of center and variability in data sets center: mean and median variability: standard deviation, quartiles boxplot: another graph of quantitative data

Histogram notes

- nice way to see the overall shape of a data set - see data broken down into small groups but hard to identify quartiles - can only get a rough idea of center or variability - hard to compare data sets

quantitative data

- numbers - counts (discrete), measurements (continuous) example: "how many students?"

Methods of collecting data

- observational studies - experiments

Boxplot

- one dimensional graph breaking the data into 4 equal parts (25% each) - "5-number summary" - min, Q1, Q2,Q3, and max - special area around Q1 to Q3 to indicate the middle 50% of the data - lines go out to max and min from there - can see immediately median, IQR and if data is skewed

interpreting the y-intercept (correlation)

- only if appropriate! (1) must have data near x=0 (2) x=0 must make sense

Data displays: examining data distributions

- organize and summarize your data set as a first step in data exploration - organize data set using graphs - summarize data set using numbers (descriptive statistics)

Simpson's Paradox

- originally comparing a variable gets one set of results but these results are reversed with a 3rd variable gets involved

Joint ("and") distribution

- overall percentage in each cell - sums to one

Simple Random Sample

- purpose: examine the entire population as it exists and take 1 sample ex: what percent of Americans have a certain occupation?; using random-digit dialing or Gallup poll

Avoiding bias in an experiment

- randomly assign subjects to treatments - control for confounding variables - avoid experimenter and subject bias (doubly-blind study) - have enough data; good example n >30 in each treatment group

Two-way tables

- relations in categorical data

What is the impact of bias?

- results are off in one direction or the other

properties of standard deviation

- same units as the original data - never negative - can equal zero - is affected by outliers and skewness

Doing a good survey

- select a good sample - design a survey that avoids bias - implement your survey to avoid bias - analyze your data properly

How to get a better survey response rate

- select a smaller sample and follow through - provide appropriate incentives (not bribes!)

boxplot notes

- shows skewed vs. symmetic shapes - limitation: does not show what type of symmetric shape - easy to determine center and variability - good for skewed data sets - easy to see quartiles but can't see any other breakdown - easy to compare data dets

Types of random samples

- simple random sample - stratified random sample - many others

Skewed right vs. skewed left

- skewed right (positively skewed) = most data to the left; mean > median - skewed left (negatively skewed) = most data to the right; mean < median

What is Bias?

- systematic favoritism during sampling or data collection process

Designing a good survey: implementation (timing)

- the timing of a survey can affect the results ex: daytime home phone survey about job satisfaction, gun control survey after a shooting

Designing a good survey: Type of survey

- the type of survey you conduct can affect the results ex: land-line survey to get student opinions

Designing a good survey: question wording

- the wording of a survey question can affect the results ex: "don't you think" "should" ... can lead to bias good examples: "what's your opinion of...", "what do you think about..."

Displaying the distribution for quantitative data

- use graphs to show 3 important characteristics in a data set: shape, 'center' and variability - one way: histogram, box plot

Predicting Y using X; finding the best line

- use x to predict y using the "best" straight line use the model: y (hat) = b0+b1x b1= Slope b0 = y-intercept - smallest SSE (sum of squares for error)

Spotting/avoiding problems in pie charts and bar graphs

- watch the scale on bar graphs - always look for sample size

Suppose you have 4 data sets whose scatterplots all show possible linear relationships. The four data sets have correlations of -0.10, +0.25, -0.90, and +0.80, respectively. Which of the correlations shows the strongest linear relationship?

-0.90

Interpreting correlation

-1 < r < 1 (less than or equal/greater than or equal) when r is +/-.7: strong +/- .7 - +/- .5 : moderate +/- 3 or less : weak r = 0 : no linear association

Stratified Random Sample

-purpose: compare subgroups of the population equally ex: how do people with different occupations feel about the economy?; divide the population into subgroups (strata) of interest, choose a simple random sample from each subgroup


Kaugnay na mga set ng pag-aaral

Explorando la ciudad - Los lugares de la ciudad - Lista A

View Set

Anatomy - Brain and Cranial Nerves Guided Reading

View Set

Nutrition MindTap Study Guide (Chp 8)

View Set

Chapter 4- Neuronal function in the Nervous System

View Set

Wk 5 - Practice: Fiscal Policy [due Day 5]

View Set

Discrimination II - Equal Pay and the Sex Equality Clause

View Set

Mastering Biology Chp 4 and 5 quiz

View Set