Stats 1430 Midterm Rumsey
Is X related to Y? - Method 1
Compare the conditional distributions If the same/close, then: there is no relationship If different, then: is a relationship; use % to explain it
process for stratified random sample
Divide the population into subgroups (strata) of interest Choose a simple random sample (usually same size) from each subgroup.
random sample
Each group of the same size has the same chance of being selected as the sample. ***** Allow no favoritism by the sampler or the sampled (bias)
complement rule simplified
Everyone who doesn't have characteristic A
T or F, if r = 0 no relationship?
False theres just no linear relationship
Confidentiality -
I can track you but I wont
Anonymity -
I cant track you
When finding the correlation between two quantitative variables, you will get the same answer if you switch X and Y. Explain briefly.
If you switch the X's and Y's around in the entire formula you get the same answer, by commutative property of multiplication.
Two questions that determine whether y-intercept can be interpreted If yes to both of these questions, y -intercept can be interpreted
Is data in area? Does # make sense?
Correlation is affected by outliers. Explain why, briefly
Looking at the formula for r, correlation is based on the mean of X, the mean of Y, the SD of X, and the SD of Y. All four of these items are affected by outliers
Marginal distribution =
Looks at one variable at a time (out of grand table)
a good experiment (3 points)...
Makes comparisons Avoids bias Has enough data
histogram
Nice way to see the overall shape of a data set & see patterns See data broken down into small groups but hard to identify quartiles Can only get a rough idea of center or variability Hard to compare data sets in detail, good at big picture
If line fits well residuals should have:
No pattern -Should have random scatter about the regression line No systematic change as X increases -Example Y values fan out as X increases No unusually large values of a residual -Outlier in the Y direction No influential points -Outlier in the X direction
Observational Studies
Observes individuals Measures variables and makes conclusions, comparisons Does not attempt to influence the responses
joint (and) distribution
Overall percentage in each cell (grand total) Sums to one
A and B are disjoint if
P(A and B) = 0
stratified random sample
Purpose: Compare subgroups of the population equally
simple random sample
Purpose: Examine the entire population as it exists
IQR=
Q3-Q1
4 steps to good survery
Select a good sample Design a survey that avoids bias Implement your survey to avoid bias Analyze your data properly
Biased Sample: Volunteer (aka) & issue
Self-selected; A call goes out and people enter the study on their own Issue: No sampling procedure used Sample won't represent any population - usually get mostly strong opinions
boxplots
Shows skewed vs. symmetric shapes Limitation: does not show what type of symmetric shape Easy to determine center and variability Good for skewed data sets Easy to see quartiles but can't see any other breakdown Easy to compare data sets|
Most common observational study:
Surveys
bias
Systematic favoritism in one direction or the other
conditional distribution=
Take 1 value of 1 variable and break into groups by the other variable
complement rule
The complement of A is the set of all outcomes in S which are not included in A.
Differences in the response must be due to
Treatment Random chance
Simpson's paradox
When you look at 2 variables you get one relationship but adding a third variable reverses the relationship
what variable should be on the x-axis of your scatterplot
variable you're using for prediction
Statistical significance:
when a result is too large to be due to chance (in our opinion)
Experiments
with experiment researcher actually gets involved; give treatment + have controls & see results (at end you can try to figure out why you got what you got)
confounding variables (aka)
working; variables operating in the background that can influence results and you didn't take account of them (they are in the background) may affect the results
Is correlation affected by outliers and skewness?
yes
It is possible for the first and second quartiles of a data set to be the same
yes
What happens to standard deviation if we add the same number to all values in the data set?
stays the same
standard deviation can equal zero?
true
Direction
uphill or downhill from left to right
Frequencies:
# in each category
Relative frequencies:
% in each category
slope=
(change in y)/unit change in x (x incr. by 1 )
correlation can be _____ <=r<= ____
-1,1
If all the residuals from a regression line are equal to zero, what is R^2
1
What happens to standard deviation if you multiply all values by 10? New SD=_________
10 x Old SD
For a correlation __ quantitative variables are needed
2
spot/avoid biased sample, Big Ideas - 2 criteria
A sampling procedure must be used. The sample must represent the entire population (truly random!)
Biased Sample: Undercoverage & issue
A subgroup of the population is excluded from the very beginning. issue: Sampling procedure is used Can only represent the remaining population without the subgroup
complement rule notation
Ac Or A'
Implementation: Response Bias
An individual in the sample responds but doesn't give the correct data
implementation: nonresponse
An individual is selected to be in the sample but doesn't respond to the survey
interpretation of correlation
strength + direction of linear relationship between x and y
biased sample: convenience & issue with it
Choose individuals in the easiest way Issue: Sampling procedure is used (technically) Sample won't represent any population
Is X related to Y? - Method 2
Compare conditional distribution to marginal (overall) distribution.
Data Distribution*****-
all possible values and how often they occur (it's a list showing how data is distributed)
How to avoid response bias
anonymity and confidentiality
Quantitative Data-
counts and measurements
2 things to interpret scatterplot
direction & strength
Independent Variable (aka)
factor; variable that youre changing and looking at the results of what happens (variable being compared)
Control Group-
fake or no treatment or existing treatment (give what is the current solution)
T or F, R squared gives you the percentage of points on the line??
false
T or F, correlation has units?
false
if disjoint a and b can happen at the same time?
false
Categorical Data-
groups (ex: gender)
strength
how close to the line points are
Treatment Group (s)-
individual groups on which each treatment is imposed
if you switch x and y what happens to r
it doesn't change
Suppose everyone at Bob's restaurant gets a $5.00 raise per hour to their existing wages. How does this raise affect the Interquartile Range of the salaries?
it will stay the same
and probability also known as?
joint
if the _____ changes, the standard deviation changes
mean
standard deviation is never _____
negative
Can you see the mean on a boxplot?
no
Can you tell what the sample size is from a boxplot?
no
in general, can you recreate the original data values from its histogram?
no
Don't be fooled by a high _________ of respondents. Look for a high __________ of respondents. This is called the ______ _____
number; percentage; response rate
residual=
observed y- predicted y (y-y hat)
standard deviation is affected by ________ and _______
outliers; skewness
correlation
r
all good samples are _________
random
Dependent Variable (aka)
response; the variable that responds, that comes out of the experiment ("what happens?")
standard deviation has the _____ _____ as the original data
same units
A ________ is a set of all possible outcomes of some random process
sample space
Getting Good Survey Results - 2 challenges
select a good sample & collect good data
If high concentration of data in the middle, IQR is _______
small