STATS 1430
Why is correlation affected by outliers and skewness?
Because it is based on mean and standard deviation, which are both affected by outliers and skewness
Which is more affected by skewness, the IQR or standard deviation?
Both are affected the same amount
What if you multiply all values by the same number?
Changes SD
Stratified Random Sample
Compare subgroups of the population equally
Two Way Tables
Display visual relationships between the two categorical variables
Simple Random Sample
Examine the entire population as it exists
The difference between experiments and observational studies
Experiment intervenes
Displaying the Distribution for Quantitative Data
Histogram
IQR equation
IQR = Q3 - Q1
What do outliers and skewness do to pattern in correlation?
Makes pattern in data less strong
Center
Mean and median
What does best fit line do
Minimizes the sum of the square of the residuals
Can the median be affected by outliers?
No
What happens to SD if we add the same number to all values in the data set?
No effect on SD
Multiplication Rule of Probability
P(A and B) = P(A) * P(B)
If you know two events are independent:
P(A and B) = P(A)*P(B)
"Or" probability P(A or B)
Probability of A or B or both occur
Joint "And" probability P(A and B)
Probability that both A and occur -intersection of A and B
What is the impact of bias?
Reduces credibility of results
What type of sample would be used if we wanted to know "What % of all Americans are nurses?"
Simple Random Sample
Variability
Standard Deviation, Quartiles
r = +/- 0.7 in scatterplot
Strong relationship
Residual Plot
a scatterplot with x values on x axis and residuals on y axis; how far off were we at that x?
Double-blind study
an experiment where neither the experimenter or subjects know who got what treatment
Implementation: Response Bias
an individual in the sample responds but doesn't give the correct data
Implementation: Nonresponse
an individual is selected to be in the sample but doesn't respond to the survey
b0 equation
b0 = y-bar - bix-bar
b1 equation
b1 = r(Sy/Sx)
better graph for skewed data
box plots
strength of boxplots
can compare different data sets on same scale
What is useful about the boxplot?
can immediately see median and IQR and whether or not data is skewed
weakness of boxplots
cannot tell exact shape of symmetrical data
descriptive statistics
center and variability
Convenience Biased Sample
choosing individuals in the easiest way
Variability definition
concentration around the center
Residuals
data - line
Categorical Data
data falling into groups
Quantitative Data
data in which quantities are important; measurements and counts
If mean < median,
data is skewed left
If mean > median,
data is skewed right
Interpreting a Scatterplot
describe the relationship between x and y with the simplest general pattern, direction, strength
statistical significance
difference in treatment is decided to be due to more than random chance
Histogram
divides data into contiguous groups on the number line and shows how many are in each group -horizontal axis: the variable you measured -vertical axis: number or percentage in each group
Boxplots
graph based on five-number summary, another graph of quantitative data
IQR is small if
high concentration in the middle
better graph for symmetric data
histogram
Strength
how closely points follow a (line) pattern
smaller section in boxplot
less spread in data
Data Distribution
list of all possible values of the data and how often they occur
Probability of an event
long-term chance that it will occur
five number summary
minimum, Q1, median, Q3, maximum
r = +/- 0.5 in scatterplot
moderate relationship
more variability
more data is farther from center
larger section in boxplot
more spread in data
R=0
no linear relationship
Does correlation have units
no units
Sample Space
set of all possible outcomes
3 important characteristics in a data set
shape, center, variability
bi in best fit line
slope
Event
subset of s
SSE
sum of squares for errors
Most common observational study
survey
Bias
systematic favoritism
least squares regression line
the "best-fit" line that is calculated by minimizing the sum of the squares of the differences between the observed and predicted values of the line
If the data are symmetric,
the mean and media will be similar
If two events' conditional probabilities look the same,
they are independent events
extrapolation
trying to make a prediction outside of the range of data provided
r = +/- 0.3 in scatterplot
weak relationship
Sample mean =
x bar
Best fit line equation
y = b0 + bix
b0 in best fit line
y-intercept
Can the mean be affected by outliers?
yes
When comparing Histograms and box plots, box plots:
-show skewed vs. Symmetric shape, but does not show what type of symmetric shape -easy to determine center (median) and variability (IQR) -easy to see quartiles but can't see any other breakdown -easy to compare data sets on name scale -best for skewed data sets
Avoiding Bias
1. A sampling procedure must be used 2. The sample must represent the entire population (truly random)
Types of Biased Samples
1. Convenience 2. Volunteer (Self-Selected) 3. Undercoverage
Stratified Random Sample Process
1. Divide the population into subgroups (strata) of interest 2. Choose a simple random sample from each subgroup
3 results that are true if and only if A and B are independent
1. P(A|B) = P(A) 2. P(A|B) = P(A | not B) 3. P(A and B) = P(A)*P(B)
Displaying the Data Distribution for Categorical Data:
1. Relative Frequencies in each category 2. Frequencies in each category
Two challenges in getting good survey results
1. Select a good sample 2. Collect good data
How do you know if two events are independent?
1. Stated in problem or 2. random sample of individuals for a survey
Properties of Correlation:
1. Two quantitative variables only 2. Linear Relationships only 3. Has no units (unit-free) 4. Switch x and y, doesn't matter 5. Affected by outliers and skewness
5 numbers to come up with with best fit line
1. correlation 2. mean of x values (x-bar) 3. mean of y values (y-values) 4. SD of x values 5. SD of y values
Properties of Standard deviation
1. same units as original data 2. never negative 3. can equal zero 4. is affected by outliers and skewness
Q1
25th percentile
Q2
50th percentile (median)
Interpreting coefficient of determination (r^2)
54.9 % of variability in coffee sales is due to temperature
Q3
75th percentile
Volunteer (Self-Selected) Biased Sample
A call goes out and people enter the study on their own
Undercoverage Biased Sample
A subgroup of the population is excluded from the very beginning
Getting a Good Sample
-random sample -allow no bias
conditional distribution
-row numbers in table divided by row total -column numbers in table divided by column total
Confounding Variables
Variables not studied that can affect the results
Is correlation affected by outliers and skewness?
Yes
Looking for Relationships in Two-way tables:
-compare both conditional distributions and if same, there is no relationship
Interquartile Range (IQR)
-distance taken up by the middle 50% of the data
Separate (Marginal) distribution
-in margin -adds to one at bottom right
A good experiment
-makes comparisons -avoids bias -has enough data (>=30)
Skewed Left
-mean < median -tail to the left
Skewed Right
-mean > median -tail to the right
Direction of scatterplot
-positive (up) -negative (down)
Frequencies
# in each category -table, frequency bar graph
% Difference
% Diff = (1st number - 2nd number)/ 1st number
Relative Frequencies
% in each category -table, pie chart, relative frequency bar graph,
coefficient of determination definition
% of variability in y due to x
Ways to check the fit of the line
- scatterplot - correlation - coefficient of determination - residuals
r is between ___ and ___
-1 and 1
If line fits well residuals should have
-No pattern -No systematic change as X increases -Few unusually large values of a residual -Few influential points
Joint Distribution
-Overall percentage in each cell -sums to one -divide by grand total
When comparing Histograms and Boxplots, Histograms:
-are a nice way to see overall shape of data set -see data broken down into small groups but harder to identify quartiles -can only get a rough idea of center or variability -hard to compare data sets
Independent Events
The outcome of one event does not affect the outcome of the second event
Conditional Probability P(B|A)
The probability of event B occurring, given that event A has already occurred
Implementation: Timing
The timing of a survey can affect the results
Two methods of collecting data
observational studies and experiments
Simpson's Paradox
occurs when a two-way table shows one relationship, but the relationship reverses if a third variable gets involved
Median
orders numbers from lowest to highest, takes number in middle
Influential points
outliers in x direction
Make predictions on scatterplot by
plugging x into Least squared regression (best-fit line) equation
Marginal Probability p(A)
probability of a single event (characteristic)
coefficient of determination
r^2
r
sample correlation
What things can you not determine from a boxplot?
sample size, mean, SD