STATS 1430

Ace your homework & exams now with Quizwiz!

Why is correlation affected by outliers and skewness?

Because it is based on mean and standard deviation, which are both affected by outliers and skewness

Which is more affected by skewness, the IQR or standard deviation?

Both are affected the same amount

What if you multiply all values by the same number?

Changes SD

Stratified Random Sample

Compare subgroups of the population equally

Two Way Tables

Display visual relationships between the two categorical variables

Simple Random Sample

Examine the entire population as it exists

The difference between experiments and observational studies

Experiment intervenes

Displaying the Distribution for Quantitative Data

Histogram

IQR equation

IQR = Q3 - Q1

What do outliers and skewness do to pattern in correlation?

Makes pattern in data less strong

Center

Mean and median

What does best fit line do

Minimizes the sum of the square of the residuals

Can the median be affected by outliers?

No

What happens to SD if we add the same number to all values in the data set?

No effect on SD

Multiplication Rule of Probability

P(A and B) = P(A) * P(B)

If you know two events are independent:

P(A and B) = P(A)*P(B)

"Or" probability P(A or B)

Probability of A or B or both occur

Joint "And" probability P(A and B)

Probability that both A and occur -intersection of A and B

What is the impact of bias?

Reduces credibility of results

What type of sample would be used if we wanted to know "What % of all Americans are nurses?"

Simple Random Sample

Variability

Standard Deviation, Quartiles

r = +/- 0.7 in scatterplot

Strong relationship

Residual Plot

a scatterplot with x values on x axis and residuals on y axis; how far off were we at that x?

Double-blind study

an experiment where neither the experimenter or subjects know who got what treatment

Implementation: Response Bias

an individual in the sample responds but doesn't give the correct data

Implementation: Nonresponse

an individual is selected to be in the sample but doesn't respond to the survey

b0 equation

b0 = y-bar - bix-bar

b1 equation

b1 = r(Sy/Sx)

better graph for skewed data

box plots

strength of boxplots

can compare different data sets on same scale

What is useful about the boxplot?

can immediately see median and IQR and whether or not data is skewed

weakness of boxplots

cannot tell exact shape of symmetrical data

descriptive statistics

center and variability

Convenience Biased Sample

choosing individuals in the easiest way

Variability definition

concentration around the center

Residuals

data - line

Categorical Data

data falling into groups

Quantitative Data

data in which quantities are important; measurements and counts

If mean < median,

data is skewed left

If mean > median,

data is skewed right

Interpreting a Scatterplot

describe the relationship between x and y with the simplest general pattern, direction, strength

statistical significance

difference in treatment is decided to be due to more than random chance

Histogram

divides data into contiguous groups on the number line and shows how many are in each group -horizontal axis: the variable you measured -vertical axis: number or percentage in each group

Boxplots

graph based on five-number summary, another graph of quantitative data

IQR is small if

high concentration in the middle

better graph for symmetric data

histogram

Strength

how closely points follow a (line) pattern

smaller section in boxplot

less spread in data

Data Distribution

list of all possible values of the data and how often they occur

Probability of an event

long-term chance that it will occur

five number summary

minimum, Q1, median, Q3, maximum

r = +/- 0.5 in scatterplot

moderate relationship

more variability

more data is farther from center

larger section in boxplot

more spread in data

R=0

no linear relationship

Does correlation have units

no units

Sample Space

set of all possible outcomes

3 important characteristics in a data set

shape, center, variability

bi in best fit line

slope

Event

subset of s

SSE

sum of squares for errors

Most common observational study

survey

Bias

systematic favoritism

least squares regression line

the "best-fit" line that is calculated by minimizing the sum of the squares of the differences between the observed and predicted values of the line

If the data are symmetric,

the mean and media will be similar

If two events' conditional probabilities look the same,

they are independent events

extrapolation

trying to make a prediction outside of the range of data provided

r = +/- 0.3 in scatterplot

weak relationship

Sample mean =

x bar

Best fit line equation

y = b0 + bix

b0 in best fit line

y-intercept

Can the mean be affected by outliers?

yes

When comparing Histograms and box plots, box plots:

-show skewed vs. Symmetric shape, but does not show what type of symmetric shape -easy to determine center (median) and variability (IQR) -easy to see quartiles but can't see any other breakdown -easy to compare data sets on name scale -best for skewed data sets

Avoiding Bias

1. A sampling procedure must be used 2. The sample must represent the entire population (truly random)

Types of Biased Samples

1. Convenience 2. Volunteer (Self-Selected) 3. Undercoverage

Stratified Random Sample Process

1. Divide the population into subgroups (strata) of interest 2. Choose a simple random sample from each subgroup

3 results that are true if and only if A and B are independent

1. P(A|B) = P(A) 2. P(A|B) = P(A | not B) 3. P(A and B) = P(A)*P(B)

Displaying the Data Distribution for Categorical Data:

1. Relative Frequencies in each category 2. Frequencies in each category

Two challenges in getting good survey results

1. Select a good sample 2. Collect good data

How do you know if two events are independent?

1. Stated in problem or 2. random sample of individuals for a survey

Properties of Correlation:

1. Two quantitative variables only 2. Linear Relationships only 3. Has no units (unit-free) 4. Switch x and y, doesn't matter 5. Affected by outliers and skewness

5 numbers to come up with with best fit line

1. correlation 2. mean of x values (x-bar) 3. mean of y values (y-values) 4. SD of x values 5. SD of y values

Properties of Standard deviation

1. same units as original data 2. never negative 3. can equal zero 4. is affected by outliers and skewness

Q1

25th percentile

Q2

50th percentile (median)

Interpreting coefficient of determination (r^2)

54.9 % of variability in coffee sales is due to temperature

Q3

75th percentile

Volunteer (Self-Selected) Biased Sample

A call goes out and people enter the study on their own

Undercoverage Biased Sample

A subgroup of the population is excluded from the very beginning

Getting a Good Sample

-random sample -allow no bias

conditional distribution

-row numbers in table divided by row total -column numbers in table divided by column total

Confounding Variables

Variables not studied that can affect the results

Is correlation affected by outliers and skewness?

Yes

Looking for Relationships in Two-way tables:

-compare both conditional distributions and if same, there is no relationship

Interquartile Range (IQR)

-distance taken up by the middle 50% of the data

Separate (Marginal) distribution

-in margin -adds to one at bottom right

A good experiment

-makes comparisons -avoids bias -has enough data (>=30)

Skewed Left

-mean < median -tail to the left

Skewed Right

-mean > median -tail to the right

Direction of scatterplot

-positive (up) -negative (down)

Frequencies

# in each category -table, frequency bar graph

% Difference

% Diff = (1st number - 2nd number)/ 1st number

Relative Frequencies

% in each category -table, pie chart, relative frequency bar graph,

coefficient of determination definition

% of variability in y due to x

Ways to check the fit of the line

- scatterplot - correlation - coefficient of determination - residuals

r is between ___ and ___

-1 and 1

If line fits well residuals should have

-No pattern -No systematic change as X increases -Few unusually large values of a residual -Few influential points

Joint Distribution

-Overall percentage in each cell -sums to one -divide by grand total

When comparing Histograms and Boxplots, Histograms:

-are a nice way to see overall shape of data set -see data broken down into small groups but harder to identify quartiles -can only get a rough idea of center or variability -hard to compare data sets

Independent Events

The outcome of one event does not affect the outcome of the second event

Conditional Probability P(B|A)

The probability of event B occurring, given that event A has already occurred

Implementation: Timing

The timing of a survey can affect the results

Two methods of collecting data

observational studies and experiments

Simpson's Paradox

occurs when a two-way table shows one relationship, but the relationship reverses if a third variable gets involved

Median

orders numbers from lowest to highest, takes number in middle

Influential points

outliers in x direction

Make predictions on scatterplot by

plugging x into Least squared regression (best-fit line) equation

Marginal Probability p(A)

probability of a single event (characteristic)

coefficient of determination

r^2

r

sample correlation

What things can you not determine from a boxplot?

sample size, mean, SD


Related study sets

Fluids & Electrolytes Notes (COPY)

View Set

Complete Periodic Table of Elements

View Set

ATI Nurse Logic 2.0 ~ Priority Setting Frameworks (Advanced Test)

View Set

(PrepU) Chapter 39: Oxygenation and Perfusion

View Set