STATS 1430.02 MIDTERM

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Rules for calculating probability

- Multiplication rule - Addition rule - Complement rule

Implementation biases

- Nonresponse - Response

T/F The standard deviation can never be negative.

TRUE

T/F The starting point can affect the way a graph looks.

TRUE

T/F You can have two data sets with the same mean but different standard deviations.

TRUE

2 major types of data collections:

- observational studies - experiments

Good surveys....

- select a good sample - design a survey that avoids bias - implements survey to avoid bias - analyzes data properly

Question wording

the way in which survey questions are phrased, which influences how respondents answer them

Residuals

data - line observed y - predicted y

One way to find P(at least one)...

1 - P(none)

The probabilities in the 4 cells of a two-way table are...

"And" Joint probabilities

P(A) = P(A and B) + P(A and Bc) where Bc means...

"B complement"

Frequencies

# (y-axis) in each category - table, bar graph

Response Rate Formula

# of responses received / total # of surveys

Relative frequencies

% (y-axis) in each category - table, pie-charts

coefficent of determination

% of variability in y due to x R^2 or R-squared

Strength of linear relationships

+/-.7 - strong +/-.5 - moderate +/-.3 - weak +/-0 - no linear relationship

Suppose you make 10 telemarketer calls. What is a complement of "at most" 9 sales?

- "all 10" - "more than 9"

How else to know if A and B indepdent?

- "random sample" is stated - experiment is replicating on each subject - stated in problem - never assume

Describing scatterplot relationship

- Pattern (relationship): linear/straight line - Direction: uphill (positive), downhill (negative) or 0 (no relationship - Strength: how closely points follow a line

Elements of survey design

- Question wording - Type of survey - Timing

A well-designed experiment has what characteristics?

- Randomization of subjects to treatments - Comparison of the results from different groups - Controlled variables - Avoids bias - Has enough data

3 results that are true if and only if A and B are independent:

1. P(A | B) = P(A) - conditional = marginal - can switch A & B 2. P (A | B) = P(A | not B or Bc) - two conditionals equal 3. P(A and B) = P(A)*P(B) >>> simplifies to P(A)*P(B) when independent *if no to any of these, A and B are dependent

How to avoid bias

1. Use a sampling procedure 2. Use a random sample in order to represent entire population

Line of best fit 5 number summary

1. correlation (r) 2. mean of x-values (xbar) 3. mean of y-values (ybar) 4. Std. dev. (Sx) of x-values 5. Std. dev. (Sy) of y-values

Probability Tree

1st branch - marginals, sums to 1 2nd branch - conditionals, sums to 1 1st branch x 2nd branch = "joint" and probabilities - every time tree breaks off, sums to 1

Volunteer sample

A method of sample selection that consists of people choosing themselves by responding to a general appeal.

Timing of survey

EX: Length of time from survey to election; the shorter the time, the more accurate the survey.

What is UNDERCOVERAGE?

Excluding a group from the survey/experiment from the beginning.

categorical data

Data that consists of names, labels, or other nonnumerical values

T/F A flat histogram (with a line straight across) contains no variability whatsoever, according to our definition.

FALSE

Control group

In an experiment, the group that is not exposed to the treatment; contrasts with the experimental group and serves as a comparison/baseline for evaluating the effect of the treatment.

Notation for correlation

R

Treatment groups

The groups receiving different treatments.

Law of Total Probability

The sum of the probabilities of all individual outcomes must equal 1. P(A) = P(A and B) + P(A and not B)

Response bias

a systematic pattern of incorrect responses in a sample survey

A _______________ is a way to graph quantitative data.

histogram

when to use two-way table?

if P(A and B) ("joint" and)

Main objective in data collections is...

minimize/avoid bias

Scatterplots are two _______________.

quantitative variables

Best slope (b1)

r (correlation) x Sy/Sx Sy- std. dev of y Sx- std. dev of x

All good samples are__________.

random

Interpret y-intercept

- only if needed - when x=0 - need data in that area to interpret

Calculating residuals

- plug in x-value for each point to get y - compare (subtract) original y from the selected point with a new calculated y from plugging x in

What should the residual plot have if the regression line fits the data well?

- random patterns - no fan shapes - points fall around the horizontal line y=0

Properties of standard deviation

- x values represent data points - used to measure concentration of data around mean - never negative - can equal 0 - affected by outliers & skewedness

Correlation of a sample (r) will always be a number between:

-1 and 1

A good experiment...

-makes comparisons -avoids bias -has enough data

What is the standard deviation of the data set 1, 1, 1, 1?

0

a correlation of ______ or higher is considered strong in the positive direction.

0.7

What is the x-variable in a data set?

Independent variable, or factor (cause)

When skewness is present in a set of data, which descriptive summary measure is most appropriate?

Interquartile range and median

Correlation is a measure of the strength and direction of what TYPE of relationship between two quantitative variables?

Linear

Sample space

S = { all possible outcomes }

What does SSE stand for?

Sum of Squares for Error

Correlation does not always imply _____________.

causation

Convenience sample

choosing individuals in the easiest way (not to be confused with volunteer, you are still choosing individuals)

A listing of all possible values in a data set and how often they occurred is called a:

data distribution

Quanittative data

data with numbers

What is the y-variable in a data set?

dependent variable, response

Bayes Rule

describes the probability of an event, based on prior knowledge of conditions that might be related to the event

Histograms vs. Boxplots

histograms - show shape - hard to identify quantities - only a rough idea of center & variability - hard to compare data sets box plots - shows skewedness/symmetry - does not show what type of symmetric shape - easy to determine center & variability (center - median, variability - IQR (distance)) - can see quantities but no other breakdown - good for comparing

when to use tree diagram?

if P(A) (marginal), P(B | A) (conditional), etc.

Examining residuals

if a line fits well the residuals have - no pattern - y-values don't fan out as x increases (no fan shapes) - none/few unusually large values (outliers in y-direction) - no influential points (outliers in x-direction

Experiment

imposes some treatment, any sizable differences in data should be due to treatment

If you add a number to every value of a data set, what happens to the standard deviation?

it does not change

A __________ distribution summarizes the information from one variable ONLY, without considering ANY information from another variable.

joint ("and")

Probability of an event

long-term chance it will occur P(A) "P of A"

Center measures

mean, median

Variability measures

standard deviation, quartiles

Linear relationship

the points show a straight line pattern, go at a constant rate and move from left to right

horizontal (x) axis

the variable measured

An experiment imposes a _______________________ on the subjects, which makes it different from an observational study.

treatment

Correlation has no __________.

units - therefore the change in units of a problem does not affect correlation

You can show that P(B|A) = P(A and B)/P(A) by...

using multiplication rule and evaluating conditionals

Confidentiality is _________ that anonymity.

weaker

x-intercept

y = 0

Line of best fit equation

y=b0 + b1x b0 = y-intercept b1 = slope

Incentives for truth in survey questions

- Anonymity - Confidentiality

T/F The standard deviation has no units.

FALSE - same as original units of problem

T/F The probabilities in the four cells of a two-way table are conditional probabilities.

FALSE - they are and "joint" probabilities

T/F the best fitting line always has an SSE of 0.

False

How do disjoint probabilities affect "Or" Addition Rule?

P(A or B) = P(A) + P(B) + 0 because P(A and B) = 0 if disjoint

Scatterplots examine relationships between what type(s) of variables?

Quantitative

If a data set is skewed to the right, the mean is ____________________ the median.

more than - tail goes to the right, few large values

Observational study

no imposed treatments, does not attempt to influence responses

If the conditional distributions are the same, then there is:

no relationship

Skewed

not necessarily bad, means off from center

vertical (y) axis

number or % frequency in each group

Which technique should you use when you have P(B|A) and some other probabilities but you are looking for P(A|B)?

Bayes Rule - switching conditionals

Which is stronger, experiments or observational studies?

Experiments, because you are controlling variables, imposing treatments, and randomly assigning those treatments

T/F If variables A and B are related in a certain way in a two way table (with 2 variables), no matter how many other variables you look at in addition to these two, the relationship will always stay the SAME.

FALSE

T/F If you have all the information filled in a two-way table, you can fill in all the information on a tree. But not the other way around

FALSE

T/F In a boxplot you can tell the exact pattern of the data set (beyond just whether the data is skewed or symmetric.)

FALSE

T/F The mean of a data set must be one of the values in the data set.

FALSE

T/F You can compare two MARGINAL distributions to see if the corresponding two variables are related.

FALSE

Suppose you have a numerical data set that is very much skewed to the left. If you had to pick one, which measure of center (mean or median) best represents most of the data in this case?

Median

A and B are disjoint if...

P(A and B) = 0

"Or" Probability Addition Rule

P(A or B) - P(A or B, or both) - P(at least one) - P(one or more) P(A or B) = P(A) + P(B) - P(A and B) * subtract P(A and B) so "both" isn't counted 2x

Bayes Rule Formula

P(A | B) = P(B | A)P(A) / P(B) the top (numerator) will always match something on the bottom (denominator) good for when you need to flip conditionals

Marginal probability

P(A) probability of a single event - adds to 1 - found on tree

For Disjoint Events: P(A or B) = ?

P(A) + P(B)

If A and B are independent then P(A or B) becomes...

P(A) + P(B) - P(A)P(B)

Thirty-seven percent of registered voters said they are likely to vote by mail in the November election according to a new survey released Tuesday. Among them, 48% plan to vote for Joe Biden. That's more than twice the 23% who plan to vote for President Trump. Write the 48% as a probability using the given information.

P(Biden voter | voting by mail)

Suppose a school figures that 70% of adults will purchase a candy bar from a 6th grader during a fund-raiser. A sixth grader randomly selects 10 adults. What's the chance that at least one of them will buy a candy bar?

P(at least one) = 1 - P(none) 1- P(0.70)^10 * because there are 10 adults

What happens to the correlation between two quantitative variables if you switch X and Y?

R (correlation) measures how X and Y move together (numerator) compared to how they move separately (denominator). If you switch the X's and Y's around in the ENTIRE formula you get the SAME ANSWER, by commutative property of multiplication.

What occurs when a two-way table shows one relationship, but the relationship reverses if a third variable gets involved?

Simpson's Paradox

When a difference in treatment is decided to be due to more than random chance, what do you call the results?

Statistically significant

T/F A two-way table has this name because it contains rows going one way, and columns going the other way.

TRUE

T/F An outlier in a data set can significantly affect the value of the mean but not the median.

TRUE

T/F It is possible for the average of a data set to be larger than most of the values in that data set.

TRUE

If 60% of male-owned businesses are successful in their first year, and 60% of female-owned businesses are successful in their first year, are gender and having a successful business in their first year independent?

Yes, conditionals same P (success | male ) = P ( success | female)

Is correlation affected by outliers?

Yes, correlation is based on the mean of X, the mean of Y, the SD of X, and the SD of Y. All four of these items are affected by outliers.

Best y-intercept (b0)

_ _ y - b1(x) y(bar) = mean of y-values x(bar)= mean of x-values

causation

a change in one variable makes another variable change

statisitical significance

a difference beyond chance

Confounding variable

a factor other than the independent variable that might produce an effect in an experiment

If a residual is negative, then that data point lies _________________ the regression line.

below

Nonresponse bias

bias introduced to a sample when a large fraction of those sampled fails to respond

anonymity

cannot attach you to your answers

If a data set is skewed to the left, the mean is ____________________ the median.

less than - tail goes to the left, few small values

Mean

measure of center, average of data set

Data is fairly symmetric when...

the mean and median are close

Two-way tables explore relationships between ____________________________.

two categorical variables

Properties of Correlation

- 2 quantitative variables - Linear relationship only - NO units - Switching x & y does not change correlation, only changes slope - Affected by outliers & skewedness

Simpson's Paradox

- 2-way table shows one relationship, a 3-way table reverses this relationship - 2 variables v.s. 3. variables

Examples of RESPONSE BIAS?

- Answering incorrectly/dishonestly - Responses of individuals who were not included in the sample from the beginning and can't answer the survey question. (undercoverage) - Not answering the survey question at all

Types of biased samples

- Convenience - Volunteer/self-selected -Undercoverage

Joint "and" distribution

- Dividing by grand total (NOT column/row totals, thats conditional) - Can't see a relationship yet

Conditionals

- Individual cell divided b column or row total - compare rows and columns to find relationship - if conditionals are same, no relationship

Which of the following distributions must sum to 1?

- Joint "And" - Marginal - Conditionals

Least squares regression line

- Minimized SSE - b0 (y-intercept) - b1 (slope)

Stratified random sampling

- aimed at a specific subgroups ("strata") of interest - choose a random sample from each group - different perspectives

Notes for box plots

- can't tell whats sample size is from boxplot - bigger boxes don't equal more data - can't see the mean on a boxplot

Information used to find line of best fit:

1. description of relationship (uphill, straight line) 2. mean 3. standard deviations of x and y

A five number summary contains the:

1. min 2. max 3. Q1 (25th) 4. median (Q2, 50th percentile) 5. Q3 (75th)

coefficient of determination

- The % of variability in Y that is explained by X - R-squared - the square of correlation R

To find conditionals from 4 cells in a two-way table...

- divide each cell by its corresponding column or row total (not both)

Sum of Squares for Error (SSE)

- each value squared to make it positive, distance between point and line - point minus line (y - x) - if error is 0, point is on line - find line with smallest SSE from all potential lines

Simple random sample

- examines population as it exists - answers same question overall, applies to everyone

Marginals

- grand totals, "the sides" of tables - can't see relationship

Experiment factors

- has to do with treatment imposed - quantity

double-blind study

An experiment in which neither the participant nor the researcher knows whether the participant has received the treatment or the placebo

"And" Joint Probability

P (A and B) = Probability A and B occur - adds to 1 - multiply marginal and conditional to get joint

Conditional probability

P (B | A) - B, given A % "of" A that have characteristic B - look for word "of" - adds to 1 - found on tree

"And" Probability Multiplication Rule

P(A and B) = P(A)*P(B | A)

Suppose the probability of someone being female and voting for Candidate A is 30%. What is the notation for this probability?

P(Female and Voting for A) = 0.30

Suppose 90% of the patients who test positive for a disease actually have the disease. Write this as a probability.

P(have disease | test positive)=0.90

Interpret slope

always put over 1 as x goes up or down, y increases/decreases by ___.

Which type of graph is best for COMPARING two or more quantitative data sets, a boxplot or a histogram?

boxplot

Which type of graph is made from the 5-number summary?

boxplot

Confidentiality

can attach you to your answers, but promises not to

How to know if an experiment has enough data?

can be replicated on several different subjects

Complement Rule

set of all outcomes in S that are NOT INCLUDED in A Notation: Ac P(Ac)=1 - P(A) - at least one - one or more - at most one (none + exactly one)

When every possible sample with the same number of observations is equally likely to be chosen, the selected sample is called:

simple random sample

The "average distance from the mean" is measured by the __________________________.

standard deviation

Which is more affected by skewness, the IQR or standard deviation?

standard deviation

Which type of sample is used when you want to compare subgroups of the population?

stratified random sample

events (characteristics)

subsets of S (letters: A, B, C, etc.)

What is the most common observational study?

surveys

Bias

systematic favoritism of certain outcomes

Variability

- how much the scores in a data set vary from each other and from the mean - farther away from center, more variability

What summary measures CANNOT be directly calculated from a boxplot?

- mean - sample size - standard deviation

Boxplot

- one-dimensional graph of quantitative data, broken into 4 equal parts (25% each) - based of 5 number summary - can be horizontal or vertical

Ways to check fit of a line

- scatterplot - correlation - coefficient of determination - residuals

Interpretations of data

- shape - center - variability

Two things that can affect the graph:

- starting point - # of bars

T/F You can recreate the original data values from its histogram.

FALSE

The median of a data set must be one of the values in the data set.

FALSE

T/F P(A and B) = P(A) * P(B) for ANY events A and B.

FALSE - only for independent events

T/F Conditional probabilities are present on a tree.

TRUE

T/F If 2 corresponding conditional distributions are different from each other in a two-way table, we know that the variables are related.

TRUE

T/F If the mean of a data set is large, the standard deviation has to be large also.

TRUE

A student has applied to two graduate schools but she will only enroll in one of them. She has a 60% chance of getting into grad school A, and a 40% of getting into grad school B. If grad school A lets her in first, there is an 80% chance she will enroll. If grad school B lets her in first, there is a 90% chance she will enroll. How do you write .80 in probability notation?

P(Enroll | A)

IQR (interquartile range)

Q1 - 1st quartile, 25th percentile (25% of data is below it) Q2 - 2nd quartile, 50th percentile/median (50% data below it) Q3 - 3rd quartile, 75th percentile (75% below)

Random sample

every member of the population has an equal chance of being selected

Problems in surveys that can cause bias:

- Nonresponse - Undercoverage - Question wording

What probability formulas CANNOT be used to test for independence?

- P(A | B) = P(B) - P(A | B) = P(Ac | B) Ac - compliment of A

Two challenges of collecting good data:

- Selecting a good sample - Collecting good/unbiased data

What do the slices on a pie chart represent?

relative frequencies

What are the units of residuals?

same as the units for Y

Types of survey

mail, telephone, online, face-to-face


Set pelajaran terkait

Conceptual Physics Chapters 9 & 10 (Circular Motion & Center of Gravity and Center of Mass) :)

View Set

Korean War, Domino Theory, and U.S. Containment Policy

View Set

Ch. 9 Applied Business Probability and Statistics

View Set

Pinocchio - Vocabolario Italiano

View Set

First Language Lessons Definitions

View Set