Psych 3020 Exam 1

¡Supera tus tareas y exámenes ahora con Quizwiz!

What percent of bottles passthe quality control inspection?

(a) 1.82% (b) 3.44% (c) 6.88% *(d) 93.12%* (e) 96.56% Z35.8=35.8−360.11=−1.82 Z36.2=36.2−360.11=1.82 P(35.8<X<36.2) = P(−1.82<Z<1.82)=0.9656−0.0344=0.931214

A 2012 Gallup survey suggests that 26.2% of Americans are obese.Among a random sample of 10 Americans, what is the probabilitythat exactly 8 are obese?

(a) 0.262^8×0.738^2 (b)(810)×0.262^8×0.738^2 **(c)(108)×0.262^8×0.738^2=45×0.262^8×0.738^2=0.0005** (d)(108)×0.262^2×0.738^8

Bayes's Theorem

-The conditional probability formula we have seen so far is aspecial case of the Bayes' Theorem, which is applicable evenwhen events have more than just two outcomes. •Bayes' Theorem: whereA2,···,Akrepresent all other possible outcomes ofvariable 1.

Numerical, continous

sleep

What happens when np and/or n(1-p)<10?

smooshed and smooshed together, less variability

Stacked dot plot

higher bars represent areas where there are more observations, makes it a little easier to judge the center and the shape of the distribution

standard normal distribution

the normal distribution with mean u=0 and standard deviation o=1

levels

the possible values of a factor

probability

the probability of an outcome is the proportion of times the outcome would occur if we observed the random process an infinite number of times

joint probabilities

the probability of outcomes for two or more variables or processes

Bernoulli random variable

when an individual trial only has two possible outcomes, often labeled as success or failure, it is called a Bernoulli random variable

Double-blind

when both the experimental units and theresearchers who interact with the patients do not know who isin the control and who is in the treatment group

Blinding

when experimental units do not know whether theyare in the control or treatment group

false positive

when we think we perceive a stimulus that is not there

Normal distributions with different parameters

μ: mean,σ: standard deviation N(μ=0,σ=1) N(μ=19,σ=4)

First quartile

25th percentile

Third quartile

75th percentile

Conditional Probability

(Break in the lecture)

Histograms

-Histograms provide a view of the data density. Higher bars represent where the data are relatively more common. -Histograms are especially convenient for describing the *shape* of the data distribution -the chosen *bin width* can alter the story the histogram is telling

Two primary types of data collection

-Observational studies -Experiment

Observational studies

-Researchers collect data in a way that does not directly interfere with how the data arise (e.g. surveys). -Results of an observational study can generally be used to establish an association between the explanatory and response variables •Can provide evidence of a naturally occurring association between variables, but they cannot by themselves show a causal connection.

Where p is unknown

-The CLT states SE=sq. rt. n(1-p)/n.....

Dot plots & mean

-the mean, also called the average (marked with a triangle in the above plot), is one way to measure the center of a distribution of data

Margin of error

41% + or - 2.9%: We are 95% confident that 38.1% to 43.9% of the public believe young adults, rather than middle-aged or older adults, are having the toughest time in today's economy 49%+ or - 4.4%: We are 95% confident that 44.6% to 53.4% of 18-34 years old have taken a job they didn't just want to pay the bills.

Finding the exact probability-using R

> pnorm(-1.82, mean = 0, sd = 1)[1] 0.0344 OR > pnorm(35.8, mean = 36, sd = 0.11)[1] 0.0345

outcome of interest

A

Expected Value

A 2012 Gallup survey suggests that 26.2% of Americans are obese. Among a random sample of 100 Americans, how many would youexpect to be obese? •Easy enough,100×0.262=26.2. •Or more formally,μ=np=100×0.262=26.2. •But this doesn't mean in every random sample of 100 peopleexactly 26.2 will be obese. In fact, that's not even possible. Insome samples this value will be less, and in others more. Howmuch would we expect this value to vary?

Calculating expectation practice

A casino game costs $5 to play. If the first card you draw is red,then you get to draw a second card (without replacement). If thesecond card is the ace of clubs, you win $500. If not, you don't winanything, i.e. lose your $5. What is your expected profits/lossesfrom playing this game?Remember: profit/loss = winnings - cost. (a) A profit of 5¢ *(b)A loss of 10¢* (c) A loss of 25¢ (d) A loss of 30¢

Cluster sampling example

A city council has requested a household survey be conducted ina suburban area of their city. The area is broken into many distinct and unique neighborhoods, some including large homes, some with only apartments. Which approach would likely be the least effective? A) simple random B) cluster sampling C) stratified sampling D) blocked sampling Answer: Cluster would be least effective because you would might study the large homes but not the apartments

Application activity:Jn venting probabilities

A common epidemiological model for the spread of diseases is the SIR model,where the population is partitioned into three groups: Susceptible, Infected, andRecovered. This is a reasonable model for diseases like chickenpox where asingle infection usually provides immunity to subsequent infections. Sometimesthese diseases can also be difficult to detect. Imagine a population in the midst of an epidemic where 60% of the population isconsidered susceptible, 10% is infected, and 30% is recovered. The only test forthe disease is accurate 95% of the time for susceptible individuals, 99% for infectedindividuals, but 65% for recovered individuals. (Note: In this case accurate meansreturning a negative result for susceptible and recovered individuals and a positiveresult for infected individuals). Draw a probability tree to reflect the information given above. If the individual hastested positive, what is the probability that they are actually infected?

probability of density function

A function used to compute probabilities for a continuous random variable. The area under the graph of a probability density function over an interval represents probability.

Probability distributions

A probability distribution lists all possible events and the probabilities in which they occur -The probability distribution for the gender of one kid Rules for probability distributions: 1. The events listed must be disjoint 2. Each probability must be between 0 and 1 3. The probabilities must total 1 •The probability distribution for the genders of two kids: (See slide 10, Ch. 3 Lecture)

normal distribution

A symmetric, bell-shaped curve that represents the pattern in which many characteristics are dispersed in the population. (overwhelmingly the most common)

contingency table

A table that summarizes data for two categorical variables

complement of an event

All possible outcomes that are not in the event

Shape of distribution: unusual observations

Are there any unusual observations or potential outliers?

Quality control Example

At Heinz ketchup factory the amounts which go into bottles of ketchupare supposed to be normally distributed with mean 36 oz. and standarddeviation 0.11 oz. Once every 30 minutes a bottle is selected from theproduction line, and its contents are noted precisely. If the amount ofketchup in the bottle is below 35.8 oz. or above 36.2 oz., then the bottlefails the quality control inspection. What percent of bottles have less than35.8 ounces of ketchup? LetX= amount of ketchup in a bottle:X∼N(μ=36,σ=0.11) Z=35.8−360.11=−1.82

Data: Gender Discrimination Example

At a first glance, does there appear to be a relatonship betweenpromotion and gender? % of males promoted:21/24=0.875 % of females promoted:14/24=0.583

condition

B

How are bar plots different than histograms?

Bar plots are used for displaying distributions of categorical variables, histograms are used for numerical variables. Thex-axis in a histogram is a number line, hence the order of the bars cannot be changed. In a bar plot, the categories can belisted in any order (though some orderings make more sense than others, especially for ordinal variables.)

Cutoff points Practice

Body temperatures of healthy humans are distributed nearly nor-mally with mean 98.2◦F and standard deviation 0.73◦F. What is thecutoff for the highest 10% of human body temperatures? (a) 97.3◦F *(b) 99.1◦F* (c) 99.4◦F (d) 99.6◦F 0.100.90P(X>x)=0.10→P(Z<1.28)=0.90 Z=obs−mean/SD→x−98.2/0.73=1.28 x=(1.28×0.73)+98.2=99.1

Finding cutoff points

Body temperatures of healthy humans are distributed nearly nor-mally with mean 98.2◦F and standard deviation 0.73◦F. What is thecutoff for the lowest 3% of human body temperatures? P(X<x)=0.03→P(Z<-1.88)=0.03 Z=obs−mean/SD→x−98.2/0.73=−1.88x =(−1.88×0.73)+98.2=96.8◦F

Point estimates and sampling variability

Chapter 5 Lecture

Independence and conditional probabilities

Consider the following (hypothetical) distribution of gender andmajor of students in an introductory statistics class: -The probability that a randomly selected student is a socialscience major is60100=0.6. -The probability that a randomly selected student is a socialscience major given that they are female is3050=0.6 -SinceP(SS|M)also equals 0.6, major of students in this classdoes not depend on their gender: P(SS|F) = P(SS).

discrete variable

Consists of separate, indivisible categories. No values can exist between two neighboring categories. ex: 1 person (you can't have 1.5 persons)

Week 4-Chapter 3

Defining Probability

Disjoint and non-disjoint outcomes

Disjoint (mutually exclusive) outcomes:Cannot happen at thesame time. -the outcome of a single coin toss cannot be a head and a tail -a student both cannot fail and pass a class -a single card drawn from a deck cannot be an ace and a queen Non-disjoint outcomes: Can happen at the same time -A student can get an A in Stats and an A in Econ in the same semester

Chapter 4 Vocabulary

Distributions of Random Variables

Pie charts

Ex: can you tell which order encompasses the lowest percentage of mammal species *NOT good for data with many many points*

exponentially

Extremely rapid increase. -In general, the probabilities for geometric distribution decrease exponentially fast.

shapes of binomial distribution

For this activity you will use a web applet. Go tohttps://gallery.shinyapps.io/distcalc/and choose Binomial coinexperiment in the drop down menu on the left. •Set the number of trials to 20 and the probability of success to0.15. Describe the shape of the distribution of number ofsuccesses. •Keepingpconstant at 0.15, determine the minimum samplesize required to obtain a unimodal and symmetric distributionof number of successes. Please submit only one responseper team. •Further considerations:•What happens to the shape of the distribution asnstaysconstant andpchanges?•What happens to the shape of the distribution aspstaysconstant andnchanges?

Recap-General Addition Rule

General addition ruleP(A or B)=P(A)+P(B)−P(A and B) Note: For disjoint eventsP(A and B)=0, so the above formula simplifies toP(A or B)=P(A)+P(B).

Independence and conditional probabilities (cont.)

Generically, if P(A|B)=P(A) then the events Aand B are said to beindependent. •Conceptually: GivingBdoesn't tell us anything aboutA. •Mathematically: We know that if events A and B are independent, P(A and B)=P(A)×P(B). Then, P(A|B) = P(A and B)/P(B) = P(A)×P(B)/P(B) =P(A)

Geometric Distributions (cont.)

Geometric distributiondescribes the waiting time until a successforindependent and identically distributed (iid)Bernouilli randomvariables. •independence: outcomes of trials don't affect each other •identical: the probability of success is the same for each trial Geometric probabilities Ifprepresents probability of success,(1−p)represents probabilityof failure, andnrepresents number of independent trials P(success on the nthtrial)=(1−p)^n−1p

data density

Histograms provide a view of data density. Higher bars represent where the data are relatively more common

Distributions of number of successes

Hollow histograms of samples from the binomial model wherep=0.10andn=10,30,100, and300. What happens asnincreases?

Binomial vs. Negative Binomial

How is the negative binomial distribution different from the binomialdistribution? •In the binomial case, we typically have a fixed number of trialsand instead consider the number of successes.•In the negative binomial case, we examine how many trials ittakes to observe a fixed number of successes and require thatthe last observation be a success.

Extracurricular activities

How would you describe the shape of the distribution of hours per week students spend on extracurricular activities? Answer: Unimodal and right skewed, with a potentially unusual observationat 60 hours/week.

General Addition Rule

If A and B are any two events, disjoint or not, then the probability that at least one of them will occur is P(A or B) = P(A) + P(B) - P(A and B) where P(A and B) is the probability that both events occur

Mean vs. Median

If the distribution is symmetric, center is often defined as the mean: mean = median If the distribution is skewed or has extreme outliers, center is often defined as the median *right-skewed: mean > median *left-skewed: mean < median

blocking

If there are variables that are known or suspected toaffect the response variable, first group subjects intoblocksbased on these variables, and then randomize cases withineach block to treatment groups

Expected value of a discrete random variable

In a game of cards you win $1 if you draw a heart, $5 if you draw anace (including the ace of hearts), $10 if you draw the king of spadesand nothing for any other card you draw. Write the probability modelfor your winnings, and calculate your expected winning.

Practice-Probability Distributions

In a survey, 52% of respondents said they are Democrats. What isthe probability that a randomly selected respondent from this sam-ple is a Republican? (a) 0.48 (b) more than 0.48 (c) less than 0.48 *(d)cannot calculate using only the information given* If the only two political parties are Republican and Democrat, then(a) is possible. However it is also possible that some people do notaffiliate with a political party or affiliate with a party other thanthese two. Then (c) is also possible. However (b) is definitely notpossible since it would result in the total probability for the samplespace being above 1.

Sampling practice

In most card games cards are dealt without replacement. What isthe probability of being dealt an ace and then a 3? Choose theclosest answer. (a) 0.0045 (b) 0.0059 (c)0.0060 (d) 0.1553 P(ace then3)= 4/52×4/51≈0.0060

idependent and identically distributed (idd)

In this case, the independence aspect just means the individuals in the example don't affect each other, and identical means they each have the same probability of success

Convenience sample

Individuals who are easily accessibleare more likely to be included in the sample.

Law of Large Numbers

Law of large numbersstates that as more observations arecollected, the proportion of occurrences with a particular outcome,ˆpn, converges to the probability of that outcome, p.

Expected value of a discrete random variable Graph

Lecture 3 slide 49

Expected value and its variability

Mean and standard deviation of binomial distribution μ=npσ=√np(1−p) •Going back to the obesity rate: σ=√np(1−p)=√100×0.262×0.738≈4.4 •We would expect 26.2 out of 100 randomly sampledAmericans to be obese, with a standard deviation of 4.4.

Expected value and its variablility

Mean and standard deviation of geometric distribution μ=1/p σ=√1−p/p^2 •Going back to Dr. Smith's experiment: σ=√1−p/p^2 =√1−0.35/0.35^2=2.3 •Dr. Smith is expected to test 2.86 people before finding thefirst one that refuses to administer the shock, give or take 2.3people. •These values only make sense in the context of repeating theexperiment many many times.

Application activity: Shapes of Distributions

Sketch the expected distributions of the following variables: •number of piercings •scores on an exam •IQ scores Come up with a concise way (1-2 sentences) to teach someone howto determine the expected distribution of any variable.

row totals

Provide the total counts across each row.

Simplifying random variables

Random variables do not work like normal algebraic variables: X+X,2X *See lecture 3 slide 58*

Simple random sample

Randomly select cases from the population, where there is noimplied connection between the points that are selected.

Population and Samples Example

Research question: Can peoplebecome better, more efficientrunners on their own, merely byrunning? Population of interest: All people Sample: Group of adult women who recently joined a runninggroup Population to which results can be generalized: Adult women, ifthe data are randomly sampled

Experiment

Researchers randomly assign subjects to varioustreatments in order to establish causal connections betweenthe explanatory and response variables. *In general, association does not imply causation, andcausation can only be inferred from a randomized experiment.

Probability Practice cont.

Roughly 20% of undergraduates at a university are vegetarian orvegan. What is the probability that, among a random sample of 3undergraduates, at least one is vegetarian or vegan? (a)1−0.2×3 (b)1−0.2^3 (c)0.8^3 (d)1−0.8×3 (e)1−0.8^3 P(at least1from veg)=1−P(none veg) =1−(1−0.2)3 =1−0.8^3 =1−0.512=0.488 Answer: (e)1−0.8^3

Sample space and complements

Sample space is the collection of all possible outcomes of a trial -A couple has one kid, what is the sample space for the gender of this kid? S={M,F} -A couple has two kids, what is the sample space for thegender of these kids?S={MM,FF,FM,MF} Complementary eventsare two mutually exclusive events whoseprobabilities that add up to 1. -A couple has one kid. If we know that the kid is not a boy,what is gender of this kid?{M,F}→Boy and girl arecomplementaryoutcomes. -A couple has two kids, if we know that they are not both girls,what are the possible gender combinations for these kids?{MM,FF, FM, MF}

outliers

Sample values that lie very far away from the vast majority of the other sample values

When p is low

Suppose we have a population where the true population proportion is p=.05, and we take random samples of size n=50 from this population. We calculate the sample proportion in each sample and plot these proportions. Would you expect this distribution to be nearly ......

Chapter 2-Examining Numerical Data

Textbook Vocabulary

robust statistics

The median and IQR are called robust statistics because extreme observations have little effect on their values: moving to the most extreme value generally has little influence on these statistics

How large is large enough?

The sample size is considered large enough if the expectednumber of successes and failures are both at least 10. np≥10 and n(1−p)≥10 10×0.13=1.3; 10×(1−0.13)=8.7

Bin width

The width of an interval that is used in making a histogram. which ones of the histograms are useful? which reveal too much about the data? which hide too much?

Calculating percentiles

There are many ways to compute percentiles/areas under thecurve: •R:> pnorm(1800, mean = 1500, sd = 300)[1] 0.8413447 •Applet:https://gallery.shinyapps.io/distcalc/

Probabilities from continuous distributions

Therefore, the probability that a randomly sampled US adult isbetween 180 cm and 185 cm can also be estimated as the shadedarea under the curve. By definition... Since continuous probabilities are estimated as "the area underthe curve", the probability of a person being exactly 180 cm (or anyexact value) is defined as 0.

Simulations using software

These simulations are tedious and slow to run using the methoddescribed earlier. In reality, we use software to generate thesimulations. The dot plot below shows the distribution of simulateddifferences in promotion rates based on 100 simulations.

Application activity: simulating the experiment

Use a deck of playing cards to simulate this experiment. 1. Let a face card representnot promotedand a non-face cardrepresent apromoted. Consider aces as face cards. •Set aside the jokers. •Take out 3 aces→there are exactly 13 face cards left in thedeck (face cards: A, K, Q, J) .•Take out a number card→there are exactly 35 number(non-face) cards left in the deck (number cards: 2-10). 2. Shuffle the cards and deal them intro two groups of size 24,representing males and females. 3. Count and record how many files in each group are promoted(number cards). 4. Calculate the proportion of promoted files in each group andtake the difference (male - female), and record this value. 5. Repeat steps 2 - 4 many times.

Dot plots

Useful for visualizing one numerical variable. Darker colors represent areas where there are more observations Ex: how would you describe the distribution GPAs in this data set? Make sure to say something about the center, shape, and spread of this distribution

Chapter 3

Vocabulary

Variability

We are also often interested in the variability in the values of arandom variable. σ2=Var(X)=k∑i=1(xi−E(X))2P(X=xi)σ=SD(X)=√Var(X)

success

We label a person a success if their healthcare costs do not exceed the deductible

Gender Discrimination: Results

We saw a difference of almost 30% (29.2% to be exact) betweenthe proportion of male and female files that are promoted. Basedon this information, which of the below is true? (a) If we were to repeat the experiment we will definitely see thatmore female files get promoted. This was a fluke. (b) Promotion is dependent on gender, males are more likely to bepromoted, and hence there is gender discrimination againstwomen in promotion decisions.Maybe (c) The difference in the proportions of promoted male and femalefiles is due to chance, this is not evidence of genderdiscrimination against women in promotion decisions.Maybe (d) Women are less qualified than men, and this is why fewerfemales get promoted.43

Introduction to Data

Week 1

Breast cancer screening-Inverting Probabilities

When a patient goes through breast cancer screening there are twocompeting claims: patient had cancer and patient doesn't have can-cer. If a mammogram yields a positive result, what is the probabilitythat patient actually has cancer? P(C|+) =P(Cand+)P(+) =0.0133/0.0133+0.0983 =0.12

Union of non-disjoint events

What is the probability of drawing a jack or a red card from a wellshuffled full deck? P(jack or red)=P(jack)+P(red)−P(jack and red)=452+2652−252=2852

Practice

What is the probability that a randomly sampled student thinks mar-ijuana should be legalized orthey agree with their parents' politicalviews? (a)40+36−78165 (b)114+118−78165 (c)78165 (d)78188 (e)1147 (See slide 8, Ch. 3 Lecture)

right skewed

When data trail off to the right in this way and have a longer right tail, median<mean

Sampling without replacement cont.

When drawingwithout replacementyou do not put back what youjust drew. •Suppose you pulled a blue chip in the first draw. If drawingwithout replacement, what is the probability of drawing a bluechip in the second draw? 1stdraw: 5, 3, 2 2nddraw: 5, 2, 2 Prob(2ndchipB|1stchipB)=2/9 =0.22 •If drawing without replacement, what is the probability ofdrawing two blue chips in a row? 1stdraw: 5, 3, 2 2nddraw: 5, 2, 2 Prob(1stchipB)·Prob(2ndchipB|1stchipB) =0.3×0.22 =.066

Sampling with replacement

When samplingwith replacement, you put back what you just drew. •Imagine you have a bag with 5 red, 3 blue and 2 orange chipsin it. What is the probability that the first chip you draw is blue? 5, 3, 2 Prob(1stchipB)=3/5+3+2 =3/10 =0.3 •Suppose you did indeed pull a blue chip in the first draw. Ifdrawing with replacement, what is the probability of drawing ablue chip in the second draw? 1stdraw: 5, 3, 2 2nddraw: 5, 3, 2 Prob(2ndchipB|1stchipB) =3/10 =0.3

Normal approximation to the binomial

When the sample size is large enough, the binomial distributionwith parametersnandpcan be approximated by the normal modelwith parametersμ=npandσ=√np(1−p). •In the case of the Facebook power users,n=245andp=0.25.μ=245×0.25=61.25σ=√245×0.25×0.75=6.78 •Bin(n=245,p=0.25)≈N(μ=61.25,σ=6.78).

Practice: which of these following events would you be the most surprised by?

Which of the following events would you be most surprised by? (a) exactly 3 heads in 10 coin flips (b) exactly 3 heads in 100 coin flips (c)exactly 3 heads in 1000 coin flips Answer: C

Properties of the choose function-Which of the following is false?

Which of the following is false? (a) There arenways of getting 1 success inntrials,(n1)=n. (b) There is only 1 way of gettingnsuccesses inntrials,(nn)=1. (c) There is only 1 way of gettingnfailures inntrials,(n0)=1. *(d)There aren−1ways of gettingn−1successes inntrials,(nn−1)=n−1.*

Whiskers and outliers

Whiskers -a box plot can extend up to 1.5 x IQR away from the quartiles -max upper whisker reach = Q3 + 1.5 x IQR -max lower whisker reach = Q1 - 1.5 x IQR IQR: 20−10=10 -max upper whisker reach=20+1.5×10=35 -max lower whisker reach=10−1.5×10=−5 •A potentialoutlieris defined as an observation beyond themaximum reach of the whiskers. It is an observation thatappears extreme relative to the rest of the data.

Outliers

Why is it important to look for outliers? -Identify extreme skew in the distribution -Identify data collection and entry errors -Provide insight into interesting features of the data

What is the probability that the average Facebook user with 245friends has 70 or more friends who would be considered power users?

Z=obs−mean/SD =70−61.256.78=1.29 P(Z>1.29)=1−0.9015=0.0985

ordinal variable

a categorical variable that has a natural ordering of the possible values

nominal

a categorical variable without this type of natural ordering

bar plot

a common way to display a single categorical variable

tree diagram (probability)

a diagram used to show the total number of possible outcomes in an experiment

stratified sampling

a divide and conquer sampling strategy -> the population is divided into groups called strata, the Strata are chosen so that the similar cases are grouped together. Then, a second sampling method, usually simple random sampling is employed within each stratum

placebo

a fake treatment

randomization

a process of randomly assigning subjects to different treatment groups

continuous variable

a quantitative variable that has an infinite number of possible values that are not countable ex: unemployment

summary statistic

a single number summarizing a large amount of data

probability distribution

a table of all disjoint outcomes and their associated probabilities

confounding variable

a variable that is correlated with both the explanatory and response variables

Random variable

a variable whose value is a numerical outcome of a random phenomenon -> such a model allows us to apply a mathematical framework and statistical principles for better understanding and predicting outcomes in the real world

mosaic plot

a visualization technique suitable for contingency tables that resemble a standardized stacked bar plot with the benefit that we will see the relative group sizes of the primary variable as well

secondary branches

all other branches

Types of Variables

all variables 1. Numerical 1A. Continuous 1B. Discrete 2. Categorical 2A: Regular Categorical 2B. Ordinal

simulation

an imitation of a possible situation

Scatterplot

are useful for visualizing the relationship between two numerical variables Ex: do life expectancy and total fertility appear to be associated or independent? -> they appear to be linearly and negatively associated as fertility increases, life expectancy decreases.

Venn diagrams

are useful when outcomes can be categorized as "in" or "out" for two or three variables, attributes, or random processes

whiskers

attempt to capture the data outside of the box

probability of a success

because 70% of individuals will not hit their deductible, we denote the probability of a success as p=.07

Categorical, ordinal

bedtime

Continuous Distributions

break in lecture

Geometric Distribution

break in lecture

Random variables

break in lecture

Sampling from a small population

break in lecture

Central Limit Theorem

central limit theorem -Sample proportions will be nearly normally distributed with mean equal to the population proportion, p and standard error equal to sq. rt. p(1-p)/n P^hat (mean=p, SE=sq.rt. p(1-p)/n -It wasn't a coincidence that the sampling distribution we saw earlier was symmetric, and centered at the true population proportion -We don't go through a detailed proof of why SE= sq.rt. p(1-p)/n, but note that as n increases, SE decreases *as n increases samples will yield more consistent p^hats, i.e. variability among p^hats will be lower

Multistage Sample

clusters are usually NOT made up of homogenous observations. We take a simple random sample of clusters, then take a simple random sample of observations from the sampled clusters (random sampling at BOTH stages)

retrospective studies

collect data after events have taken place. Example: Researchers reviewing past events in medical records.

intensity map

colors are used to show higher and lower values of a variable

standard deviation

defined as the square root of the variance

negative bimodal distribution

describes the probability of observing the "k^th" success on the "n^th" trial

deviation

distance of an observation from its mean

multimodal

distributions with more than two modes

Categorical, ordinal, could be used as numerical

dread

simple random sample

every member of the population has a known and equal chance of selection Ex: a raffle

Intensity map

ex: what patterns are apparent in the change in population between 2000 and 2010?

placebo effect

experimental units showing improvementsimply because they believe they are receiving a specialtreatment

placebo

fake treatment, often used as the control group for medical studies

primary branch

first branch for inoculation

uniform

has no mode

Categorical

gender

prospective study

identifies individuals and collects information as events unfold

Independent variables

if two variables are not associated

without replacement

if we sample from a sample population without replacement, we no longer have independence between our observations

census

including the entire population

Outliers

influential points that can drag the line

multistage sample

is like a cluster sample, but rather than keeping all observations in each cluster, we collect a random sample within each cluster

Variance

is roughly the average squared deviation from the mean s2=∑ni=1(xi− ̄x)2n−1 •The sample mean is ̄x=6.71, and the samplesize isn=217. •The variance of amount ofsleep students get per nightcan be calculated as:

Shape of distribution: Skewedness

is the histogram right skewed, left skewed, or symmetric? Positive skew: tail on left Negative skew: tail on the right 0 skew: symmetric

Standard deviation

is the square root of the variance, and has the same units as the data s=√s2 •The standard deviation ofamount of sleep studentsget per night can becalculated as: s=√4.11=2.03hours •We can see that all of thedata are within 3 standarddeviations of the mean.

left skewed

long, thinner tail to the left

false negative

not perceiving a stimulus that is present

Poisson distribution

often useful for estimating the number of events in a large population over a unit of time. For instance, consider each of the following events: -having a heart attack -getting married -getting struck by lightning

unimodal

one mode

There are two parts to a conditional probability...

outcome of interest and the condition

hollow histograms

outlines of histograms of each group put on the same plot

categorical variable

places an individual into one of several groups or categories Example: vegetables and fruits

marginal probabilities

probabilities based on a single variable without regard to any other variables

random noise

reminder that the observed outcomes in the data sample may not perfectly reflect the true relationships between variables since there is random noise

sample

represents a subset of the cases and is often a small fraction of the population

transformation

rescaling of the data using a function

trial

suppose a health insurance company found that 70% of the people they insure stay below their deductible in any given year. Each of these people can be thought of as a trial

Sampling distribution

suppose you were to repeat this process many times and obtain as many ps. This distribution is called a sampling distribution

Box plot

the box in a box plot represents the middle 50% of the data, and the thick line in the box is the median

Interquartile Range (IQR)

the difference between the first and third quartiles (the total length of the box)

Explanatory and response variables

to identify the explanatory variable in a pair of variables,identify which of the two is suspected of affecting the other: explanatory variable -> might affect response variable *Labeling variables as explanatory and response does notguarantee the relationship between the two is actually causal,even if there is an association identified between the twovariables. We use these labels only to keep track of whichvariable we suspect affects the other.*

column totals

total counts down each column

side-by-side box plot

traditional tool for comparing across groups

bimodal

two modes

Independent

two processes are independent if knowing the outcome of one provides no useful information about the outcome of the other

Bayesian Statisitcs

updating beliefs using Bayes' Theorem is actually the foundation of this entire section of statistics

Binomial Distribution

used to describe the number of successes in a fixed number of trials. This is different from the geometric distribution, which described the number of trials we must wait before we observe a success

cluster sample

we break up the population into many groups called clusters, then we sample a fixed number of clusters and include all observations from each of those clusters in the sample

failure

we label a person a failure if he/she does exceed his/her deductible in a year

A trial is like a hypothesis test

•Hypothesis testing is verymuch like a court trial. H0: Defendant is innocent HA: Defendant is guilty •We then present theevidence - collect data. •Then we judge the evidence - "Could these data plausibly have happened by chance if the null hypothesis were true?" >If they were very unlikely to have occurred, then the evidence raises more than a reasonable doubt in our minds about the null hypothesis. •Ultimately we must make a decision. How unlikely is unlikely?

Number of hours of sleep on school nights

•Mean = 6.88 hours, SD = 0.92 hrs •72% of the data are within 1 SD of the mean:6.88±0.93 •92% of the data are within 2 SD of the mean:6.88±2×0.93 •99% of the data are within 3 SD of the mean:6.88±3×0.93

Why do we use the squared deviation of the calculation of variance?

•To get rid of negatives so that observations equally distantfrom the mean are weighed equally. •To weigh larger deviations more heavily.

Sampling without replacement (cont.)

•When drawing without replacement, the probability of thesecond chip being blue given the first was blue is not equal tothe probability of drawing a blue chip in the first draw since thecomposition of the bag changes with the outcome of the firstdraw. Prob(B|B),Prob(B) •When drawing without replacement, draws are notindependent .•This is especially important to take note of when the samplesizes are small. If we were dealing with, say, 10,000 chips in a(giant) bag, taking out one chip of any color would not have asbig an impact on the probabilities in the second draw.

row proportions

computed as the counts divided by their row totals

Six sigma

"The termsix sigma processcomes from the notion that if one hassix standard deviations between the process mean and the nearestspecification limit, as shown in the graph, practically no items willfail to meet specifications."

Heights of females-Normal Distribution Example

"When we looked into the data forwomen, we were surprised to see heightexaggeration was just as widespread,though without the lurch towards abenchmark height."

Heights of Males-Normal Distribution Example

"The male heights on OkCupid verynearly follow the expected normaldistribution - except the whole thing isshifted to the right of where it should be.Almost universally guys like to add acouple inches.""You can also see a more subtle vanityat work: starting at roughly 5' 8", the topof the dotted curve tilts even furtherrightward. This means that guys as theyget closer to six feet round up a bit morethan usual, stretching for that covetedpsychological benchmark."

What is the main difference between observational studies and experiments?

(a) Experiments take place in a lab while observational studies donot need to. (b) In an observational study we only look at what happened in thepast.(c) Most experiments use random assignment while observationalstudies do not. (d) Observational studies are completely useless since no causalinference can be made based on their findings. Answer: c) Most experiments use random assignment while observationalstudies do not.

Practice: which of the following is false?

(a) Majority of Z scores in a right skewed distribution are negative. *(b)In skewed distributions the Z score of the mean might bedifferent than 0.* (c) For a normal distribution, IQR is less than2×SD. (d) Z scores are helpful for determining how unusual a data pointis compared to the rest of the data in the distribution.

An August 2012 Gallup poll suggests that 13% of Americans thinkhome schooling provides an excellent education for children. Woulda random sample of 1,000 Americans where only 100 share thisopinion be considered unusual?

(a) No (b)Yes μ=np=1,000×0.13=130 σ=√np(1−p)=√1,000×0.13×0.87≈10.6 Method 1:Range of usual observations:130±2×10.6=(108.8,151.2)100 is outside this range, so would be considered unusual. Method 2:Z-score of observation:Z=x−meanSD=100−13010.6=−2.83100 is more than 2 SD below the mean, so would beconsidered unusual.

A random variable that follows which of the following distributions can take on values other than positive integers?

(a) Poisson (b) Negative binomial (c)Binomial *(d)Normal* (e) Geometric

Which of the following describes a case where we would use thenegative binomial distribution to calculate the desired probability?

(a) Probability that a 5 year old boy is taller than 42 inches.(b) Probability that 3 out of 10 softball throws are successful. (c) Probability of being dealt a straight flush hand in poker.(d) Probability of missing 8 shots before the first hit. **(e)Probability of hitting the ball for the 3rd time on the 8th try.**

Can we calculate the probability of rolling a 6 for the first time onthe 6throll of a die using the geometric distribution? Note that whatwas a success (rolling a 6) and what was a failure (not rolling a 6)are clearly defined and one or the other must happen for each trial.

(a) no, on the roll of a die there are more than 2 possible outcomes *(b)yes, why not* P(6on the6throll)=(5/6)^5(1/6)≈0.067

A 2012 Gallup survey suggests that 26.2% of Americans are obese.Among a random sample of 10 Americans, what is the probabilitythat exactly 8 are obese?

(a) pretty high *(b)pretty low*

Which of the following is not a condition that needs to be met for thebinomial distribution to be applicable?

(a) the trials must be independent (b) the number of trials,n, must be fixed (c) each trial outcome must be classified as asuccessor afailure **(d)the number of desired successes,k, must be greater than thenumber of trials** (e) the probability of success,p, must be the same for each trial

Practice: Which of these variables do you expect to be uniformly distributed?

(a) weights of adult females (b) salaries of a random sample of people from North Carolina (c) house prices (d) birthdays of classmates (day of the month) Answer: (d)birthdays of classmates (day of the month)

Below are four pairs of Binomial distribution parameters. Whichdistribution can be approximated by the normal distribution?

(a)n=100,p=0.95 **(b)n=25,p=0.45** (c)n=150,p=0.05 (d)n=500,p=0.015

Anecdotal evidence and early smoking research

*Anti-smoking research started in 1930s and 1940s when cigarette smoking became increasingly popular. While some smokers seem to be sensitive to cigarette smoke, otherswere completely unaffected. *Anti-smoking research was faced with resistance based onanecdotal evidencesuch as "My uncle smokes three packs aday and he's in perfectly good health", evidence based on alimited sample size that might not be representative of thepopulation. *It was concluded that "smoking is a complex human behavior,by its nature difficult to study, confounded by humanvariability." *In time researchers were able to examine larger samples ofcases (smokers), and trends showing that smoking hasnegative health impacts became much clearer.

Cluster Sample

*Clusters* are usually not made up of homogeneous observations.We take a simple random sample of clusters, and then sample ~all~ observations in that cluster. Usually preferred for economical reasons. Ex: a city samples every block but then selects 2 blocks and talks to every house on that block

Stratified sample

*Strata* are made up of similar observations. We take a simplerandom sample from *each* stratum

Pros and cons of transformations

-Skewed data are easier to model with when they are transformed because outliers tend to become far less prominent after an appropriate transformation # of games: 70, 50, 25 log # of games: 4.25, 3.91, 3.22 -However, results of an analysis in log units of the measured variable might be difficult to interpret. What other variables would you expect to be extremely skewed? Answer: salary, housing prices, etc.

Bar plots with two variables

-Stacked bar plot: graphical display of contingency table information, for counts. -Side-by-side bar plot:Displays the same information by placing bars next to, instead of on top of, each other. -Standardized stacked bar plot: Graphical display ofcontingency table information, for proportions.

Median

-The median is the value that splits the data in half when ordered in ascending order 0,1,2,3,4 Median=2 -If there are an even number of observations, then the median is the average of the two values in the middle 0,1,2,3,4,5 -> 2+3/2=2.5 Median=2.5 -Since the median is the midpoint of the data, 50% of the values are below it. Hence, it is also the 50th percentile

Extending the framework for other statistics

-The strategy of using a sample statistic top estimate a parameter is quite common, and it's a strategy that we can apply to other statistics besides a proportion -take a sample of students at a college and ask them how many extracurricular activities they are involved in to estimate the average number of extracurricular activities all students in this college -the principles and general ideas are from this chapter

When the conditions are not met....

-Whether either np or np(1-p) is small, the distribution is more discrete....

Bar plots

-a bar plot is a common way to display a single categorical variable. A bar plot where proportions instead of frequencies are shown is called a relative frequency bar plot.

Contingency tables

-a table that summarizes data for two categorical variables is called a contingency table -the contingency table below shows the distribution of survival and ages of passengers on the Titanic.

Sampling distributions are never observed

-in real-world applications, we never actually observe the sampling distribution, yet it is useful...

Class Layout (Lecture 1)

-not necessary to attend during the zoom time -attendance not necessary -textbook assigned open source in bookstore -main focus: get you prepared for graduate school Book: Open Intro to Statistics RSoftware No dedicated labs, just lectures with some lab work to demonstrate statistical things Free software package Online textbook? Statistical techniques/why we need to know them, how to interpret them https://rstudio.com/products/rstudio/download/ https://leanpub.com/openintro-statistics https://leanpub.com/openintro-statistics

Suppose that you don't have access to the population of all American adults, which is quite likely scenario. In order to estimate the proportion of American adults who support solar power expansion, you might sample from the population and use your sample proportion as the best guess for the unknown population proportion

-sample, with replacement, 1000 American adults from the population....

Q1, Q3 and IQR

-the 25th percentile is also called the first quartile, Q1 -the 50th percentile is also called the median -the 75th percentile is also called the third quartile, Q3 -between Q1 and Q3 is the middle of 50% of the data. The rangethese data span is called theinterquartile range, or theIQR. IQR=Q3−Q1

Mean

-the sample mean, denoted as x, can be calculated as -the population mean is also computed the same way but is denoted as u. It is often not possible to calculate u since population data are rarely available. -the sample mean is a sample statistic, and serves as a point estimate of the population mean. This estimate may not be perfect, but if the sample is good (representative of the population), it is usually a pretty good estimate.

Point estimates and error

-we are often interested in population parameters -complete populations are difficult to collect data on, so we use sample statistics as point estimates for the unknown population of interest -Error in the estimate=difference between population parameter and sample statistic -Bias is a systematic tendency to over-or-under-estimate the true population parameter -Sampling error describes how much an estimate will tend to vary from one sample to the next -Much of statistics is focused on understanding and quantifying sampling error, and sample size is helpful for quantifying this error

Extremely skewed data

-when data are extremely skewed, transforming them might make modeling easier. A common transformation is the *log transformation*. -The histograms on the left shows the distribution of number ofbasketball games attended by students. The histogram on the rightshows the distribution of log of number of games attended. (Slide 30, Lecture 2)

Simulating the experiment...

... under the assumption of independence, i.e. leave things up tochance. If results from the simulations based on thechance modellook likethe data, then we can determine that the difference between theproportions of promoted files between males and females wassimplydue to chance(promotion and gender are independent). If the results from the simulations based on the chance model donot look like the data, then we can determine that the differencebetween the proportions of promoted files between males andfemales was not due to chance, butdue to an actual effect ofgender(promotion and gender are dependent).

Two competing claims

1. "There is nothing going on."Promotion and gender areindependent, no genderdiscrimination, observed difference in proportions is simplydue to chance.→Null hypothesis 2. "There is something going on."Promotion and gender aredependent, there is genderdiscrimination, observed difference in proportions is not dueto chance.→Alternative hypothesis

Practice-Inverting probablites

1. a woman who gets tested once and obtains a positiveresult wants to get tested again. In the second test, what should weassume to be the probability of this specific woman having cancer? (a) 0.017 *(b)0.12* (c) 0.0133 (d) 0.88 2. What is the probability that this woman has cancer if this secondmammogram also yielded a positive result? (a) 0.0936 (b) 0.088 (c) 0.48 *(d)0.52* P(C|+) =P(Cand+)/P(+) =0.0936/0.0936+0.088 =0.52

Principles of Experimental Design

1.Control:Control for the (potential) effect of variables otherthan the ones directly being studied .2.Randomize:Randomly assign subjects to treatments, andrandomly sample from the population whenever possible. 3.Replicate:Within a study, replicate by collecting a sufficientlylarge sample. Or replicate the entire study. 4.Block:If there are variables that are known or suspected toaffect the response variable, first group subjects intoblocksbased on these variables, and then randomize cases withineach block to treatment groups.

Adding or multiplying

A company has 5 Lincoln Town Cars in its fleet. Historical data showthat annual maintenance cost for each car is on average $2,154 witha standard deviation of $132. What is the mean and the standarddeviation of the total annual maintenance cost for this fleet? Note that we have 5 cars each with the given annual maintenancecost(X1+X2+X3+X4+X5), not one car that had 5 times thegiven annual maintenance cost(5X). Lecture 3 Slide 59

Gallup poll Practice

A recent Gallup poll suggests that 25.5% of Texans do not havehealth insurance as of June 2012. Assuming that the uninsured ratestayed constant, what is the probability that two randomly selectedTexans are both uninsured? (a)25.5^2 (b)0.255^2 (c)0.255×2 (d)(1−0.255)^2 Answer: (b)0.2552

An analysis of facebook users

A recent study found that "Facebook users get more than they give".For example: •40% of Facebook users in our sample made a friend request,but 63% received at least one request •Users in our sample pressed the like button next to friends'content an average of 14 times, but had their content "liked"an average of 20 times •Users sent 9 personal messages, but received 12 •12% of users tagged a friend in a photo, but 35% werethemselves tagged in a photoAny guesses for how this pattern can be explained? We are given thatn=245,p=0.25, and we are asked for theprobabilityP(K≥70). To proceed, we need independence, whichwe'll assume but could check if we had access to more Facebookdata.P(X≥70)=P(K=70orK=71orK=72or···orK=245)=P(K=70)+P(K=71)+P(K=72)+···+P(K=245) This seems like an awful lot of work...

Practice-School Parking

A school district is considering whether it will no longer allow high schoolstudents to park at school after two recent accidents where students wereseverely injured. As a first step, they survey parents by mail, asking themwhether or not the parents would object to this policy change. Of 6,000surveys that go out, 1,200 are returned. Of these 1,200 surveys that werecompleted, 960 agreed with the policy change and 240 disagreed. Whichof the following statements are true? I. Some of the mailings may have never reached the parents. II. The school district has strong support from parents to move forwardwith the policy approval. III. It is possible that majority of the parents of high school studentsdisagree with the policy change. IV. The survey results are unlikely to be biased because all parentswere mailed a survey. (a) Only I (b) I and II (c) I and III (d) III andIV (e) Only IV Answer: (c) I and III

Practice

A study is designed to test the effect of light level and noise level onexam performance of students. The researcher also believes thatlight and noise levels might have different effects on males and fe-males, so wants to make sure both genders are equally representedin each group. Which of the below is correct? (a) There are 3 explanatory variables (light, noise, gender) and 1response variable (exam performance) (b) There are 2 explanatory variables (light and noise), 1 blockingvariable (gender), and 1 response variable (examperformance) (c) There is 1 explanatory variable (gender) and 3 responsevariables (light, noise, exam performance) (d) There are 2 blocking variables (light and noise), 1 explanatoryvariable (gender), and 1 response variable (examperformance)

Data Basics: Classroom survey

A survey was conducted on students in an introductory statistics course. Below are a few of the questions on the survey, and the corresponding variables the data from the responses were stored in: •gender: What is your gender?•introextra: Do you consider yourself introverted orextraverted?•sleep: How many hours do you sleep at night, on average?•bedtime: What time do you usually go to bed? •countries: How many countries have you visited? •dread: On a scale of 1-5, how much do you dread beinghere?

What type of variable is a telephone area code?

A. numerical, continuous B. numerical, discrete C. categorical D. categorical, ordinal

Fari game

Afairgame is defined as a game that costs as much as itsexpected payout, i.e. expected profit is 0. Do you think casino games in Vegas cost more or less than theirexpected payouts? If those games cost less than theirexpected payouts, it would mean thatthe casinos would be losing money onaverage, and hence they wouldn't beable to pay for all this:

generalized linear models

Allow us to analyze models that have a particular kind of non-linearity and particular kinds of non-normally distributed (but still independent and constant) errors. -The idea of modeling rates for a Poisson distribution against a second variable such as the day of week forms the foundation of some more advanced methods that fall in the realm of generalized linear models

double-blind study

An experiment in which neither the participant nor the researcher knows whether the participant has received the treatment or the placebo

Practice-Independence

Between January 9-12, 2013, SurveyUSA interviewed a random sampleof 500 NC residents asking them whether they think widespread gun own-ership protects law abiding citizens from crime, or makes society moredangerous. 58% of all respondents said it protects citizens. 67% of Whiterespondents, 28% of Black respondents, and 64% of Hispanic respon-dents shared this view. Which of the below is true?Opinion on gun ownership and race ethnicity are most likely (a) complementary (b) mutually exclusive (c) independent *(d)dependent* (e) disjoint Answer: (d) dependent

CLT conditions

Certain conditions must be met for the CLT to apply: 1. Independence: sampled observations must be independent. This is difficult to verify, but is more likely if -random sampling/assignment is used and -If sampling without replacement, n<10% of the population. 2. Sample size: there should be at least 10 expected successes and 10 expected failures in the observed sample. This is d....

Calculating the # of Scenarios-Choose Function

Choose function The choose functionis useful for calculating the number of ways tochooseksuccesses inntrials. (nk)=n!k!(n−k)! •k=1,n=4:(41)=4!1!(4−1)!=4×3×2×11×(3×2×1)=4 •k=2,n=9:(92)=9!2!(9−2)!=9×8×7!2×1×7!=722=36

Disjoint vs. Complementary

Do the sum of probabilities of two disjoint events always add up to1? ->Not necessarily, there may be more than 2 events in the samplespace, e.g. party affiliation. Do the sum of probabilities of two complementary events alwaysadd up to 1? ->Yes, that's the definition of complementary, e.g. heads and tails.

Shape of distribution: Modality

Does the histogram have a single prominent peak unimodal), several prominent peaks (bimodal/multimodal), or no apparentpeaks (uniform)? Note: In order to determine modality, step back and imagine a smooth curve overthe histogram - imagine that the bars are wooden blocks and you drop a limpspaghetti over them, the shape the spaghetti would take could be viewed as asmooth curve.

Choosing the appropriate proportion

Does there appear to be a relationship between age and survival for passengers on the Titanic? To answer this question we examine the row proportions: % adults who survived 654/2092 = .31 % Children who survived: 57 / 109≈0.52

Geometric Distribution-Miligram experiment

Dr. Smith wants to repeat Milgram's experiments but she only wants tosample people until she finds someone who will not inflict a severe shock.What is the probability that she stops after the first person? P(1stperson refuses)=0.35 ... the third person? P(1stand2ndshock,3rdrefuses)=S0.65×S0.65×R0.35=0.652×0.35≈0.15 ... the tenth person? P(9shock,10threfuses)=S0.65×···×S0.65︸︷︷︸9of these×R0.35=0.659×0.35≈0.0072

Counting the # of scenarios

Earlier we wrote out all possible scenarios that fit the condition ofexactly one person refusing to administer the shock. Ifnwas largerand/orkwas different than 1, for example,n=9andk=2: RRSSSSSSSSRRSSSSSSSSRRSSSSS···SSRSSRSSS···SSSSSSSRR writing out all possible scenarios would be incredibly tedious andprone to errors.

Side by side box plots

Ex: does there appear to be a relationship between class year and number of clubs students are in?

Chapter 2 Lecture

Examining Numerical Data

Relapse

Experiment Example (See Ch. 3 Lecture slides 23-25) Researchers randomly assigned 72 chronic users of cocaine intothree groups: desipramine (antidepressant), lithium (standardtreatment for cocaine) and placebo. Results of the study aresummarized below. What is the probability that a patient did not relapse? *Marginal Probability:* What is the probability the patient relapsed?P(relapsed) =48/72≈0.67 *Joint Probability:* What is the probability that a patient received the antidepressant(desipramine) and relapsed? P(relapsed and desipramine) =10/72≈0.14 *Conditional Probability:* P(relapse | desipramine)=P(relapse and desipramine)/P(desipramine) =10/72 / 24/72 =10/24 =.42 Conditional Probability cont. -> If we know that a patient received the antidepressant (desipramine),what is the probability that they relapsed? P(relapse | desipramine) =10/24≈0.42 P(relapse | lithium) =18/24≈0.75 P(relapse | placebo) =20/24≈0.83 P(desipramine | relapse) =10/48≈0.21 P(lithium | relapse) =18/48≈0.375 P(placebo | relapse) =20/48≈0.42

A college student working at a psychology lab is asked to recruit10 couples to participate in a study. She decides to stand outsidethe student center and ask every 5thperson leaving the buildingwhether they are in a relationship and, if so, whether they would liketo participate in the study with their significant other. Suppose theprobability of finding such a person is 10%. What is the probabilitythat she will need to ask 30 people before she hits her goal?

Given:p=0.10,k=10,n=30. We are asked to find theprobability of10thsuccess on the30thtrial, therefore we use thenegative binomial distribution. P(10thsuccess on the30thtrial)= =(299)×0.1010×0.9020 =10,015,005×0.1010×0.9020 =0.00012

Suppose that in a rural region of a developing country electricitypower failures occur following a Poisson distribution with an averageof 2 failures every week. Calculate the probability that on a givendaythe electricity fails three times.

Givenλ=2. P(only 1 failure in a week)=21×e−21!=2×e−21=0.27 We are given the weekly failure rate, but to answer this questionwe need to first calculate the average rate of failure on a given day:λday=27=0.2857. Note that we are assuming that the probabilityof power failure is the same on any day of the week, i.e. weassume independence. P(3 failures on a given day)=0.28573×e−0.28573! =0.28573×e−0.28576 =0.0029

Expected Value

How many people is Dr. Smith expected to test before finding thefirst one that refuses to administer the shock? The expected value, or the mean, of a geometric distribution is defined as 1/p. μ=1/p=1/0.35=2.86 She is expected to test 2.86 people before finding the first one thatrefuses to administer the shock. But how can she test a non-whole number of people?

General Multiplication Rule

If A and B represent two outcomes or events, then P(A and B) = P(A I B) X P(B) It is useful to think of A as the outcome of interest and B as the condition

Checking for independence

If P(A occurs, given that B is true) = P (A I B) = P(A), then A and B are independent P(protects citizens) = 0.58P (randomly selected NC resident says gun ownership protectscitizens, given that the resident is white) =P(protects citizens|White) = 0.67 P(protects citizens|Black) = 0.28 P(protects citizens|Hispanic) = 0.64 P(protects citizens) varies by race/ethnicity, therefore opinion ongun ownership and race ethnicity are most likely dependent.

Sampling Bias: Non-Response

If only a small fraction of the randomly sampled people choose to respond to a survey, the sample may no longer be representative of the population.

Putting everything together (Texas Gallup Poll Practice)

If we were to randomly select 5 Texans, what is the probability thatat least one is uninsured? •If we were to randomly select 5 Texans, the sample space forthe number of Texans who are uninsured would be:S={0,1,2,3,4,5} •We are interested in instances where at least one person isuninsured:S={0,1,2,3,4,5} •So we can divide up the sample space into two categories:S={0,at least one} Since the probability of the sample space must add up to 1: Prob(at least1uninsured) =1−Prob(none uninsured) =1−[(1−0.255)5] =1−0.7455 =1−0.23 =0.77 At least 1 P(at least one) =1−P(none)

Sampling Bias Example: Landon vs. FDR

In 1936, Landonsought theRepublicanpresidentialnominationopposing there-election of FDR. The Literary Digest Poll: •The Literary Digest polled about 10million Americans, and got responsesfrom about 2.4 million. •The poll showed that Landon would likelybe the overwhelming winner and FDRwould get only 43% of the votes. •Election result: FDR won, with 62% of thevotes. •The magazine was completely discredited because of the poll,and was soon discontinued. What went wrong? *The magazine had surveyed: -Its own readers -registered automobile owners -registered telephone users •These groups had incomes well above the national average ofthe day (remember, this is Great Depression era) whichresulted in lists of voters far more likely to supportRepublicans than a trulytypicalvoter of the time, i.e. thesample was not representative of the American population atthe time.

Describing variability using the 68-95-99.7 rule

SAT scores are distributed nearly normally with mean 1500 andstandard deviation 300. •∼68% of students score between 1200 and 1800 on the SAT .•∼95% of students score between 900 and 2100 on the SAT. •∼99.7% of students score between 600 and 2400 on the SAT.

Robust statistics

Median and IQR are more robust to skewness and outliers that mean and SD -for skewed distributions it is often more helpful to use median and IQR to describe the center and spread -for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread Example: If you would like to estimate the typical household income for a stu-dent, would you be more interested in the mean or median income? Answer: Median

Case Study: Treating Chronic Fatigue Syndrome

Objective: Evaluate the effectiveness of cognitive behavior therapy for chronic fatigue syndrome Participant pool: 142 patients who were recruited from referrals by primary care physicians and consultants to a hospital clinic specializing in chronic fatigue syndrome Actual participants: Only 60 of the 142 referred patients entered the study. Some were excluded because they didn't meet the diagnostic criteria, some had other health issues and some refused to be a part of the study Study Design: Patients randomly assigned to treatment and control groups, 30 patients in each group *Treatment: Cognitive behavior therapy-collaborative, educative, and with a behavioral emphasis. Patients were shown on how activity could be increased steadily and safely without exacerbating symptoms *Control: Relaxation-no advice was given about how activity could be increased. Instead progressive muscle relaxation, visualization, and rapid relaxation skills were taught *Results: The table shows the distribution of patients with good outcomes at 6 month follow up. Note that 7 patients dropped out of the study: 3 from the treatment and 4 from the control group *Proportion with good outcomes in treatment group: 19/27=.70->70% *Proportion with good outcomes in the control group: 5/26=.19->19% Understanding the results Does the data show a "real" difference between the groups? *Suppose you flip a coin 100 times. While the chance a coin lands heads in any given coin flip is 50%, we probably won't observe exactly 50 heads. This type of fluctuation is part of almost any type of data generating the process *The observed difference between the two groups (70-19=51%) *Since the difference is quite large, it is more believable that the difference is real *We need statistical tools to determine if the difference is solarge that we should reject the notion that it was due tochance. Generalizing the results These patients had specific characteristics and volunteered to be apart of this study, therefore they may not be representative of allpatients with chronic fatigue syndrome. While we cannotimmediately generalize the results to all patients, this first study isencouraging. The method works for patients with some narrow setof characteristics, and that gives hope that it will work, at least tosome degree, with other patients. *be careful with generalization, we need to know where people came from before we can generalize them (EX: College students drinking habits vs. non-college students)

Sampling Bias: Voluntary response

Occurs when the sample consists ofpeople who volunteer to respond because they have strongopinions on the issue. Such a sample will also not berepresentative of the population.

Calculating the expectation of linear combination

On average you take 10 minutes for each statistics homework prob-lem and 15 minutes for each chemistry homework problem. Thisweek you have 5 statistics and 4 chemistry homework problems as-signed. What is the total time you expect to spend on statistics andchemistry homework for the week? E(S+S+S+S+S+C+C+C+C)=5×E(S)+4×E(C) =5×10+4×15=50+60 =110min

Product rule for independent events

P(A and B)=P(A)×P(B) Or more generally,P(A1and···and Ak)=P(A1)×···×P(Ak) You toss a coin twice, what is the probability of getting two tails in arow? P(T on the first toss)×P(T on the second toss)=1/2×1/2=1/4

Standardizing with Z scores

SAT scores are distributed nearly normally with mean 1500 andstandard deviation 300. ACT scores are distributed nearly normallywith mean 21 and standard deviation 5. A college admissions offi-cer wants to determine which of the two applicants scored better ontheir standardized test with respect to the other test takers: Pam,who earned an 1800 on her SAT, or Jim, who scored a 24 on hisACT? Since we cannot just compare these two raw scores, we insteadcompare how many standard deviations beyond the mean eachobservation is. •Pam's score is1800−1500300=1standard deviation above themean. •Jim's score is24−215=0.6standard deviations above themean. •These are calledstandardizedscores, orZ scores. •Z score of an observation is thenumber of standard deviations it falls above or below the mean. Z=observation−meanSD •Z scores are defined for distributions of any shape, but onlywhen the distribution is normal can we use Z scores tocalculate percentiles. •Observations that are more than 2 SD away from the mean(|Z|>2) are usually considered unusual.

From histograms to continuous distributions

Since height is a continuous numerical variable, itsprobabilitydensity functionis a smooth curve.

Binomial distribution

Suppose we randomly select four individuals to participate in thisexperiment. What is the probability that exactly 1 of them will refuseto administer the shock? Let's call these people Allen (A), Brittany (B), Caroline (C), andDamian (D). Each one of the four scenarios below will satisfy thecondition of "exactly 1 of them refuses to administer the shock": Scenario 1:0.35(A)refuse×0.65(B) shock×0.65(C) shock×0.65(D) shock=0.0961 Scenario 2:0.65(A) shock×0.35(B)refuse×0.65(C) shock×0.65(D) shock=0.0961 Scenario 3:0.65(A) shock×0.65(B) shock×0.35(C)refuse×0.65(D) shock=0.0961 Scenario 4:0.65(A) shock×0.65(B) shock×0.65(C) shock×0.35(D)refuse=0.0961The probability of exactly one 1 of 4 people refusing to administerthe shock is the sum of all of these probabilities.

Independence

Two processes are independent if knowing the outcome of one provides no useful information about the outcome of the other •Knowing that the coin landed on a head on the first tossdoes notprovide any useful information for determining whatthe coin will land on in the second toss.→Outcomes of twotosses of a coin are independent. •Knowing that the first card drawn from a deck is an ace doesprovide useful information for determining the probability ofdrawing an ace in the second draw.→Outcomes of two drawsfrom a deck of cards (without replacement) are dependent.

The birthday problem

What is the probability that 2 randomly chosen people share a birthday? Pretty low, 1/365=0.0027 What is the probability that at least 2 people out of 366 people sharea birthday? Exactly 1! (Excluding the possibility of a leap year birthday.) What is the probability that at least 2 people (1 match) out of 121people share a birthday? Somewhat complicated to calculate, but we can think of it as the complement of the probability that there are no matches in 121 people. P(no matches)=1×(1−1365)×(1−2365)×···×(1−120365)=365×364×···×245365121=365!365121×(365−121)!=121!×(365121)365121≈0 P(at least1match)≈1

Law of Large Numbers Example

When tossing afaircoin, if heads comes up on each of the first 10tosses, what do you think the chance is that another head will comeup on the next toss? 0.5, less than 0.5, or more than 0.5? HHHHHHHHHH? •The probability is still 0.5, or there is still a 50% chance thatanother head will come up on the next toss. P(Hon 11thtoss)=P(Ton 11thtoss)=0.5 •The coin is not "due" for a tail .•The common misunderstanding of the LLN is that randomprocesses are supposed to compensate for whateverhappened in the past; this is just not true and is also calledgambler's fallacy(orlaw of averages).

Associated Variables

When two variables show some connection with one another

column proportions

computed as the count divided by the corresponding column total

parameters

because the mean and standard deviation describe a normal distribution exactly, they are called the distribution's parameters

Numerical, discrete

countries

symmetric

data sets that show roughly equal trailing off in both directions

observational data

data where no treatment has been explicitly applied (or explicitly withheld)

geometric distribution

describes the waiting time until a success for independent and identically distributed (idd) Bernoulli random variables,

stacked bar plot

graphical display of contingency table information

population

group of individuals of the same species that live in the same area

probability of failure

sometimes denoted with q=1-p, which would be 0.3 for the insurance example

conditional probability

the likelihood that a target behavior will occur in a given circumstance The conditional probability of outcome A given condition B is computed as the following:

median

the middle score in a distribution; half the scores are above it and half are below it

mode

the mode is represented by a prominent peak in the distribution

with replacement

the professor sampled her students with replacement, she repeatedly sampled the entire class with regard to who she already picked

sample space

the set of all possible outcomes

Law of Large Numbers (LLN)

the tendency to stabilize around p

The question from the prior slide asked for the probability of givennumber of successes,k, in a given number of trials,n, (k=1success inn=4trials), and we calculated this probability as #of scenarios×P(single scenario)

•#of scenarios: there is a less tedious way to figure this out,we'll get to that shortly... •P(single scenario)=pk(1−p)(n−k)probability of success to the power of number of successes, probability of failure to the power of number of failures TheBinomial distributiondescribes the probability of havingexactlyksuccesses innindependent Bernouilli trials withprobability of successp.

Linear combination

•A linear combination of random variables X and Y is given by aX+bY wherea and b are some fixed numbers. •The average value of a linear combination of randomvariables is given by E(aX+bY)=a×E(X)+b×E(Y)

Is it Poisson?

•A random variable may follow a Poisson distribution if theevent being considered is rare, the population is large, and theevents occur independently of each other •However we can think of situations where the events are notreally independent. For example, if we are interested in theprobability of a certain number of weddings over one summer,we should take into consideration that weekends are morepopular for weddings. •In this case, a Poisson model may sometimes still bereasonable if we allow it to have a different rate for differenttimes; we could model the rate as higher on weekends thanon weekdays. •The idea of modeling rates for a Poisson distribution against asecond variable (day of the week) forms the foundation ofsome more advanced methods calledgeneralized linear

Random variables

•A random variableis a numeric quantity whose value dependson the outcome of a random event *We use a capital letter, likeX, to denote a random variable *The values of a random variable are denoted with a lowercaseletter, in this casex *For example,P(X=x) •There are two types of random variables: *Discrete random variablesoften take only integer values *Example: Number of credit hours, Difference in number of credithours this term vs last •Continuous random variablestake real (decimal) values *Example: Cost of books this term, Difference in cost of booksthis term vs last

Obtaining good samples

•Almost all statistical methods are based on the notion ofimplied randomness. •If observational data are not collected in a random frameworkfrom a population, these statistical methods - the estimatesand errors associated with the estimates - are not reliable .•Most commonly used random sampling techniques aresimple,stratified, andclustersampling.

Random Process

•Arandom processis asituation in which we knowwhat outcomes couldhappen, but we don't knowwhich particular outcomewill happen. •Examples: coin tosses, dierolls, iTunes shuffle, whetherthe stock market goes up ordown tomorrow, etc. •It can be helpful to model aprocess as random even if itis not truly random.

Continuous distribituons

•Below is a histogram of the distribution of heights of US adults. •The proportion of data that falls in the shaded bins gives theprobability that a randomly sampled US adult is between 180cm and 185 cm (about 5'11" to 6'1").

Bernoulli random variables

•Each person in Milgram's experiment can be thought of as atrial. •A person is labeled asuccessif she refuses to administer asevere shock, andfailureif she administers such shock. •Since only 35% of people refused to administer a shock,probability of successisp=0.35 .•When an individual trial has only two possible outcomes, it iscalled aBernoulli random variable.

General Multiplication Rule

•Earlier we saw that if two events are independent, their jointprobability is simply the product of their probabilities. If theevents are not believed to be independent, the joint probabilityis calculated slightly differently. •IfAandBrepresent two outcomes or events, then P(A and B)=P(A|B)×P(B) Note that this formula is simply the conditional probabilityformula, rearranged. •It is useful to think ofAas the outcome of interest andBasthe condition.

Difference between blocking and explanatory variables

•Factors are conditions we can impose on the experimentalunits. •Blocking variables are characteristics that the experimentalunits come with, that we would like to control for. •Blocking is like stratifying, except used in experimentalsettings when randomly assigning, as opposed to whensampling.

68-95-99.7 rule

•For nearly normally distributed data, •about 68% falls within 1 SD of the mean, •about 95% falls within 2 SD of the mean, •about 99.7% falls within 3 SD of the mean. •It is possible for observations to fall 4, 5, or more standarddeviations away from the mean, but these occurrences arevery rare if the data are nearly normal.

Determining dependence based on sample data

•If conditional probabilities calculated based on sample datasuggest dependence between two variables, the next step isto conduct a hypothesis test to determine if the observeddifference between the probabilities is likely or unlikely to havehappened by chance. •If the observed difference between the conditionalprobabilities is large, then there is stronger evidence that thedifference is real. •If a sample is large, then even a small difference can providestrong evidence of a real difference. We saw that P(protects citizens|White) = 0.67 and P(protects citizens|Hispanic) = 0.64. Under which condition would you be more convincedof a real difference between the proportions of Whites and Hispanics whothink gun widespread gun ownership protects citizens?n=500orn=50,000n=50,00016

A trial as a hypothesis test (cont.)

•If the evidence is not strong enough to reject the assumptionof innocence, the jury returns with a verdict of "not guilty". >The jury does not say that the defendant is innocent, just thatthere is not enough evidence to convict >The defendant may, in fact, be innocent, but the jury has noway of being sure. •Said statistically, we fail to reject the null hypothesis. >We never declare the null hypothesis to be true, because we simply do not know whether it's true or not. >Therefore we never "accept the null hypothesis". •In a trial, the burden of proof is on the prosecution. •In a hypothesis test, the burden of proof is on the unusualclaim. •The null hypothesis is the ordinary state of affairs (the statusquo), so it's the alternative hypothesis that we considerunusual and for which we must gather evidence.

Case study: Gender Discrimination Example

•In 1972, as a part of a study on gender discrimination, 48male bank supervisors were each given the same personnelfile and asked to judge whether the person should bepromoted to a branch manager job that was described as"routine". •The files were identical except that half of the supervisors hadfiles showing the person was male while the other half hadfiles showing the person was female. •It was randomly determined which supervisors got "male"applications and which got "female" applications. •Of the 48 files reviewed, 35 were promoted. •The study is testing whether females are unfairlydiscriminated against.

Problems with taking a census

•It can be difficult to complete a census: there always seem tobe some individuals who are hard to locate or hard tomeasure.And these difficult-to-find people may have certaincharacteristics that distinguish them from the rest of thepopulation .•Populations rarely stand still. Even if you could take a census,the population changes constantly, so it's never possible to geta perfect measure. •Taking a census may be more complex than sampling.

Percentiles

•Percentileis the percentage of observations that fall below agiven data point. •Graphically, percentile is the area below the probabilitydistribution curve to the left of that observation.

Exploratory analysis to inference

•Sampling is natural .•Think about sampling something you are cooking - you taste(examine) a small part of what you're cooking to get an ideaabout the dish as a whole. •When you taste a spoonful of soup and decide the spoonfulyou tasted isn't salty enough, that's exploratory analysis. •If you generalize and conclude that your entire soup needssalt, that's aninference. •For your inference to be valid, the spoonful you tasted (thesample) needs to berepresentativeof the entire pot (thepopulation).

Miligram Experiment

•Stanley Milgram, a Yale Universitypsychologist, conducted a series ofexperiments on obedience toauthority starting in 1963 .•Experimenter (E) orders theteacher (T), the subject of theexperiment, to give severe electricshocks to a learner (L) each timethe learner answers a questionincorrectly. •The learner is actually an actor,and the electric shocks are notreal, but a prerecorded sound isplayed each time the teacher •These experiments measured the willingness of studyparticipants to obey an authority figure who instructed them toperform acts that conflicted with their personal conscience. •Milgram found that about 65% of people would obey authorityand give such shocks. •Over the years, additional research suggested this number isapproximately consistent across communities and time.

Sampling with replacement (cont.)

•Suppose you actually pulled an orange chip in the first draw. Ifdrawing with replacement, what is the probability of drawing ablue chip in the second draw? 1stdraw: 5, 3, 2 2nddraw: 5, 3, 2 Prob(2ndchipB|1stchipO) =3/10 =0.3 •If drawing with replacement, what is the probability of drawingtwo blue chips in a row? 1stdraw: 5, 3, 2 2nddraw: 5, 3, 2 Prob(1stchipB)·Prob(2ndchipB|1stchipB) =0.3×0.3 =0.3^2=0.09 •When drawing with replacement, probability of the secondchip being blue does not depend on the color of the first chipsince whatever we draw in the first draw gets put back in thebag. Prob(B|B)=Prob(B|O) •In addition, this probability is equal to the probability ofdrawing a blue chip in the first draw, since the composition ofthe bag never changes when sampling with replacement. Prob(B|B)=Prob(B) •When drawing with replacement, draws are independent.

Large samples are preferable, but...

•The Literary Digest election poll was based on a sample sizeof 2.4 million, which is huge, but since the sample wasbiased,the sample did not yield an accurate prediction. •Back to the soup analogy: If the soup is not well stirred, itdoesn't matter how large a spoon you have, it will still nottaste right. If the soup is well stirred, a small spoon will sufficeto test the soup.

Poisson Distribution

•The Poisson distribution is often useful for estimating the number of rare events in a large population over a short unit of time for a fixed population if the individuals within the population are independent. •The rate for a Poisson distribution is the average number of occurrences in a mostly-fixed population per unit of time, and is typically denoted byλ. •Using the rate, we can describe the probability of observingexactlykrare events in a single unit of time. Poisson distributionP(observek rare events) =λke−λk!,where k may take a value 0, 1, 2, and so on, and k!representsk-factorial. The lettere≈2.718is the base of the natural logarithm.The mean and standard deviation of this distribution areλand√λ,

Negative Binomial Distribution

•The negative binomial distribution describes the probability of observing the kth success on the nth trial. •The following four conditions are useful for identifying anegative binomial case: 1. The trials are independent. 2. Each trial outcome can be classified as a success or failure. 3. The probability of success (p) is the same for each trial. 4. The last trial must be a success. Note that the first three conditions are common to thebinomial distribution. Negative binomial distribution P(kthsuccess on thenthtrial) =(n−1k−1)pk(1−p)n−k,wherepis the probability that an individual trial is a success. Alltrials are assumed to be independent.53

The normal approximation breaks down on small intervals

•The normal approximation to the binomial distribution tends toperform poorly when estimat- ing the probability of a smallrange of counts, even when the conditions are met .•This approximation for intervals of values is usually improvedif cutoff values are extended by 0.5 in both directions. •The tip to add extra area when applying the normalapproximation is most often useful when examining a range ofobservations. While it is possible to also apply this correctionwhen computing a tail area, the benefit of the modificationusually disappears since the total interval is typically quitewide.

Linear combinations (cont.)

•The variability of a linear combination of two independentrandom variables is calculated as V(aX+bY)=a2×V(X)+b2×V(Y) •The standard deviation of the linear combination is the squareroot of the variance. Note: If the random variables are not independent, the variance calculation gets alittle more complicated and is beyond the scope of this course.

Different Interpretations of Probability

•There are several possible interpretations of probability but they (almost) completely agree on the mathematical rules probability must follow. P(A)= Probability of event A 0≤P(A)≤1 •Frequentist interpretation: -The probability of an outcome is the proportion of times theoutcome would occur if we observed the random process aninfinite number of times. •Bayesian interpretation: -A Bayesian interprets probability as a subjective degree ofbelief: For the same event, two separate people could havedifferent viewpoints and so assign different probabilities. -Largely popularized by revolutionary advance in computationaltechnology and methods during the last twenty years.

Chapter 4 Lecture

•Unimodal and symmetric, bell shaped curve •Many variables are nearly normal, but none are exactlynormal •Denoted asN(μ,σ)→Normal with meanμand standarddeviationσ

Expectation

•We are often interested in the average outcome of a randomvariable .•We call this theexpected value(mean), and it is a weightedaverage of the possible outcomes

Recap: Hypothesis testing framework

•We start with anull hypothesis (H0)that represents the statusquo .•We also have analternative hypothesis (HA)that representsour research question, i.e. what we're testing for .•We conduct a hypothesis test under the assumption that thenull hypothesis is true, either via simulation (today) ortheoretical methods (later in the course). •If the test results suggest that the data do not provideconvincing evidence for the alternative hypothesis, we stickwith the null hypothesis. If they do, then we reject the nullhypothesis in favor of the alternative.

Blocking experiment

•We would like to design an experiment toinvestigate if energy gels makes you runfaster: *Treatment: energy gel *Control: no energy gel •It is suspected that energy gels might affectpro and amateur athletes differently,therefore we block for pro status: *Divide the sample to pro and amateur *Randomly assign pro athletes to treatmentand control groups *Randomly assign amateur athletes totreatment and control groups *Pro/amateur status is equally represented inthe resulting treatment and control groups Why is this important? Can you think of other variables to block for?


Conjuntos de estudio relacionados

smartbook Wk 4 - Apply: Summative Assessment: Washburn Guitars: Pricing Decisions and Diversity

View Set

The Bluest Eye Test: Spring ROBISON

View Set