Biology 215 exam 2
Law of Large Numbers
"subject to random variation" is not the same thing as "completely unpredictable" if we wanted to *empirically estimate* the probability of tossing a Heads in one coin toss, we would need to toss the coin multiple times and see how often it comes up "Heads". Each toss of the coin is a *trial* we could count up how many times our trials gave us a success (which we define as the outcome we are estimating the probability of - in this case, the coin landing Heads up is a *success* ), and divide by the number of trials But, with only 4 tosses, there is a good chance that instead of 2 Heads we would get 3 heads, for an empirical estimate of P(Heads) = 3/4 = 0.75, or even 4 heads, for an empirical estimate of P(Heads) = 4/4 = 1. Getting 3 or 4 heads would not be a particularly unlikely outcome, but they give us empirical estimates of P(Heads) that are very different from the correct value. If we tossed a coin 1,000 times, the empirical estimate would equal 0.5 exactly only if we got 500 Heads and 500 Tails - we wouldn't be too surprised if we got a few more or fewer Heads than 500 though due to random chance. So, based on this simple example, when we're estimating probabilities empirically we need to use large numbers of trials. In fact, when we specify a probability of Heads = 1/2, we are making a statement about what would happen in an imaginary, mathematically perfect world in which we could toss a coin an infinite number of times. Although we can never do anything an infinite number of times, we do expect our real-world trials to converge on the behavior of infinite tosses as the number of trials gets large. This is the *law of large numbers* , and it tells us that our actual, empirical results should converge on the correct theoretical values as the number of trials goes to infinity. For this exercise, a "Trial" is tossing a coin ten times, not tossing it once. It turns out that calculating the probability of a string of "heads" in a row in 10 tosses is a fairly complicated problem to solve mathematically. There are two *simple special cases* that are easy: A) 10 H's in a row, and B) at least one H in ten tosses.
The z-transformation
*standard normal distribution* or z-distribution is when the mean is 0 and standard deviation is 1. this is used when we want to calculate the corners of a distribution plot lets say the middle of the plot is 65 inches (men) and we want to calculate probability of being 69 inches and up. z = xi - mean / standard deviation z= (69.5 - 65) / 2.6 = 1.5 A z-score of 1.5 tells us that 69.6 is 1.5 standard deviation above the mean. In other words, a height of 69.5 on the black curve of heights is equivalent to 1.5 on the blue standard normal falling above z value = *upper tail probability* to CALCULATE IT , set lower z value to 1.5 and upper z value a high number (1000 for example.. it would be infinity but the curve is so close to the x-axis at 1000 standard deviations that the probability will be the same to four decimal places as if we had used infinity). *P VALUE* = 0.0668
strategy for solving normal probability problems
1. draw picture of what you are solving 2. do z transformations 3. figure out what quantities are needed from z table 4. look up probabilities from table 5. calculate final probabilities
how to do one sample t - test
1. state null and alternative hypothesis 2. Calculate a test statistic 3. The *test statistic* tells us how different our sample mean is from the null hypothetical value. 4. To do this unit conversion we first need to know the *standard error* for the data set. The standard error for a single sample of data is just the standard deviation divided by the square root of the sample size. Our sample of 20 people had a standard deviation of 0.3, which means that the standard error would be: 5. Now to get our *observed t-value test statistic* we just need to know how many standard errors 0.1° is. 6. Compare the test statistic to a sampling distribution to obtain a *p-value* The p-value is the probability of randomly sampling from a population with a mean equal to the null value (μ = 98.6°) and obtaining a sample mean that is at least as different from the null value as the observed sample mean (x̄ - μ = 0.1°). 7. We will use the t-distribution to obtain the p-value for our observed t-value For a *one-sample t-test* degrees of freedom equals the sample size minus 1, which this example is 20-1 = 19. This p-value is called a *one-tailed p-value* , because we are only using one tail of the t-distribution to obtain it. 8. We account for the possibility of unexpected results by using a *two-tailed p-value* . One tail p value is .076 and since shape is symmetrical , both corners of graph, so two tail p-value is .076 x 2 = .152 9. The way to indicate if your p-value is based on one tail or two is in the way you represent the alternative hypothesis. If you are only using the upper tail for the p-value you are only testing for a warming effect, and your alternative hypothesis should be: HA: μ > 98.6° If you are testing for an effect of brandy on body temperature, and are interested in finding either an increase or decrease in body temperature, then your alternative hypothesis should be: HA: μ ≠ 98.6° If you were only testing for a reduction in body temperature the alternative hypothesis would be HA: μ < 98.6°.
Anderson-Darling Test
AD is a hypothesis test test "goodness of fit" of data to the normal curve . A bad fit would be data points too far apart if p is high -> p > .05 , data is normal , RETAIN NULL HYPOTHESIS not a good fit, p < .05 = REJECT NULL HYPOTHESIS
One sample hypothesis test
Compare the mean of the measured body temperatures to 98.6° (average normal core body temperature for humans - the body temperature we would expect if brandy has no effect on body temperature). We would draw a conclusion based on these decision rules: If mean body temperature is higher than 98.6°, conclude that drinking brandy increases core temperature. If mean body temperature is equal to 98.6° conclude that drinking brandy doesn't affect core temperature. If mean body temperature is less than 98.6° conclude that drinking brandy decreases core temperature. Means exactly equal to 98.6° don't happen very often, but means only rarely go as low as 98.5° or higher than 98.7°. This suggests a solution to our problem - even though a mean that is exactly equal to 98.6° isn't likely, we can change our decision rules to reflect the fact that small differences can happen by chance. If we can characterize the range The question is, then, how DIFFERENT from 98.6° would our sample mean have to be to conclude that brandy has an actual effect on core temperature? To answer that question we need *inferential statistics* , which are methods that allow us to draw conclusions about a population based on a sample of data. This week we will learn how to do a type of *null hypothesis significance test* (NHST), called a *ONESAMPLE T-TEST * , to determine if our sample mean of x̄ = 98.7° is different enough from 98.6° to conclude that drinking brandy increases body temperature.
checking data normality probability plot, histogram ,
GRAPHING : plot your data. Is it bell curve? if then, it is normal. For histograms , look at bins ! Look at the *long-run behavior* : is the behavior at the far edges of the graph, the far left and far right (think of a positive y=mx+b plot). data samples > sample size = normal distribution the best graphical method to use with smaller sample sizes is to do *NORMAL PROBABILITY PLOT* . Normal distribution fall along the diagonal lines (those are like the 95% confidence intervals) if more than 5% fall off, it is not normally distributed . We also asses with the AD TEST !! a hypothesis test , it is a cutoff to determine if it is normally distributed, if our data is too wonky p < .05 = not normal
example of probability: smoking and health
Health effects on smoking? We know it is not healthy, it is associated with high blood pressure, cancer, gum disease, etc. But what does it mean that smoking causes high blood pressure? Cause/effect cannot be *deterministic* , we are saying that if someone smokes they will definitely get high blood pressure... that is not accurate because some people smoke and do not get high blood pressure. And some people who do not smoke get high blood pressure. What we want to know is the probability of having high blood pressure if you are smoking.
Inferences are based on hypotheses
In the sciences, we base our conclusions on tests of hypotheses. In the sciences, a *hypothesis* is simply a possible explanation for some phenomenon. In inferential statistics we will call these sorts of statements of the way we think a system is working scientific hypotheses, although sometimes we will simply refer to the *scientific hypothesis* as the question we are trying to answer. *Null hypothesis* : at a population level, average body temperature (μ) for people drinking brandy is 98.6°. Ho: μ - 98.6° = 0 or mu = 98.6 *Alternative hypothesis* : at a population level, average body temperature (μ) is not equal to 98.6° for people drinking brandy. HA: μ ≠ 98.6° the null hypothesis is about the POPULATION MEAN, not about the sample mean
calculating probabilities
NASA doesnt accept those who are less than 149 cm or taller than 193 cm maybe because of how the rockets are set up, etc. Q: what percentage of people are too tall ? what percentage of people are too short ? we use normal curve with specified mean of 175. cm for men and 163 cm for women , or use Z distribution
product rule for *joint* ("and") probabilities
Q: what is the probability of tossing 2 coins and getting an H on both? symbolically : P(H coin 1 and coin 2) look at row margins for cell that you are trying to determine, and look at column marginal row to look at outcome the probability that *TWO INDEPENDENT events* will both occur is the product of the marginal probabilities of each event. -p (H coin1) -p (H coin 2) -p (H coin 1 and 2) 1/2 x 1/2 = 1/4
successful predictions of low event probability
Salem Oregon predicted it's solar eclipse exactly by year, month, day, time, and minute. probability of predicting a total eclipse correctly year = 1/375 ... minute= 1/197,100,000 it does not mean that the event is never going to occur just because it is so low
Test for normality
See SHAPE with HISTOGRAM graph - > histogram -> simple all numeric values must go in graph variables, in multiple variables tab put in separate panels of same graph = 3 histograms with 3 numeric values NORMALITY TEST WITH NORMAL PROBABILITY PLOT graph -> probability plot -> single ->3 numeric values in in graph variables -> multiple graphs on multiple variable tab -> in separate panels of same graph -> in by variables tab add sex to the "by variables with groups on separate graphs" -> 2 sets for each sex IF VARIABLES ARE NORMAL p > 0.05 = pass AD, normal p < 0.05 = fail AD, not normal AVERAGE AND S.D OF VARIABLE THAT PASSED A.D TEST display descriptive statistics GRAPH DISTRIBUTIONS FOR MALE AND FEMALE FOR VARIABLE PASSING A.D TEST graph -> probability distribution plot -> 2 distributions -> set both distributions to normal -> use female mean and standard deviation for 1 , male for other -> graph by clicking ok -> right click -> add -> reference lines -> add female and male means (separated by space) -> show lines at x-value CALCULATE IF ONE SEX'S VALUES IS GREATER THAN THE OTHER select probability distribution plot -> view probability -> distribution tab allows to specify mean and standard deviation if we want to see if females have greater diameter than men 's mean, input input female data , make sure it is set to normal distribution shaded area tab -> x values for define shaded area by -> right tail shaded option -> x value box put male mean -> click ok
Potential for improvement of the triple test
Since the probability we're looking at, P(DS | Positive), is based just on the Positive column, we can focus on the two ways that we can get positive test results. We have two rates that indicate how well the test is working, sensitivity and false positive rate. We will try changing each of them to see how P(DS | Positive) changes - we want P(DS | Positive) to be as BIG as possible, so if changing sensitivity or false positive rate increases P(DS | Positive) we are improving the performance of the test. Start with *sensitivity* . Being right only 60% of the time leaves lots of room for improvement. The best sensitivity possible would be 1, which means the disease is always detected when it's present. Change the test sensitivity to 0.7, 0.8, 0.9, and 1 and record P(DS | Positive) at each sensitivity. Once you're done with the sensitivities, change sensitivity back to 0.6. Next, try improving *false positive rate* . The best false positive rate we could have is zero, which means the test is never positive for No DS pregnancies. Change false positive rate to 0.04, 0.03, 0.02, 0.01, and 0, and for each one record the P(DS | Positive) on your worksheet. You should see that, even though we could only improve false positive rate by 5%, it has a much bigger effect on P(DS | Positive) than improving sensitivity does
why do we use probability?
Statistics always deals with random variables. We can only make probabilities statements from random variables. For example, in coin toss, we do not know if it is going to be heads or tails, but we know there is 50% chance of it being heads and 50% of it being tails. But that does not mean that after getting heads, the next one will be tails. Everything is random. "The likelihood of any event occurring, has an equal chance of occurring."
how to do an experiment with data (how we should start)
We start with the hypothesis ! *Null hypothesis* = statement of no difference . Written as H0 : . So the null hypothesis would be h0 : there is no difference in our distribution and normal distribution *Alternative hypothesis* = statement of difference. Written as Ha. ha = there is a difference between our distribution and normal distribution
Why learn about the normal distribution
We used the *t-distribution* to find how many standard errors we had to go from our sample mean to have a 95% chance of containing the population parameter we were estimating , and to construct confidence intervals for estimated means You have already learned that biological variables, such as human height, can be thought of as *random variables* , because if we randomly pick a person from a population to measure we cannot know ahead of time exactly how tall he or she will be. *normal distribution* is a bell curve. the normal curve is smooth and symmetrical, whereas the human histogram is more uneven and not perfectly symmetrical around the mean. The *mean* is also referred to as the *location* of the distribution, because it determines where along the number line the curve is centered. The standard deviation is referred to as the scale of the distribution, because it determines the dispersion in the distribution. Normal distributions are continuous, smooth curves, with an area beneath them from -∞ to +∞ equal to 1. Probabilities from continuous probability distributions like the normal are always *areas under curve*
combining probabilities
a dice has a total of 6 sides. probability of rolling a 3,4,5 or 6 ? 1/6 + 1/6 + 1/6 + 1/6 = 4/6 the probability of not rolling one of those = 1 - 4/6 = 2/6
trial
a single observation, execution of experiment with random outcome (ex: coin toss)
probability
event in relative frequency of an event if we repeated a random trial over and over again under the same conditions frequentist definition of probability -> based on frequency of occurrence of event since probabilities are *proportions of total* , MUST fall between 0 and 1
mutually exclusive
events that CANNOT happen at the same time are *mutually exclusive*
less impressive: low probabilities already observed
for random trials , each possible outcome can be unlikely , but when conducting a trial, one of those outcomes must happen ex: tossing a coin 10 times. Ten H's seem unlikely, has prob. of .001, but a random sequence has the same prob of .001.
general product rule
if variables are not independent then P ( A and B ) not P (A) x P (B) so, general product rule states P ( A and B ) = P (A) P (A|B)
Probability from a z-table
it gives *upper-tail probabilities* for the standard normal distribution. Find the first two digits of the z-value (including the whole number and the first decimal place) in the rows of the first column. For a z-value of 1.50, the right row is 1.5. Find the third digit of the z-value (the second decimal place) in the column headings. For a z-value of 1.50, the correct column is 0. Our probability of 0.067 tells us to expect 6.7% of this population of women to be 69.5 inches or taller. Note that this table starts at 0.00, and doesn't include any negative numbers. Negative z-values are just values below the mean, so this table seems to only be capable of giving a) upper tail probabilities that b) are below the mean. how do we find *lower-tail probability* ? Remember that the normal curve is symmetrical, which means that the probability of being below -1.5 is the same as the probability of being above 1.5 standard deviations, which we already found to be 0.067. The probability of being shorter than 61.7 inches would also be 0.067.
why normal distribution matters
it is a good mathematical model for many continuous (numbers with decimal points) numeric values good model for discrete numbers (whole numbers) we use normal distribution to calculate probabilities of outcomes for continuous variables and we want to consult with probability distribution with known values
mistakes in interpreting low probabilities
low probabilities are unlikely, not impossible ex: Americans have a low probability of being members of congress -Alexandria is a member of congress, so she is not an American citizen. INVALID even if it is a small probability, we cannot treat them as 0.
marginal probabilities
marginal = in margins row marginals = probability of flipping heads or tails on any of the coins
location and scale of normal distribution
normal curves are defined by these parameters : location = mean scale = standard deviation
When should you not use a normal curve to represent your data? And how do you know?
only use normal distribution if it matches data *Normal probability plots* compare how the data actually are distributed against how they would be distributed if they were from a normal distribution. The slightly curved lines on either side of the straight diagonal line are *prediction limits* , which are like confidence intervals. As long as no more than 5% of the red dot fall outside of the prediction limits the data are considered a sufficiently good match to the normal distribution to use it to represent our data The *Anderson-Darling* test measures how much difference there is between the observed data and what would be expected if the data were normally distributed, and produces the AD statistic from these differences. Perfectly normal data will have no difference between observed and expected values, and will have an AD value of 0. AD values get bigger as data diverges from normal. The AD test uses the AD statistic to get a *p-value* . A p-value is a probability of obtaining the AD statistic if the data are actually normally distributed. Small differences between observed and expected values will happen by chance, and thus the p-value generated from a small AD value will be large If the *p-value* is LESS than 0.05, you fail the test - your data are not normally distributed. If the *p-value* is GREATER than or equal to 0.05, you pass the test - you can consider your data to be normally distributed.
how do we know the probability of an event?
p = x/n x = number of ways an event can occur n = possible outcomes. example coin tossing: 2 possible outcomes, heads or tails , so x=2. when the coin is tossed, it will only land on 1 side, so n is 1. heads has probability of 1/2, same for tails. For observed data ( *empirical probabilities* ) , RELATIVE FREQUENCIES ARE CATEOGIRCAL VAUES. x = frequency n = total sample sizes
event
particular outcome (ex: heads on a coin toss)
what does probability = 0 and probability = 1 mean?
probabilities fall between 0 and 1. 0 = outcome cannot happen ex: flipping a coin on both heals and tails or measuring a negative length is p=0. 1 = outcome must happen ex: flipping a coin and getting heads or tails and measuring a length equal to or greater than 0 is p=1.
should we use normal curve as a model for our data ?
probability distribution gives long run even if the shape is a little wonky, we know long run behavior approximates normal distribution
relative frequencies as probabilities
r. f = proportions = part / total example = causes of deaths of teenagers in 1999. any person can fall into one category , so *relative frequencies* are *empirical probabilities* empirical probability = prob. event A will occur p(A) = number of times event actually occurs / number of times the experiment is performed A listing of all possible outcomes and their probabilities is a *probability distribution* ex: probability of teenager dying from accident in 1999 is 48% // probability of that occuring is .48...
Finding probabilities by enumerating the possible outcomes
since having a coin be heads or tails is mutually exclusive, we can flip two coins and see the probability of one landing on heads and the other one on tails. *P*= number of ways an event can occur / number of possible outcomes In a punnett square for coin 1 and coin 2, HH , HT, TH, TT. 1/4 probability for each
how do you know two events are INDEPENDENT from each other?
the outcome of 1 does not affect the other we want data to be independent and random so it can have no bias.
conditional probability
the probability that an event, A, will occur given that we know that event B has already occurred. written as *P ( A | B )* , vertical line = given . Given that event be already occurred it is read as "the probability that A will occur , given that B has occurred" A is a subject to random chance. P ( A | B ) is a PROBABILITY of A , not of B. For the example, "what is the probability of having high blood pressure given that you are a smoker." The key word is given smoker, so we look at smoker column. Event A is people who have high BP. 37 have high HB given that there are smokers.
Triple test applied to the general population
the study population has a really high rate of DS; according to the P(DS) you calculated, the rate of DS in the population is 0.5. When the test is actually used on the general population the rate is much lower - and this turns out to make a huge difference in the answer to the patient's question. Let's see how huge, and why. In actual human populations, DS is a rare disorder, occurring in only 1 in 1000 pregnancies. We can see how making just this one change affects the probabilities we calculated in the previous step. Change the "DS rate:" to have a 1 for the first entry, and a 1000 for the second - this sets the DS rate to 1 in 1000, and you'll see that the table automatically updates to reflect this new rate. There will be a 1 as the marginal total for DS, a 999 as the marginal total for No DS, and 1000 as the grand total. The conditional probability P(DS | Positive) is telling us that we only need to consider the *positive test results* . There are two ways to get a positive test result - *true positives* (i.e. positives received by women who actually have a DS pregnancy) and *false positives* (i.e. positives received by women who have a No DS pregnancy), and the probability P(DS | Positive) is calculated as the proportion of positive test results that are TRUE POSITIVES. Changing the DS rate affects this probability because: The *relative number* of true positives decreases and The relative number of false positives increases With fewer true positives, and more false positives, the probability P(DS | Positive) goes way down.
test of independence
the test determines whether or not marginal probability of having high blood pressure is the same as conditional probability of having high blood, given that you are a smoker same = independent if smoking is not independent of high blood pressure , everyone who smokes will have high blood pressure. P ( High BP | Smoker ) = 170/170 = 1
Probabilities based on the "study data"
to calculate the *marginal probability* - A marginal probability is the total number of times that DS occurred, divided by the total number of women in the study. *inverse probability* of being a DS patient if you receive a Positive test result, or P(DS | Positive) - To do a *conditional probability*, you need to set a "Given", which is the outcome that comes after the vertical line - the given is already known to have occurred, and we are not calculating its probability. Then, set the outcome whose probability you are calculating as "Conditional", which is DS. You'll see that when you set Positive as "Given" its entire column is shaded gray to indicate that we only need to consider the data in the "Positive" column if we know already that we've received a positive test result. The conditional probability of DS given a positive result is the number of times the pregnancy actually was DS after a positive test result was obtained, divided by the number of positive test results (60/65). *sensitivity* = P ( Positive | DS ) Sensitivity is a standard measure of how well a screening test is working Another standard measure of effectiveness is the *false positive rate* - The false positive rate is the probability that a patient that is known not to have the condition will incorrectly test positive. This too is a CONDITIONAL probability, P(Positive | No DS), and the vertical line tells us we only need to pay attention to the No DS row, because we know the patient doesn't have a DS pregnancy. Sensitivity and false positive rate are *classic researcher's questions*, because they both require us to already know the DS status of the patient. When a patient takes the triple test, she will know the test result but will not know her actual DS status (if she knew that, the screening test would be unnecessary). The probability you calculated in step 2 to test for independence between DS status and test result is a *patient's question* , namely: if I get a positive test result, what is the probability that I have a DS pregnancy? If you think about what the patient's question is asking, what is the *given* ? What is it the probability of (DS status, or test result) = If she knows her test result, positive, that's the given. We are calculating the probability of the outcome that ISN'T KNOWN YET, which is the probability that the pregnancy is DS.
independent variables
we want our data to be independent from each other is there any effect of smoking on high BP? if high BP is independent of smoking, meaning smoking doesnt increase chance of having high BP , we would use prob. of having high blood pressure in comparison of whole group
joint probability example
what is the probability of picking someone randomly and them being a smoker and having high blood pressure or P (smoker and high BP) ? multiplication rule applies : p( High BP ) = 220 (total BP) / 1000 (total smoking +BP) = .22 p (Smoker) = 170 (total smokers) / 1000 (total smoking + BP) = .017 0.22 x .17 = .037 empirical : 37 (high BP and smoker) / 1000 (total of all Bp and people) = .037
normal distributions
x- axis = units of STANDARD DEVIATION centered at mean = mu two standard deviations away from mean = 95% of data probability of distributions corresponds to probability of EVENT OCCURRING
