Stats Test 1
What is the general rule for choosing the classes (bin width) of a histogram?
# of bins = (square root) of the # of observations
IQR
Q3-Q1
Binomial coefficient for lottery example?
(4/1) = 4!/1!(4-1)! = 4
Suppose you have data for many individuals on their walking speed and their heart rate after a 10-minute walk. Use this information to answer the next two questions. (a) When you make a scatterplot, the explanatory variable on the x axis (b) You expect to see
(a) should be walking speed (b) a positive association
What is the formula to find mean?
(n+1)/2
Tree Diagrams
(or decision trees) will help us to solve probability problems involving conditional probabilities.
Continuous variables
Quantitative Variable; can take on any real numerical value over an interval(i.e. time, distance). Example Stat 302 Survey: weight (lb)
what does a modified boxplot display?
5 number summary and suspected outliers (if they exist)
Causation
A cause and effect relationship in which one variable controls the changes in another variable. Association is NOT causation
Ac
Complement of A Not A, opposite of A
TYPES OF VARIABLES QUESTIONS!
LOOK BELOW
is standard deviation resistant to outliers?
Mean is not resistant to outliers, hence standard deviation is not resistant to outliers.
Symmetric Distribution
Data that are evenly distributed.
Binomial Distribution is a ______ probability distribution?
Discrete
What are the types of Sample Space?
Discrete Sample Space Continuous Sample Space
A and B are two events, they could be....
Disjoint but not independent Not disjoint but independent Neither disjoint nor independent
There are two primary tests used by college admissions, the SAT and the ACT. Suppose you score 650 on the SAT math which has =500 and =100 and 30 on the ACT math which has mean=21 and s.d =4.7. How can we compare these scores to tell which score is relatively higher?
First estimate the z-score for each test result SAT: z= 650-5/100 = 1.5 ACT: 30-21/4.7 = 1.91 Interpretations of the z-score: • Your SAT score of 650 is 1.5 standard deviation higher than the mean SAT score. • Your ACT score of 30 is 1.91 standard deviation higher than the mean ACT score. • Since your z-score is greater for the ACT, you performed relatively better on this exam.
Discrete probability distributions
For a discrete random variable, its probability distribution (also called the probability distribution function) is any table, graph, or formula that gives each possible value and the probability of that value
Columns of Histogram
For each class on the horizontal axis, draw a column. The height of the column represents the count (or percent) of data points that fall in that class interval. 3
how do you interprert scatterplots (4)
Form or Trend Direction Strength Outliers
In each of the following situations, is it more reasonable to simply explore the relationship between the two variables or to view one of the variables as an explanatory variable and the other as a response variable? In the latter case, which is the explanatory variable and which is the response variable? Are they categorical or quantitative (quantitative means "numerical")? The weight in kilograms and height in centimeters of a person.
Height: explanatory, quantitative. Weight: response, quantitative.
What charts/graphs does distribution apply to?
Histogram Time Plots Box Plots
Is the IQR resistant to outliers?
IQR IS resistant to outliers
Rule Three
If two events have no outcomes in common (A and B are disjoint events): Addition rule for disjoint events: the probability that one or the other occurs is the sum of their individual probabilities. P(AUB) = P(A) + P(B) The probability that both occurs together is zero P(A∩B)=0
z-score what is the equation?
If x is an observation from a distribution that has mean μ and standard deviation σ, the value of x in standard units:
What are the three elements of Shape?
Mode, Symmetry. Skewness
Compound Event
More than one outcome
What are the types of Categorical Variables?
Nominal Ordinal
Identify which number is the statistic and which is the parameter. In the 2000 census of the United States, it was determined that in the State of Alabama, the average household earnings were 31,934 dollars. In a random sample of 50 households in Alabama taken last year, the average household earnings were slightly higher at 32,002 dollars.
Parameter: 31,934 dollars, Statistic: 32,002 dollars
Direction of association
Positive Direction: As one goes up, the other goes up also. Negative Direction: As one goes up, the other goes down. No Direction
Probability
Probability is the study of randomness.
Probability
Probabilty = times outcome occurred/ total number of trials
How to report correlation? Bad vs good Raising salaries increases productivity or Employees with higher salaries tend to be more productive.
Raising salaries increases productivity -- Bad Employees with higher salaries tend to be more productive -- Good
What does random mean?
Random denotes things that are unpredictable or not deliberate
What does random mean?
Random denotes things that are unpredictable or not deliberate Random events are not perfectly predictable, but in the long term they have certain regularities that can be describe and quantify using probability. -
The performance of a diagnostic test for a given disease can be defined by two properties:
Sensitivity Specificity
What are the types of events?
Simple Event Compound Event
Phytopigments are a marker of the amount of organic matter that settles in sediments. Phytopigment concentrations in deep-sea sediments collected worldwide showed a very strong right-skew. Of these two summary statistics, 0.009 and 0.016 grams of phytopigment per square meter of bottom surface, which one is the mean and which one is the median and why?
Since the distribution is strongly right-skewed, we expect the mean to be larger than the median. Therefore 0.016 is the mean and 0.009 the median.
What skew will a histogram describing age people die?
Skew to left few die at a young age -- the older you get the more people are dying
What skew will a histogram describing income?
Skew to right Few are rich most income at the top and goes down -- rich people on tail (less of them)
how can you tell if a boxplot is skewed?
Skewed distribution the whisker will be much longer than the other.
Specificity
Specificity =P(negative test | no disease) = P(T-|Dc) The condition is not present and the test correctly identify the condition is not present. (True Negative)
How would two histograms be different, if using the same set of data?
The number of bins/classes
How does standard deviation measure spread?
Standard deviation measures spread by looking at how far the observations are from their mean.
X: blood cholesterol levels of men aged 55 to 64. X~N(222, 37) What percentage of men have cholesterol levels of less than 230 mg/dl? P(X<230)=?
Step 1: find z score (230-222)/37 = .2162 Step 2. Use the z‐tables (normal standard tables) to obtain the following percentage) : P(Z<.2162)=.5832 Step 3: The percentage of men with cholesterol levels less than 230 is the same percentage of observations in the standard distribution that lie below z‐score P(x<230)=P(Z<.2162) = .5382
Independence
The outcome of a trial is not influenced by the outcomes of the previous trials.
Variance
The average of the squared differences from the Mean.
Correlation (r)
The correlation is a measure of the directionand strength of a linear relationship between two quantitative variables
Lower case letters (x, y, z) refer to?
refer to particular values taken by random variables
Population
The entire group of subjects (not necessary people) about which we want information
Form or trend of scatterplot
The form of the relationship between 2 quantitative variables refers to the overall pattern.
(Shape) Symmetry
The histogram for a symmetric distribution will look the same on the left and the right of its center.
Identical probabilities
The probabilities for each outcome must remain the same for each trial.
Probability of 0 mean
The probability that a continuous r.v. takes a specific single value equals 0. P(X = 45)=0
Rule Four
The probability that an event does not occur is 1 minus the probability that the event does occur P(Ac) = 1-P(A)
Is range resistant to outliers?
The range is not resistant to outliers.
What does the range measure?
The range measures how spread out the data values are
Horizontal Axis of Histogram
The range of values that the quantitative variable takes is divided into equal-size intervals, or classes.
What are the variables?
There are 2 variables. They are "Weight" and "Parents Susceptibility to pyrethroid insecticide".
In each of the following situations, is it more reasonable to simply explore the relationship between the two variables or to view one of the variables as an explanatory variable and the other as a response variable? In the latter case, which is the explanatory variable and which is the response variable? Are they categorical or quantitative (quantitative means "numerical")? A person's leg length and arm length in centimeters.
There is no reason to view one or the other as explanatory: both are quantitative.
categorical data
These place an individual into one of several groups or categories.
Capital letters (X, Y, Z) are used to?
represent random variables
Disjoint
Two events are mutually exclusive (also called disjointevents) if they have no outcomes in common and so can never occur together.
Multiplication rule for finding P(A∩B)
Using the definition of the conditional probability, for events A and B, the probability that A and B both occur: P(A∩B) = P(AIB) * P(B) also P(A∩B) = P(BIA) * P(A)
on a density curve x can take what?
X can take any value in the interval [0,∞)
What is an original boxplot?
an original boxplot is a graphical display of the five number summary
Two main types of random variables:
discrete continuous
Because r uses the standardized values of the observations, r ______ when we change the units of measurement of x, y, or both
does not change
probability is limited to ________
events for which we can specify all possible outcomes.
The relationship between two variables is often due to both variables being...
influenced by other variables lurking in the background.
statistic
is a number that can be computed from the sample data without making use of any unknown parameters. In practice, we use a statistic to estimate an unknown parameter. A statistic would almost certainly take a different value if another sample from the same population is chosen.
0,0,4,0,0,0,10,0,6,0 mean? median? mode?
mean -- 2 median -- 0 mode -- 0
what diagram is used to study the association of two variables?
scatterplot
Sample Space
set of all possible outcome for random phenomenon.
Assessed value of houses in a large city
skewed right
Empirical probability
the long run proportion
Cyclical Variations
variations with some regularity over time
Design
what the plan is for the experiment The sample, the randomization (random assignment of treatment), and the plan to obtain percentages of each group that have heart attacks.
If X has the binomial distribution with n trials and probability p of success on each trial. X~Bin(n,p) the possible values of x are __________?
{ 0, 1, 2, ..., n} If k is any one of these values. The probability of k successes is given by the binomial probability formula
in binomial distribution μ=?
μ= np
Probability of 1 means
•The interval containing all possible values has probability equal to 1 P(0<X<∞)=1
What are all the values that a correlation r can possibly take?
−1 ≤ r ≤ 1
Median
"middle point of the data" The median is the number that divides the (ordered) data in half
r is always between -1 and 1
-At -1 or 1, there is a perfect straight line relationship. -The closer to -1 or 1, the stronger the relationship. -The closer to 0, the weaker the relationship.
According to the Children's Oncology Group Research Data Center at the University of Florida, 20% of children with neuroblastoma (a form of brain cancer) undergo surgery rather than the traditional treatment of chemotherapy or radiation. The surgery is successful in curing the disease 95% of the time. (Explore,Spring 2001.) Consider a child diagnosed with neuroblastoma. What are the chances that the child undergoes surgery and is cured?
.19
Based on data from the U.S. Center for Health Statistics, the death rate for males in the 15-24 age bracket is 114.4 per 100,000 population, and the death rate for females is that same bracket is 44.0 per 100,000 population. If two males in that age bracket are randomly selected, what is the probability that they both survive?
0.998
Let's play the lottery on 4 occasions and we consider each trial independent. What is the probability that we win on the first trial and lose on the remaining 3 trials? p = represent the probability of winning lottery 1-p = is the probability of losing lottery
1st trial: p 2nd: 1-p 3rd: 1-p 4th: 1-p A1: win in 1st trial P(A1) = p (1-p) (1-p) (1-p) A2: win in 2nd trial P(A2) = (1-p) p (1-p) (1-p) A3: win in 3rd trial P(A3) = (1-p) (1-p) p (1-p) A4: win in 4th trial P(A4) = (1-p)(1-p)(1-p)p P(Win Once) = P(A1) + P(A2) + P(A3) + P(A4) • P(Win Once) = 4 p(1-p)(1-p)(1-p) = 4 p (1-p)3 ≈ 4(0.0000000001)
How to report correlation? Bad vs good A child that has two educated parents will graduate from college. or Children whose parents are educated are more likely to graduate from college
A child that has two educated parents will graduate from college. -- bad Children whose parents are educated are more likely to graduate from college -- good
standard deviation
A computed measure of how much scores vary around the mean score.
(T-)
A negative test result states that the condition is not present
types of quantitative variables
Continuous variables can take on any real numerical value over an interval (i.e. time, distance). Example Stat 302 Survey: weight (lb) Discrete variables take on only certain fixed values, with no intermediate values possible. (i.e. number of car accidents, children, pets).
In each of the following situations, is it more reasonable to simply explore the relationship between the two variables or to view one of the variables as an explanatory variable and the other as a response variable? In the latter case, which is the explanatory variable and which is the response variable? Are they categorical or quantitative (quantitative means "numerical")? Inches of rain in the growing season and the yield of corn in bushels per acre.
Inches of rain during the growing season: explanatory, quantitative. Yield of corn: response, quantitative.
Properties of the standard deviation
It measures the spread of the data (a "typical" distance of the observations from the mean). Its value must be greater than or equal to 0. (Zero occurs only when all observations have the same value, otherwise, it's greater than 0.) It has the same units of measurement as the original observations. The variance has units that are squared. It is not resistant to outliers. Strong skewness or a few outliers can greatly increase its value. It works best for bell-shaped and symmetric distributions
When P(B) > 0, the conditional probability of Avent B is:
P(AIB) = P(A∩B)/P(B)
Discrete Variables
Quantitative Variable; take on only certain fixed values, with no intermediate values possible. ex. number of car accidents, children, pets
we interpret probability in the __________ run
We will interpret the probability of an outcome to represent long-run results
Simple Event
When one outcome
what is the notation for binomial distribution?
X ~ Binomial(n,p)
what are the paramters for binomial distribution?
X-Bin(n,p) Mean: μ μ=np Standard Deviation (σ) σ= sqrt (σ^2) = sqrt (np(1-p))
Mean
a.k.a arithmetic mean average or center of gravity A measure of center in a set of numerical data
How do you calculate the mean
add all the values then divide by the number of observations
Uniform Distribution: Y can take?
any value in the interval (0,1)
what is the diagnostic test?
application of the Bayes' Theorum
Spurious correlation
association between two variables is because both are related to a third variable omitted from consideration
each interval on density curve has probability of what?
between 0 and 1
what is the name for two variables? three?
bivariate multivariate
What do boxplots not show?
boxplots do NOT provide any info about the existence or location of modes (most often data)
how is a continuous probability distribution specified?
by the density curve
Center
center of mass or midpoint
What graphs for quantitative variables?
histogram time plots boxplots
lurking variable
is a variable that is not among the explanatory or response variables
For each of the following variables, indicate whether you would expect its histogram to be symmetric, skewed to the right, or skewed to the left.
look below
is mean resistant to outliers? median
mean -- no median -- yes
what measures the strength of the linear relationship?
the linear correlation coefficient "r"
normal distribution center is?
the mean (μ)
what does SD measure?
the spread of the data
Strength of the association
the strength of the relationship between 2 quantitative variables refer to how much variation, or scatter the strongest -- tighter line
bimodal histogram?
two "high points"
how may outcomes can binomial distributions have?
two -- success or fail
Independent Events vs Disjoint Events
two disjoint events can never be independent, except in the trivial case that one of the events is null.
Finding normal probabilities
use cumulative probability
σ2
variance population
Which of the following statements about the correlation, r, between Y and X is NOT true? r measures the strength of the linear association between Y and X. r does not provide any information about the strength of non-linear relationships. Calculating r is not appropriate if the distributions of Y and X are skewed with outliers r = 0 means there is no relationship between Y and X.
r = 0 means there is no relationship between Y and X.
how do you find mean of histogram?
take the number of people for each bin times the x variable divide by n do example****
Multiple myeloma is a cancer of the bone marrow currently without effective cure. It affects primarily older individuals: It is almost never diagnosed in individuals under 40 years old, and its incidence rate (the number of diagnosed malignant cases per 100,000 individuals in the population) is highest among individuals 70 years of age and older. The distribution of incidence rate of multiple myeloma by age at diagnosis is....
Skewed left
Nominal variables
categorical; purely qualitative and unordered ex. car color, gender
the larger the SD, the ______ the variability of the data
greater
when the histogram is skewed right, the mean and median are?
mean is larger then median
what are the measures of location?
mean/median -- measure of center skewness
Mu --
stands for mean on normalized curves
Example 5.15: Mammography screening Breast cancer occurs most frequently among older women. Of all age groups, women in their 60s have the highest rate of breast cancer. The National Cancer Institute (NCI) estimates that 3.65% of women in their 60s get breast cancer. Mammograms are X-ray images of the breast used to detect breast cancer. A mammogram can typically identify correctly 85% of cancer cases and 95% of cases without cancer. 1. If a woman in her 60s gets a positive mammogram, what is the probability that she indeed has breast cancer?
(.85)(0.0365) // (.85)(.0365) + (.05)(.9635) .392
Degree of freedom
(statistics) an unrestricted variable in a frequency distribution df=n-1
Enzyme immunoassay tests are used to screen blood specimens for the presence of antibodies to HIV, the virus that causes AIDS. Antibodies indicate the presence of the virus. The test is quite accurate but is not always correct. Here are approximate probabilities of positive and negative test results when the blood tested does and does not actually contain antibodies to HIV: Antibodies present w/ positive test -- .9916 Antibodies present w/ negative test -- .0084 antibodies absent w/ positive test -- .0047 antibodies absent with negative test -- .9953 Suppose that 1% of a large population carries antibodies to HIV in its blood. Draw a tree diagram with this information and use it to answer the following question. 1. What is the probability that the test is positive for a randomly chosen person from this population?
.0146
Two-thirds of patients in a drug trial suffered from sever migraines as a side effect. Another side effect was impaired vision, with 25% of all patients having no impairment, 75% having some level of vision impairment. What is the probability that a randomly selected patient has no side effects? (Assume independence and give answer to 3 decimal places.)
.083
Sport utility vehicles (SUVs) are generally considered to be more prone to roll over than cars. In the USA in 1997 24.0% of all fatalities involved a roll-over. In addition, given that the fatality involved a roll-over, 15.8% of all fatalities involved SUVs. Furthermore, given that the fatality did not involve a roll-over, 5.6% of all fatalities involved SUVs . Suppose that a fatality involved an SUV, calculate the probability that the fatality involved a roll-over. Please use 3 decimal places.
.471
SD values must be greater then or equal to __
0 (Zero occurs only when all observations have the same value, otherwise, it's greater than 0.)
Probability is always between what two numbers?
0 and 1 0 and 100%
Choose four numbers that have the largest possible standard deviation. 0, 0, 10, 10 9, 9, 10, 10 0, 3, 6, 10 1, 4, 7, 10
0,0,10,10
If mothers were always 2 years younger than the fathers of their children, the correlation between the ages of mother and father would be?
1
which is the best candidate for normal distribution? 1. Human heights by gender 2. Guinea pigs survival times after inoculation of a pathogen
1
area of each density curve for normal distribution equals?
1 (one)
Choose four numbers that have the smallest possible standard deviation.
1,1,1,1
"or equal" cumulative probability
1-Z-SCORE
A construction company employs three sales engineers. Engineers 1,2, and 3 estimated the costs of 30%, 20% and 50%, respectively, of all jobs bid on by the company. For i=1,2,3, define Ei to be the event that a job is estimated by engineer i. The following probabilities describe the rates at which engineers make serious errors in estimating costs: P(error|E1)=0.01 P(error|E2)=0.03 P(error|E3)=0.02 1. If a particular bid results in a serious error in estimating job cost, what is the probability that the error was made by engineer 1? 2. If a particular bid results in a serious error in estimating job cost, what is the probability that the error was made by engineer 2? 3. If a particular bid results in a serious error in estimating job cost, what is the probability that the error was made by engineer 3? 4. Based on the probabilities, given parts a-c which engineer is most likely responsible for making the serious error?
1. .158 2. .316 3. .526 4. 3
Choose an American farm at random and measure its size in acres. Here are the probabilities that the farm chosen falls in several acreage categories: Acres <10 -- .14 10-49 --.19 50-99 --.13 100-179 -- .04 180-499 -- .29 500-999 -- .12 1000-1999 -- .05 ≥2000 -- .04 A be the event that the farm is less than 50 acres in size, and let B be the event that it is 500 acres or more. 1. Find P(A) 2. Find P(B) 3. Find P(Ac) 4. Find (A or B)
1. .33 2. .21 3 .67 4. .54
Here is the probability model for the blood type of a randomly chosen person in the United States. Blood Type O -- .23 A -- .37 B -- .06 AB -- ? 1. What is the probability that a randomly chosen American has type AB blood? Please use 2 decimal places. 2. Maria has type B blood. She can safely receive blood transfusions from people with blood types O and B. What is the probability that a randomly chosen American can donate blood to Maria? Please use 2 decimal places. 3. What is the probability that a randomly chosen American does not have type O blood? Please use 2 decimal places.
1. .34 2. .29 3. .77
Blood type: O+ -- .39 O- -- .01 A+ -- .27 A- -- .005 B+ -- .25 B- -- .004 AB+ -- .07 AB- -- .001 1.What is the proportion of {O+} among Asian Americans? 2.What proportion of Asian American have blood type O?
1. .39 2. (.39+.01=__) -- .4
Sickle-cell anemia is a hereditary medical condition affecting red blood cells that is thought to protect against malaria, a debilitating parasitic infection of the liver and blood. That would explain why the sickle-cell trait is found in people who originally came from Africa, where malaria is widespread. A study in Africa tested 543 children for the sickle-cell trait and also for malaria infection. In all, 25% of the children had sickle-cell and 6.6% of the children had both sickle-cell and malaria. Overall, 34.6% of the children had malaria. 1. What is the probability that a child has either malaria or sickle-cell? 2. What is the probability that a child has neither malaria nor sickle-cell? 3. What is the probability that a child has malaria given that the child has the sickle-cell trait? 4. What is the probability that a child has malaria given that the child does not have the sickle-cell trait? 5. Are the events sickle-cell trait and malaria independent? What might that tell you about the relationship between sickle-cell and malaria?
1. .53 2. .47 3. .264 4. .373 5. No; it tells us that there is evidence of a biological link between the two conditions
In a General Social Survey of Americans in 1991, two variables, gender and finding life exciting or dull, were measured on 980 individuals. The two-way table below summarizes the results. Tmale -- 425 Tfemale -- 555 Mexciting -- 213 Mroutine -- 200 Mdull -- 12 Fexciting - 221 Froutine - 305 Fdull - 29 Total Exciting - 434 Total Routine - 505 Total Dull - 41 Total People -- 980 Let A = randomly chosen person is female Let B = randomly chosen person finds life exciting 1. Find P(A | B) 2. Are the events A & B independent?
1. 221/434 2. No, because P(A | B) ≠ P(A)
X: height (inches) of young women aged 18 to 24 X~N (μ.=64.5, σ2=2.52) Using the 68-95-99.7 rule: 1. What percentage of young women are taller than 64.5 inches? P(X>64.5) 2. P(X>69.5) 3. P(X>72) 4. P(X<62)
1. 50% 2. .05/2 3. .32/2
How to check if the normal approximation to the data makes sense?
1. A histogram will work for a large data set. 2. Normal Quantile Plot(a.k.a. Normal Probability Plot). - Works well for large and small data set. - If the data have approximately a Normal distribution, the plot will have roughly a diagonal straight-line pattern. • A Normal quantile plot (QQ plot) - -- x-axis: z-score -- y-axis: sorted data
Application for binomial distributions
1. In a clinical trial, a patient's condition may improve or not. X: # of patients who improved among the study participants. 2. Is a child obese or not (based on their body mass index)? X: # of obese children in a random sample 3. In a quality control study, we assess the number of defective items in a lot of goods, irrespective of the type of defect. X: # of defective items in a random sample
A person's blood type is given as a combination of a group (O, A, B, or AB) and a Rhesus factor (+ or −). 1. What is the blood type of a randomly chosen Asian American? 2. Example of Simple Event? 3. Example of Compound Event?
1. S={O+,O-, A+,A-,B+,B-,AB+,AB-} 2. C = {a random Asian American's blood type is O+} 3. D={a random Asian American's blood type is O+ or O-} E={a random Asian American's blood type is O}
What are the conditions of a tree diagram?
1. The sum of the probabilities emanating from any branch is One 2. The final outcomes are disjoint
when to use the correlation (r)?
1. To use r, there must be a true underlying linear relationshipbetween the two variables. 2.The variables must be quantitative. 3.The form for the points of the scatterplot must be reasonably straight. 4.Outliers can strongly affect the correlation. Look at the scatterplot to make sure that there are no strong outliers. 4
6% of American's blood type is O−. The Tennessee Red Cross has 32,000 donors and needs at least 1850 that are O− this year. Will they run out? Let X = number of donors that are O−. ~ in(n=32000, p=0.06) 1. 1. Can we use the normal approximation to the binomial distribution? LOOK AT EXAMPLE ON SLIDE 50 on last slideshow
1. Yes; Condition 1: np=32,000*.06=1,920 ≥ 10 Condition 2: n(1 − p)=32,000*(1-.06)=30,080≥ 10 Yes
What are boxplots useful to show?
1. the location of the data and spread of the data 2. whether the distribution is symmetric or skewed 3. whether there are any outliers 4. box plots are useful for comparing distributions for two or more groups
The blood cholesterol levels of men aged 55 to 64 are approximately Normal with mean 222 mg/dl and standard deviation 37 mg/dl. X: blood cholesterol levels of men aged 55 to 64. X~N(m=222, s2=372) 1.If the cholesterol levels were 240 mg/dl, what is the z-score? 2. A z‐score is -2, what is the corresponding pregnancy cholesterol level?
1. z= (242-222)/37=.4864 it is .4864 standard deviations to the right 2. x=222+37(-2) = 148
Continuous Random Variable
A continuous r.v. is measured in real units, such as time, weight, temperature or length.
Continuous probability distribution
A continuous random variable (X) has possible values that form an interval. •
Uniform Distribution Example A delivery man informed you that he would arrive any time between 5pm and 6pm. You started waiting for him since 5pm. Y: waiting time till the delivery man arrived. 1. P(Y≤ .5) 2.P(Y>.8) 3.P(Y ≤ 0.5 orY > 0.8)
A delivery man informed you that he would arrive any time between 5pm and 6pm. You started waiting for him since 5pm. Y: waiting time till the delivery man arrived. 1. .5 2. .2 3. .7
Discrete Random Variable
A discrete r.v. takes a set of separate values, such as 0, 1, 2, 3, ....
Histogram
A histogram can be used to visually analyze a frequency table similar to bar graph but with bar charts, each column represents a group defined by a categorical variable; and with histograms, each column represents a group defined by a quantitative variable.
(T+)
A positive test result states that the condition is present (T+)
Time Plot
A time series is data collected over time. Time series can be displayed in a time plot. Time plots show possible trends (a clear overall pattern) and possible cyclical variations (variations with some regularity over time)
the z-score tell us what?
A z‐score tells us how many standard deviations the original observation falls away from the mean, and in which direction. - Observations larger than the mean have positive z‐scores. -Observations smaller than the mean have negative z‐scores.
Total Females: 2606 Total Males: 10833 FAccident -- 1830 FHomicide -- 452 FSuicide -- 324 MAccident -- 6311 MHomicide -- 2414 MSuicide -- 2108 Total Accidents -- 8141 Total Homicide -- 2866 Total Suicide -- 2432 Total Deaths -- 13439 A. Using the above data, what is the probability that the victim was male? B. Using the above data, what is the conditional probability that the victim was male, given that the death was accidental? C. Using the above data, what is the conditional probability that the death was accidental, given that the victim was male? D. Using the above data, let A be the event that a victim of violent death was a woman and B the event that the death was a suicide. The proportion of suicides among violent deaths of women is expressed in probability notation as a. P(A) + P(B) b. P(A | B) c. P(A and B) d. P(B | A)
A. .806 B. .780 C. .583 D. d -- P(BIA)
An opinion poll calls 2000 randomly chosen residential telephone numbers, then asks to speak with an adult member of the household. The interviewer asks, "How many movies have you watched in a movie theater in the past 12 months?" What is the population?
All adults with residential telephones
What is the population?
All male larval tobacco budworms
Rule Two
All possible outcomes (sample space) together must have probability 1. If S is the sample space, then P(S) = 1.
The area within each box represents what?
All the outcomes that could possibly happen, S.
When are two variables associated?
An association exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable.
intervals
An even spacing of the numbers along the axis of the graph.
Event
An event is a subset of the sample space. Denoted by capital letters: A, B, C,... An event corresponds to a particular outcome or a group of possible outcomes
How do you identify an outlier?
An outlier is an observation that is far above or far below the rest of the data values. It falls outside the overall pattern. Suspected low outlier: any value less than : Q1 -1.5 x IQR Suspected high outlier: any value greater than: Q3 + 1.5 x IQR
Rule One
Any probability is a number between 0 and 1. Consider any event A, the probability P(A) satisfies 0 ≤ P(A) ≤ 1.
Which of the following graphs would be appropriate to describe frequency of days of week people have their birthday on?
Bar chart Pie Chart
How to choose which summary statistics to use?
Because the mean is not resistant to outliers or skew, use it to describe distributions that are fairly symmetrical and don't have outliers. Otherwise, use the median and the five-number summary, which can be plotted as a boxplot.
Binomial Distribution
Bernoulli trials: • Each trial can result in one of two outcomes (S=success or F=fail) • Trials are independent. • The probability of success, P(S), is a constant p for all trials. X = the number of successes in n Bernoulli trials. • The random variable X is said to have a Binomial distribution with parameters n and p. • Notation: X ~ Binomial(n,p)
The incidence of major depression in adults is about 10%. Count of adults diagnosed with depression in a sample of 20 adults what is the notation?
Bin(n=20,p=.1)
Of the following data sets, which would you display with a histogram? A) A record of the blood type of selected individuals (A, B, AB, or O) B) A record of gender of selected individuals (labeled as male = 0, female = 1) C) A record of the height of selected individuals (in inches) D) A record of age of selected parents (in years) at birth of first child
C and D
choice of auto to buy (domestic or import)
Categorical
county of residence
Categorical
what type of variable is birth day?
Categorical
What are the two types of variables?
Categorical Quantitative
Bar Graphs
Categorical Data each category is reperesented by a bar the height of a bar represents either: the counts/frequency or the relative frequency
conditions of tree diagrams
Conditions: 1. The sum of the probabilities emanating from any branch is 1 2. The final outcomes are disjoint
what are the 3 ways of interpretation of probability?
Consider that outcomes are equally likely. Frequency interpretation Subjective interpretation
General Addition Rule
Consider two events A and B P(A or B) = P(A) + P(B) - P(A and B)
Normal Distribution
Continuous probability distributions The Normal or Gaussian distributions are a family of symmetrical, bell-shaped density curves specified by two numbers or parameters: its mean (μ) and its variance (σ2)
Uniform Distribution
Continuous probability distributions always between 0 and 1 Like the probability of rolling a 1, 2, 3, 4, 5, or 6, the uniform distribution has the same probability for measuring each value.
Response Variable
Dependent variable (Y) Measures an outcome of a study. The response variable could be categorical or numerical
density curve
Describes the overall pattern of a distribution and the value of the area underneath it is exactly 1 Probability of an outcome in any range of values on the horizontal axis = area under the density curve on that range.
The student performance in a three question quiz can be summarized in the following tree diagram. C: correct answer I: incorrect answer Each path from the first set of two branches to the third set of eight branches determines an outcome in the sample space. Based on the tree diagram, how many outcomes does a student's performance have?
Eight The sample space-- S = {CCC, CCI, CIC, CII, ICC, ICI, IIC, III} 5
Sample space S={MM,MF,FM,FF} M:male F:female Event A -- Both Children have the same gender Event B -- Elder Child Male What is A∩B? what is AUB? What is Ac?
Event A -- (MM.MF) Event B -- (MM) A∩B -- (MM) ("Both children have the same gender and the elder is a male" ) AUB -- (MM,FF,FM) "Either both children have the same gender or the elder is a male or both" Ac -- (MF,FM) "Both children have different gender" -- opposite
An individual's blood type is described by the ABO system and the Rhesus factor. In the American population 16% of individuals have a negative Rhesus factor (Rh−), and 43.75% of those with Rh− are blood type O. 1.Pick a random American, which is the probability of having Rh-and blood type O?
Event A: individual is Rh- Event B: individual has blood type O P(A∩B) = ? "Probability of having Rh-& blood type O" P(A)=.16 P(BIA) = .4375 Probability of being blood type O given an individual has Rh-" .4357 * .16 = .07
Determine whether the following statement is true or false: Tomorrow, it might rain sometime, or it might not rain at all. Since the sample space has two possible outcomes, each must have probability ½.
False
Do students with higher SAT scores get higher GPA? explanatory/response variable
GPA = Response (y), SAT score = Explanatory (x)
Tips on Finding Normal Probabilities
Greater than: • P(X > x) = 1 - P(X ≤ x) Between two numbers: • P(a < X < b) = P( X < b) - P (X < a) Outside of two numbers: • P(X < a OR X > b) = P(X < a) + P(X > b) • P(X < a OR X > b) = 1 - P( a ≤ X ≤ b) Due to symmetry: P(X<-a)=P(X>a)
How to find median of even number of data?
If n is even, the median is the average of the two center observations
when can the binomial distribution be approximated by a normal distribution?
If n is large, and p is not too close to 0 or 1 Practically, the Normal approximation can be used when both np ≥10 and n(1 − p) ≥10.
How to find median of odd number of data?
If nis odd, the median is the value of the center observation
How to report correlation? Bad vs good If r = −0.99 then this proves that drinking more red wine lowers cholesterol. or There is a strong negative association between red wine consumption and cholesterol level
If r = −0.99 then this proves that drinking more red wine lowers cholesterol. -- bad There is a strong negative association between red wine consumption and cholesterol level -- good
Outliers (scatterplot)
In a scatterplot, outliers are points that fall outside of the overall pattern of the scatterplot.
Bayes' Theorem
In some situations, we know the conditional probability P(B|A) but are much more interested in P(A|B). • For example, diagnostic tests provide P(Test +|disease) but we are interested in P(disease|Test +) • Bayes' Theorem provides a method for finding P(A|B) from P(B|A)
The 68-95-99.7 rule (a.k.a. the Empirical Rule)
In the Normal distribution with mean μ and s.d.σ: 1. Approximately 68%of the observations fall within σ of μ. 2.Approximately 95%of the observations fall within 2σ of μ. 3.Approximately 99.7%of the observations fall within 3σ of μ.
in the short run, the proportion of times something happens is? long run?
In the short-run, the proportion of times that something happens is highly random. In the long-run, the proportion of times that something happens becomes very predictable.
inflection point
Inflection point is where the direction of curvature turns from upward to downward or vice versa.
A∩B
Intersection of A and B Both A and B are true/occur
Probability Rules
Look below
Can you compare the counts in (b) to answer the question "Is there a difference between male and female students in the proportion who binge drink?"
No, because the total number of females is far greater than the number of males in the study.
Coffee is a leading export from several developing countries. When coffee prices are high, farmers often clear forest to plant more coffee trees. Here are data on prices paid to coffee growers in Indonesia and the rate of deforestation in a national park that lies in a coffee-producing region, for five years: Coffee is currently priced in dollars. If it were priced in euros, and the dollar prices in the above table were translated into the equivalent prices in euros, would the correlation between coffee price and percent deforestation change?
No, units do not affect correlation
In each of the following situations, is it more reasonable to simply explore the relationship between the two variables or to view one of the variables as an explanatory variable and the other as a response variable? In the latter case, which is the explanatory variable and which is the response variable? Are they categorical or quantitative (quantitative means "numerical")? The typical amount of calories a person consumes per day and that person's percent of body fat.
Number of calories consumed per day: explanatory, quantitative. Percent of body fat: response, quantitative.
Complement rule for conditional probability:
P( Ac|B) = 1-P(A|B)
What percentage of men has the blood cholesterol levels between 259 and 273 mg/dl (inclusive)? P(259 ≤ X ≤ 273) = ?
P(259≤ X ≤ 273)=P(X<273)-P(X<259)
(A|B) means "the probability that A is true if you are GIVEN that B is true". You must be given P(A&B), the probability that both A and B are true at the same time in order to calculate P(A|B) or P(B|A) The formulas are...
P(AIB) = -------------- P(A & B) / P(B) P(BIA) -------------- P(A&B) / P(A)`
If any one of these is true, the other two are also true, and the events A and B are independent.
P(A|B)=P(A),.............(1) P(B|A)=P(B),.............(2) P(AB)=P(A)P(B).......(3)
The TV weather forecaster announced that the probability of rain on Saturday is 50%, the probability of rain on Sunday is 50% and the probability that it rains on both Saturday and Sunday is 35%. What is the probability of rain over the weekend? Consider the following events: A = rains on Saturday B = rains on Sunday
P(Rain Weekend) = P(AUB) = P(A) + P(B) - P(A∩B) -- addition rule .5+.5-.35
distance (in kilometers) of commute to work
Quantitative: Continuous
the length of time to run a marathon
Quantitative: Continuous
Number of pets in family
Quantitative: Discrete
the number of people in line at a box office to purchase theater tickets
Quantitative: Discrete
What are measures of spread?
Range (not resistant to outliers) Quartiles -- IQR (resistant to outliers) Standard Deviation
Relative frequency
Relative frequency= # the outcome occurred/total number of trials
Vertical Axis of Histogram
Represents either the frequency (counts) or the relative frequency (percentage of total).
Sensitivity
Sensitivity = P (positive test | disease) = P(T+|D) The condition is present and the test correctly identify the condition is present. (True Positive)
How can you describe the distribution of a quantitative variable?
Shape (mode, symmetry, skewness) Center Spread Unusual Observations
What can we use instead of σ2 on normal distribution?
Standard Deviation (s.d)
the spread of normal curves is described by? what does it equal?
Standard Deviation (σ) which equals to the distance between the center and the inflection point
How high are the highest 10% cholesterol levels? X: blood cholesterol levels of men aged 55 to 64. X~N(222, 37^2)
Step 1. State the problem and draw a picture. P(X ≤ x*) = 0.90 Step 2. Find z-score such that the P(Z<z*)=.9 z*=1.28 Step 3. Find the observation x*, such that x* = + σ z* x* = 222+(37)(1.28) Step 4. Answer: The highest 10% of cholesterol levels are at least:
Refer to the previous question describing the study of 22,071 male physicians who were randomly assigned to take either an aspirin or a placebo every other day. Match each of the following terms to its correct description Subject? Sample? Population? What are the variables in this study?
Subject -- a physician in the study Sample -- the 22,071 male physicians Population -- all US males aged 40 or older There are two variables: (1) took aspirin (yes/no) and (2) had a heart attack (yes/no)
IQ for the general population
Symmetric
The height of female college students
Symmetric
Types of Unimodal Distribution Shapes
Symmetric Distribution Left Skew Right Skew
how can you tell if a boxplot is symmetric?
Symmetric: The whiskers will be approximately equal in length.
False Negative
Test states the condition is absent (T-), but it is actually present (D)
False Positive
Test states the condition is present (T+), but it is actually absent (Dc)
Standard Deviation
The "typical" distance of the observations from the mean. It describes the variation around the mean. square root of variance
cumulative probability
The cumulative probabilityfor a value x in a distribution is the proportion of observations in the distribution that lie at or below x. The probability of all values less than or equal to a particular value P(X<x)=shaded area=P(Z<z) use z-tables
What do we use to find measures of location?
The mean, median and mode are measures of the center of the data.
Binomial Coefficient
The number of ways of arranging k successes among n observations is given by the binomial coefficient: It is read n choose k
Quantitative data (numerical)
These takes numerical values for which arithmetic operations such as adding and averaging make sense. Usually recorded in a unit of measurement
Independent Events
Two events are independent, if the outcome of one event cannot influence the outcome of another event. For the intersection of two independent events, A and B, P(A∩B)= P(A) * P(B)
Law of Large Numbers (LLN)
Under the LLN, the probability that a particular outcome occurs is p%, means that if you repeat the experiment over and over again, independently and under essentially identical conditions, the percentage of the time that A occurs gets closer and closer (converge) to p."
AUB
Union of A and B Either A or B or both are true
what do the slices of a pie chart represent?
Values of a categorical variable
Events can be represented on what?
Venn Diagram
In each of the following situations, is it more reasonable to simply explore the relationship between the two variables or to view one of the variables as an explanatory variable and the other as a response variable? In the latter case, which is the explanatory variable and which is the response variable? Are they categorical or quantitative (quantitative means "numerical")? Water temperature controlled at different levels and growth (measured by weight) of corals in aquariums.
Water temperature: explanatory, quantitative. Growth: response, quantitative
Why do we divide by n-1 instead of n?
We are dividing by the number of independent pieces of information that go into the estimate of a parameter. This number is called the degrees of freedom (df, and it is equal to the number of independent scores that go into the estimate minus the number of parameters estimated as intermediate steps in the estimation of the parameter itself). called degree of freedom
tree diagrams
We have used Venn diagrams to solve probability problems involving unions and intersections of two or more events. Tree diagrams (or decision trees) will help us to solve probability problems involving conditional probabilities.
What is the special case of the addition rule?
When two events A and B are disjoint P(A∪B) = P(A) + P(B)
Each child born to a particular set of parents has probability 0.25 of having blood type O. If these parents have 3 children, what is the probability that exactly 2 of them have type O blood?
X= # of children with type blood O p= probability of blood type O 1-p=probability of non having blood type O X~Bin(n=3,p=.25) P(x=2) = (3/2) p^2 (1-p)^3/2= 3(.25)^2 (..75) = .1406
notation for normal distribution
X∼N (μ,σ2)
Skewness
a measure of the shape of a data distribution. Data skewed to the left result in negative skewness; a symmetric data distribution results in zero skewness; and data skewed to the right result in positive skewness
percentile
a number that tells us what percent of the total number of data values that lie at or below a certain level The pthpercentile is the value such that p % of values fall below or at the value
Individual (subject)
a person or any specific object in a population the objects described by a set of data.
The distribution of heights of young women aged 18 to 24 is approximately Normal with μ = 64.5 inches and σ = 2.5 inches. X~N(μ=64.5,σ2=2.52) a) 68% of the women's heights are between ______ and ______inches. b) 95% of the women's heights fall within________ σ of μ c) 99.7% of the women's heights fall within ______ inches of the mean. ( μ=_____ inches)
a) M- (pineapple) M+ (pineapple) 64.5-2.5= 62 64.5+2.5= 67 between 62 and 67 inches b) 2 c) 7.5 (3*2.5=7.5 inches) μ= 64.5 inches
Conditional Probability
can be thought of as a means of adjusting probability in the light of new information. In symbols, P(A|B)= "the probability of event A, given event B."
Continuous sample space
can take on any one of an infinite number of possible values over an interval. Example: Cholesterol level For a random person: S = {any reasonable positive value mg / dL}
Discrete Sample Space
can take on only certain values (whole number or a descriptor). Example: Blood types \ For a random person: S = {O+, O-, A+, A-, B+, B-, AB+, AB-}
As we make more observations, the proportion of times an outcome occurs gets ________?
closer and closer to a certain number we would expect.
Univeriable
data that has one variable
Multivariate
data that has three or more variables
Bivariate
data that has two variables
standardizing normal distribution
f the raw score is transformed into a z-score, however, the value of the z-score tells exactly where the score is located relative to all the other scores in the distribution to transform any Normal curve N(m,s) into the standard Normal curve N(0,1). for each x we calculate a new z -- z-score
how do you find median of a histogram?
find n (total number of measurements) then.... n+1 / 2 gives you "position of median" count the measurements and estimate where on the histogram
SD works best for what distributions?
for bell-shaped and symmetric distributions
the larger the Standard Deviation the ________ the histogram is?
further away
the larger the standard deviation, the _________ the varability of the data
greater
The interperatiation of probability is?
how to connect the mathematics of probability to the real world.
What information can you get from the dot plot or stem-and-leaf plot that you cannot get in a histogram?
individual data values
A historian wants to estimate the average age at marriage for women in New England in the early nineteenth century. Within her state archives, she finds marriage records for the years 1800-1820. She takes a sample of 100 of those records, noting the age of the bride for each. The average age of the brides in the sample is 24.1 years. Using a statistical method, she finds a margin of error and estimates that the average age of brides in this population was between 23.5 and 24.7. Inference? Data? Sample? Description? Population? Variable the number 24.1 is what?
inference -- The prediction that the average age of all brides living in New England in the early nineteenth century is between 23.5 and 24.7 years. data -- A list of the ages at marriage of all the 100 women Sample -- The 100 women chosen Description -- The average age was 24.1 years old. Population -- All women living in New England who married in the early part of the nineteenth century variable(s) -- age at marriage the number 24.1 is what? -- Statistic
standard deviation
is the "typical" distance of the observations from the mean. It describes the variation around the mean.
Pie Chart
it can only represent one categorical variable (e.g. major), breaks down into its components. Each characteristic is represented by a slice, and the size represents what percent of the whole is made up by that.
Variable
it is any characteristic of an individua
Left skewed boxplot
longer line on left side
Right skewed boxplot
longer line on right side
What are the proportions of students who are binge drinkers for male and female?
male -- .486 female -- .409
males -------------- binge drinkers -- 1908 non-binge drinkers -- 2017 total -- 3925 Females ------------------------ binge drinkers -- 2854 non-binge drinkers -- 6142 total -- 6979 report the cell count for male binge drinkers and female binge drinkers?
male -- 1908 female -- 2854
What are the two aspects of probability?
mathematical and philosophical. Philosophical: How does the real world connect with the mathematical theory? Mathematical-What does probability mean in the following statement: "the probability that a fair (balanced) coin lands heads is .5"?
A meteorologist preparing a talk about global warming compiled a list of weekly low temperatures (in degrees Fahrenheit) he observed at his southern Florida home last year. The coldest temperature for any week was 36 degrees F but he inadvertently recorded the Celsius value of 2 degrees. Assuming that he correctly listed all the other temperatures, how will this error affect the following summary statistics? mean ________ median and quartiles _____ range ___________ IQR ___________ Standard Deviation _______
mean -- decrease median and quartiles -- unaffected range -- increase IQR --- unaffected Standard Deviation -- increase
how to find mean and s.d from normal approximation? example: Consider 100 free‐throws taken by LeBron James, what is the probability that at least 80 of them are successful? Use the normal approximation if appropriate. Otherwise, choose e)?
mean = np s.d= npq mean = 100*.75 s.d = 100*75*(1-.75) find z-score! 1-zscore
IQR
measure of spread Interquartile Range (IQR) is the difference between Q3 and Q1 IQR= Q3 - Q1 It is the range of the middle 50% of the data values.
what does r measure
measures the strength of only the linear association between two variables. infers the unknown population correlation (ρ)
If given a graph that is skewed what measures of location and spread should you use?
median and IQR
15 16 18 19 20 20 21 28 32 34 36 43 46 46 48 48 72 87 min? Q1? median? Q3? Max? IQR? outliers?
min -- 15 Q1-- 20 median -- 33 Q3 -- 46 max -- 87 IQR -- 26 outliers -- 87
What is Five number summary?
minimum, Q1, median, Q3, and maximum
if r is negative, the association is ________
negative ex. a car's value vs. its age
what are the two types of categorical data?
nominal -- purely qualitative and unordered (i.e. car color, gender). Example Stat 302 Survey: Major ordinal -- can be ranked (i.e. star ratings, like a scale). They are not quantitative variables because the interval between consecutive ranks are often not identical. Example Stat 302 Survey: Election results (Very satisfied, Satisfied, Indifferent, Dissatisfied, Very dissatisfied)
is Standard Deviation resistant to outlers?
not resistant
what is "μ"?
notation for population mean "parameter"
What is "Xbar"? (x with bar over it)
notation for sample mean "statistic"
explanatory variable
or independent variable (X) may explain or influence changes in a response variable. The explanatory variable could be categorical or numerical.
μ
population mean
if "r" is positive, then the association is _______?
positive ex. height vs weight
types of trends
positive linear negative linear independent curvilinear
P(X>45) =
probability of the commuting time larger than 45 mins.
The best evidence for causation comes from what?
properly designed randomized comparative experiments.
relative frequencies
proportion of the data with that value how often something happens divided by all outcomes.
Spread
range of values taken by the variables distance between min and max
M
represents μ mean
pineapple symbol
represents σ2 variance
Number of times checking account overdrawn in the past year for the faculty in your school
skewed right
For which of the following distributions is the mean greater than the median?
skewed to the right
Venn Diagram
solve probability problems involving unions and intersections of two or more events. •
formula for standard deviation
sq. root of (sum of (x-xbar)^2/n-1)
Empircal rule (SD) histogram looks like?
symmetrical
what is correlation "r" based on?
the means and standard deviations of x and y -- so that means that r is only appropriate when there are no outliers
the standard normal distribution is?
the normal distribution N(0,1) with mean 0 and standard deviation 1
if the variable X has a normal distribution with mean μ and standard deviation σ, then the standardized variable has _________
the normal standard distribution Z-N(0,1)
frequency
the number of times the value occurs
what is "k" in binomial distributions?
the number we want
a density curve describes?
the overall pattern of a distribution.
the probability of a particular outcome is?
the proportion of times (relative frequency) that the outcome would occur in a long run of observations.
probability distribution tell us what?
what values X can take and; how to assign probabilities to those values.
The probability (chance) that an event occurs is p" occurs on what?
will depend on the interpretation of probability.
This histogram shows frequencies. If you were to construct a histogram using the percentages for each interval, how would the shape of this histogram change?
would not change at all
normal quantile plot x-axis = y-axis =
x -- z-score y -- sorted data
if the z-score is given the x=?
x = μ+ σ z
the spread is always on what axis?
x axis
what is the "statistic" notations for normal distribution
x-bar and s.d2
is correlation affected by outliers?
yes -- strongly
Categorical Variable
These place an individual into one of several groups or categories.
Quantitative (or Numerical) Variables
These takes numerical values for which arithmetic operations such as adding and averaging make sense. Usually recorded in a unit of measurement (i.e. seconds, kilograms, inches).
Description
What was found in regard to the sample, not the whole pop The actual percentages of the people in the sample who have heart attacks.
Uniform Distributions
When all the bins have the same frequency, or at least close to the same frequency. The Histogram will be flat
Eight offensive tackles (OT) are randomly selected from the NFL OT population and their weights are recorded. The list of weights in pounds is: 328, 308, 330, 307, 314, 317, 307, 317 1. What is the mean weight? 2.What is the median? 3.What will the median be if we add another player with a weight of 310? 4.What will the mean and median be if we add another player with a weight of 500
1. 316 2. 315.5 3. 314 4. median: 315.5; mean: 333.8
How to summarize quantitative data by a typical value? (2 ways)
1. A measure of location, such as the mean, median, or mode 2. A measure of spread -- how well the typical value represents the list (a measure of spread, such as the range, inter-quartile range, or standard deviation).
Tricky Examples. 1. Level of Pain -- 0= no pain, 1= little pain, 2= some pain, 3=lots of pain 2. Comorbidity of Diabetes: 1. Hypertension (66%) 2. Back Pain (44%) 3.. Depression (25%) 4. Arthritis (23%)
1. Categorical; ordinal (rank) 2. Categorical; nominal
Internet sites report that about 13% of Americans are left-handed. Question: Is this true for students at your university? During an exam in Stat 302 Section 520, the instructor walks around the room and counts 15 left-handed students out of 98 students in the class. Identify: 1. Population 2. Sample 3. Variable 4. individual 5.. (Population) Parameter 6. (Sample) Statistic
1. Population -- Americans 2. Sample -- Section 520 (class) 3. Variable -- left handed 4. Individual -- any American 5. Parameter -- Percentage of Americans who are left handed (13%) 6. Statistic -- Percentage of students in section 520 who are left handed
In a class of 105 students, the scores on a statistics exam are summarized in the following frequency table Scores Frequency ------------------------------- 61-70 16 71-80 31 81-90 42 91-100 16 Total 105 1. What is n? 2. The median score is in which class of scores? 3. Can we precisely compute the mean score from this frequency table?
1. n=105 2. 81-90 the median is in the position of the 53rd student --- add the frequencies ---> puts 53rd student in the class of "81-90" 3. No; Don't know the specific scores of all the students just the "class" of scores
Frequency Tables
A listing of possible values for a variable, together with the number of times the values occurs For categorical variable, possible values are the list of groups or categories of the variable. For quantitative variable, possible values can be: -List of discrete values of the variable (e.g. number of pets). -List of Intervals that contain the data values (e.g. age group)
Sample
A part of the population selected to draw conclusions about the population
Right Skew
Concentration of scores lies toward the low end of the scale and the tail trails off to the right
What are the types of Quantitative Variables?
Continuous Discrete
The following study is a very famous medical study called the Physicians' Health Study I. It was begun in 1982 and involved 22,071 male physicians between the ages of 40 and 81. One of the goals of the study was to determine if taking 325 mg of aspirin every other day would reduce a man's risk of a first heart attack. This was a randomized clinical trial which means each subject in the study was randomly assigned one of two treatments - taking an aspirin every other day or taking a placebo every other day. About 11,000 subjects were randomly assigned to take an aspirin every other day, and about 11,000 subjects were randomly assigned to take a placebo every other day. The study lasted 5 years. At the end of the study, the researchers described their results by reporting the percentage of heart attacks in the group taking aspirin and in the group taking a placebo. They found 0.9% of the group taking aspirin and 1.7% of the group taking a placebo had heart attacks. They used statistical methods to predict that US men aged 40 or older who take aspirin every other day have a lower risk of having a first heart attack compared to men who don't take an aspirin every other day. Match each term with the aspect of the study to which it refers: Design? Inference Description
Design -- The sample, the randomization (random assignment of treatment), and the plan to obtain percentages of each group that have heart attacks. Inference -- The use of statistical methods to make the prediction that US men aged 40 or older who take aspirin every other day have a lower risk of having a first heart attack compared to men who don't take an aspirin every other day. Description -- The actual percentages of the people in the sample who have heart attacks.
How do you calculate Q1 and Q3?
FIRST FIND MEDIAN OF DATA SET The first quartile (Q1) is the median of the values below the median in the sorted data set. The third quartile (Q3) is the median of the values above the median in the sorted data set.
What does measures of spread or varability tell us?
Measures of location summarize what is typical of elements of a list, but not every element is typical. Measures of spread or variability describe the variability of quantitative data -Are all the observations near the center? -Are most of the elements close to each other? On the average, how far are the elements from each other?
What graphs are for categorical data?
Pie Charts Bar Graphs
Inference
Refers to methods for making decisions about a population, based on data obtained from a sample of that population.
An opinion poll uses random digit dialing equipment to dial 2,000 randomly chosen residential telephone numbers. OF these, 31% are unlisted numbers. This isn't surprising because 35% of all residential numbers are unlisted. Identify the boldface numbers (31,25) as either parameters or statistics.
Statistic: 31%, Parameter: 35%
What does Statistics mean?
Statistics is the science of learning from data, and of measuring, controlling, and communicating uncertainty.
(Shape) Mode
The mode is the value that appears most often in a set of data. A mode of a histogram is a hump or high-frequency bin One mode → Unimodal Two modes → Bimodal Three or more → Multimodal
Inference
applying the results to the whole population The use of statistical methods to make the prediction that US men aged 40 or older who take aspirin every other day have a lower risk of having a first heart attack compared to men who don't take an aspirin every other day
Ordinal variables
categorical; can be ranked ex. star ratings (scale) They are not quantitative variables because the interval between consecutive ranks are often not identical ex. Stat 302 Survey: Election results (Very satisfied, Satisfied, Indifferent, Dissatisfied, Very dissatisfied)
Trend
clear overall pattern
Left Skew
concentration of scores lies toward the high end of the scale and trails off to the left -- negative skew
parameter
is a number that describes the population In practice, the value of a parameter is not known when we cannot examine the entire population.
Populations of tobacco budworm, an insect pest harmful to the cotton plant, have developed resistance to a number of common insecticides. Entomologists hypothesized that the development of insecticide resistance could also affect other aspects of the tobacco budworms biology, in particular, the weight of male larvae. A group of biologists wanted to test whether or not the average weight of male larval tobacco budworms whose parents are resistant to pyrethroid insecticide was different from the average weight of male larval tobacco budworms whose parents are very susceptible to pyrethroid insecticide. For their study, the biologists took 2 random samples of male larval tobacco budworms. One random sample consisted of 100 male larvae whose parents were resistant to pyrethroid insecticide. The other random sample consisted of 100 male larvae whose parents were very susceptible to pyrethroid insecticide. The biologists found that the sample average weight of male larval tobacco budworms with resistant parents was greater than the sample average weight of male larval tobacco budworms with very susceptible parents.
look at next questions
range
measure of spread the difference between maximum and minimum Range = max - min
Between median and mean which is resistant to skew and outliers?
median -- the mean is NOT
Distribution
the distribution of a variable tells us what value it takes and how often it takes these values The distribution of a statistical data set (or a population) is a listing or function showing all the possible values (or intervals) of the data and how often they occur. Plotting our data will give us an idea of the variable's distribution
What will occur on a histogram when we plot frequency or relative frequency?
the distribution will stay the same
What is the relationship between mean and the median when graph is "Perfectly Symmetric"?
the mean equals the median
What is the relationship between mean and the median when graph is "Skewed to the Right"?
the mean is larger than the median. For skewed distribution, the mean is farther out in the long tail than is the median
What is the relationship between mean and the median when graph is "Skewed to the Left"?
the mean is smaller than the median. For skewed distribution, the mean is farther out in the long tail than is the median
For skewed distributions Is mean or median preferred for true observation?
the median is preferred because it is better representative of a typical observation.
Quartiles what are the parts?
the numbers that separate the set into four equal parts first (lower) quartile (Q1)is the 25th percentile (p=25) second quartile (Q2)50th percentile (median, p=50) third (upper) quartile (Q3)is the 75th percentile (p=75) Data: 100%