Statistics Midterm
for a box: the sum of the draws is likely to be around ____________, give or take ___________ or so.
expected value; standard error
5 cards are dealt from a shuffled deck of 52 cards. Compute the following probabilities (a) The first card dealt is a heart. (b) The fifth card dealt is a heart.
13/52 for both bitch. 5 cards are dealt, they're independent.
which is a possible output of the following code: die=1:6 sample(3, die) - 4 1 2 - 3
3
two boxes have colored balls in them. box A has 3 blue and 7 red balls in it, box B has 1 blue and 1 red. six balls are drawn without replacement from box A. what is the chance that exactly two balls are blue?
6C2 x 3/10 x 2/9 x 7/8 x 6/7 x 5/6 x 4/5
consider a binomial random variable x with parameters n= 100, and p= 0.5 how would you use R to compute the probability that x ≤ 2?
pbinom(2, size=100, prob=0.05)
if you have to classify and count the draws,
put 0's and 1's on the tickets (1's count toward your number)
find 5 number summary of the vector 'grades' using quantile()
quantile(grades, c(0, 0.25, 0.5, 0.75, 1)) quantile() specifies the fraction of observations less than or equal to each value
what does this return in R: select(filter(forbes18, total > 20 & salary < 14), total, salary)
returns just the totals and salary columns with those filters
what does this return in R: select(filter(forbes18, total > 28), salary, total, sport)
returns the columns salary, total, sport with totals > 28, in descending order (????)
take a histogram with some big outliers to the right would the histogram be skewed left or right? why?
right. since the outliers pull it right.
use the function round( ) to round the vector "grades" to the nearest integer
round(grades, 0)
to find the sd of 'grades' vector, use R's sd() function then use your own the way we do it in class
sd(grades) #instead of n, R divides the deviations by n-1, so it'll always be bigger than what we want s1 <- grades - mean(grades) s2 <- (s1)^2 s3 <-mean(s2) s4 <-sqrt(s3) our_sd <- s4
based on forbes18 data frame, who was the top earner among the soccer player (counting total earnings)?
soccer_data = filter(forbes, forbes$sport == "Soccer") soccer_data = soccer_data[order(soccer_data$total, decreasing = TRUE),] player = soccer_data[1,2] cuz dataframe[row, column] will give you the output you desire
major difference between standard deviation and standard error
standard deviation is for a list standard error is for a chance process
given A and B are two events with P(A) = 0.4, and P(A∪B) = 0.7 if they're independent, what's P(B)
P(A∪B) = P(A) +P(B) − P(A∩B) = P(A) + P(B) − P(A)P(B). 0.7 = 0.4 + X - 0.4X X = 0.5 P(B) = 0.5
A player throws darts at a target. On each try, his probability of hitting the bullseye is 0.05, independently of the other tries. How many times should he throw so that his probability of hitting the bullseye at least once is 0.5?
Probability of hitting it at least once = 1-P(never hitting it)= 1−0.95^n. The n is what we are looking for. Use 0.5 = 1−0.95^n or 0.5 = 0.95^n, take logs of both sides of the equation, and we get that n=log(0.5)/log(0.95)≈13.5. Round up to the nearest integer and we get that the player should throw at least 14 times. Note that you can leave it as a fraction since you won't have a calculator.
what would you expect the following code to return? arrange(filter(forbes18,sport=="Soccer"), total)
a *data frame* of rows of soccer players in ascending order of their totals
does adding/subtracting affect standard deviation? does multiplying/dividing affect standard deviation?
adding/subtracting *no* multiplying/dividing *yes*
the rms size of a list is 0. this means that ...
all #'s are 0
a quiz has 25 multiple choice questions. each question has 5 possible answers. a correct answer is worth 4 points, but one point is lost for each incorrect answer. a student answers all the questions by picking an answer at random from the 5 choices. if we want to simulate this in R using sample( ), we would use the box ______ , and the function sample with inputs _____ fill in the blanks.
box = c( 4, -1, -1, -1,-1); sample(box, 25, replace = TRUE)
in R, how would you create a vector that consisted of ninety-five 0's, and five 1's?
c(rep(0,95),rep(1,5))
how do you calculate a change in the mean when one value changes?
change in mean = (the old value - the modified value)/number of observations
write R code to compute the probability of getting exactly 2 heads in 5 tosses
dbinom(2, 5, prob=0.5)
a study is carried out to determine the effect of party affiliation on voting behavior in a certain city that is divided up into sections called wards. in each ward, the percentage of registered democrats that vote is higher than the percentage of registered republicans who vote. true or false: for the city as a whole, the percentage of registered democrats who vote must be higher than the percentage of registered Republicans who vote. explain bitch
false!!! this can lead to simpson's paradox where one party (democrats) is higher in each ward but lower overall, when they (democrats) appear more in wards that have a low overall voting rate.
write the code to simulate 2 rolls of a six-sided die and sum the result. that is, as if you are playing a game where you roll a pair of dice to move in the game
sum(sample(1:6, size=2, replace =TRUE))
given forbes18: write code to compute the average salary (name her "bingo"), grouped by sport
summarize(group_by(forbes18, sport), bingo = mean(salary)) or with piping (fancy!): forbes18 %>% group_by(sport) %>% summarise(bingo = mean(salary))
can events be mutually exclusive if they're independent?
suppose A and B are independent events such that P(A) = 0.2 andP(B) = 0.6.(a) is it possible for A and B to be mutually exclusive? no, since they are independent, the probability of their intersection is (0.6)(0.2) = 0.12, not 0! *events cannot be both independent and mutually exclusive*
incoming students at a certain law school have an average LSAT score of 163 and an SD of 8. tomorrow, 1 student will be picked at random. you have to guess the score now; the guess will be compared with the actual score, to see how far off it is. Each point off will cost a dollar. (for example, if the guess is 158 and the score is really 151, you will have to pay $7. (a) is the best guess 150, 163, or 170? (b) you have about one chance in three to lose more than ____
(a) the best guess is *163*, because it is the average (b) another application of 68-95 rule. around 32%~(1/3) of the data lies outside (163 ± 8) dollars. so the probabilities of losing more than *8 dollars* is approximately 32%.
standard deviation for when a list only has two different numbers ("big" and "small")
(big - small) • root(fraction big • fraction small) ex/ 5 1 1 1 standard deviation = (5-1) • root (1/4)(3/4)
expected value for the sum of a box
(number of draws)(average of #'s in the box)
standard error for the sum of draws from a box (w/ replacement)
(root n)(standard deviation of box)
how to solve this kind of problem: 'find the missing section!' given a histogram or distribution table of quantitative data
(width x %) + (width x %) + (width x %) = 100% use that ^ do some basic algebra to find a missing width or % ex/ (# of hours, %) = (0-2, 6%), (2-4, 14%), (4-8, 8%), (8-12, HELP%), (12-20, 1.25%) find missing value HELP: find the total area of the other bins, and you get (12+28+32+10)% or 82%, which means the total area of the bin from 8-12 hours 18%. HELP is therefore 18/4 = 4.5%
only about 4% of people have type ab blood. what is the probability that the first AB donor is the tenth donor that we check? answer using the terminology/variables of a geometric distribution (X, p)
*X = the number of people we have to check up to and including the first person with type AB blood*, and P(typeAB) = 0.04 so X has a geometric distribution with parameter *p = 0.04* so P(X = 10) = (1−p)^9 ×p = (0.96^9)×0.04
conditions for using binomial random variables to model
- # of trials must be *fixed* - each observation can be represented by *"success" or "failure"* - each observation is *independent* (no card-dealing!)
requirements to use geometric model
- 2 possible outcomes for each trial (success or failure) - the trials are independent - the probability of success is the same for each trial - the variable of interest is the number of trials required to obtain *the first success*!!!!!!!! (whereas for binomials, its fixed!)
forbes_new = group_by(forbes18, sport) true or false: this will create a new data frame forbes_new that has the rows rearranged by sport. (that is, all the soccer players are listed together, and all the basketball players etc)
false; the data frame will look the same as forbes18, but we can now use it to summarize variables by sport like > summarize(forbes_new) returns the number of players in each sport
how to find the number of players of each sport in forbes18
table(forbes18$sport)
use summarize( ) to find the mean total income by sport. order the output by income, but in descending order. which sport had the maximum mean total income?
forbes_summary <- arrange(summarise(group_by(forbes, sport), mean = mean(total)),desc(mean)) forbes_summary or pipe it: newdata <- forbes %>% group_by(sport) %>% summarise(mean = mean(total)) %>% arrange(desc(mean)) newdata
for admission into a selective elementary school, 50 kids were given a math and reading test. assume both scores have histograms that follow the normal curve. the summary statistics are as follows: the data is loaded into R as a data frame called data with two variables. the first variable is called math and has the math scores. the second variable is called reading and has the reading scores. You can assume that the data is already loaded in, and the dplyr and ggplot2 write code in R to make a histogram (using ggplot() and the density scale) for the reading scores. make your histogram with 5 class intervals (there are a few ways to do this, I don't care which way). if it helps, you can assume that all the scores are between 50 and 100.
ggplot(data, aes(x = reading, y = ..density..)) + geom_histogram(bins = 5)
given A and B are two events with P(A) = 0.4, and P(A∪B) = 0.7 suppose that P(B) = 0.6. what would P(A∩B) be?
make sure u don't just do P(A) x P(B), that doesn't account for their intersection! instead use this: P(A∪B) = P(A) + P(B) - P(A∩B) so 0.7 = 0.4 + 0.6 - P(A∩B) -> P(A∩B) = *0.3*
for admission into a selective elementary school, 50 kids were given a math and reading test. assume both scores have histograms that follow the normal curve. the summary statistics are as follows: the data is loaded into R as a data frame called data with two variables. the first variable is called math and has the math scores. the second variable is called reading and has the reading scores. You can assume that the data is already loaded in, and the dplyr and ggplot2 write code in R to find the proportion of math scores that are above 70
mean(data$math > 70) OR length(filter(data, math > 70))$math/50
for admission into a selective elementary school, 50 kids were given a math and reading test. assume both scores have histograms that follow the normal curve. the summary statistics are as follows: the data is loaded into R as a data frame called data with two variables. the first variable is called math and has the math scores. the second variable is called reading and has the reading scores. You can assume that the data is already loaded in, and the dplyr and ggplot2 write code in R to find the average math score for the students that got over 75 on the reading test
mean(filter(data, reading > 75)$math)
can we use the binomial model to investigate the following situation? If yes, identify n and p: city council of 11 republicans and 8 democrats pick a committee of 4 at random. what is the probability that they choose all democrats?
no
is it always possible to design a double-blind controlled experiment, provided the budget is sufficiently large?
no. for example, you may need to assign the subjects to different diets
a standard deck of cards is shuffled: can we model the number of aces in a 6-card hand using *binomial random variables*? why/why not?
no. the probabilities do not stay the same from draw to draw as we draw *without replacement*
strong positive correlation between the amount of time the children spend playing video games and their levels of aggressiveness while playing with other kids. does this indicate that playing video games causes children to become violent?
nope because there could be confounding variables or lurking ones that have not been considered
twenty-one people in a room have an average height of 5 feet 6 inches. a 22nd person enters the room. how tall would he have to be in order to raise the average by 1 inch?
old average = (sum of existing observations)/n new average = (sum of existing observations + new observation)/(n+1) 67 inches = (66 * 21) + x)/(21+1) --> x = 88 inches
Which of these has a Geometric model: (a) The number of people we survey until we find someone who has taken Statistics. (b) The number of people we survey until we find two people who have taken Statistics. (c) The number of people in a Sociology class who have taken statistics. (d) The number of sodas drunk by students each day. (e) The number of aces in a 5 card poker hand.
only a
for the list 107, 98, 93, 101, 104, which is smaller - the rms size or the sd? no arithmetic needed. also what is the formulaic relationship between sd and rms?
the numbers in the list seem pretty close together, so the sd will not be very large. but since the numbers are tightly around 100, the square root of the average of their squared sum will be big. SD=√(RMS^2−μ^2)
jane has three children, each of which is equally likely to be a boy or a girl independently of the others. define the events: b = {there is at most one boy} c ={the family includes one boy and one girl} are b and c independent?
yes im confused
can we use the binomial model to investigate the following situation? If yes, identify n and p: roll 5 dice and need 2 6's to win the game.
yes n = 5 p = 1/6
a die is rolled 3 times and a coin is tossed 3 times. can we model the following using *binomial random variables*: the sum of the number of times the coin lands heads and the die lands with an odd number of spots. why or why not?
yes, as the probabilities are the same for heads or odd number of spots! both = 1/2! we can model this using a box with one ticket marked 0 and one ticket marked 1, and we draw six times from this box. the first three draws represent the coin toss, and the next three represent the die rolls. therefore n = 6 and p = 0.5.
for a list of positive numbers, can the sd ever be larger than the average?
yes. consider v = (1, 1000)mean(c(1,1000))## [1] 500.5 sd(c(1,1000))## [1] 706.3997 in general, sd and mean don't depend on each other. the mean tells the the center, the sd tells how dispersed the data is with respect to the center.
this is such a dumb question and i dont even think its correct but they gave it to us on a quiz so: in each ward, the % of democrats is > the % of republicans who vote. does this mean the percentage of democrats who vote, overall, must be higher than republicans?
"no. this can lead to simpson's paradox where one party is greater in each ward but lower overall, when democrats appear more in wards that have a low overall voting rate." hm ok sure but study that^
what would be a way you can get the RMS deviations from the mean (or SD as is in our text) using the R functions sd( ) and length(x)? You will need one more function, which one
##how to go from R's sd to our own sd? n <-length(grades) new_sd <- sd(grades)* sqrt((n-1)/n) new_sd == our_sd
given 6 numbers: for adding up the draws, the box is...
1 2 3 4 5 6
what does this return in R: slice(select(forbes18, Name, total), 1:3)
1 Floyd Mayweather 285 2 Lionel Messi 111 3 Cristiano Ronaldo 108
create a new data frame forbes1 with a column showing how much money each athlete on the list earned through endorsements
forbes1 <- data.frame(endorsements = forbes$total - forbes$salary, name = forbes$Name)
what was the mean total income of all the athletes on the list?
mean(forbes18$total)
what's in a 5 number summary? how do you find it in R?
minimum 1st quartile (25th percentile) median (50th percentile) 3rd quartile (75th percentile) maximum in R: summary( )
table()
returns # of items in each group like table(forbes18$sport) returns the number of players in each sport
what does this return in R: slice(forbes18, 5:n())
returns all the rows >5
head(forbes18)
returns first 6 rows of forbe18
given this distribution table... cholesterol (mg): datadatadatadata percent (%): datadatadatadata what would be the *horizontal scale*, and the *vertical scale* for this histogram?
the vertical scale will have %/mg, and the horizontal axis has mg
true or false: in this course, the height of a block in a histogram represents how crowded that block is.
true, that is what the height represents (%/horizontal unit)
toss a fair coin 4 times. what is the chance of fewer than 2 heads?
(1/2)^4 + [ (1/2)^3 • (1/2)^1 • 4C1 ] = 1/16 + 4/16
if i deal seven cards from a standard deck of 52, what is the chance that I will get two triples (three of a kind) and one other card. For example, king of hearts, king of spades, king of clubs, ten of diamonds, ten of spades, ten of clubs, three of hearts.
(13C2 • 4C3 • 4C3 • 11C1 • 4C1) / 52C7
a box contains 8 red marbles and 3 green ones. six draws are made at random with replacement. find the chance that 3 green marbles are drawn.
(6!/3!3!) (3/8)^3 (5/8)^3
given 6 numbers: for counting 6's, the box is...
0 0 0 0 0 1
calculate standard deviation for a box
1. calculate each #'s deviation from the mean 2. square each 3. add them together 4. divide by n 5. root that bitch or here's a formula: root[ (X-E(X))^2 / n ]
E(X) for 1. binomial and 2. geometric distributions
1. np 2. 1/p
a coin is tossed and you win a dollar if there are exactly 50% heads. would you rather the coin be tossed 10 times or 100 times? explain your choice.
10 times, since the error will make it very unlikely that we will be at exactly 50% heads when we toss 100 times.
chance error
how much a value differs from the expected value
given forbes18: assume that dplyr is loaded in R. write the code that you would use to create a new column called "endorsements" that gives the difference between the player's total earnings (called total) and his salary.
mutate(x, endorsements = total-salary)
write R code to compute the probability of getting at most 1 head in 5 tosses
pbinom(1, 5, prob=0.5) or dbinom(0, 5, prob=0.5)+dbinom(1, 5, prob=0.5)
how to find the sd of a random variable also robert's work schedule for next week will be released today. robert will work either 45, 40, 25, or 12 hours. the probabilities for each possibility are listed below: 45 hours: 0.3 40 hours: 0.2 25 hours: 0.4 12 hours: 0.1 what is the standard deviation of the possible outcomes?
there are four steps to finding the standard deviation of random variables. first, calculate the mean of the random variables. second, for each value in the group (45, 40, 25, and 12), subtract the mean from each and multiply the result by the probability of that outcome occurring. third, add the four results together. fourth, find the square root of the result. (45−32.7)2(0.3)=45.38 (40−32.7)2(0.2)=10.66 (25−32.7)2(0.4)=23.72 (12−32.7)2(0.1)=42.85 45.38+10.66+23.72+42.85=122.61 √122.61=11.07
why does this produce an error ggplot(topo)+geom_histogram(x=value,y=..density..)
there's no aes(x=...)
a study is made of the age at entrance of college freshmen. If we assume that the distribution of entering freshmen's ages follows the normal curve, would you guess the SD of their ages to be about 1 month, 1 year, or 5 years? why?
this is an application of the *empirical rule*, which tells us that 68% of the data are within 1 SD of the average. this implies that one month is too little as most freshmen are not born within a month of the average month, and 5 years is too high as freshmen are mostly closer in age than that. this leaves *1 year* to be the correct answer.
in a data frame, every {observation or variable?} has its own column
variable
sort(x)
will sort the *vector* x in ascending order sort(x, decreasing = FALSE) is the default
P(a) = 0.5 and P(b) = 0.4 P(a and b) = 0.2 does this prove that a and b independent?
yes
two boxes have colored balls in them. box A has 3 blue and 7 red balls in it, box B has 1 blue and 1 red. six balls are drawn without replacement from box A. what is the chance that exactly two balls are blue or the first two balls drawn are the same color?
= p(exactly two balls are blue) + p(first two blue or red) - p(intersection) p(exactly two balls are blue) = 6C2 x 3/10 x 2/9 x 7/8 x 6/7 x 5/6 x 4/5 p(first two blue or red) = 7/10 * 6/9 + 3/10 * 2/9 p(intersection) = P(BBRRRR) + P(RR then BBRR in some order) = (1 + 4C2) * (3 * 2 * 7 * 6 * 5 * 4) / (10 * 9 * 8 * 7 * 6 * 5) put them togetha
empirical rule
given a normal distribution: 68% of the data will be within 1 standard deviation 95 of the data will be within 2 standard deviations 99.7 of the data will be within 3 standard deviations ex/ in a large statistics course, the scores for the final followed the normal curve closely. the average was a 70 out of 100 and 75% of the class scored between 60 and 80 points. can we conclude that the sd of the scores was less than 10 points? yup! the sd of the scores has gotta be < 10, since only about 68% of the data should be within 1 sd of the mean, and we have 75% of the data with 10 points of the mean.
what does this return in R: arrange(forbes18, total, salary, sport)
it arranges the data frame in order of ascending total's, then salaries if the totals are the same, then groups by sport if the salaries are the same (i think? verify maybe)
how to find median given a distribution table or histogram
to find the median, we assume the data are uniformly spread within each bin. find the bin that includes the 50% mark add up the areas of the bins until you pass 50) then go back one step. make the note of her width (we'll call it almost_median). now take the area you've accumulated up to the end of that bin, and let's say you're at 40% and need 10% more. take that 10% and divide it by the area of the *next* bin (that contains your median). now we need to convert to width, so multiply that % by the width of that bin to get a little width we'll call median_boost. now add median_boost to almost_median to get the median. fuq im sorry this was so confusing. ex/ up to the bin marked 4-8 hours, we have 40%. therefore we need 10% more for the median, out of the 32% in the next bin marked 4-8 hours. 10% will need (10/32)(4) = 1.25 hours of this bin. therefore the median is 4+1.25 or 5.25 hours. that is, you know that the height of this bin is 8% and therefore you need a width of 1.25 to get to 10%.
explain why if X is the value of a ticket drawn from a box of tickets, we have SE(X) = the standard deviation of the tickets in the box.
when we represent X as the result of a single draw from a box, the tickets in the box are equally likely. therefore the expected value E(X) will just the regular arithmetic average of the ticket values. this implies that the SE of X will be the the RMS size of deviations from the EV of X (since all the tickets are equally likely). ?????