Statistics + probability
X is a discrete random variable. Write in terms of P(X ≤ k), where k is a natural number to be found, P(X > 28)
1 - P(X ≤ 28) because: P(X > 28) = P(X ≥ 29) = 1 - P(X ≤ 28)
How many s.f to write standard deviation and other stats/probabilities to unless stated
3 s.f (must not be more or less or marks lost)
Without further calculation, state whether the probability of obtaining more heads than tails than in the question before would be the same if four fair coins were tossed. Give a reason for your answer
No, could get an even number of heads and tails which was not possible with 3 coins
independent event rules for conditional probability
P(A|B) = P(A) P(A|B) = P(A|B') -if events are not independent, these rules will not work
mean error vs sample mean
mean error: assumes all values in each class = midpoint of class sample mean: uses actual values
frequency prediction from probability and total number of people
probability (for that category) x total number of people
P(A, B) on tree diagram
probability that A happens in the first event then B happens in the second event
mean and variance of binomial distribution for use with normal approximation + rule to work
X ~ B(n, p) -p must be close to 0.5 and n large for this to work -np must be not too close to 0 or n (write in question to show you are considering this) E(x) = np Var(x) = np(1 - p) -therfore: X ~ N(np, np(1 - p))
Donna wraps a £5 prize inside 1 of every 20 bars in her shop then mixes them up and sells them at random. (i) If you buy 10 bars, what is the probability of winning at least once? (ii) How many bars would you need to buy to be 99% sure of winning at least 1 prize? (iii) Donna buys the bars for £0.20 each and sells them for £1.50 each. On an order of 10 bars, what is the probability that she makes a loss?
(i) X ~ B(10, 1/20) P(X ≥ 1) = 1 - 0.95¹⁰ (using binomial formula) = 0.401 (ii) 0.99 ≥ 1 - 0.95ⁿ n = 89.78, buy 90 (iii) expected profit with no wins = 10(1.50 - 0.20) = £13 So Donna will make a loss if 3 or more people win P(X ≥ 3) = 1 - P(X ≤ 2) = 0.012
A firm producing mugs has a quality control scheme in which a random sample of 10 mugs from each batch is inspected. For 50 such samples, the numbers of defective mugs are as follows. Number of defective mugs: 0, 1, 2, 3, 4, 5, 6+ Number of samples: 5, 13, 15, 12, 4, 1, 0 (i) Find the mean and standard deviation of the number of defective mugs per sample. (ii) Show that a reasonable estimate for p, the probability that a mug is defective, is 0.2. Use this figure to calculate the probability that a randomly chosen sample will contain exactly two defective mugs. Comment on the agreement between this value and the observed data.
(i) enter data into calculator, then use 1VAR with x list and frequency list set. Standard deviation is sx (ii) Total number of mugs: 50 x 10 = 500 Total number of defective mugs: 13 + (15 x 2) + (12 x 3) + (4 x 4) + 5 = 100 100/500 = 0.2 X ~ B(10, 0.2) P(X = 2) = 0.302 0.302 x 50 = 15.1 samples with 2 defective mugs, which matches the data well
The probability distribution of X is: r 1, 2, 3, 4, 5 P(X = r) = 16/31, 8/31, 4/31, 2/31, 1/31 two independent values of X are chosen, and their sum S is found. Find the probability that S is odd
-S is odd when doing: 1 + 2, 1 + 4, 2 + 1, 2 + 3, 2 + 5, 3 + 2, 3 + 4, 4 + 1, 4 + 3, 4 + 5, 5 + 2, 5 + 4 -this would take a very long time to calculate, so instead do: P(sum odd) = P(odd then even) + P(even then odd) P(odd) = 16/31 + 4/31 + 1/31 = 21/31 P(even) = 1 - 21/31 = 10/31 P(odd then even) = 21/31 x 10/31 = 210/961 P(even then odd) = 10/31 x 21/31 = 210/961 210/961 + 210/961 = 420/961
conditions for correlation + difference between correlation and association
-a change in the value of one of the variables causes a change in the other variable -the relationship is linear -both variables are random -correlation is a special case of association -Association covers relationships between variables that do not need to be linear or random
estimating population standard deviation
-as long as n is large enough (greater than 50), sample S.D may be used as estimate for population S.D (done in questions where S.D has to be calculated given a table of values or Σx², Σx, n)
description + features of normal distribution curve + notation + values of standard normal
-bell-shaped -symmetrical, mean = median = mode (μ in centre) -unimodal -area under every normal curve = 1 probability = area under curve between limits spread of normal distribution = standard deviation Z ~ N(μ (mean), σ² (variance)) Z ~ N(0, 1) for standard normal σ = standard deviation
how to calculate the percentile of interval data
-calculate n -find the cumulative frequency represented by that percentile using: percentile/100 x n (eg: 15th percentile of 30 data items = 15/100 x 30 = 4.5) -find the class that frequency ends in, and find how far into the class it goes (eg: if the first class has a frequency of 3 and the second has frequency of 4, the frequency ends in the second class, 1.5 data points in) -use the class interval range and how far into the class the data item is to find the value (eg: if the second class interval is 10 < t < 20, the value is: 1.5/4 x 10 = 3.75, since 3.75 into the 10 class, percentile is 13.75)
how to solve questions where mean, standard deviation, or both have to be found
-calculate z using info given to make equation -use inverse normal to find z value in standard normal for given probability -solve equation from first step -if both mean and standard deviation must be found, this process has to be repeated with two data values which gives two equations that have to be solved simultaneously
How to perform sample correlation coefficient hypothesis test
-compare r with appropriate critical value (critical value depends on sample size, significance level, one/two tail test) -find critical value from table given to you, or critical value is given already
correlation and causation relationship
-correlation does not imply causation -just because correlation between two variables exists, does not mean one causes the other
Michael thinks that because there is a positive correlation between wine consumption and divorce rate, higher wine consumption leads to a higher divorce rate. Charlotte claims that Michel is indulging in 'pseudo-statistics'. What arguments could she use to support this point of view?
-correlation does not imply causation -other variables may affect the relationship between wine consumption and divorce rate
why does it not matter if you use < instead of ≤ with normal but does for binomial
-difference is infinitely small with normal distribution as distribution is continuous -only matters when doing continuity correction -does matter with binomial as distribution is discrete so it would be a whole number difference
There are 12 people at an identification parade; one of them is guilty and the others are innocent. Three witnesses are called to identify the guilty person. Assuming they make their choice purely by random selection, draw a tree diagram showing the possible outcomes labelling them Correct or Wrong.
-different subtree for each witness -trials are independent because each witness can call the same person, so the denominators are the same -for each tree, upper branch = 1/12, lower branch = 11/12
when can the normal distribution be used to model discrete variables
-distribution is approximately normal: steps in possible values are small compared to S.D so can be treated as if continuous -continuity corrections are applied where appropriate -useful when n of binomial nCr is large because calculator cannot calculate with large values of n -also used to approximate binomial distribution (discrete) between integers/steps
Cluster sampling
-divide population area into clusters -use all people in one or more clusters with simple random sampling to choose which cluster(s) are chosen -no sample frame -simple -cheap -biased if wrong cluster is selected
Stratified sampling
-divide population into groups -simple random sampling carried out in each group number to select = group size/population size X sample size (eg: selecting people from UK and US as separate groups - SRS may choose more of one group by accident so stratified is used to make it even/results proportional) -reflects population structure -guarantees proportional representation of groups in pop -population must be clearly classified into distinct data -suffers from same downsides as SRS
A bag contains 4 red and 6 blue marbles. A marble is chosen at random but not replaced in the bag. A second marble is then chosen at random. Given that the second marble is blue, what is the probability that the first marble is also blue?
-draw probability tree P(B n B) = 6/10 x 5/9 = 1/3 P(B) = (4/10 x 6/9) + (6/10 x 5/9) = 3/5 P(B|B) = (1/3) / (3/5) = 5/9
Calculating rₛ manually
-draw table with headings (variable 1), rank, (variable 2), rank, d, d² -rank each of the values of variable 1 in the 1st rank column -rank each of the values of variable 2 in the 2nd rank column -calculate d (difference between the two ranks) -calculate d² -find the total of d² values, then use the formula (given to you): rₛ = 1 - 6Σd²/n(n² - 1)
how to find the probability of an outcome from two events happening
-draw tree diagram (multiplying probabilities can give errors if events are not independent, so use tree diagram)
There are 90 players in a tennis club. Of these, 23 are juniors, the rest are seniors. 34 of the seniors and 10 of the juniors are male. There are 8 juniors who are left-handed, 5 of whom are male. There are 18 left-handed players in total, 4 of whom are female seniors. Represent this information in a J, L, M Venn diagram
-draw triple venn diagram with circles J, M, L and: centre (J n L n M): 5 J n L n M': 3 J n M n L': 5 J n M' n L': 10 L n M' n J': 4 M n L n J': 6 M n L' n J': 28 J' n M' n L' (outside): 29
A roulette wheel has 38 numbers on it, they are all equally likely to come up. A player chooses a 17 forty times in a row. He bets £1 each time. If he wins, he gets £35. How much does he expect to win or lose in 40 rolls?
-each spin: 35(1/38) - 1(37/38) = -5.26 so 5.26p loss expected 5.26 x 40 = 210.5p £2.11 expected loss on 40 rolls
conclusions from largest value in sample being greater than 3 standard deviations from the mean
-either standard deviation is wrong or distribution is not normal
interpolation
-estimating a value within two known values in a data set (eg: finding a point on a cumulative frequency curve)
Simple random sampling + pros and cons
-every sample has equal chance of being selected -sampling frame has samples with numbers -numbers selected using RNG -bias free -every sample has equal chance -not suitable for large populations -sampling frame needed
Given that P(E′) = 0.44, P(F′) = 0.28 and P(E ∪ F) = 1, find P(E ∩ F).
-find P(E) and P(F) so addition rule can be used: P(E) = 1 - 0.44 = 0.56 P(F) = 1 - 0.28 = 0.72 -using addition rule: P(E ∪ F) = P(E) + P(F) - P(E ∩ F) P(E ∩ F) = 0.56 + 0.72 - 1 P(E ∩ F) = 0.28
A manufacturer produces bolts, the internal diameter of which can be modelled by the Normal distribution N(2.8, 0.2²) (the mean and standard deviation are measured in mm). What percentage of bolts have an internal diameter measurement within one standard deviation of the mean? (proving without known percentage) Bolts which have an internal diameter of less than 2.7 mm or greater than 2.92 mm are rejected. Out of a batch of 800 bolts, how many would be acceptable (to the nearest whole number)?
-find probability of bolts being +-0.2 (S.D) from 2.8 (mean) X ~ N(2.8, 0.2²) P(2.6 ≤ X ≤ 3) = 0.6827 -percentage of bolts within one standard deviation of the mean is 68.3% P(2.7 ≤ X ≤ 2.92) = 0.4172 0.4172 x 800 = 334 expected acceptable bolts
Assume that the masses of adult men can be modelled by the Normal distribution with mean 75 kg and standard deviation 5 kg. What is the interquartile range for the masses of adult men?
-find upper quartile: X ~ N(75, 5²) P(X < Q) = 0.75 -using inverse normal: Q = 78.35 since the distance from the mean to the upper quartile = 3.35 (78.35 - 75) by symmetry, the IQR is twice this, so IQR = 6.7 (could also find LQ and find difference but it will take longer)
using z for normal distribution probability questions
-find value of z given the information -find the probability of z (may be less than or more than calculation) with the standard normal to find the probability -can also be used to find area between two values by finding probability of the two z values using the standard normal and finding the difference
Skilled operators make a particular component for an engine. The company believes that the time taken to make this component may be modelled by the Normal distribution. They time one of their operators, Sheila, over a long period. They find that only 10% of the components take her over 90 minutes to make, and that 20% take her less than 70 minutes. Estimate the mean and standard deviation of the time Sheila takes.
-first piece of info: z = (90 - μ)/σ P(z > (90 - μ)/σ) = 0.1 (or φ(z) = 10% = 0.1) -using inverse normal in calc with standard normal values: -since it is over 90, look at the right side of the curve z = 1.282 so (90 - μ)/σ = 1.282 -second piece of info: -since it is less than 70, look at the left side of the curve z = (70 - μ)/σ P(z < (70 - μ)/σ) = 0.2 z = -0.842 so (70 - μ)/σ = -0.842 -solving both equations simultaneously: σ = 9.4 μ = 77.9
normal approximation to binomial distribution idea
-graph of P against x for binomial distribution has discrete stick values -if the tops of the sticks are joined, it forms a normal distribution -the sticks can be turned into bars to find the area under the curve using rectangular approximation in normal eg: P(X = 2) = 0 because there is no area -so to find this value, a continuity correction is used P(1.5 < x < 2.5) to approximate the value
Deviation and spread
-how far a value is from the mean (x - x̄) -spread can be seen using mean of deviations (Σ ¦x - x̄¦)/n
Assuming the distribution of the heights of adult men is Normal, with mean 174 cm and standard deviation 7 cm, find the probability that a randomly selected adult man is under 185cm (direct way and z way)
-in calculator: let X be the height of a man in cm X ~ N(174, 7²) σ = 7 μ = 174 lower = -10000 upper = 185 answer = 0.942 -using z value: z = (185 - 174)/7 = 1.57 in calculator: σ = 1 μ = 0 lower = -10000 upper = 1.57 answer = 0.942
how to do continuity correction (P(X = 3), P(X > 4), P(X ≥ 5), P(X < 6), P(X ≤ 7), P(8 < X ≤ 9), P(10 ≤ X ≤ 11))
-make values 0.5 times the s.f either side, and choose them to include the binomial value(s) in the continuity corrected interval if ≥ or ≤ or = or exclude them if < or > -make continuity correction in terms of < or > only P(X = 3) -> P(2.5 < x < 3.5) P(X > 4) -> P(X > 4.5) P(X ≥ 5) -> P(X > 4.5) P(X < 6) -> P(X < 5.5) P(X ≤ 7) -> P(X < 7.5) P(8 < X ≤ 9) -> P(8.5 < X < 9.5) P(10 ≤ X ≤ 11) -> P(9.5 < X < 11.5)
Is it reasonable to use a straight line of best fit when the data is almost horizontal on a graph
-no, weak correlation would suggest line of best fit is inappropriate
z
-number of standard deviations beyond the mean -particular value of variable Z which has mean 0 and S.D 1 z = (x - μ)/σ -compared against standard normal
The times taken for 10 people to run 100m are listed below: 11.59, 11.50, 13.61, 14.42, 1.84, 15.54, 16.32, 17.16, 19.43, 25.47 One of the results has been written down incorrectly. Given the actual median time is 15.19, calculate the correct time
-outlier is 1.84 (much lower than the rest, likely impossible to occur) -order correct data: 11.5, 11.59, 13.61, 14.42, 15.54, 16.32, 17.16, 19.43, 25.47 -since 15.19 isn't in the data set, the median must be the result of a number between 15.54 and the correct time we don't know, so: (15.54 + x)/2 = 15.19 x = 14.84
lower and upper tail of a distribution
-outlier range LQ - 1.5 x IQR UQ + 1.5 x IQR
how to draw a box plot
-plot whiskers at (LQ - 1.5 x IQR) and (UQ + 1.5 x IQR) -they now represent 'lowest value that is not an outlier' and 'highest value that is not an outlier', so point on boundary is not an outlier -plot outliers as crosses outside of whiskers -can still assume 25% of data fits between each Quartile as long as outliers are disregarded
Quota sampling
-population already divided into groups based on characteristics -people are self selected from groups until quota is met (enough people from that group) number to select = sample size/population size X group size -representative sample can be achieved with small sample size -cheap (not quick) -no sample frame -non-random sampling, can introduce bias -non-response not recorded -can be biased due to researcher
P(A|B)
-probability of A given B has already happened (conditional probability) P(A|B) = P(A n B)/P(B)
spearman's rank correlation coefficient
-ranks data then compares ranks -used to measure association, can be used for any relationship where appropriate (eg: when relationship is non-linear) rₛ rₛ = 1: perfect positive rank order/relationship rₛ = -1: perfect negative rank order/relationship rₛ = 0: no association (random scattering of points)
identify two ways in which the normal distribution used to model an event may be modified if it is found that half the time, the mean is lower
-reduce mean -increase standard deviation
Large data set information and context
-sample of 200 people in USA (out of a sample of 5000) -2003 to 2004 (so results will be different to modern day and UK) -age range of 17-85 -outlier of an individual with 0 diastolic pressure (and many missing/ #N/A readings) -4 systolic pressure (heart pumping) readings and 4 diastolic pressures (heart resting) found and averages found for each systolic > diastolic (in mmHg, millimetres of mercury) -systolic follows normal distribution, diastolic does not
Systematic sampling + pros and cons
-selects sample within sample frame every kth interval k = population size/sample size -after selecting starting point using random number from 1 to k -simple/easy (not quick or cheap) -suitable for large samples (not populations) -can introduce bias without random sampling frame -sampling frame needed
how to prove events are independent
-show that they satisfy P(A ∩ B) = P(A) x P(B)
The probability distribution of a discrete random variable is given by P(X = r) = k for 1 ≤ r ≤ 20. Find the value of k.
-since the probability is discrete and every probability = k: 20k = 1 k = 1/20
conditional probability venn diagram (may help with solving)
-single oval divided in half (with box) by wavy line -each half has different variable and they are mutually exclusive -part of circle in each halved box has different value -may also be helpful to represent with tree diagram because branches are affected by previous branches (dependent events - conditional) (eg: first branch is P(A), so following branch is P(B|A if events are dependent/conditional)
Given that P(A) = 0.9 and P(B) = 0.7, find the smallest possible value of P(A ∩ B).
-solve algebraically using addition rule: P(A ∪ B) = P(A) + P(B) - P(A ∩ B) P(A ∪ B) = 0.9 + 0.7 - P(A ∩ B) -we know that the probability of P(A ∪ B) cannot exceed 1 so: 1 ≥ 1.6 - P(A ∩ B) P(A ∩ B) ≥ 0.6 -so the smallest possible value of P(A ∩ B) is 0.6 -cannot be solved using trial and error since probabilities of A and B without the intersection are constants and we don't know them
standard error
-standard deviation of a sample means distribution σ/√n
Sxx
-sum of squared deviations -easier than converting all values to positive manually Sxx = Σ(x - x̄)^2 or Σx^2 - nx̄^2
During voting at a by-election, an exit poll of 1700 voters indicated that 50% of people had voted for the Labour party candidate. When the votes were counted, it was found that he had in fact received 57% support. Of the 1700 people interviewed 850 said they had voted Labour but 57% of 1700 is 969, a much higher number. What went wrong? Is it possible to be so far out just by being unlucky and asking the wrong people?
-this situation can be modelled by the binomial distribution: X ~ B(1700, 0.57) np = 969 S.D = 20.4 -the conditions for a normal approximation apply: np is not close to 0 or n p is close to 0.5 at 0.57 and n is large at 1700 X ~ N(969, 20.4²) continuity correction for 850: 850.5 P(X < 850.5) = 3.15e-9 (could also calculate z and use standard normal) -it is very unlikely that this event occurred with random sampling
Pearson's product moment correlation coefficient (PMCC)
-used to calculate correlation of bivariate data set r is the gradient of the line of best fit (least squares regression line) r = +1: perfect positive r = -1: perfect negative r = 0: no correlation -pairs of data items (coordinates) can be entered into table in calculator and r can be calculated CALC -> REG -> X -> aX + b -should check conditions for correlation apply before using
midrange
-value halfway between max and min values (max + min)/2 -do not consider frequency of each category when calculating midrange, only value halfway between highest and lowest categories -used when data is symmetrical with no outliers -easy to calculate
Potential errors with cumulative frequency curve
-values should be plotted at upper bound of interval (not midpoint or lower) -using grouped data for the intervals will reduce the error introduced by misplotting but will be less accurate -using graph to interpolate instead of raw data will be less accurate
Three fair coins are tossed. By considering the set of possible outcomes, HHH, HHT, etc., tabulate the probability distribution for X, the number of heads occurring. Then illustrate the distribution and describe the shape of the distribution.
-write possible outcomes: HHH, HHT, HTT, HTH, THH, TTH, THT, TTT total of 8 -tabulate probability distribution: h: 0, 1, 2, 3 P(X = h): 1/8, 3/8, 3/8, 1/8 -draw graph with P(X = h) on the y-axis and h on the x-axis -draw straight lines upwards at each value of h to the probability (should have gaps between lines) -distribution is symmetrical
standard deviation facts for normal distribution
68% of all values are within ±1 S.D from the mean 95% of all values are within ±2 S.D from the mean 99.7% of all values are within ±3 S.D from the mean 6 standard deviations in whole curve (useful for estimates) -points of inflection are at ±1 S.D from the mean -outliers are 2 S.D from the mean
The jaguar is a species of big cat native to South America. Records show that 6% of jaguars are born with black coats. Jaguars with black coats are known as black panthers. Due to deforestation a population of jaguars has become isolated in part of the Amazon basin. Researchers believe that the percentage of black panthers may not be 6% in this population. Find the minimum sample size needed to conduct a two-tailed test to determine whether there is any evidence at the 5% level to suggest that the percentage of black panthers is not 6%.
0.025 either tail (1 - 0.06)ⁿ < 0.025 n = 60
All the Jacks, Queens and Kings are removed from a pack of playing cards. Giving the Ace a value of 1, this leaves a pack of 40 cards consisting of four suits of cards numbered 1 to 10. The cards are well shuffled and one is drawn and noted. This card is not returned to the pack and a second card is drawn. Find the probability that both cards are of the same suit.
10/40 x 9/39 = 3/52 for each pair picked of same suit -this can happen 4 ways because there are 4 different suits, so: 3/52 x 4 = 12/52
Two cards are picked at random, without replacement, from a pack of 52 playing cards. What is the probability that the second card is black, given that the first is red?
26/31 -do not use conditional probability formula here as picking red and picking black are mutually exclusive events, so the probability of picking black is not dependent on picking red first (only dependent on picking a card first) -'given' just means probability of B given R has already happened
Three dice are thrown. Find the probability of obtaining at least two 6s, and different scores on all dice
At least two 6s: possible by rolling 6, 6, (another number) in 3 arrangements or possible by rolling three sixes: 1/6 x 1/6 x 5/6 x 3 = 5/72 1/6 x 1/6 x 1/6 = 1/216 5/72 + 1/216 = 0.0741 Different scores on all dice: 6/6 x 5/6 x 4/6 = 0.556
Hypothesis tests for spearman's rank correlation coefficient
H₀: There is no association between the variables H₁: There is association between the variables (2-tail test) or There is positive association between the variables (1-tail) or There is negative association between the variables (1-tail) -calculated or given value of rₛ is compared with critical value that is given or can be found from table depending on sample size, one/two tail test, and significance level -if rₛ > critical value: result is significant, reject H₀, there is evidence to suggest (an/a positive/a negative) association between (variable 1) and (variable 2)
H₀ and H₁ of sample correlation coefficient hypothesis test
H₀: ρ = 0 (no correlation between variables) H₁: ρ ≠ 0 there is correlation between the variables (two tail test) or ρ > 0 there is a positive correlation between variables (one tail test) or ρ < 0 there is a negative correlation between variables (one tail test) ρ (rho)
effect on mean and variance (or S.D) of changing values by multiplication and addition
E(y) = a E(x) + b Var(y) = a² Var(x) -shifting does not change variance or S.D because spread is the same (only multiplication of values affects it) E = mean Var = variance y = new value x = original value a = multiplier b = addition/shift value
A seed supplier advertises that, on average, 80% of a certain type of seed will germinate. You calculate the critical region to be X ≥ 17 when 18 seeds are planted which has a probability of 0.0991. Determine the probability that Mr Brewer will reach the wrong conclusion if the true germination rate is 80%. What about if the true germination rate is 82%?
He reaches the wrong conclusion when rejecting Ho (because the true germination rate of 80% = Ho), which has a probability of 0.0991. He will reach this wrong conclusion if he gets 17 or more seeds to germinate in his tests. If the true germination rate is 82%, he reaches the wrong conclusion by accepting Ho (because Ho of 80% is wrong). (and the critical region of this and its probability can be calculated which would be done already in the question)
The results of an examination, in which there were 2454 candidates, are modelled by a Normal distribution with mean 62 and standard deviation 15. Marks are recorded as integers. If the pass mark is 40, approximately how many candidates would you expect to pass? Approximately how many candidates (to the nearest 5) would you expect to gain marks between 50 and 70 inclusive?
Let X = number of candidates who pass X ~ N(62, 15²) -since pass marks are discrete data, a continuity correction must be used: P(X ≥ 39.5) = 0.9332 0.9332 x 2454 = 2290 expected to pass P(49.5 ≤ X ≤ 70.5) = 0.5122 0.5122 x 2454 = 1257 which is 1255 to the nearest 5
A machine is set to produce metal rods of length 20 cm, with standard deviation 0.8 cm. The lengths of rods are Normally distributed. The machine is reset to be more consistent so that the percentage between 19 cm and 21 cm is increased to at least 95%. Calculate the new standard deviation to 1 d.p.
P(19 < X < 21) ≥ 0.95 P(19-20/σ < Z < 21-20/σ) ≥ 0.95 since this is two tailed, use the centre tail with inverse normal: -using inverse normal with area = 0.95: -1.96 < z < 1.96 1.96 = 21-20/σ σ = 0.51 -> 0.5
A particular condition affects 0.8% of the population. 90.1% of the population as a whole carry a certain gene. 9.85% of the population neither carry the gene nor are affected by the condition. Paul discovers that he carries the gene. He believes that it is very likely that he will be affected by the condition. Determine whether or not he is correct.
P(G u C) = 1 - 0.0985 = 0.9015 0.9015 = P(G) + P(C) - P(G n C) P(G n C) = 0.0075 P(G|C) = 0.0075/0.901 = 0.00832 This is very unlikely so he is incorrect
A venn diagram of independent events M and P has values: (M n P') = 15, (P n M') = 10, with total number of items 60. Find the value of the intersection
P(M) = (15 + x)/60 P(P) = (10 + x)/60 P(M n P) = x/60 -since they are independent events: P(M n P) = P(M) x P(P) x/60 = (15 + x)/60 x (10 + x)/60 x = 30
In a Donkey Derby event, there are three races. There are six donkeys entered for the first race, four for the second and three the third. Sheila places a bet on one donkey in each race. She knows nothing about donkeys and chooses each donkey at random. Find the probability that she backs at least one winner.
P(Sheila backing at least one winner) = 1 - P(Sheila backing no winners) 1 - (5/6 x 3/4 x 2/3) = 0.583
how to calculate P(41 ≤ x < 57) on calculator
P(X ≤ 56) - P(X ≤ 40) -use BCD on calculator to solve
how to calculate P(52 ≤ X < 61) with binomial distribution
P(X ≤ 60) for upper limit P(X ≥ 52) = 1 - P(X ≤ 51) for lower limit P(X ≤ 60) - (1 - P(X ≤ 51))
A roulette wheel has 38 numbers on it, they are all equally likely to come up. A player chooses a 17 forty times in a row. What is the probability that he wins at least once?
Probability of something happening at least once = 1 - probability of something not happening in all trials: P(win) = 1/38 P(not win) = 37/38 for 40 rolls: 1 - (37/38)⁴⁰ = 0.656
arithmetic mean vs weighted mean
arithmetic mean = average weighted mean = Exf/n (total of (variable X frequency) / total frequency) E(sigma) = total or summination
hypothesis testing: distribution of individual items and distribution of sample means
distribution of individual items: population mean = μ N(μ, σ²) -samples from population are taken, and means of the samples plotted in a distribution: -still centered around population mean x̄ ~ N(μ, σ²/n) -as sample size is increased, the means will be closer to the true population mean (S.D decreases so curve is tighter and taller)
find the probability that, when a card is chosen at random from a pack of cards, the first card chosen is a four or heart
four: 4/52 heart: 13/52 since the 4 of hearts appears in both groups, it must be removed from one so: 4/52 + 13/52 - 1/52 = 16/52 = 4/13
Every evening, 5 men and 5 women are chosen to take part in a competition. Of these 10 people, exactly 3 will win a prize. The 3 prize-winners are chosen at random. i) find the probability that on a particular evening, 2 of the prize-winners are women and the other is a man ii) four evenings are selected at random. Find the probability that on at least 3 evenings, 2 of the prize-winners are women and the other is a man
i) 5/10 x 4/9 x 5/8 = probability of WWM = 5/36 -this combination can occur 3 ways (WWM, WMW, MWW), so multiply by 3: 5/36 x 3 = 5/12 -can also do: (5C2 x 5C1)/(10C3) = 5/12 which is less risky because it already accounts for the 3 possible ways ii) this can happen by: 3 evenings + 4 evenings -4 evenings: (5/12)⁴ = 625/20736 -3 evenings: 7/12 x 5/12 x 5/12 x 5/12 = 875/20736 this can occur 4 ways: (can be found by 4C3 = 4) NYYY, YNYY, YYNY, YYYN, so multiply by 4: 875/5184 add probabilities: 625/20736 + 875/5184 = 1375/6912
Some years ago the police did a large survey of the speeds of motorists along a stretch of motorway, timing cars between two bridges. They concluded that their mean speed was 80 mph with standard deviation 10 mph. Recently the police wanted to investigate whether there had been any change in motorists' mean speed. They timed the first 20 green cars between the same two bridges and calculated their speeds (in mph) to be as follows: 85, 75, 80, 102, 78, 96, 124, 70, 68, 92, 84, 69, 73, 78, 86, 92, 84, 69, 73, 78, 86, 92, 108, 78, 80, 84 (i) State an assumption you need to make about the speeds of motorists in the survey for the test to be valid. (ii) State suitable null and alternative hypotheses and use the sample data to carry out a hypothesis test at the 5% significance level. State the conclusion.
i) The speeds follow a normal distribution (and are random) ii) let μ be the population mean of motorists' speeds H₀: μ = 80 H₁: μ ≠ 80 two-tailed test sample mean = 85.1 z = (85.1 - 80)/(10/√20) = 2.28 -using standard normal, critical region: z ≤ -1.64 or z ≥ 1.64 2.28 > 1.64 so the standardised result is in the critical region, so it is significant, reject H₀. There is evidence to suggest the population mean speed is different from 80mph.
100 cars are entered for a road worthiness test which is in two parts, mechanical and electrical. A car passes only if it passes both parts. Half the cars fail the electrical test and 62 pass the mechanical. 15 pass the electrical but fail the mechanical test. Find the probability that: i) a car chosen at random fails on one test only ii) given that it has failed, failed the mechanical test only
i) draw venn diagram given the information: P(M) = 62/100 P(E) = 50/100 P(E n M') = 15/100 50 - 15 = 35/100 = P(E n M) P(M n E') = 62 - 35 = 27/100 23/100 outside circles -add areas of circles without intersection (they only pass 1 test): 15/100 + 27/100 = 42/100 ii) P(M' | (M n E)') probability of failing (P((M n E)')) = 27 + 15 + 23 = 65/100 P(failing mechanical n failing) = 15/100 15/65
A fair coin is tossed 100 times. i) Use the Normal approximation to the binomial to estimate the probability that exactly half of the tosses result in heads. ii) Also estimate the probability that more than 60 of the tosses result in heads.
i) let X be the number of heads obtained X ~ B(100, 0.5) P(X = 50) np = 50 = μ √(np(1 - p)) = 5 = σ X ~ N(50, 5²) np is 50, not close to 0 or n so appropriate for approximation n is large at 100 and p = 0.5 so appropriate for approximation continuity correction: P(49.5 < X < 50.5) = 0.0797 ii) P(X > 60) continuity correction: P(X > 60.5) = 0.0179 (excluding 60 because not ≥)
Sally and Robin have slightly different conjectures about the probabilities of accidents on different days of the week. Sally's conjecture is that an accident is equally likely to occur on any day of the week. (i) Using Sally's conjecture, write down the probability of an accident occurring on any particular day. Hence write down the expected number of accidents on each day of the week for a total of 92 accidents. (ii) Use a Normal approximation to the distribution B(92, 5/7) to find the probability of a number of weekday accidents at least as great as 72. (iii) Robin uses this probability in a hypothesis test at the 5% significance level that cycling accidents are more common on week days, so the probability of an accident occurring on a week day is greater than 5/7. State the null and alternative hypotheses for this test, and interpret the result.
i) p = 1/7 92 x 1/7 = 13.1 per day expected ii) X ~ B(92, 5/7) P(X ≥ 72) np = 65.71 np(1 - p) = 18.78 n is large and the value of np is not close to 0 or n, and the value of p is close to 0.5, so a continuity correction is appropriate continuity correction: X ~ N(65.71, 18.78) P(X > 71.5) = 0.091 iii) let p = probability of an accident occurring on a week day H₀: p = 5/7 H₁: p > 5/7 one-tailed test Significance = 5% -> 0.05 0.091 > 0.05 The probability of obtaining 72 or more accidents is greater than the significance level, so the result is not significant, H₀ should be accepted. There is no evidence to suggest an increase in accidents on weekdays than weekends
why is a higher sample size more preferable for hypothesis tests
it gives a greater probability of rejecting Ho when it should be rejected
The heights of the tides in a harbour have been recorded over many years and found to be Normally distributed with mean 2.512 fathoms above a mark on the harbour wall and standard deviation 1.201 fathoms. A change is made so that the heights are now recorded in metres above a different datum level, 0.755 metres lower than the mark on the harbour wall. Given that 1 fathom is 1.829 metres, describe the distribution of the heights of the tides as now measured.
let H be the height in fathoms let Y be the height in m above the new level H ~ N(2.512, 1.201²) 2.512 x 1.829 = 4.59 m mean 2.201 x 1.829 = 2.196m S.D with shifted mark: 4.59 + 0.755 = 5.349 = new mean 2.196² is the same S.D because the spread is the same new distribution: Y ~ N(5.349, 2.197²)
A bag contains blue discs and red discs. There are 15 blue discs and an unknown number of red discs. There are more red discs than there are blue discs. A disc is taken at random from the bag and not replaced. A second disc is then taken at random from the bag. Calculate the probability that 2 blue discs are taken, given that two discs of the same colour are taken.
let there be n reds P(BB) = 15/(15 + n) x 14/(14 + n) = 210/(15+n)(14+n) P(same colour) = P(BB) + P(RR) P(RR) = n/(15 + n) x (n-1)/(14 + n) = n(n-1)/(15+n)(14+n) P(same colour) = (n² - n + 210)/(15+n)(14+n) = (n n = 21 (cannot be 10 because there are more blue than red) P(BB) = 210/(36)(35) = 1/6 P(RR) = 21(20)/(36)(35) = 1/3 P(same colour) = 1/6 + 1/3 = 1/2 (1/6) / (1/2) = 1/3
A machine is designed to make paperclips with mean mass 4.00 g and standard deviation 0.08 g. The distribution of the masses of the paperclips is Normal. A quality control officer weighs a random sample of 25 paperclips and finds their total mass to be 101.2 g. Conduct a hypothesis test at the 5% significance level to find out whether this provides evidence of an increase in the mean mass of the paperclips (using probability instead of critical ratios)
let μ be the population mean of paperclip masses H₀: μ = 4.00 g H₁: μ > 4.00 g one-tailed test Significance = 5% -> 0.05 x̄ ~ N(4, 0.08²/25) 101.2 / 25 = 4.048 P(x̄ > 4.048) = 0.00135 0.00135 < 0.05 The probability of obtaining a mean paperclip mass of 4.048 or more is greater than the significance level. Therefore, we should reject H₀, the result is significant. There is evidence to suggest an increase in mean mass.let μ be the population mean of paperclip masses H₀: μ = 4.00 g H₁: μ > 4.00 g one-tailed test Significance = 5% -> 0.05 x̄ ~ N(4, 0.08²/25) 101.2 / 25 = 4.048 P(x̄ > 4.048) = 0.00135 0.00135 < 0.05 The probability of obtaining a mean paperclip mass of 4.048 or more is greater than the significance level. Therefore, we should reject H₀, the result is significant. There is evidence to suggest an increase in mean mass.
ways of solving hypothesis testing questions
probability: x̄ ~ N(μ, σ²/n) P(x̄ ...) = compare with significance critical ratios: -calculate z -find critical values and region using standard normal -compare with z value may have to calculate S.D and x̄ given info. x̄ = Σx/n -state 'significant' if probability is less than significance level, or value is in critical region, and reject H₀
Root mean squared deviations
sqrt(Sxx/n) -measure of how much a typical value might be above or below the mean
difference between stratified and quota
stratified: number selected from each group is proportional to how large the group is compared to the population quota: number selected from each group is (usually) pre-determined (eg: 50 boys, 50 girls)
calculating z of sample means distribution
z = (x̄ - μ) / (σ/√n) (observed value - expected value)/standard deviation
φ(z) and Φ(z) + equation for normal curve
φ(z): normal curve (actual distribution) Φ(z): cumulative distribution (standard normal) φ(x) = 1/(σ√2π) x e^(-1/2 ((x - μ)/σ)²) in formula sheet z can be substituted in
root mean square deviations
√(Σ(x - x̄)²/n) Σ(x - x̄)² can be replaced with Σx² - nx̄²