Prob/Stats and ML Interview Questions

Ace your homework & exams now with Quizwiz!

How much does gradient descent change the weights of our model per iteration?

(Slope of the gradient) * (learning rate)

nCr equation WITH replacement

(n + r - 1)! / ( (n-r)! * r! ) note: same denominator as w/o replacement, but a greater numerator term

nCr equation WITHOUT replacement

(n!) / ( (n-r)! * r! ) note: similar to # of perms w/o replacement, but has a greater denominator since order doesn't matter for combos

Let's say we use people to rate ads. There are two types of raters. Random and independent from our point of view: 80% of raters are careful and they rate an ad as good (60% chance) or bad (40% chance). 20% of raters are lazy and they rate every ad as good (100% chance). 1. Suppose a rater rates just three ads, and rates them all as good. What's probability the rater was lazy? 2. Suppose a rater sees N ads and rates all of them as good. What happens to the probability the rater was lazy as N tends to infinity? 3. Suppose we want to exclude lazy raters. Can you come up with a rule for classifying raters as careful or lazy?

1) Baye's - (1.0)(0.2) / (1.0)(0.2)+(0.6^3)(0.8) = 53.6% 2) Trends toward 100%. Because the inverse probability gets lower and lower (0.6^N)(0.8) 3) Using confidence of 95%, need to rate 9 ads as good, then probability of lazy is 96%, can rule out

Amy and Brad take turns in rolling a fair six-sided die. Whoever rolls a "6" first wins the game. Amy starts by rolling first. What's the probability that Amy wins?

1st round of rolls: P(Amy | first roll) = 1⁄6 P(Brad | first roll) = P(Amy loses) x 1⁄6 2nd roll P(Amy | second roll) = P(Amy loses first) x P(brad loses first) x 1⁄6 = 5⁄6 x 5⁄6 x 1⁄6 P(Brad | second roll) = P(Amy loses second roll) x 1⁄6 = 5⁄6 x 5⁄6 x 5⁄6 x 1⁄6 Since want to calculate the probability of amy winning, we can write out the geometric series: 1⁄6 + (5⁄6)^2 x 1⁄6 + (5⁄6)^4 x 1⁄6 The formula for a sum of the infinite geometric series is a/1-r so we can rewrite this as: 5⁄6^2 = 25⁄36 1⁄6 / (1-25⁄36) = 6⁄11

What is a confidence interval? What does it mean if we have a 0.95 confidence interval?

A confidence interval gives the PROBABILITY that our true value lies within the range of values. Bigger interval = higher probability.

What is the difference between a box plot and a histogram?

A histogram shows the distribution of all values over the entire range. A box-plot shows the min, max and 25, 50, 75 quartiles of the data (lower quartile, median, upper quartile)

You randomly draw a coin from 100 coins where there is 1 unfair coin (head-head), 99 fair coins (head-tail) and roll it 10 times. If the result is 10 heads, what's the probability that the coin is unfair?

Applying Bayes Theorem and law of total probability: P (unfair|10 heads) = P (10 heads | unfair) P (unfair) / (P (10heads|unfair) P (unfair) +P (10heads|fair)P (fair)) = 0.912

Given n samples from a uniform distribution[0,d] (from 0 to d), how would you estimate d?

Assuming you sample many times, d would be close to the maximum observed value. The mean and expected value of the distribution would also equal d/2, so we could find d by taking 2*mean of our samples.

Consider a game with 2 players, A and B. Player A has 8 stones, player B has 6. Game proceeds as follows. First, A rolls a fair 6-sided die, and the number on the die determines how many stones A takes over from B. Next, B rolls the same die, and the exact same thing happens in reverse. This concludes the round. Whoever has more stones at the end of the round wins and the game is over. If players end up with equal # of stones at the end of the round, it is a tie and another round ensues. What is the probability that B wins in 1, 2, ..., n rounds?

Because at the beginning time, A has 8 and B has 6, so let A:x and B:y, then A:8+x-y and B:6-x+y; so there are 10/36 prob of B wins. And A wins prob is 21/36 and the equal prob for next round is 5/36. So for B wins at round prob is 10/36. And if they are equal and to have another round, the number has changed to 7 and 7. So A:7+x-y and B:7-x+y, so this time B wins has prob 15/36 and A wins has prob 15/36. And the equal to have another round is 6/36=1/6. So overall B wins in 2 rounds has prob 5/36*15/36. And for round 3,4,...etc, since after each equal round, the number will go back to 7 and 7 so the prob will not change. So B wins in round 3,4,...n has prob 5/36*(6/36)^(r-2)*15/36. r means the number of the total rounds. Total probability that B wins the game: = 10/36 + (5/36) (6/36)N-3 * 15/36

How do you find an anomaly in a distribution? How do you detect if a new observation is outlier?

Compute the confidence interval given the data set. If the data point lies outside of the confidence interval (say 99%), we consider it an outlier.

What is a p-value? Would your interpretation of p-value change if you had a different (much bigger, 3 mil records for ex.) data set?

Def. of p-value: probability of observed results if null hypothesis is true. (null hypothesis is true if samples are from same population). With very large samples, p-values are not reliable because stderror = std/sqrt(n). It is best to measure differences in large samples using 'Effect Size' because this uses standard deviation instead of standard error.

You are about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that "Yes" it is raining. What is the probability that it's actually raining in Seattle?

Different approach here. Since we don't know the probability of rain, we can use the probability of consensus, which is the probability that all three agree on what to say: P(consensus) = P( all lies) + P(all truth) Since we know that they all agreed, we can solve for the following: P(all truth)/ P(consensus) To get the probability of consensus: P(all truth) = 2/3 x 2/3 x 2/3 = 8/27 P(all lie) = 1/3 x 1/3 x 1/3 = 1/27 Then we can calculate the conditional probability: (8/27)/(1/27 + 8/27) = 8/9

Three ants are sitting at the three corners of an equilateral triangle. Each ant starts randomly picks a direction and starts to move along the edge of the triangle. What is the probability that none of the ants collide?

Each ant have two sides to choose from for their journey, so total conditions will be equal to 2^3 ,since the ants can decide to go in the same direction two different ways, the total probability will be 2/8 .

Let's say that there are six people trying to divide up into two equally separate teams. Because they want to pick random teams, on each round, each person shows either a face up or face down hand and if there are three of each, then they'll split into teams. What's the expected number of rounds that everyone will have to pick a hand side before they split into teams?

Each person is essentially a Bernoulli trial, coming up 1 or 0 (face up or face down) with equal probability. To calculate the probability of exactly 3 people getting face up (which implies 3 people face down) we use the binomial PMF: nCk * p^k * (1-p)^(n-k). Filling in the numbers for our particular problem yields 6C3 * .5^3 * (1-.5)^(6-3). 6!/(3! * 3!) * .5^6 = 20/(2^6) = 20/64 = 0.3125 So on any given day, there's a 31.25% probability of forming a team. We divide 1 by the daily probability to get the expected number of days. 1/0.3125 = 3.2 days

There are four people on the ground floor of a building that has five levels not including the ground floor. They all get into the same elevator. If each person is equally likely to get on any floor and they leave independently of each other, what is the probability that no two passengers will get off at the same floor?

Each user can choose which of the five floors they want to get off of. The total possibilities are: 5*5*5*5 = 625 The possible combinations of each passenger getting off on a different floor are: 5*4*3*2 = 120 Probability that no two passenger gets off the same floor: 120 / 625 = 0.192

We have two options for serving ads within Newsfeed: 1 - out of every 25 stories, one will be an ad. 2 - every story has a 4% chance of being an ad. For each option, what is the expected number of ads shown in 100 news stories? If we go with option 2, what is the chance a user will be shown only a single ad in 100 stories?

For both cases, expected value is 4 adds. binomial(p=0.04,n=100). E = n*p = 4 P(x=1 out of 100) = n * (p)^k * (1-p)^(n-k) P(1 out of 100) = 100 * (0.04) * (0.96)^99 = 0.07

What is the difference between Bayesian vs frequentist statistics?

Frequentist stats just records how frequently something occurs. Bayesian stats tests whether new observed results fit the hypothesized distribution or model.

What is the expected value and variance of: binomial distribution

Geometric dist: E(x) = 1/p Var(x) = (1-p) / p^2 Questions usually go: How many times do we expect to roll a dice before we see X?

Imagine a deck of 500 cards numbered from 1 to 500. If all the cards are shuffled randomly and you are asked to pick three cards, one at a time, what's the probability of each subsequent card being larger than the previous drawn card?

Ignore size of population. It really only matters what is the low card, middle card, high card (1,2,3). There are only 6 possible orders to draw them in: (3,2,1) (3,1,2) (2,1,3) (2,3,1) (1,3,2) (1,2,3) So the probablility that you always draw a higher card on the 2nd and 3rd draw is 1/6.

Given an unfair coin with the probability of heads and tails not equal to 50/50, what algorithm could generate a list of random ones and zeros?

Let's say for the first round of flipping the coin we consider H=1 and T=0, next round we reverse the order and consider H=0 and T=1, and again in the next round of flipping the coin we consider H=1 and so on. In even numbers of flipping the coin the distribution of 1s and 0s will be exactly 50⁄50.

How do you calculate variance in an unsupervised model?

Lets assume that our unsupervised model is a cluster algorithm and we segmented our data in k clusters. Then, we can define the variance as: variance = between class variation / within class variation

What is the expected value and variance of: geometric distribution

Note: stdev = sqrt(variance) Binomical dist: E(x) = n*p Var(x) = n*p*(1-p) Questions usually go: If we toss a coin 20 times, how often do we expect to see heads?

What is the equation for Bayes Theorem?

P(A | B) = ( P(B | A) * P(A) ) / P(B)

Alice and Bob take turns in rolling a fair dice. Whoever gets "6" first wins the game. Alice starts the game. What are the chances that Alice wins?

P(Alice wins)=P(Alice wins in round 1)+P(Alice wins in round 2)+P(Alice wins in round 3)+ P(Alice wins in round 4) + ........ = 1/6 + (5/6) ^2 (1/6) + (5/6) ^4 (1/6) + .......... = 1/6 ( 1+ (5/6) ^2 + (5/6) ^4 +......................) = 1/6 / (1-(5/6)^2) = 6/11

A jar holds 1000 coins. Out of all of the coins, 999 are fair and one is double-sided with two heads. Picking a coin at random, you toss the coin ten times. Given that you see 10 heads, what is the probability that the coin is double headed and the probability that the next toss of the coin is also a head? Give your answer to 3 significant figures.

P(DH | 10H) = P(10H | DH ) * P(DH) / P(10H) P(10H) = P(10H | DH) * P(DH) + P (10H | NOT DH) * P(NOT DH) NUM = 1*0.001 = 0.001 DEN = 0.001 + (0.5^10)*0.999 DEN = 0.001975586 NUM/DEN = 50.61%

What is recall? (binary classification)

Percent of real positives that are predicted positive Recall = TP / (TP+FN) Models with high sensitivity have high recall

How would you test if survey responses were filled at random by certain individuals, as opposed to truthful selections?

Perform a goodness of fit test to see if responses were uniform random distributed. See if the responses of an individual were statistical outliers compared to other populations.

You have 2 dice. What is the probability of getting at least one 4? Same question with n dice.

Prob of not getting a 4 is 5/6 Prob of not getting a 4 for n dice = (5/6)^n Therefore, prob of getting a 4 for n dice = 1-(5/6)^n

What is the difference between likelihood and probability?

Probability = area under an interval of a distribution curve. Probability: What is the probability of a measurement, given the distribution? Probability example: what is the probability a mouse weighs 34 grams, given the normal distribution of weight? Likelihood: What is the likelihood of the distribution, given our measurement? Likelihood example: what is the likelihood that this distribution is normal(mean=30,std=2), given our mouse weight measurements?

Let's say we're given a biased coin that comes up heads 30% of the time when tossed. What is the probability of the coin landing as heads exactly 5 times out of 6 tosses?

Probability can be modeled using a binomial distribution. P(H) = 0.3 P(T) = 0.7 P(5 Heads in 6 Tosses) = 6C5*P(H)^5*P(T) = 6*(0.3^5)*0.7 = 0.0102

Amazon has a warehouse system where items on the website are located at different distribution centers across a city. Let's say in one example city, the probability that a specific item X at location A is 0.6, and at location B the probability is 0.8. Given you're a customer in this example city and the items are only found on the website if they exist in the distribution centers, what is the probability that the item X would be found on Amazon's website?

Probability of the item being present= 1- p(item NOT in A AND NOT in B) = 1-(0.4*0.2)=0.92

What is a Q-Q plot?

Quartile-quartile plot Take the values between two samples at given percentiles. Quartiles for one sample are x axis, y axis for other sample. If the samples have same distribution, then we see a straight line with slope == 1. Note: taking quartiles allows us to compare a sample with fewer obs to a sample with many more obs.

What is an ROC curve? What does area under the curve mean?

Receiver operator characteristic, used in evaluating binary classification model. Y axis is true-positive rate, x-axis is false positive rate. Calculated for different values of Beta --> sensitivity versus specificity.

What are examples of normalization in NLP?

Removing punctuation Removing stopwords (a, the, of, for) Stemming: reducing different forms of word down to the stem (ex: running = ran = run) Separation by parts of speech: verb, noun, adjective, conjunction

What are the major differences between L1 and L2 regularization?

Ridge Regression can only shrink coefficients (slopes) near 0. LASSO Regression can shrink coefficents all the way to 0. LASSO is generally better at removing useless variables from models with many variables. Ridge Reg. tends to be better at reducing model variance when most variables are useful.

What is L2 regularization?

Ridge Regression: adds a small amount of bias to regression model, to reduce the variance (prevents overfitting) Ridge Reg. Penalty = SSR + lambda * (slope)^2 Slope = how much prediction changes for a unit change in input. L2 regularization makes model less maliable SSR = sum of squared residuals

What are TF-IDF in NLP?

Term frequency: how many times term is in document divided by total number of words in document. Inverse document frequency: total number of docs divided by number of docs that contain term. Measures rarity of word (rarer == more informative)

What does a t-test do?

Tests the null hypothesis: that two samples come from the same population.

There are 25 horses. You can race any 5 of them at once, and all you get is the order they finished. How many races would you need to find the 3 fastest horses?

The number of minimum number of races will be 7. Step 1: 5 Races among all the horses and we will choose the first from each group. Step 2. 1 Race among all the winners of previous 5 races and select the fastest horse. Step 3. For Final race we can remove Group D and E, as they D1 confirmed 4th and E1 5th position. We can keep A2, A3, B1, B2 and C1 for final race and choose 2nd and 3rd horses after the race.

What is the definition of the variance?

Variance is the expectation of the squared deviation of a random variable from its mean. Following method is adopted to calculate variance. 1. Calculate the mean of data set. 2. Square all the differences between mean and each observations and sum up. 3. Divide the figure obtained by no.of observation for popultation variance while by one less no.of observations for sample variance.

You're at a casino with two dice, if you roll a 5 you win, and get paid $10. What is your expected payout? If you play until you win (however long that takes) then stop, what is your expected payout?

We can get 5 in 4 ways (1+4, 2+3,3+2,4+1). So, probability of success will be P (S) = 4/36 = 1/ 9. The expected number of trails to first success will be 1 /P(S) = 1/ 1/9 = 9. So, there will be 8 losses followed by a win. Suppose if X= 5$ and Y = 10$, the pay-out will be= 9*(-5$) + 10$ = - 35$

You're playing casino dice game. You roll a die once. If you reroll, you earn the amount equal to the number on your second roll otherwise, you earn the amount equal to the number on your first roll. Assuming you adopt a profit-maximizing strategy, what would be the expected amount of money you would win?

When you roll the first die with a value of 1, 2, 3, then you should always roll the second die since the expected value for rolling a single die is 3.5. This is done by taking the total sum of the values divided by the number of options: (1 + 2 + 3 + 4 + 5 + 6)/6 = 3.5 So when you roll a 1, 2, 3 then the actual expected value is 3.5 If you roll a 4, 5, 6 on the first die, then the expected value is 5 and you should not roll the second die. The overall expected monetary value then comes to (3.5+5)/2 = 4.25

What is an ANOVA? Does ANOVA assume normality?

When you want to compare 3+ groups, rather than 2 (with a t-test). Core idea: How much of the total variance in all your samples is occuring WITHIN groups VS BETWEEN groups? F = ratio of (between group variance)/(within group variance) Larger F means that groups are different b = degrees of freedom between groups = (number of groups - 1) w = DoF within groups = (total number of observations - number of group)

Imagine you have N spaghettis in a bowl. You reach in and grab one end-piece, then reach in and grab another end-piece, and tie those two together. What is the expected value of the number of spaghetti loops in the bowl?

You have 2N ends. Then after select the first end you have 2 possible outcomes: 1- Select the other end of the current spaghetti: The probability of this event is 1/(2N-1). 2- Select another end. The probability of this event is (2N-2)/(2N-1). We get a recurrence formula: E(N) = 1/(2N-1) + E(N-1) With base case E(1) = 1 from problem description.

Two random cards numbered from 1,2...100 are pulled from the deck. What is the probability that one number doubles the other from the deck?

You have a total of 100=50*2 favorable outcomes (1;2), (2;4)..(50; 100) and 100_C_2 possible combinations 100/(100C2)=0.02

What goes into the calculation of test duration? (for example, if you want to choose how long to A/B test a new feature)

You primarily want to consider two things: 1) How many consumers you can test in a given time period. 2) How many total subjects do you need to have for statistical power. (based on power analysis)

How do you draw a uniform random sample from a circle in polar coordinates?

A point in polar coordinates is determine by (r, theta). The radius is constant. We take a uniform random sample by drawing theta from a uniform random dist between [0, 2*pi]. We can convert to x,y cartesian coordinates by x = r * cos(theta), y = r * sin(theta).

nPr equation WITHOUT replacement

(n!) / (n-r)!

What percentage of normal distribution lies within 1 std of mean? 2, 3 std?

68%, 95%, 99.7%

A and B toss a die, whoever gets 6 first wins. Given A starts first, what's the probability that A can win?

This is a geometric series which is an infinite G.P. with geometric ratio = (5/6)^2 = 25/36 and starting value 1/6. So, the probability that A can win = (1/6) / (1- 25/36) = 6/11. P = 6/11

Do t-tests and ANOVA assume normality? How do we test normality?

'Parametric' tests like t-tests and ANOVA do assume normal distribution of dependent variables. However, ANOVA is not very sensitive to deviation from normality in the data. We can test for normality using a Q-Q plot or a KS test.

Let's say we use people to rate ads. There are two types of raters. Random and independent from our point of view: 80% of raters are careful and they rate an ad as good (60% chance) or bad (40% chance). 20% of raters are lazy and they rate every ad as good (100% chance). 1. Suppose we have 100 raters each rating one ad independently. What's the expected number of good ads? 2. Now suppose we have 1 rater rating 100 ads. What's the expected number of good ads? 3. Suppose we have 1 ad, rated as bad. What's the probability the rater was lazy?

1) Let's denote X the random variable of good ads, Ng the number of good raters the good ads can come either from lazy or careful raters thats means : E(X) = (80%Ng)60% + (20%Ng)100% = 68 2) There's a certain symmetry between ads and raters and here's why : Let's denote Ng the number of ads E(X) = 80%(60%Na) + 20%(100%Na) = 68 3) Let's denote B the random variable describing the event of bad ad, and L : lazy rater we need to calculate P(L|B), P(L|B) = P(L,B)/P(B) = 0

How many people must be gathered together in a room, before you can be certain that there is a greater than 50/50 chance that at least two of them have the same birthday?

2nd person who enters has (364/365) of not having same birthday. 3rd has (363/365) (364*363*362*....)/(365^k) p_no = 365! / ( (365-k)! * (365^k) ) 1 - p_no = p p_no < 0.5 when k = 23

What's the probability that in a room full of k people, at least 2 people will have the same birthday?

2nd person who enters has (364/365) of not having same birthday. 3rd has (363/365) (364*363*362*....)/(365^k) p_no = 365! / ( (365-k)! * (365^k) ) 1 - p_no = p p_no < 0.5 when k = 23

What is the bias versus variance tradeoff in Machine Learning?

A model has "high bias" when it has a more general form and does not fit any specific data set exactly. A model has "high variance" when it is overfit to one particular data set, and thus has high error when applied to other data sets. We want to minimize both bias and variance, but this is usually a trade-off.

You're about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that "Yes" it is raining. What is the probability that it's actually raining in Seattle?

Applying Bayes Theorem P (R|YYY) = P(R) P (YYY|R) / ( P(R) P (YYY|R) + P(NR) P (YYY|NR) ) = (P(R) (2/3) ^ 3) / ((P(R) (2/3) ^ 3) + ( 1- P( R) ) ( 1/ 3)^3)) = 8*P(R)/1+ 7*P(R) We make prior probability assumption, P(R) = 1/4 then P(R | YYY) = 8/11

There are 30 red marbles and 10 black marbles in Urn #1. You have 20 red and 20 Black marbles in Urn #2. Randomly you pull a marble from the random urn and find that it is red. What is the probability that it was pulled from Urn #1?

Applying Bayes Theorem and law of total probability: = 3/5

If a jar has X red balls and Y blue balls, what is the minimum number of draws that is necessary to ensure that you have one ball of each color?

Best case: we draw 1 on each immediately and total draws = 2 Worst case: we get unlucky and only draw the color that has the most balls. Answer: max(X,Y) + 1

How should we treat colinearity in data analysis?

Collinearity or Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model. Variance Inflation Factor (VIF) is used to measure conlinearity of features. Any variable with VIF greater than 5 must be removed.

Three zebras are chilling in the desert. Suddenly a lion attacks. Each zebra is sitting on a corner of an equally length triangle. Each zebra randomly picks a direction and only runs along the outline of the triangle to either edge of the triangle. What is the probability that none of the zebras collide?

Each zebra has 2 possibilities (clockwise or counterclockwise). So the total is 2 x 2 x 2 = 8. The only possibilities are all 3 zebras going clockwise or counterclockwise (2). So 2⁄8 = 25%.

What is XGBoost?

Extreme gradient boosted (regression or classification). Uses a decision tree to build a high performance and easily interpretable regression or classification model. Allows us to examine contribution of different features to the model easily!

You're given a fair coin. You flip the coin until either Heads Heads Tails (HHT) or Heads Tails Tails (HTT) appears. Is one more likely to appear first? If so, which one and with what probability?

HT appears in both. Given that you have an HT, there's a 50% chance of an H preceding it resulting in a HHT win, while for HTT to win you need a T as both a prefix and a suffix T - HT - T so that 1⁄2 * 1⁄2=25%. So the odds are 2⁄1 in favor of HHT , or 66% chance

What is an example of a data set with a normal/gaussian distribution?

Height, weight of humans

I randomly pick an integer from 1 to 100, how would you guess it? What if you have prior knowledge that I pick a number that is hard to guess such as near 1 or 100?

I would conduct a binary search: I would ask you if your card is greater or lower than 50. Greater or lower than 25. etc, etc O(logn) number of rounds, where n is highest integer

Let's say we have a very naive advertising platform. Given an audience of size A and an impression size of B, each user in the audience is given the same random chance of seeing an impression. 1. Compute the probability that a user sees exactly 0 impressions. 2. What's the expected value of each person receiving at least 1 impression?

If the impressions are non-repetitive: P(0)=1-B/A P(1)= B/A If the impressions can be repetitive, then: P(0)= (1-1/A)**B P(at least 1) = 1 - (1-1/A)**B

What is the Law of Large Numbers?

If you repeat an experiment independently a large number of times and average the result, what you obtain should be close to the expected value

What is the probability of getting a pair by drawing 2 cards in a 52 card deck?

Ignore whichever card is first chosen. Probability of 2nd card matching first card: P = 3/51

What is a KS test?

Kolmogorov-Smirnov Test. Tests whether sample fits a distribution well.

What is L1 regularization?

LASSO Regression: adds a small amount of bias to regression model, to reduce the variance (prevents overfitting) LASSO penalty = SSR + lambda * abs(slope) Slope = how much prediction changes for a unit change in input. L2 regularization makes model less maliable SSR = sum of squared residuals

What is the relationship between the coefficient in logistic regression and 'the odds ratio'?

LogReg coefficient = ln( Odds Ratio ) Odds Ratio = P(A) / (1 - P(A)) LogReg coefficient is high for a given variable when that variable has a major role in deciding classification boundary. Odds Ratio tells us change in output variable for +1 unit change in input variable.

What is the relevance of maximum likelihood to logistic regression?

Logistic regression is solved using Maximum Likelihood. LogReg error term = sum of the likelihoods

What is the canonical best classification model to use for NLP document classification? (such as sentiment analysis, etc)

Naive Bayesian Classifier (multiclass) Because NBC is based on Bayes Theorem, which computes conditional probability of two events, given the individual probability of those events. NBC is especially good when we know all the conditional probabilities between our words and target. It can also work with very large data sets in a very computationally light way.

What is the probability of getting a 5 on throwing dice 7 times?

P(5) = 1/6 P(~5) = 1 - 1/6 = 5/6 P(no 5s in 7 throws) = (5/6)^7 Answer = 1 - (5/6)^7

What is more likely: Getting at least one six in 6 rolls, at least two sixes in 12 rolls or at least 100 sixes in 600 rolls? (dice)

P(6) = 1/6 P(not 6) = 5/6 P(atleast 1 in 6 rolls) = 1 - (5/6)^6 P(atleast 2 in 12 rolls) = 1 - (5/6)^12 - (12 choose 1)*(1/6)*(5/6)^11 P(atleast 100 in 600 rolls) = .... = 1 - (5/6)^600 - (600 choose 99) *(1/6)*(5/6)^600 Every time, the probability goes down as our scenario gets more specific. P(atleast 1 in 6 rolls) is greatest

What is the probability of pulling a different color or suit card from a shuttled deck of 52 cards?

P(A or B)=P(A)+P(B)-P(A and B) P(diff color) = 26/51 P(diff suit) = 39/51 (13 cards per suit) P(diff color and suit) = 26/51 P(diff color or suit) = 26/51 + 39/51 - 26/51 = 39/51

What is the relationship between AND and XOR (exclusive or) in condit probability?

P(A xor B) = P(A) + P(B) - P(A and B)

What is the Probability definition equation?

P(A) + P(~A) = 1

What is the Law of Total Probability?

P(A) = P(A | N) + P(A | ~N)

What is precision? (binary classification)

Percent of Positive predictions that are True Precision = TP / (TP + FP) Models with high specificity have high precision

What is the ACCURACY of a classification model?

Percent of predictions that are correct.

In an A/B test, how can you check is assignment to the various buckets was truly random?

Plot the distributions of multiple features between A and B group. Distributions should match. Can perform t-tests, effect size tests between feature values between populations.

What is R^2 value? Why do we need it? How to calculate it? What is the difference between R^2 and adjusted R^2?

R^2 = "coefficient of determination" R^2 = proportion of variance in the target (dep. var) that is predictable from the feature (indep. var) R = "correlation coefficient" R ~=~ covariance(X and Y) / sqrt( var(X) * var(Y) ) var(X) = MSE = mean squared error

Suppose we have two coins. One is fair and the other biased where the probability of it coming up heads is 3/4. Let's say we select a coin at random and flip it two times. What is the probability that both flips result in the same side?

Tails: (1⁄2)(1⁄4)(1⁄4) + (1⁄2)(1⁄2)(1⁄2) = 5⁄32 Heads: (1⁄2)(3⁄4)(3⁄4) + (1⁄2)(1⁄2)(1⁄2) = 13⁄32 summing both, we get 18⁄32

Let's say you're playing a dice game. You have 2 die. 1. What's the probability of rolling at least one 3? 2. What's the probability of rolling at least one 3 given N die?

Take the complement of the probability of rolling no 3's 1-(5⁄6)^2 1- (5⁄6)^N

What is the probability of drawing two cards (from the same deck of cards) that have the same suite?

The key is to ignore the first card because it doesn't matter which suite we have drawed. For the second card, there are 12 cards left of the suite we want. So the probability would be equal to 12/51=0.2353

What is the Central Limit Theorem?

The sum of a large number of random variables is approximately normal. The more random variables we sum together, the more our distribution becomes normal.

We have two buckets full of marbles. There are 30 red marbles and 10 black marbles in Bucket #1 and 20 red and 20 Black marbles in Bucket #2. Your friend secretly pulls a marble from one of the two buckets and shows you that the marble is red. 1. What is the probability that it was pulled from Bucket #1? 2. Let's say your friend picks two marbles instead and they both happen to be red. What is the probability that they both came from Bucket #1?

This doesn't take into account that the probabilities of drawing from bucket 1 change when you've already drawn from bucket one (i.e. you must calculate probabilities of NO replacement when calculating a sequential draw). Wouldn't the probability of drawing twice from bucket 1 actually be: First draw: 30 red in urn 1 / 50 red total. Second draw: 29 red in urn 1 / 49 red total 30⁄50 * 29⁄49

Suppose you have a categorical dependent variable and a mixture of continuous and categorical independent variables, what type of tools/algorithms/methods you could apply for analysis?

This is a classification problem. For categorical independent variables (features), we can: 1) Encode them (one-hot encoding) and use LogReg, SVC, etc 2) Leave as categorical and only use tree-based models such as Random Forest, Gradient Boosting, AdaBoost

What is an example of a dataset/variable that has a non-normal/gaussian distribution?

Time of meals in a given day (trimodal), distribution of global wealth (skew right)

Suppose you run an analysis that suggests a certain factor predicts a certain outcome. How would you gauge whether it's actually causation or just correlation?

To test causality, you can test the "necessity and sufficiency". You would launch an A/B test without a feature to see if there was a causal effect on metrics. Then you would launch an A/B test with enhancement of the feature to see if there was a causal effect in the opposite direction.

What is vectorization in NLP?

Turning a list of words into a vectorized frequency table where each column is a different word, each row is a different document and the value per element is the count of times that word appears in the document. Known as "Bag of Words" format

In statistics, what is a 'Type 1 error'?

Type 1 error = False Positive. Predicting positive when obs is actually negative.

In statistics, what is a 'Type 2 error'?

Type 2 error = False Negative Predicting negative when obs is actually positive.

You flip a fair coin 576 times. Without using a calculator, calculate the probability of flipping at least 312 heads.

Use normal approximation Variance = np(1-p)=144 STD dev = sqrt(Var)=12 312 = 576⁄2 + 2* StdDev Right tail area is 2.2%

Based on past data, 98% reviews are legitimate and 2% are fake. If a review is fake, there is 95% chance that the machine learning algorithm identifies it as fake. If a review is legitimate, there is a 90% chance that the machine learning algorithm identifies it as legitimate. What is the percentage chance the review is actually fake when the algorithm detects it as fake?

Using Bayes Theorem and Law of Total Probability: P (True detect) = ( P (True detect | Fake) * P (Fake) ) + ( P (True detect | Legit) * P (Legit) ) = 0.117 So: P (Fake | True Detect ) = 0.019 /0.117 =0.16 which is equal to 16%.

What is standard deviation and variance? Why do we need it?

Variance = average squared error of data points relative to the mean. Stdev = sqrt(variance) Standard deviation is useful because it allows us to relate the variance in our data to its distribution.

Let's say you have to draw two cards from a shuffled deck, one at a time. What's the probability that the second card is not an Ace?

We would have to consider both probabilities of the first card being an ace and non ace. (4⁄52)(48⁄51) + (48⁄52)(47⁄51)

Imagine you have N pieces of rope in a bucket. You reach in and grab one end-piece, then reach in and grab another end-piece, and tie those two together. What is the expected value of the number of loops in the bucket?

You have 2N ends. Then after select the first end you have 2 possible outcomes: 1- Select the other end of the current spaghetti: The probability of this event is 1/(2N-1). 2- Select another end. The probability of this event is (2N-2)/(2N-1). We get a recurrence formula: E(N) = 1/(2N-1) + E(N-1) With base case E(1) = 1 from problem description.

nPr equation WITH replacement

n^r (think binary numbers)

There are 100 products and 25 of them are bad. What is the confidence interval?

p = 25/100 = 0.25 CI = 0.25 +/- 1.96 * sqrt( (0.25*(1-0.25)) * 100) CI = p +/- Z * sqrt(variance of binom dist) CI = (16.5,33.5) 95% confidence = plus or minus 1.96 STDEV

Explain how a probability distribution could be not normal and give an example scenario.

rolling a dice is a uniform distribution number of accidents given average number of accidents is a poisson distribution Income is a great example. It is a large right skew. Bimodal distribution of students who study for exam versus those that do not study

Prove why Pearson's correlation coefficient is between -1 and 1.

var(X) = MSE = mean squared error R ~=~ covariance(X and Y) / sqrt( var(X) * var(Y) ) R cannot exceed +1 or -1 because X and Y cannot covary which each other more than they covary with themselves R is negative when X and Y covary in opposite directions R is positive when X and Y covary in the same direct relative to their means


Related study sets

Alta - Chapter 7 - The Central Limit Theorem - Part 1

View Set

SIMTICS Urinary tract and adrenal glands

View Set

Unit 1 Progress Check - AP Macro

View Set

Language of Medicine: pt 4- Female Reproductive System

View Set

Chp. 33 PrepU Questions: Management of Nonmalignant Hematologic Disorders (Exam 1)

View Set

Scrum Guide---Definition of done

View Set

Save the Planet! Trivia questions

View Set