Statistics
24. What is the probability of rolling two 6's in a row with a fair die (a fair die has numbers 1-6 on it)?
⅙ * ⅙ = 1/36
19. What is a long-tailed distribution?
A long-tailed distribution is one where there are many relatively extreme, but unique, outliers. These distributions happen often in retail. For example, if we looked at customers baskets at a grocery store over a 1 month period, we may see that there are many thousands, or even millions, of unique baskets for customers. This is because there are so many different item combinations that a customer can select. And because foods are not consumed at the same rate (and other reasons), it is relatively rare to make repeated identical purchases. Special techniques must be used, such as doing clustering on the tail, when dealing with long-tail datasets in order to leverage them to train classification or other predictive models.
25. We roll a fair die 10 times. What is the probability that at least one of them comes up as a 3?
1- (⅚)^10 In this case, it is easier to calculate 1 - P(the complement of what we want) which is 1 - P(we roll a die 10 times and never observe a 3)
Null Hypothesis
A null hypothesis is a hypothesis that says there is no statistical significance between the two variables in the hypothesis. It is the hypothesis that the researcher is trying to disprove. In the example, Susie's null hypothesis would be something like this: There is no statistically significant relationship between the type of water I feed the flowers and growth of the flowers. A researcher is challenged by the null hypothesis and usually wants to disprove it, to demonstrate that there is a statistically-significant relationship between the two variables in the hypothesis.
2. What does it mean when a p-value is low?
A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis. A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis
3. What value is most often used to determine statistical significance?
A value of alpha = 0.05 is most often used as the threshold for statistical significance.
20. What is A/B testing and why is it useful?
An A/B test is a controlled experiment where two variants (A and B) are tested against each other on the same response. For example, a company could test two different email subject lines and then measure which one has the higher open rate. Once the superior variant has been determined (through statistical significance or some preset time period or metric), all future customers will typically only receive the "winning" variant. A/B testing is useful because it allows practitioners to rapidly test variations and learn about an audience's preferences.
9. What is bias (of a statistic)?
Bias is the difference between the calculated value of a parameter and the true value of the population parameter being estimated. For example, if we survey homeowners on the value of their homes and only the wealthiest homeowners respond, then our "home value" estimate will be biased since it will be larger than the true value for our population. (That is an example of sampling bias causing us to have a biased statistic). For machine learning models, bias refers to something slightly different: it is error caused by choosing an algorithm that cannot accurately model the signal in the data. For example, selecting a simple linear regression to model highly non-linear data would result in error due to bias.
15. Discuss the differences between Bayesian and frequentist statistics.
Both attempt to estimate a population parameter based on a sample of data. Frequentists treat the data as random and the statistic as fixed. Inferences are based on long-run infinite sampling, and estimates of the parameter come in the form of point estimates or confidence intervals. Bayesians treat the population parameter as random and the data as fixed. Bayesian statistics allows/requires you to make informed guesses about the value of the parameter in the form of prior distributions. Estimates of the parameter come in the form of posterior distributions.
5. What are some pitfalls of using classification accuracy to assess your model?
Classification accuracy can be misleading in the case of imbalanced data sets. For example, if 95% of my target is "1" and 5% is "0," I can achieve 95% accuracy by predicting "1" for every observation in my data. Obviously this model ins't useful despite having an accuracy of 95%.
Classification accuracy
Classification accuracy is a metric that summarizes the performance of a classification model. = the number of correct predictions divided by the total number of predictions. = # correct predictions / # total predictions
14. What is extrapolation and why can it be dangerous?
Extrapolation is making predictions on data that lies outside the range of the training set. Example: let's say we have a model that predicts the value of homes based on their size. Our model was trained on a data set containing homes between 500 and 5000 sq-ft. Using this model to predict the value of a 6000 sq-ft home is extrapolation. Extrapolation is dangerous because you usually can't guarantee the relationship between the target and features beyond what you've observed. In the example, the relationship between square footage and home price may be "locally linear" between 500-5000 sq. feet, but exponential after that, resulting in a poor prediction.
13. What is interpolation?
Interpolation is making predictions on data that lies inside the range of the training set. Example: let's say we have a model that predicts the value of homes based on their size. Our model was trained on a data set containing homes between 500 and 5000 sq-ft. Using this model to predict the value of a 4200 sq-ft home is interpolation.
4. What are the 5 linear regression assumptions and how can you check for them?
Linearity: the target (y) and the features (xi) have a linear relationship. Example: If the scatter plot follows a linear pattern (i.e. not a curvilinear pattern) that shows that linearity assumption is met. ---------------------------- Independence: the errors are not correlated with one another. ---------------------------- Normality: the errors are normally distributed. ---------------------------- Homoskedasticity: the variance of the error term is constant across values of the target and features. ---------------------------- No Multicollinearity: Check: look for correlations above ~0.8 between features
22. What is multi-armed bandit testing and why is it useful?
Multi-armed bandit (or simply "bandit") testing is similar to multivariate and A/B testing, but the sampling distribution for variants change gradually over time as feedback is received. For example, with traditional A/B tests, we test could test 2 email subject lines: A and B. We would initially send out emails to 200 customers, sending 100 A variations and 100 B variations. After some set period of time, say 24 hours, we would observe which email variant was opened by more customers. We would then send that variant to all customers moving forward. With bandit testing, we would set some learning rate for the distribution of variants to change over time. Perhaps 60 customers opened the A variant emails and only 50 customers opened the B variant emails. We could then shift the distribution from (50% A, 50% B) to (55% A, 45% B) for the next round of emails. Using this approach, we can continuously monitor the response from our audience and shift our resources accordingly. This is particularly useful in marketing or any industry where people's preferences and opinions may change rapidly since it continuously tests and learns preferences and can adapt very quickly. Note from Kyle: I love bandit testing and suggest using it over A/B or multivariate testing whenever possible!
Why is multicollinearity a problem?
Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because independent variables should be independent. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.
16. What is the multiple testing problem and how can we compensate for it?
Multiple Hypothesis Testing occurs when we run many hypothesis tests at once. If more than one hypothesis test is used to arrive at the same (or a correlated) conclusion, our chance of making a false positive goes up. One way to compensate for it is using the Bonferroni Correction. Here, we recalculate each individual alpha to equal overall_alpha/k, where k is the number of tests, so that we don't artificially increase the chance of false positives.
21. What is multivariate testing and why is it useful?
Multivariate testing is very similar to A/B testing, but it simultaneously tests more than 2 variants. This can be extremely useful when trying to optimize across a larger parameter space, e.g. 5 possible email subject lines, but it can take many more samples to achieve a statistically significant result. Another potential drawback is that a relatively large audience (>50%) will receive a non-optimal variation during testing.
26. We randomly draw two cards, without replacement, from a standard deck of cards. What is the probability they are both Kings? (there are 4 Kings in a standard 52-card deck).
P(A and B) = P(A) × P(B | A) = 4/52 × 3/51 Mark As Complete
6. What are some ways to deal with imbalanced datasets?
Resampling is a common way to deal with imbalanced datasets. Here are two possible resampling techniques: 1. Use all samples from your more frequently occurring event and then randomly sample your less frequently occurring event (with replacement) until you have a balanced data set 2. Use all samples from your less frequently occurring event and then randomly sample your more frequently occurring event (with or without replacement) until you have a balanced data set
23. What is the bootstrap technique? What is it used for?
The bootstrap technique is a nonparametric method of learning the sampling distribution of a parameter. Specifically, bootstrap involves sampling your entire dataset with replacement many times, at each pass calculating the statistic you're interested in. A distribution is constructed by building a histogram of the statistics generated from each pass.
11. What is the curse of dimensionality?
The curse of dimensionality refers to problems that occur when we try to use statistical methods in high-dimensional space. As the number of features (dimensionality) increases, the data becomes relatively more sparse and often exponentially more samples are needed to make statistically significant predictions. Imagine going from a 10x10 grid to a 10x10x10 grid... if we want one sample in each "1x1 square", then the addition of the third parameter requires us to have 10 times as many samples (1000) as we needed when we had 2 parameters (100). In short, some models become much less accurate in high-dimensional space and may behave erratically. Examples include: linear models with no feature selection or regularization, kNN, Bayesian models. Models that are less affected by the curse of dimensionality: regularized models, random forest, some neural networks, stochastic models (e.g. monte carlo simulations)
1. What is a p-value?
The p-value is used to determine if the outcome of an experiment is statistically significant. A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis. A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis
Standard Error of Regression
The standard error of the regression (S), also known as the standard error of the estimate, represents the average distance that the observed values fall from the regression line. Conveniently, it tells you how wrong the regression model is on average using the units of the response variable.
7. What is a Type I error?
Type I error is the rejection of a true null hypothesis, or a "false positive" classifications.
8. What is a Type II error?
Type II error is the non-rejection of a false null hypothesis, or a "false negative" classifications.
18. Name 3 continuous distributions and give an example of each one.
Uniform: All outcomes are equally likely to occur. All equal-length intervals have the same likeliness to occur. Any single outcome, i.e interval with length = 0, has a likeliness of 0 Example: Select a random real number between 0 and 10. P(x in [0,3]) = 3/10, but P(x=1) = 0 Normal: A "bell shaped" symmetric distribution that is described by its average and the degree to which observations deviate from the average (standard deviation). Example: Height of humans Beta: A probability distribution of probabilities, i.e. a distribution that represents the likeliness of a range of distributions being true when the true distribution is unknown Example: You create a distribution of possible 3-point shooting percentages for your favorite basketball player at the start of the season to estimate his true shooting percentage over the entire season with the knowledge that he will probably have a similar percentage as last year and that a hot- or cold-streak at the start of the season is not necessarily representative of his "true" underlying shooting percentage for the entire season
17. Name 4 discrete distributions and give a brief explanation and example for each one.
Uniform: All outcomes are equally likely to occur. P(each event) = 1/n Example: Outcome of a fair and balanced die (uniform on [1, 2, 3, 4, 5, 6]) Bernoulli: Only two possible outcomes can occur. The events are complementary. P(event 1) = p, P(event 2) = 1-p Example: Outcome of a single coin flip Binomial: Describes the count of successes of n repeated Bernoulli trials, with each trial having a probability of success of p Example: Outcome of multiple coin flips, e.g. after observing 2 coin flips, we have P(2 heads) = .25, P(2 tails) = .25, P(1 heads, 1 tails) = .50 Poisson: Describes the probability of k events occurring in a fixed period of time, given that each event occurs at a constant rate and is independent of the time the last event occurred. Example: The number of cars that drive past your house in the next hour
10. What is variance (of a statistic)?
Variance is a measurement of how spread out a set of values are from their mean. More formally, Var(X) = E[(X-u)^2]
12. What is the Central Limit Theorem?
When we draw samples of independent random variables (drawn from any single distribution with a finite variance), their sample mean tends toward the population mean and their distribution approaches a normal distribution as sample size increases, regardless of the the distribution from which the random variables were drawn. Their variance will approach the population variance divided by the sample size. For example, let's say we have a fair and balanced 6-sided die. The result of rolling the die has a uniform distribution on [1, 2, 3, 4, 5, 6]. The average result from a die roll is (1+2+3+4+5+6)/6 = 3.5 If we roll the die 10 times and average the values, then the resulting parameter will have a distribution that begins to look similar to a normal distribution centered around 3.5. If we roll the die 100 times and average the values, then the resulting parameter will have a distribution that looks/behaves even more similar to a normal distribution, again centered at 3.5, but now with decreased variance, etc.