WEEK6 UDS
What is the primary purpose of an AA test in online experimentation? A. To determine if there are significant differences between two variations of a feature B. To ensure the experimentation platform is measuring metrics correctly C. To measure the impact of a new product launch D. To increase user engagement on the website
(B): To ensure the experimentation platform is measuring metrics correctly. AA tests involve splitting traffic equally between two identical versions (A and A) of a feature. Since there's no change being tested, it helps identify any measurement inconsistencies within the experimentation system.
You've set up an AA test and notice that one of the "A" variations is consistently showing a slight positive lift compared to the other. Which of the following is the most likely explanation? A. This is a true, statistically significant result and you should take action based on it. B. Your experimentation system might have slight biases, or there could be natural variability within the metric. C. The positive lift is caused by a bug in your implementation D. None of the above
(B): Your experimentation system might have slight biases, or there could be natural variability within the metric. AA tests minimize external factors as both groups receive the same experience. Slight lifts could be due to inherent metric fluctuations or minor biases in the system, not a genuine effect.
Which of the following is the best example of survivorship bias? A. A marketing campaign analyzing only the customers who completed a purchase. B. An experiment is stopped early because the treatment group seems to be performing poorly. C. A study on airplane armor during WWII focuses on planes that returned safely. D. An A/B test with an uneven split of users between control and treatment.
(C): A study on airplane armor during WWII focuses on planes that returned safely. This scenario exemplifies survivorship bias. We only analyze data from planes that survived to be inspected, overlooking planes impacted in areas not reinforced (the planes that didn't return).
You want to ensure your AA tests have enough power to detect even small inconsistencies. Which of the following actions would help achieve this? A. Run the AA test for a shorter duration with a smaller sample size B. Include a wider range of user segments in the AA test C. Run the AA test for a longer duration with a larger sample size D. Focus only on top-level key performance indicators (KPIs)
(C): Run the AA test for a longer duration with a larger sample size. Statistical power refers to the ability to detect a true effect. A larger sample size over a longer period increases the test's sensitivity to pick up even minor inconsistencies.
Which of the following is NOT a common benefit of conducting AA tests? A. Establishing a baseline for your metrics and their natural variation B. Identifying potential bugs or biases in your experimentation platform C. Ensuring consistency of results across multiple experiments D. Improving the user experience with certainty
(D): Improving the user experience with certainty. AA tests establish a baseline and detect biases, but they don't definitively improve user experience. You'd need a follow-up A/B test with a variation to confirm improvements.
Describe a scenario in online experimentation where Sample Ratio Mismatch (SRM) might occur. Provide a plausible reason for why it would happen.
A good answer would describe a situation where the intended allocation of users between control and treatment groups (e.g., 50/50 split) doesn't match the actual assignment that occurs during the experiment. I
Survivorship Bias
A logical error where conclusions are drawn based only on the data from those who "survived" a process
Sample Ratio Mismatch (SRM)
A mismatch between the intended and actual ratio of units (e.g., users) in the control and treatment groups of an experiment
Stable Unit Treatment Value Assumption (SUTVA)
A principle in experimental design assuming that individual units (e.g., users) do not influence each other's outcomes
Intention-to-Treat
A principle where analysis includes all participants initially assigned to treatment groups, regardless of whether they fully adhered to the treatment
Confidence Intervals
A range of values that is likely to contain the true treatment effect within a specified level of certainty (e.g., 95%)
False Discovery Rate
A statistical method for controlling the expected proportion of false positives when conducting multiple hypothesis tests
What does an indicator variable coefficient commonly represent?
ATE
Average Treatment Effect on the Treated (ATT)
ATT measures the average difference in outcomes for the group that received the treatment compared to what their outcome would have been if they hadn't received the treatment
Average Treatment Effect on the Untreated (ATU)
ATU measures the average difference in outcomes for the group that didn't receive the treatment compared to what their outcome would have been if they had received the treatment
What does the Average Treatment Effect (ATE) measure? A. The difference between potential outcomes for a single individual. B. The average difference between potential outcomes across a population. C. The percentage of people who respond positively to a treatment D. The overall effect size in standard deviation units
B - The average difference between potential outcomes across a population. The ATE is not about any single individual's difference (as we can't directly observe those). Instead, it summarizes the average expected effect if the treatment was applied to a relevant population.
A researcher is interested in the effect of a new job training program on employment outcomes. Which assumption is crucial for making valid causal claims? A. The participants in the job training program are highly motivated. B. The treatment and control groups are comparable (due to randomization or controlling for confounders). C. The effects of the job training program are the same for everyone. D. None of the above
B - The treatment and control groups are comparable (due to randomization or controlling for confounders). This is the concept of ignorability or unconfoundedness. For causal inference, we need groups that would have similar outcomes if they hadn't received the treatment. Randomization or careful control of other factors helps achieve this.
Sarah took a headache relief pill and her headache went away. If Sarah hadn't taken the pill, her headache might still have gone away on its own. This scenario highlights: A. The Stable Unit Treatment Value Assumption (SUTVA) B. The importance of confounders C. The fundamental problem of causal inference D. The definition of a potential outcome
C - The fundamental problem of causal inference This highlights the core issue: we can only observe what did happen (Sarah took the pill, headache went away), but not what would have happened if she hadn't taken it. This makes directly isolating the pill's causal effect impossible.
Which of the following is NOT a potential outcome? A. A patient's blood pressure if they receive a new medication. B. A student's test score if they attend a tutoring program. C. The salary of a person who goes to college. D. The crop yield under a new fertilizer treatment.
C - The salary of a person who goes to college. This is only an observed outcome; the potential outcome we're missing is the person's salary if they hadn't gone to college. The difference between these is the causal effect of college on salary.
Multiple Hypothesis Tests
Conducting numerous statistical tests on the same dataset, increasing the likelihood of false positives
Threats to Internal Validity
Factors that can compromise the accuracy of your experimental results, making it difficult to isolate the true impact of the treatment
True or False: If a change is statistically significant, you should always launch it. Explain your reasoning. (Short Answer)
False. Cost: A small improvement might not justify the engineering effort. Side Effects: The change might harm other metrics (guardrails would catch this). Strategic fit: Even a positive change might not align with long-term business goals. 5
Give an example of ATE?
Imagine a study giving students a special tutoring program. The ATE would be the average difference in test scores between students who received the tutoring and those who didn't.
Lossy Instrumentation
Inaccurate data collection (like click tracking) potentially introducing bias or sample ratio mismatches
Bad Hash Functions
Inappropriate randomization functions can lead to uneven distribution of users into experiment groups
In the provided example in the A/B book, why is "revenue-per-user (who started checkout)" a better OEC than total revenue?
It focuses on the most relevant audience: Users who started checkout are the only ones potentially impacted by the change. Including everyone dilutes the results with noise from unaffected users. Total revenue can fluctuate due to external factors unrelated to the experiment (e.g., overall traffic increase). Revenue-per-user normalizes for such variations.
Residual or Carryover Effects
Lasting effects from previous experiments or treatments that can contaminate current results
Statistical Significance Practical Significance Confidence Interval a) The range within which we are confident the true difference between variants likely lies. b) An observed difference unlikely to occur by chance, assuming the null hypothesis is true. c) The magnitude of change in a metric that matters from a business perspective.
Statistical Significance - b: p-values and confidence intervals measure the likelihood of observing a difference just by chance. A low p-value implies a real difference due to the experiment. Practical Significance - c: This considers whether the change is big enough to be meaningful for the business, beyond statistical chance. Confidence Interval - a: This range around the observed change provides an idea of where the true difference likely lies with a certain level of confidence (e.g., 95%).
Statistical Power
The ability to detect a meaningful difference if it exists
Unconfoundedness
The assignment of treatment is independent of the potential outcomes, either naturally (like in a randomized controlled experiment) or after controlling for observable variables (confounders). This implies no hidden biases that affect both treatment receipt and outcomes.
Average Treatment Effect (ATE)
The average causal effect across the population or a relevant subpopulation. This is the key focus of most causal inference methods, as it offers a broader effect estimate
Causal Effect
The difference between the potential outcomes for a single unit. Because we can never observe both Y1 and Y0 simultaneously for the same unit, this individual-level causal effect is fundamentally unobservable.
Randomization Unit
The entity (e.g., user, session) that is randomly assigned to a test condition
Fundamental Problem of Causal Inference
The fact that we can only observe one potential outcome for each individual (the one corresponding to their received treatment), making it impossible to directly calculate individual causal effects.
SUTVA Violations in Social Networks, Marketplaces, etc
The interconnected nature of these environments can violate the assumption that participants do not affect each other
Treatment
The intervention or exposure whose effect we want to study. This could be a medical treatment, a policy change, an educational program, etc.
Stable Unit Treatment Value Assumption (SUTVA)
The potential outcome for one unit is unaffected by the treatment assignment of other units (no interference or spillover effects). The treatment has a consistent, well-defined version (no variations that would lead to differing effects).
Peeking at p-values
The practice of continuously monitoring p-values in an online experiment, leading to inflated false positive rates
OEC (Overall Evaluation Criterion)
The primary metric for evaluating experiment success
Propensity Score
The probability of receiving the treatment, often calculated based on observed characteristics. Used in methods like propensity score matching to help create more comparable groups
Potential Outcomes
These are the hypothetical values of an outcome variable that an individual or unit would experience under different treatment conditions. Y1: The outcome if the unit is treated. Y0: The outcome if the unit is not treated.
True or False: If we perfectly randomize participants to a treatment or control group, we can always assume that their potential outcomes are independent of their treatment assignment.
True Proper randomization breaks any systematic link between the treatment assignment and individuals' characteristics (and thus their potential outcomes). This is why randomized experiments are the gold standard for causal inference.
True or False: AA tests should always be run before launching a new experiment.
True. AA tests act as a control to ensure the experimentation platform is working correctly before launching an A/B test with variations. They identify potential issues that could skew later result
True or False: The Stable Unit Treatment Value Assumption (SUTVA) states that an individual's participation in a study doesn't affect other people's outcomes.
True. .The Stable Unit Treatment Value Assumption (SUTVA) is a core principle assuming participants in an experiment don't influence each other's outcomes.
Name two types of guardrail metrics and provide one example of each. (Short Answer)
Trust-related:Example: Ensure even split of users across control and test groups. Unequal splits mean the experiment might be flawed. Organizational:Example: Monitor site-wide latency. Your experiment shouldn't significantly slow down the site, impacting all users.
Confounders
Variables that affect both the treatment assignment and the outcome, potentially obscuring the true causal relationship
What is the reference group considered in an indicator variable?
baseline for comparison when interpreting the coefficients of the other indicator variables.
What is the purpose of a "fake door" or "painted door" test? a) To fully launch a new feature and measure its impact on all users. b) To determine customer preferences before building a new feature. c) To assess potential user behavior changes without building the full feature. d) To introduce new visual designs on a live website.
c) To assess potential user behavior changes without building the full feature. Fake door tests help gather data on user interest and likely behavioral impact. This minimizes investment in features that may not be successful before the full implementation. pen_spark
Guardrail metrics
secondary metrics you monitor alongside your primary experiment metric (OEC) to ensure the validity of your experiment and to catch unexpected side effects of your change