WEEK6 UDS

Ace your homework & exams now with Quizwiz!

What is the primary purpose of an AA test in online experimentation? A. To determine if there are significant differences between two variations of a feature B. To ensure the experimentation platform is measuring metrics correctly C. To measure the impact of a new product launch D. To increase user engagement on the website

(B): To ensure the experimentation platform is measuring metrics correctly. AA tests involve splitting traffic equally between two identical versions (A and A) of a feature. Since there's no change being tested, it helps identify any measurement inconsistencies within the experimentation system.

You've set up an AA test and notice that one of the "A" variations is consistently showing a slight positive lift compared to the other. Which of the following is the most likely explanation? A. This is a true, statistically significant result and you should take action based on it. B. Your experimentation system might have slight biases, or there could be natural variability within the metric. C. The positive lift is caused by a bug in your implementation D. None of the above

(B): Your experimentation system might have slight biases, or there could be natural variability within the metric. AA tests minimize external factors as both groups receive the same experience. Slight lifts could be due to inherent metric fluctuations or minor biases in the system, not a genuine effect.

Which of the following is the best example of survivorship bias? A. A marketing campaign analyzing only the customers who completed a purchase. B. An experiment is stopped early because the treatment group seems to be performing poorly. C. A study on airplane armor during WWII focuses on planes that returned safely. D. An A/B test with an uneven split of users between control and treatment.

(C): A study on airplane armor during WWII focuses on planes that returned safely. This scenario exemplifies survivorship bias. We only analyze data from planes that survived to be inspected, overlooking planes impacted in areas not reinforced (the planes that didn't return).

You want to ensure your AA tests have enough power to detect even small inconsistencies. Which of the following actions would help achieve this? A. Run the AA test for a shorter duration with a smaller sample size B. Include a wider range of user segments in the AA test C. Run the AA test for a longer duration with a larger sample size D. Focus only on top-level key performance indicators (KPIs)

(C): Run the AA test for a longer duration with a larger sample size. Statistical power refers to the ability to detect a true effect. A larger sample size over a longer period increases the test's sensitivity to pick up even minor inconsistencies.

Which of the following is NOT a common benefit of conducting AA tests? A. Establishing a baseline for your metrics and their natural variation B. Identifying potential bugs or biases in your experimentation platform C. Ensuring consistency of results across multiple experiments D. Improving the user experience with certainty

(D): Improving the user experience with certainty. AA tests establish a baseline and detect biases, but they don't definitively improve user experience. You'd need a follow-up A/B test with a variation to confirm improvements.

Describe a scenario in online experimentation where Sample Ratio Mismatch (SRM) might occur. Provide a plausible reason for why it would happen.

A good answer would describe a situation where the intended allocation of users between control and treatment groups (e.g., 50/50 split) doesn't match the actual assignment that occurs during the experiment. I

Survivorship Bias

A logical error where conclusions are drawn based only on the data from those who "survived" a process

Sample Ratio Mismatch (SRM)

A mismatch between the intended and actual ratio of units (e.g., users) in the control and treatment groups of an experiment

Stable Unit Treatment Value Assumption (SUTVA)

A principle in experimental design assuming that individual units (e.g., users) do not influence each other's outcomes

Intention-to-Treat

A principle where analysis includes all participants initially assigned to treatment groups, regardless of whether they fully adhered to the treatment

Confidence Intervals

A range of values that is likely to contain the true treatment effect within a specified level of certainty (e.g., 95%)

False Discovery Rate

A statistical method for controlling the expected proportion of false positives when conducting multiple hypothesis tests

What does an indicator variable coefficient commonly represent?

ATE

Average Treatment Effect on the Treated (ATT)

ATT measures the average difference in outcomes for the group that received the treatment compared to what their outcome would have been if they hadn't received the treatment

Average Treatment Effect on the Untreated (ATU)

ATU measures the average difference in outcomes for the group that didn't receive the treatment compared to what their outcome would have been if they had received the treatment

What does the Average Treatment Effect (ATE) measure? A. The difference between potential outcomes for a single individual. B. The average difference between potential outcomes across a population. C. The percentage of people who respond positively to a treatment D. The overall effect size in standard deviation units

B - The average difference between potential outcomes across a population. The ATE is not about any single individual's difference (as we can't directly observe those). Instead, it summarizes the average expected effect if the treatment was applied to a relevant population.

A researcher is interested in the effect of a new job training program on employment outcomes. Which assumption is crucial for making valid causal claims? A. The participants in the job training program are highly motivated. B. The treatment and control groups are comparable (due to randomization or controlling for confounders). C. The effects of the job training program are the same for everyone. D. None of the above

B - The treatment and control groups are comparable (due to randomization or controlling for confounders). This is the concept of ignorability or unconfoundedness. For causal inference, we need groups that would have similar outcomes if they hadn't received the treatment. Randomization or careful control of other factors helps achieve this.

Sarah took a headache relief pill and her headache went away. If Sarah hadn't taken the pill, her headache might still have gone away on its own. This scenario highlights: A. The Stable Unit Treatment Value Assumption (SUTVA) B. The importance of confounders C. The fundamental problem of causal inference D. The definition of a potential outcome

C - The fundamental problem of causal inference This highlights the core issue: we can only observe what did happen (Sarah took the pill, headache went away), but not what would have happened if she hadn't taken it. This makes directly isolating the pill's causal effect impossible.

Which of the following is NOT a potential outcome? A. A patient's blood pressure if they receive a new medication. B. A student's test score if they attend a tutoring program. C. The salary of a person who goes to college. D. The crop yield under a new fertilizer treatment.

C - The salary of a person who goes to college. This is only an observed outcome; the potential outcome we're missing is the person's salary if they hadn't gone to college. The difference between these is the causal effect of college on salary.

Multiple Hypothesis Tests

Conducting numerous statistical tests on the same dataset, increasing the likelihood of false positives

Threats to Internal Validity

Factors that can compromise the accuracy of your experimental results, making it difficult to isolate the true impact of the treatment

True or False: If a change is statistically significant, you should always launch it. Explain your reasoning. (Short Answer)

False. Cost: A small improvement might not justify the engineering effort. Side Effects: The change might harm other metrics (guardrails would catch this). Strategic fit: Even a positive change might not align with long-term business goals. 5

Give an example of ATE?

Imagine a study giving students a special tutoring program. The ATE would be the average difference in test scores between students who received the tutoring and those who didn't.

Lossy Instrumentation

Inaccurate data collection (like click tracking) potentially introducing bias or sample ratio mismatches

Bad Hash Functions

Inappropriate randomization functions can lead to uneven distribution of users into experiment groups

In the provided example in the A/B book, why is "revenue-per-user (who started checkout)" a better OEC than total revenue?

It focuses on the most relevant audience: Users who started checkout are the only ones potentially impacted by the change. Including everyone dilutes the results with noise from unaffected users. Total revenue can fluctuate due to external factors unrelated to the experiment (e.g., overall traffic increase). Revenue-per-user normalizes for such variations.

Residual or Carryover Effects

Lasting effects from previous experiments or treatments that can contaminate current results

Statistical Significance Practical Significance Confidence Interval a) The range within which we are confident the true difference between variants likely lies. b) An observed difference unlikely to occur by chance, assuming the null hypothesis is true. c) The magnitude of change in a metric that matters from a business perspective.

Statistical Significance - b: p-values and confidence intervals measure the likelihood of observing a difference just by chance. A low p-value implies a real difference due to the experiment. Practical Significance - c: This considers whether the change is big enough to be meaningful for the business, beyond statistical chance. Confidence Interval - a: This range around the observed change provides an idea of where the true difference likely lies with a certain level of confidence (e.g., 95%).

Statistical Power

The ability to detect a meaningful difference if it exists

Unconfoundedness

The assignment of treatment is independent of the potential outcomes, either naturally (like in a randomized controlled experiment) or after controlling for observable variables (confounders). This implies no hidden biases that affect both treatment receipt and outcomes.

Average Treatment Effect (ATE)

The average causal effect across the population or a relevant subpopulation. This is the key focus of most causal inference methods, as it offers a broader effect estimate

Causal Effect

The difference between the potential outcomes for a single unit. Because we can never observe both Y1 and Y0 simultaneously for the same unit, this individual-level causal effect is fundamentally unobservable.

Randomization Unit

The entity (e.g., user, session) that is randomly assigned to a test condition

Fundamental Problem of Causal Inference

The fact that we can only observe one potential outcome for each individual (the one corresponding to their received treatment), making it impossible to directly calculate individual causal effects.

SUTVA Violations in Social Networks, Marketplaces, etc

The interconnected nature of these environments can violate the assumption that participants do not affect each other

Treatment

The intervention or exposure whose effect we want to study. This could be a medical treatment, a policy change, an educational program, etc.

Stable Unit Treatment Value Assumption (SUTVA)

The potential outcome for one unit is unaffected by the treatment assignment of other units (no interference or spillover effects). The treatment has a consistent, well-defined version (no variations that would lead to differing effects).

Peeking at p-values

The practice of continuously monitoring p-values in an online experiment, leading to inflated false positive rates

OEC (Overall Evaluation Criterion)

The primary metric for evaluating experiment success

Propensity Score

The probability of receiving the treatment, often calculated based on observed characteristics. Used in methods like propensity score matching to help create more comparable groups

Potential Outcomes

These are the hypothetical values of an outcome variable that an individual or unit would experience under different treatment conditions. Y1: The outcome if the unit is treated. Y0: The outcome if the unit is not treated.

True or False: If we perfectly randomize participants to a treatment or control group, we can always assume that their potential outcomes are independent of their treatment assignment.

True Proper randomization breaks any systematic link between the treatment assignment and individuals' characteristics (and thus their potential outcomes). This is why randomized experiments are the gold standard for causal inference.

True or False: AA tests should always be run before launching a new experiment.

True. AA tests act as a control to ensure the experimentation platform is working correctly before launching an A/B test with variations. They identify potential issues that could skew later result

True or False: The Stable Unit Treatment Value Assumption (SUTVA) states that an individual's participation in a study doesn't affect other people's outcomes.

True. .The Stable Unit Treatment Value Assumption (SUTVA) is a core principle assuming participants in an experiment don't influence each other's outcomes.

Name two types of guardrail metrics and provide one example of each. (Short Answer)

Trust-related:Example: Ensure even split of users across control and test groups. Unequal splits mean the experiment might be flawed. Organizational:Example: Monitor site-wide latency. Your experiment shouldn't significantly slow down the site, impacting all users.

Confounders

Variables that affect both the treatment assignment and the outcome, potentially obscuring the true causal relationship

What is the reference group considered in an indicator variable?

baseline for comparison when interpreting the coefficients of the other indicator variables.

What is the purpose of a "fake door" or "painted door" test? a) To fully launch a new feature and measure its impact on all users. b) To determine customer preferences before building a new feature. c) To assess potential user behavior changes without building the full feature. d) To introduce new visual designs on a live website.

c) To assess potential user behavior changes without building the full feature. Fake door tests help gather data on user interest and likely behavioral impact. This minimizes investment in features that may not be successful before the full implementation. pen_spark

Guardrail metrics

secondary metrics you monitor alongside your primary experiment metric (OEC) to ensure the validity of your experiment and to catch unexpected side effects of your change


Related study sets

PC Operating Systems: Midterm Study

View Set

Concepts II - Chapter 45 Religion

View Set

NU144- Chapter 19: Postoperative Nursing Management

View Set

Manufacturing Processes: Fundamentals of Metal Forming (CH. 17)

View Set

CH14 GROUP HEALTH AND BLANKET INS

View Set

NIH Stroke Scale Group B Patient 1-6

View Set