STAT 212

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

Confidence Interval (Ch. 7)

= An interval centered around our sample statistic for which we are fairly confident that our parameter of interest is in this interval Sample mean & standard error: -Approximately 68% of all possible sample means will be within 1 standard error -Approximately 95% of all possible sample means will be within 2 standard error -Approximately 99.7% of all possible sample means will be within 3 standard error Z-value representing the number of standard errors: 90% of all possible sample means within 1.65 standard errors of the mean 95% of all possible sample means within 1.96 standard errors of the mean 98% of all possible sample means within 2.33 standard errors of the mean

Parsimonious

A model that contains a few predictors as possible while explaining a reasonable percentage of variance in the response variable. -Even if our adjusted r^2 value had gone up slightly, we shouldn't necessarily keep both predictors. The question is whether the amount of additional variability is explained is really worth having an extra layer of complexity to our model.

Dummy Variable

A variable having only the values fo 0 and 1 that are use to represent the 2 different categories of a qualitative variable -Use term "dummy" b/c the #s don't bear any meaning, representative of "yes" and "no". Slope: If the (dummy variable), then we would expect (response) to increase/decrease by (dummy variable value) as the other predictors are held constant.

Absolute Risk Reduction & Relative Risk ("Risk Ratio") - Ch. 8

Absolute Risk Reduction reports the absolute value difference b/w risk for the treatment and control group -Will always be very low at low-risk situations and as a result, the treatment's impact will be hard to see. Relative Risk: -Compare risk as a ratio where we compare the values more proportionally. -Divide the treatment group's risk by the control group's risk -If the relative risk is below 1, that means risk is reduced. Relative risk above 1 means risk is being increased.

Experiments vs. Observational Studies

Experiments: Usually more intensive to coordinate and costly to carry out. They may also be difficult to conduct based on whether the supposed causal agent is ethical or even possible to administer. -Controlled intervention - "treatment factor" & observe a response on the participants or units of measurement Observational Studies: Often much easier to conduct b/c it may be as simple as accessing existing data, or collecting w/o coordinating interventions. But also might be lost in quality w/o the same controlled intervention. -Can investigate a potential cause and effect relationship. But w/o controlled intervention, its difficult to make causal claims -Often just stick to noting 2 variables have an associative relationship

Confidence Intervals for Risk (Ch. 8)

Ex: relative risk was 0.287, a 95% confidence interval for relative risk is calculated to be (0.196, 0.424). We are 95% confident that the risk for contracting polio with the Salk vaccine is b/w 19.6% and 42.4% for the risk for contracting polio with the placebo injection.

Measuring Risk (Ch. 8)

Risk = the probability of an undesired outcome for a particular population (disease contraction, death). -Can be assessed from a general standpoint, or as a conditional probability.

Residual

The vertical distance from a data point to the regression line. Measures the error b/w our best fit line and an individual data point. -We use the notation y & y^ to distinguish b/w an actual data point and predicted data point for the y variable for a given predictor variable value. -yi^ is predicted y-value given that xi is observed x-value; (xi, y^) is a point on the line -yi is the actual y-value paired up with xi; (xi, yi) is an observed data point *We define the residual for observation i is yi - yi^

Statistical Modeling

Using data to predict most likely outcomes of a random variable, often under different conditions

Equation of a line

y = 74 - 8x -Slope tells you the rate at which the response variable changes with respect to unit changes in the predictor Ex: For every one unit increase in (predictor), we expect (response) to increase/decrease by ## -Intercept provides you a starting point/positional reference - the model's approximation for the response variable value when the predictor value is at 0 (typically not a meaningful # to interpret on its own)

Association vs. Causation

-Association DOES NOT imply causation -In other words, simple b/c we've identified a linear (or some other) relationship b/w variables, we don't actually know if one is the causal agent of the other, specifically when we're looking at observational study data. -We have to go back to the design. Only if this data comes from an experimental design can we safely presume causation.

Hypothesis Testing Assumptions for 2-sample t-test (Ch. 10)

-Each sample is representative of its respective population -The population distribution we are gathering data from is approximately normally distributed, or the sample size we took is sufficiently large to ensure the distribution of possible sample means from samples of that size will be normally distributed (n > 30)

When is it inappropriate to use a correlation coefficient?

-Not all relationships are linear in form. A low correlation coefficient signals no LINEAR relationship. -For this reason, remember that correlation coefficient do NOT measure ANY relationship. We would need polynomial regression methods and a non-linear association measures to describe those relationships.

Creating a Multiple Linear Regression Model

-Remember our r^2 measures how much variance in the response variable is being explained by our model. Therefore, I want predictors in my model that have strong relationships with the response variable and, therefore, meaningfully increase r^2 for my multiple regression.

Important Assumptions for Creating a Confidence Interval for Proportions (Ch. 7)

-We need to ensure our sample is collected in such a way that it is representative of the population we wish to generalize to. -Our sample size is large enough and the proportion in question un-extreme enough to guarantee our distribution of possible sample organizations will be normally distributed (at least 10 successes and 10 failures) -When our sample comprises over 10% of the population, we should recognize our confidence interval will overestimate the possible error in our sample statistic.

Important Assumptions for Creating a Confidence Interval for Means (Ch. 7)

-We need to ensure our sample is collected in such a way that it is representative of the population we wish to generalize to. Any external validity threats should be noted as limitations -We need to ensure the CLT holds and we can actually create a confidence interval: -> Population distribution is somewhat normally distributed and fairy symmetric -> Sample size is sufficiently large (>30)

The relationship b/w the regression line & the correlation coefficient

-When two variables have a perfect correlation, "the standard deviation line" will be equivalent to the line of best fit. Slope = ±Sy/Sx -But what about measuring a correlation that isn't 0, but also isn't perfect? Slope = r*(Sy/Sx)

Measuring a Linear Relationship

-While "association" is a general term to describe any relationship, we typically reserve the term "correlation" to refer to linear relationships. -In statistics, the Correlation Coefficient is a value b/w -1 & +1 that describes the direction and strength of a linear relationship b/w 2 numeric variables. -Negative values imply that one variable increases in value, the other decreases in value (negative correlation). Positive values imply that as one variable increases, the other variable increases as well (positive correlation).

P-Value vs r^2

-You might loosely think of r^2 like an effect size. It tells you how well the model is fitting. -The p-value simply tells us if we have evidence for any linear relationship at all (any departure from null). -As with hypothesis testing before, a large sample makes it much easier to get a small p-value, even if the difference is small. But r^2 won't be inflated by a large sample size and will appropriately reflect how much variability is explained by the model.

Chapter 14: Takeaways

1. If completing a linear model, check that each predictor variable you add fits the conditions we discussed in Chapter 13. Graph each predictor variable one-on-one with the response variable. 2. Generally speaking, including fewer good predictors is better than including many not-so-good predictors. Pay attention to the adjusted r^2 to determine model strength. If it doesn't go up, the newest predictor should go. If it goes up only a marginal amount, you should question if it's adding much predictive power. 3. Consider using a model selection method (like forwards or backwards selection) if you have 4 or more predictors to test

Multiple Linear Regression

Creating a model with multiple linear predictor variables -We have to assess which predictors we actually have evidence for a linear relationship with our response variable. -We also have the added layer of deciding which set of predictor variables makes the strongest, yet simplest, model

Observational Studies Examples

Cross-Sectional Studies = Not well designed for causal claims b.c there is no control over confounding factors. Instead, a good analysis relies on the collection of confounding variables and stratification in the analysis. -Collect data during a specific time and limited period of time (survey) -If the purpose is inferential, researchers are looking to see if a potential condition or result is associated with a potential causal agent (treatment factor) -Commonly in the form of a one-time survey Cohort Studies = First identify a potential causal factor, then track response over time. These studies can be slightly stronger than cross-sectional since changes can more carefully be tracked if desired, but other inherent confounders cannot necessarily be controlled. -Identifying several groups of people based on a shared characteristic/condition/risk factor. The idea is to see if each cohort will respond differently, presumably due to this observed "treatment factor" -Commonly prospective form - "looking ahead" Case-Control Studies = Used when it is the response that is first noted, then an investigation into each case possible for possible causal factors. These are often the design of choice in rare-incidence conditions where finding a particular response in the random sample would be difficult w/o a super large sample. -Identifying people who have had certain responses, and then looking to see if there are any common characteristics that might explain that response -Commonly retrospective in form - "looking back"

Reviewing Statistical Methods

Independent Samples t-test (Two Sample t-test - Ch. 10): -Used when we want to compare the measurements of 2 different groups to decide whether or not we have the evidence for a difference in means. -Use term "independent" b/c the groups are independent of each other Paired t-test (test for dependent sample - Ch. 10): -Comparing 2 sets of measurements, but that the measurements from each group are paired together -Most clearly seen in a pre-post study, where the same person provides a measurement before and after some treatment -Could also be done in the context of 2 separate groups of people/units, but who are purposely matched. This might be done w/ identical twins, or just other people who have a connection or shared attribute. ANOVA (Ch. 11): -Use when we have 3 or more groups to compare across some numerical measure -Can be used in more complicated situations Chi-Square Goodness-of-Fit Test (Ch. 12): -Involves analysis of 1 categorical variable to determine whether the frequencies recorded for each category reasonably align with the null hypothesis that the frequencies follow a particular distribution. -This categorical variable could have 2 or more category options. -To identify a chi-square test, there should be one categorical variable involved - we'll be analyzing whether the frequencies (the counts) for each category align with certain expected frequencies. Chi-Square Contingency Table Test (Ch. 12): -"test for homogeneity" or "test for independence" -Used when we want to know if 2 categorical variables have any relationship with one another. Simple Regression (Ch. 13): -Involves 2 numerical (likely continuous or continuous-like) variables and defining whether they might have some type of correlation. Multiple Regression (Ch, 14): -Involve 2 or more predictor variables in the process of building a model to predict 1 numerical variable. -Could involve non-linear relationship b/w predictor and the response variable as well Logistic Regression (Ch. 14): -Could involve building a model where we have a binary variable as our response variable and one or more predictor variable\ -Ex = predicting whether one gets accepted into UIUC using their ACT score as a predictor.

Interpolation & Extrapolation

Interpolation = predicting Y based on an X value WITHIN the range of X values we have info for -When X = 4,5,6,7 or 8, we can use the regression line to safely predict Y when X = 4.50 or some other value b/w 4 & 8. Extrapolation = predicting Y based on an X value outside of the range of X values we have info for -Consider if we wanted to predict Y for an X value outside our range of observations -While making predictions immediately outside the range is generally safe, making predictions well out often unreliable.

Simple Linear Regression (SLR)

Restricted to comparing no more than 2 continuous/discrete variables to determine if there is a linear relationship b/w them. -Two variables are denoted as X (Predictor variable) and Y (Response variable) and are plotted against each other on a scatterplot. -Sometimes in experiments, the predictor (x) variable is what we think might be the causal agent.

Regression

Modeling the mean value of a response variable associated with certain values of one or more predictor variables

Hypothesis Test for the Predictor

Null: There is no linear relationship b/w our predictor variable and response variable . The slope is equal to 0. Alternative: There is a linear relationship b/w our predictor variable and response variable. The slope is not equal to 0.

Measuring Odds (Ch. 8)

Odds in Favor: measures the likelihood of the condition taking place. Odds Ratio: odds ratio is the comparison of the Odds in Favor for each group in much the same way that relative risk is a comparison of risks b/w groups. -Divide odds in favor for the group getting the treatment (or group being exposed to something) by odds in favor for the control group (group not exposed) -Will always slightly "exaggerate" the effect in comparison to relative risk. Interpretation: The odds of getting ovarian cancer for women with this gene, as compared to those w/o, are 20.57 to 1.

Measuring the Effectiveness of a Test (Ch. 8)

Positive Predictive Value (PPV): % of positive test results that are correct (given positive) Negative Predictive Value (NPV): % of negative test results that are correct (given negative) Sensitivity: % of truly positive people who correctly test positive (given has condition) Specificity: % of truly negative people who correctly test negative (given has non condition) *If the patient is truly negative, then their % negatives should match the known specificity of the test.

Experiments Examples

Randomized Control Trials = typically most trusted designs - participants are randomly assigned to a treatment factor and the intervention is controlled. -Blind sorting Blocking = alternative sorting method that can introduce some group composition biasing, but it is usually needed when our group sizes are not large (certainly less than 20, and occasionally with large group sizes too). -Researchers may split up participants into blocks, and then randomly assign each block to a group to ensure proportional blocks in each group. Matched Pairs/Pre-Post Design = create a more statistically efficient design by reducing within-group variability, but requires careful pairing (or in the case of pre-post measurements, can introduce confounding like test familiarity or placebo effects). -More efficient design than a randomized control trial, but naturally have more validity threats - depending on the context

Assumptions and Cautionary Notes with Regression

Representative: -As always, if trying to use our sample results to make a claim about how 2 variables relate in a greater population, then our sample should be representative of that population! Linear Relationships seems Appropriate: -This may seem like an obvious point, but do check that data you are analyzing with a linear model. -Never run a regression on 2 variables w/o first graphing the data and looking at the scatterplot. If it doesn't look linear, you might want to solicit help for another type or regression. Homoscedasticity (Equal Variance): -If your scatterplot looks like a cone shape, then your variable is heteroscedastic. This means your regression analysis will produce unreliable values. -You need advanced statistical methods (data transformation) to create a new form of the variable that will be homoscedastic. Multivariate Normality (or very large sample size): -This is the regression equivalent of the normality or sufficiently large sample size requirements we've needed before. The idea is that residuals from your SLR should be approximately normally distributed. -A visual test with the scatterplot and fitted line, with "residual plots" or "QQ Plots" can give you a sense if there's a potential problem. Normality of residuals test may also give you a more objective measure of whether normality is a reasonable assumption. -For this reason, use n>100 as a minimum benchmark to ensure sampling distribution of the residuals for any given x value is approximately normally distributed. Independence in Residuals (No Autocorrelation): -This is usually a concern to consider when completing a "Time Series" model that indicates that variability at X + 1 will be dependent and related to variability at X. This may also be the case when X is a locational variable where the variability in data points correlate with the position, and near positions are obviously related. Watching For Outliers: -In regression, we can have outlier data points on our scatterplot, but there are special types of outliers to be aware of. -An influential point is an outlier that can have a VERY strong effect on the best fit line, often making an otherwise "insignificant" relationship look "significant" in our model summary. -In general, influential points will be outliers that exist near the corners of the graph.

Hazard Ratio (Ch. 8)

Similar to relative risk, but one additional element added. It can be calculated to find the relative risk at any particular time range. -Calculated by finding the # of new events (infections/deaths) for a new treatment in that time period out of the total unaffected cases DIVIDED BY the # of new events (infections/deaths) for the comparison group in that time period out of the total unaffected cases. HR = (T: # of new cases/# remaining)/(C: # of new cases/# remaining)

Regression Slope

Simple Linear Regression: For every unit increase in (predictor) we expect (response) to increase/decrease by (predictor slope value). Multiple Linear Regression: For every unit increase in (predictor) we expect (response) to increase/decrease by (predictor slope value) as (other predictor) holds constant.

The coefficient of determination

The % of the variability in the response variable that is "explained" by the predictors in our model (and will always be a value b/w 0% to 100%)

Multicollinearity

When multiple predictor variables are, themselves, highly correlated and explain mostly the same variable in the response variable -Consider this example where we have a response variable "Murder Rate" and 2 predictor variables: "Assault Rate" and "Rape Rate". Individually, each of these variables are reasonable linear predictors of Murder Rate.


संबंधित स्टडी सेट्स

Quiz: Administering a blood transfusion

View Set

Table 17-13 Pupils in Comatose Patients

View Set

Understanding Evolution: Homology and Analogy

View Set

11.11.R - Lesson: National Economic Strategies & The Federal Reserve

View Set

MASTER SIGN ELECTRICIAN TEST 2 2020

View Set