Stats Final - Oxford

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

The Sample Average Treatment Effect (SATE)

(1/n)∑{Yi(1) − Y i(0)} This summarizes SATE - Y1 is the outcome of interest with the treatment, Y0 the one without the treatment - We are taking the difference for all observations for the outcome with the treatment and without and taking the average of that So for SATE you are taking the difference between the treated and control units, all summed together, over the total number of cases - This is possible to calculate if you can randomize - Randomized controlled trials (RCT) are the gold standard, especially when they are double-blind because they solve for the placebo and hawthorne effects - This lets us infer the counterfactual outcome via the process of randomization

Boxplots

- A way of characterizing the distribution of a continuous variable, but characterizing various aspects of that distribution all at once - It helps show the median, lower quartile, upper quartile, IQR, Max/Min, and Outliers - The whiskers on the boxplot correspond either to the max and min, or to the range for outliers (1.5 x IQR) - whichever comes first - Boxplots can be useful to compare things across multiple sub-groups within a variable (for example, information for the variable education, but divided for different provinces in Afghanistan)

Alternative Hypothesis

- Alternative ( H 1 ) contradicts the null hypothesis - This is a statement that the parameter falls into a different set of value than those predicted by ( H 0 ) - Regarding ideology, old people are more right wing than young people - This is the contradiction of the null

Conditions for a Control Being Including in a Regression

- Controls that satisfy cov(C, Y) =/= 0 and cov(C, X) =/= 0 can eliminate bias - Be careful how you include those controls to make sure that these controls are predetermined and that can predict both X and Y - Make sure you don't control on ​outcomes of X

Correlation

- Correlation is the scaling the covariance by the standard deviation of X and Y - So ​r = (Sxy/SxSy) - This means −1 ≤ r ≤ 1 - Correlation is a measure of the strength of the relationship between two variables - This is sometimes called the pearson correlation because s/he is who invented it

Tight Distributions

- Even if the sampling distribution is normal, some are 'tighter' than others, meaning they are more accurate - If a sampling distribution is tighter, its standard deviation is smaller and the proportion of estimates that fall within 1, 2, or 3 S.D.'s is much smaller and thus the room for error is much lower - Sampling distributions that are tightly clustered will give us a more accurate estimate on average than those that are more dispersed

Sampling Distributions

- If we took lots of samples we would get a distribution of sample means, or the sampling distribution - The sampling distribution of a statistic (in this case the mean of our sample) is the probability distribution that specifies probability for the possible values that the statistic can take - It so happens that this sampling distribution (the distribution of sampling means) is ​normally distributed - Moreover, if we took lots of samples then the distribution of the sample means would be ​centered around the population mean - Due to averaging, the sample means do not vary as widely as the individual observations

Probability Sampling

- It ensures that the sampling is representative of the population - In this approach, every unit of the population has some, non-zero probability of being selected into the sample

A useful Rule of Probability

- It's easier to calculate 1-P(not A) than the P(B) - Example: if P(Voting) = 0.6, then P(Not Voting) = 0.4

Difference-In-Difference Strategy

- Key idea: use PA before-and-After difference to figure out what would have happened in NJ without the increase - NJ before-and after-difference addresses within-state confounding - This infers the counterfactual via ​Pennsylvania, which addresses temporal bias - This is called the parallel time trend assumption - assuming that absence the law in NJ, the change in NJ would have happened parallel to the one that happened in PA - this allows us to estimate the sample average treatment effect for the treated (SATE) - So this is employment in NJ before and after, subtracted from employment in PA before and after Assumption: Parallel Time Trend assumption (NJ absent the treatment would have seen a parallel trend in employment compared with PA) - This assumption accounts for both unit-specific and time-varying confounders

Measures of Central Tendency

- Most common is the mean, which is the average - The median is more robust for outliers -The mode is the most common variable

Link Between Normal Distribution and Sampling

- Normal distributions are relevant because the distribution of sample means are normally ​distributed - Sampling is the process by which we select a portion of observations from all the possible observations in the population - This allows us to learn about the larger population without going to all the expense and time of observing every single individual in the population - In practice we face practical and budget constraints- Done correctly, sampling can provide us with a much, much cheaper alternative to a complete census at a small accuracy cost - However, there are many sources of bias that can arise and may lead a researcher to draw the wrong conclusions about the population from his or her sample - What we want to estimate is the mean ( Θ) → but this is unobservable for the population - So we use data to compute an estimate of the population, a ​sample mean ​( Θ )

The Three Central Axioms of Probability Theory

- Probability of any event is non-negative - P(A) ≥ 0 - The probability that one of the outcomes in the sample space occurs is 1 - P(Ω) = 1 - Addition rule: if events A and B are mutually exclusive, then - P(A or B) = P(A) + P(B) - Mutually exclusive events have no overlap between them (they can't happen simultaneously), compared to non-mutually exclusive events which do have overlap

Interpreting C.I.

- Probability that the true value is in a ​particular confidence interval is either 0 or 1 - Confidence intervals are random, while the truth is fixed Thus, with a 95% confidence interval, with repeated sampling, 95% of confidence intervals around the estimate will contain the population parameter - A 95% confidence interval does not mean that the population parameter has a 95% probability of lying within the particular confidence interval that we can calculated - All of these confidence intervals ignore non-sampling error (i.e. response bias and non-response bias)

Sample Statistics

- Sample statistics are estimates of population parameters - For a statistic to be useful, the sample needs to be representative of the population - A sample thus aims to capture a representative grouping of the population to make inferences about that populations

Options for Visualizing Bivariate Relationships

- Scatterplots - Correlation Coefficients - QQ Plots

Notation for Normal Distribution

- The mean and standard deviation of a normal distribution describes its particular shape - These are called the parameters of the normal distribution - A particular normal distribution can be represented by the following notation - N(μ, σ2) - So to describe a distribution with a mean of 20 and S.D. of 5 you would write N (20, 25)

Properties of the correlation coefficient

- The order we compute this in doesn't matter - cor(x,y) = cor(y,x) - This isn't affected by changes in scale as long as these changes are consistent - Correlation measures ​linear association - Thus, non-linear correlations sometimes aren't well represented by correlation coefficients (as seen with data that is U-shaped) - Thus, you should plot data you are getting correlations for, because otherwise you might miss strong relationships that are just non-linear

Z-Scores

- The z-score for a value xi of some variable (x) is the number of standard deviations that xi falls from the mean of x - So Z = (xi - mean of x) / (standard deviation of x) - For any normal distribution, the probability of falling within Z standard deviations of the mean is the same, regardless of the distribution's standard deviations - For 1 s.d. = 0.68 - For 2 s.d. = 0.95 - For 3 s.d. = ~1 - This is true for any normal distribution - For any value of z (not just whole numbers) there is a corresponding probability that an observation in a normally distributed population will fall within - these are listed in z-tables that list out these probabilities The Z-score for a value of xi of a variable is the number of standard deviations that xi falls from the mean of x- Z = (xi - mean of x) / standard deviation of x - For any normal distribution, the probability of falling within z standard deviations of the mean is the same, regardless of the distribution's standard deviation - For 1 S.D. (or z value of 1) the probability is 0.68 - For 2 S.D. the probability is 0.95 - For 3 S.D. the probability is almost 1

Correlation Coefficients

- These allows us to characterize the way that two variables move together - The mathematical definition of this is the mean of the products of the z-scores- This tells us how, on average, two variables move together - it will create a number between 1 and -1- A positive number means that they both increase or decrease together- A negative number says the opposite, if one increases the other decreases - This can be done in R with cor()

Scatterplots

- These are direct comparisons of two different variables for the same units (within the same set of data, with the same observations) - For example, comparing years of education with age - This can be good for showing relationships between variables by the general shape of the data cloud - it can thus indicate correlation

Random Variable

- This is a variable that assigns a number to an event - So for flipping a coin it would assign 1 to a heads flip and a 0 to a tails flip - Voting results could be assigned with random variables into a 1 and 0 as well - These variables can be discrete or continuous

Histograms

- This visualizes the distribution of a continuous interval variable - so the variable can take any number of values on a scale- To create a histogram by hand you need to create bins along the variable of interest - Count the number of observations in each bin - Density = the height of each bin - Density = proportion of observation in the bin / bin width What is density - The area of the blocks in a histogram sum to 100% - this doesn't tell us the percentage of people in each bin, it tells us the percentage divided by the bin width - It is more generally the percentage per individual horizontal unit - Thus, if the width is 1, than the bin is equal to the percentage

Using Categorical Variables with Many Groups

- What about when a categorical variable doesn't have two groups but many - like a person being single, married, divorced, or widowed (coded 1-4) - If we're comparing Life Satisfaction based on this status, how can the regression coefficient be interpreted? - If the x is measures in several non-ordered categories, you first need to convert the categories into dummy variables - The number of dummy variables to include for ​k categories is ​k​ - 1 dummies - So for life satisfaction you would divide it into three categories - Divorced or not, Widowed or not, Single or not - And then when checking life satisfaction it is β0 + each of these variables' β - But here, marriage is variable omitted (because of the ​k ​- 1 rule) - In this case, β0 represents the best predicted value for marriage b​ecause this is the excluded variable - it is the same as saying your status if you're not divorced, widowed, or single (meaning you must be married) - In this case β1 represents the average difference in life satisfaction between married and whichever other category you are looking at - So in LifeSatisfaction = β0 + β1Divorced + β2Widowed + β3Single + μi - With β0 representing the average level of life satisfaction for married people- β1 being the average difference in life satisfaction between divorced and married people- β2 being the average difference in life satisfaction between the divorced and widowed - When you have a value for β1 the value for the other β are 0 because they can't have a value if β1 has a value (and the same for all the other β values)

Central Limit Theorem

- What is the distribution of sample mean X(bar)n when X is not normally distributed? - The approximate (asymptotic) distribution of X(bar)n is ​also normal - If the sample size is large(ish) the distribution of sample means (the sample distribution) is approximately normal - This is true regardless of the shape of the population distribution - As ​n (​ the sample size) increases, the sampling distribution looks more and more like a normal distribution - This is called the central limit theorem - Thus, when many samples are taken, the sampling distribution distributes approximately normal independent of the parent population - This becomes more and more true the larger the sample sizes are - This plays a large role in calculating uncertainty in social sciences The central limit theorem states that the sampling distribution of the mean of any random variable will be normal or nearly normal, if the sample size is large enough.

Sample Mean

- What we want to estimate is the mean ( Θ) → but this is unobservable for the population - So we use data to compute an estimate of the population, a ​sample mean ​(Θ(hat))

Post-Treatment Bias

- When we include controls that are realized after X and that can be caused by x this is called ​post-treatment bias - Thus, we want our controls to affect both X and Y but not be affected by them

Standardized Coefficient

- You can try to standardize findings across different measurements by comparing differences in standard deviation (SD) instead of in the measure itself - this would multiply the slope by (SD(x) / SD(y)) - This allows us to show how much a SD change in X leads to an SD change in Y - This is known as the ​standardized coefficient - This helps you compare two different variables with different unit changes - it helps you ensure one measure is truly more correlated than another since it focuses not on a change in the unit of x but changes in the standard deviation of x

Note on Control for Experiments

- You don't need to include controls because people's conditions are randomly assignment - but we still may want to do controls (not because they will correlate with X) but because by doing so we might increase the precision of our estimate about β1 - The greater the part of TSS you explain, the smaller the RSS, and hence the higher the precision of β1(hat)

The Two Key Questions of Statistical Inference

1) How well would X(bar) behave as the sample size increases? - The law of large numbers mean that as the sample size increases, the mean of X will converge towards p - this is known as consistency - But how large is large enough? - 2) How would X(bar) behave over repeated data generating processes? - Consider the hypothetical where you conduct a survey under the same conditions again and again - you expectation is that the average performance of the mean of x will equal p(E(X) = p) → this is known as unbiasedness

Assumptions of OLS

1) Linearity - which includes bias and efficiency 2) Random Sampling 3) No Perfect Collinearity (multicollinearity) 4) Zero Conditional Mean 5)Heteroskedasticity 6) Normality of Error Term The key OLS assumptions for unbiasedness and efficiency are: 1) random sampling, 2) no perfect collinearity, 3) zero conditional mean, 4) homoskedasticity, and 5) normality of the error term. Checking that these assumptions hold for our models is crucial. If they do not, we may get biased or inefficient estimates of the coefficients and/or standard errors. When we find that these assumptions are violated, however, we can make adjustments to our model specification to help us make more valid inferences — which we discuss in this worksheet too!

The Regression Work Flow

1) Start With Theory 2) Contrast Hypotheses 3) Think about model 4) Estimate Parameters 5) Interpret the Coefficients 6) Inference

How do we calculate the confidence interval?

1. Calculate the standard error of the estimate of the population mean - The standard error of the estimate = standard deviation devided by the square root of the number of observations - SE = sd/squar root of n 2. Calculate the margin of error • Margin of Error = product of the standard error * the z-score associated with the confidence level of our choice • A z-score is the number of standard deviations a data point falls away from the mean Note: When you have multiple samples and want to describe the standard deviation of one those sample means (standard error), you need the z-score that tells you where the score lies on a normal distribution curve. Or in other words, it shows how many standard errors there are between the sample mean and the (unknown/estimate) population mean. • A z-score of 1 is 1 standard deviation above the mean value of the reference population (a population whose known values have been recorded), etc. etc.

T-Statistic

: The t-value is an estimate of "how extreme your estimate is", relative to the standard error (assuming a normal distribution, centred on the null hypothesis). More specifically, the t-value is a measure of how many standard deviations our coecient estimate is far away from 0. The greater the t-value, the stronger the evidence against the null hypothesis (here, no relationship between presidential elections and vote share). The closer to 0, the more likely it is that there is in fact no relationship between the covariate and the outcome of interest. In our example, the t-statistic values are relatively close to zero and are small relative to the standard error, which indicates that there is no relationship.

Survey Sampling

A key issue in politics is what people think about their approaches and their government - A key source of understanding people's opinion are opinion polls / surveys - Most interview only several hundred people but are able to infer about the thinking of millions of people, or even whole nations - Samples and populations- Often when numbers are reported they relate to a sample of the population - This is because of the high cost of actually polling all eligible voters / people - this is essentially a census - This is also because shorter polls are much quicker to do and easier to achieve - But these samples can still be used to make inferences about populations

Linear Regression Model

A linear regression model is a ​linear ​approximation of the relationship between explanatory variables X and dependent variable Y - Y=α+βx+ε - Beta is the regression coefficient which describes the relationship between X and Y The assumption we need to make to do this is that Y varies according to X in the same way through the range of values of X - This allows us to predict the value of Y for each value of X - This assumes a linear relationship - This can be summarized (the conditional expectation function) as - E[Y|X]= β0 +β1X - With β0 being the ​y-intercept ​- this is often a non-real value since an x-intercept often can't equal zero (as seen when looking at the effect of age on voting ) - β1 is the ​slope​ at any value of x -it is E[Y|X=x] - This doesn't vary across the value of X because there is only one slope - So the linear regression model is E[Y | X] = β0 + β1X - With Y as the DV - X as the IV - And β1 being the association of Y given the value of X - β1 = Increases in Y associated with one unit increase of X across the fit line - If β1 is > 0 an increase of X is associated with an increase in Y If β1 is < 0 an increase in X is associated with a decrease in values of Y - Magnitude of β1 tells us how Y changes with a one unit increase in X - You should be mindful of if the Y-intercept is a real value - You should also be careful to make sure you don't reverse the DV and IV - for example saying how voting leave affects your age instead of how your age affects voting leave

Hypothesis

A testable statement about the world - they must be falsifiable - they are tested by attempting to see if they could be false, rather than 'proving' them to be The famous example - you can't prove that all swans are white by counting white swans, but you can prove that they ​aren't a​ ll white by finding just one black swan

Log Transformations

A useful transformation when variables are positive and right-skewed (with very large incomes to the right) is the ​logarithm - Transformations that bring the distributions more in line with a normal distribution are also useful to stabilize variance - For example, the logarithm of 1000 to base 10 is 3 because 103 = 1000 - So if x=by then y is the logarithm of x to base b and is written as y=log​b​(x) - The most typical base is 10. Logarithms with base 10 are known as ​the common logarithm - Other common logarithms are those who have an exponential term of about 2.17, which is known as the natural logarithm - Why do we use logarithm? - Apart from mediating problems of right skewness, logarithms have an interesting property - When interpreting a logarithm coefficient, you get a constant percentage change estimation - This is just to say that by using the logarithmic transformation either for X or for Y, we can still get estimates that are substantively interesting - So by taking the logarithm, we can interpret the results not in terms of ΔX / ΔY but in terms of %ΔX / ΔY and ΔX/%ΔY and %ΔX/%ΔY

Selection Bias

After we have a frame, the way we draw the sample from that frame can lead to ​sampling bias Literary Digest Survey - they had an enormous sample, but they drew the whole sample from a non-representative population, because they only asked people with cars and with telephones and with subscriptions to Literary Digest, which skewed strongly towards higher income people

Critical Value

Alpha - the cut off level for out confidence interval (like 0.05 for a 95% interval)

Response Bias

And even when they answer all the questions there can still be ​response bias - answering questions in a skewed way or dishonest way

The Law of Large Numbers

As a sample size increases, the sample average of a random variable approaches to its expected value SEE Equation Sheet for this equation - Example: - Flip a coin 10 times and you might get very variant numbers of heads and tails, but if you flipped a coin 10 times 100 different times, the sample mean should get very close to the 50/50 divide - Thus, if we have lots of sample means then the average will be the same as the population mean - In technical languages, the sample mean is an unbiased estimator of the population mean

Options for Visualizing Univariate Distributions

Bar Plots Histograms Boxplots

Response Bias

But bias can still arise even when there is full responses - this is known as the response bias, which results from misreporting (aka from people lying or being misleading) - Survey responses can be effected from a huge number of things, such as- The ordering of questions, the nature of the interview, the identity of the interviewer, etc- There is also the issue of sensitive questions, which leads to ​social desirability bias - Here people respond to questions in a way that they think is desirable- People may hide their true feelings or behavior because they are embarrassed about them - seen with issues of racial prejudice, corruption, or political orientation

Covariates on R-Squared

By adding one or more independent variables, also known as covariates, you reduce the sum of the squared residuals and thus your R-squared increases. However, the goal of multivariate regression is to address confounding factors, so that we can better estimate the effect our independent variable of interest has on the dependent variable.

Confidence Intervals

By the Central LImit Theorem (CLT) we know that with repeated sample, the sampling distribution is going to be normal - and in this case, it will be centered around the population parameter - A confidence interval for an estimate is a range of numbers within which the parameter is likely to fall - For CLT - X ~ N(E(X), (V (X))/n)) - For current case - X ~ N (p, (p(1 − p)/n)) - You choose a level from the confidence interval - The standard one is 95%, which means you are computing a confidence interval which contains the true value 95% of the time over repeated data generation In the standard normal distribution, the mean is 0 and the SD is 1- We know that the sampling distribution will distribute itself normally - This means we can have a known proportion of the distribution falling within a specific value of the mean CIα =[X(bar) − zα/2 × standard error, X + zα/2 × standard error] - Where zα/2 is called the ​critical value, ​and α reflects our chosen confidence interval From Worksheet: For the estimate of the mean to be of value, we must have some idea of how precise it is. That is, how close to the population mean is the sample mean estimate likely to be? This is most commonly done by generating a confidence interval around the sample mean. Confidence intervals are calculated in such a way that, under (hypothetical) repeated sampling, the population parameter of interest (e.g. the mean or median) is contained in the confidence interval with a given probability. So, for instance, the population mean lies within the 95% confidence interval in 95% of random samples. Note: This is not the same as saying that a 95% confidence interval contains the population mean with a probability of .95, although this is a common misinterpretation. 95% confidence means that we used a procedure that works 95% of the time to get this interval. That is, 95% of all intervals produced by the procedure will contain the true value of the population parameter we seek to estimate. For any one particular interval, the true population parameter of interest is either inside the interval or outside the interval. The confidence interval centers around the estimated statistic (e.g. mean) and it extends symmetri- cally around the point estimate by a factor called the margin of error. We can write it as: CI= point estimate ± margin of error. Most often people use either the 95% or 99% confidence interval.

Nominal Variables

Categorical measure, with no ordering - Married or Unmarried - Employed or Unemployed

Use of Categorical Variables in Regressions

Categorical variables are variables that aren't nominal or ordinal - they fit into non-ordered categories - You can often have two categories for one variable with 1 meaning that variable isn't present and 0 meaning it is present - so for the variable of colonization 1 is colonized and 0 is not colonized - So if predicting the effect of being a colony on democracy or not you might do: - E[Democracy|Colony] = β0 + β1Colony - What does β0 represent - the mean level of democracy for non-colonies - And β1 is the difference in the level of Democracy between colonies and non-colonies - To do this, we can generalize the prediction equation SEE equation sheet for info on how this is done for the example using British colonies

Before-and-After Comparison

Compare the same units before and after the treatment - Assuming that there are no time varying confounding factors

Cross-Sectional Comparison

Compare treated units with control units after the treatment - Assumption - the treatment and control units are comparable - Potential problem: There may be some unit specific confounding factors

Counfounding Bias

Confounding bias is caused by the difference between treatment and control variables - when there is an unaccounted for factor that is causing the outcome Key assumption in observation studies is unconfoundedness ​- the idea that treatment and control groups are comparable with respect to everything other than treatment - If we are looking at the effect of ​T ​on ​Y, ​there can't be an unobserved variable ​X ​also effecting the outcome ​Y - To avoid this we need a good comparison group that doesn't likely suffer from unobserved variables

Confounding

Confounding is defined as the bias due to ​common causes ​of the explanatory variable and outcome - If the purpose is to test the effect of X on Y, a goal is to control all potential confounders - However, not all confounders can be observed effectively, measured, or controlled for - One solution for this is to do an experiment with randomization, which should control for all of these different factors and thus solve confounding variables - For this to work, the groups have to be comparable because of randomization - Why we use controls - Experiments often are either infeasible or inadequate for other reasons - We want to try and control for alternative explanations or confounders as a proxy for doing an experiment - We do this by including controlling variables - Regression is useful in making causal claims less arbitrary A confounder is a pre-treatment variable that influences both your dependent variable y and your main independent variable of interest x

Unbiasedness in Statistical Inference

Consider the hypothetical where you conduct a survey under the same conditions again and again - you expectation is that the average performance of the mean of x will equal p(E(X) = p) → this is known as unbiasedness

Covariance

Covariance (COV) is the combined variance for both x and y - The sample variance is how spread a variable is - The covariance is a measure of how variables are associated together - it is the expected value of the product of the deviations of the two variables from their means - In a way, the covariance is the same as the variance, the only difference being that instead of using only X we use X and Y SEE equation sheet for this equation

Direction of Bias

Crucially, the direction of the bias that confounders introduce is ambiguous ex ante: they can either inflate or deflate estimated coefficients. Since a confounder may be unobservable (a latent variable), it may be difficult to control for it even if we know the true underlying data generating process. If a potential confounder is not the only confounder in the true model, then controlling for that confounder can actually worsen one's ability to draw valid causal inferences by reducing the bias on one side (positive) but not the other (negative, or vice versa); this will pull OLS estimates further from the true population parameter.

Descriptive Statistics

Descriptive Statistics are just statistics that describe a large amount of data in a summary form - These are useful because they help us understand what a typical unit from our population looks like - It also helps us reduce any measurement to key indicators

Unit Non-Response Bias

Even after we have our sample, trying to reach people can lead to bias, especially if certain respondents don't respond to the survey - which is called ​unit non-response - Where certain members of a chosen sample don't respond to the survey at all - even if the survey is representative of the population at interest Literary Digest Survey - Only 1⁄4 of the questionnaires they sent out were returned, leading to questions about the difference between the group that chose to return the surveys and the group that chose not to

F-Statistic

F-statistic is a good indicator of whether there is a relationship between our independent and the dependent variables. The further the F-statistic is from 1 the better it is. However, how much larger the F-statistic needs to be depends on both the number of data points and the number of predictors. Generally, when the number of data points is large, an F-statistic that is only a little bit larger than 1 is already sucient to reject the null hypothesis (H0 : There is no relationship between dependent and independent variables). The reverse is true as if the number of data points is small, a large F-statistic is required to be able to ascertain that there may be a relationship.

Matching

Finding a similar unit for comparison is known as ​matching -- This is seen in the example of the minimum wage increase in New Jersey compared with Pennsylvania, where the underlying conditions should be very similar and so the treatment unit should be very similar to the control unit

The Process of Hypothesis Testing

First, we specify our null hypothesis. Our research hypothesis H1 is that X and Y are related, meaning Beta > 0 or Beta < 0 (that is, the alternative hypothesis). Hence the null hypothesis H0 is usually Beta = 0 (for two-tailed tests; for one-tailed tests, it would be that the mean beta is larger or smaller than 0). Then, we choose a statistic giving us a score that can be mapped onto a known distribution telling us the probability of observing this score from our sample if H0 were true. In linear regression, we calculate the following t-statistic (also called t-value): Then, we set the significance cutoff point alpha for rejecting the null. Alpha is usually 0.05 (but it can also be 0.10 or 0.01). Provided that the sample size is large enough and that we are doing a two-tailed test, we find the critical value for this cut off point from the t-distribution: 1.96 for alpha = 0.05. Lastly, we compare the t-value to the critical value: we reject the null if the absolute value of our realised T-value is equal to or larger than 1.96.

One versus Two Sided Tests

For Pual the octopus, we calculated the probability of observing at least 12 correct guesses - One-sided alternative hypothesis: H1 : p > p0 or p < p0 - One-sided p-value: Pr(Z > Zobs) + Pr(Z < Zobs) - But we could also consider extreme responses on the other side - e.g. observing at least 1​ 2 incorrect guesses - Two-sided p-value = Pr(Z > |Zobs |) + Pr(Z < - |Zobs | ) - As the sampling distribution is normal and symmetric, the p-value for this alternative hypothesis is simply 2x that calculated previously - Convention is to use two-tailed tests - Making it even more difficult to find results just due to chance - Normally don't have very strong prior information about the difference - A one-tailed test needs a very strong theoretical justification

Frame Bias

From your target population you select a framing population from a sampling frame (which is a list of all of the potential people within your population), but when your sampling frame isn't fully representative of the population, this can lead to a​ frame bias

Statistical Inference

Guessing what we do not observe from what we do observe - What we want to estimate: the Parameter ( Θ) which is unobservable - what you do observe: data - We use data to compute an estimate of the parameter Θ(hat) - But the key question is how good Θ(hat) is as an estimate of Θ - We thus want to know the estimation error = Θ(hat) - Θ0 where Θ0 is the true value of Θ - The problem is that Θ0 is unknown - So we consider two hypotheticals instead - 1) How well would Θ(hat) perform as the sample size goes to infinity? - 2) How well would Θ(hat) perform over repeated data generating processes?

Stratified Random Sampling

Here we classify our population into groups, and then within these groups we can select using SRS - Thus, if we wanted to survey cat owners, we could create a group of cat owners with 50 people and a group of non-cat-owners with 50 people, even though only 16% of people own cats

Bias Within the Linearity Assumption

Here, Bias is the expected difference between the estimator and the parameter ( Θ ) - this is ​not ​the difference between an estimate and the parameter - So if you take a small sample and its average is far from the population average, it is highly biased

Efficiency Within the Linearity Assumption

Herre, efficiency is having a distribution with a smaller variance, even if it is unbiased - tighter data is less likely to have extreme values- This is aided by using larger samples in your sampling - this will make a smaller spread - If we have two unbiased estimators (Θ(hat)1 and Θ(hat)2) of Θ , we will want to pick the one with a lower variance which is more efficient Here, the SE is the standard deviation from a sampling distribution

Hypothesis Testing for Proportions

Hypothesis- H0: p = p0 and H1: p =/ p0 - Here, the test statistic is the z-score - Z-score= X(bar)−p0 /√p0(1−p0)/n (SEE Equation list) - Under the null, by the central limit theorem, Z-score ~ N (0, 1) → this means that it is distributed normally, with a mean 0 and a standard deviation of 1 - Is Zobs for the test statistic unusual under the null? - Reject the null when | Zobs | > zα/2 - Retain the null when | Zobs | ≤ zα/2

Cluster Random Sampling

If population members are naturally clustered, we can SRS clusters, and then SRS respondents within selected clusters- Randomly selecting schools and then randomly selecting students within schools

Comparison of Means

If we find that two groups have different means for a given variable, how do we know whether this is due to chance in the particular sample that we have drawn? What we want to know is the probability that the means of the two groups are different in the population. Typically, when there is sufficient evidence in our sample to say that there is less than a five per cent chance that we got this result purely by chance when the difference in the underlying population is actually zero, we say that there is a statistically significant difference in means. Note that 'statistical significance' is not the same as substantive significance! Even very small differences are statistically significant if we are confident that they reflect a real difference in the population. The t.test function tests the difference in means. simple comparison of means for a variable between two groups is only helpful when the setting is quasi-random, meaning there are no possible confounders. This is unlikely when we have observational data, which is usually the case in Political Science and IR. That is why it is so important to understand regression outputs.

Tradeoff between Type I and Type II errors

If we have a very very high bar for H0 being rejected, than we will retain H0 much more frequently, even if it isn't true - thus less Type I error = more Type II error Example of the legal trial and trying to not let guilty people go free or to sentence someone who is innocent

QQ Plots

If you are trying to do a comparison between studies with different respondents, you can't use scatter plots or correlation coefficients because you will have different units, not just different variables - One way to compare two different studies is the compare the entire distributions of studies - This can be done with histograms - And it can be done with QQ Plots - If you are comparing peoples responses in the first wave of survey questions and second wave, you can do so with two different histograms next to each other - But if we want to compare this data with data from another study, we can do two histograms, or we can use a QQ plot - QQ plots plot the quantiles of one variable against those of another - A 45 degree line indicates that there is equality between the two distributions - We could look at two groups in the gay mariage study (the treatment group and control group) and to do this we would need a histogram or a QQ plot - If you plot the quantiles for one variable on the x axis and the quantiles for the other on the Y axis, this will let you see the relationship - Things above the 45 degree line show that a variable on the Y-Axis has a greater value at corresponding quantiles than the same variable on the X-Axis

Omitted Variable

If you can think of other explanations for an association, you need to control for those before you can try to get causation - these are the omitted factors that can explain ​both ​the x variable and y variable - For example, an omitted variable in the relationship between democracy and GDP is if a country is an oil producing country (since oil producing countries tend to be more autocratic) - this is where you would 'control' for that variable But there may be other factors affecting this association - omitted controls that affect both x and y - We need to control for these to avoid the omitted variable bias and whose causality is ​before x​ and y → so in this example, something that affects someone's education and their turnout (such as your parent's education) - In a linear regression β is the regression coefficient of X1, which describes the association for X and Y - For β(hat) to be unbiased, it needs to meet the condition of the off the expected value of the error conditioning of X1 → E[ε|X1] = 0 - In observational studies, X1 is likely to be determined by omitted variables in ε , which could be related to the Y - Thus it leads to the conclusion that E [β(hat)] =/ β - This is known as the omitted variable bias -What we can do in our regression analysis is to account for the omitted variable by placing X2 in our model, which should reduce the error term and the residuals - Y = β0 + β1 X1 + β2 X2 + ε - Here, you hold X2 constant, with β1 denoting the partial association of X1 and Y - And here, the cov(X2, Y ) =/ 0 and cov(X2, X1) → meaning X2 is related to both Y and X1This can be extended to many different variables - Y = α + β1 X1 + ... + βk Xk + εi

The Danger of Multiple Testing

If you do lots of tests, the odds of getting a false result is fairly high - Doing a 95% test in 10 tests, the odds that one of them is wrong is 0.4

Hypothesis Testing

In quantitative political science, much of what you do is statistical hypothesis testing - This is essentially a probabilistic 'proof by contradiction' - we assume the negation of the proposition, and show that it leads to the conclusion - We construct a null hypothesis ( H0) and its alternative ( H1 ) - We then pick a test statistic (T) - We then figure out the sampling distribution of T under H0 (this is called the reference distribution) - We then ask if the observed value of T is likely to occur under H0 ? - If it is, we retain H0 - If it isn't, we reject H0

Control for Factors

In regression analysis, we control for factors that are related to both the dependent and independent variable. When you control for something in statistics, you add to your model one or more variables of interest that are kept constant when you interpret the relationship between your dependent and independent variables of interest.

Conditional Expectations Function

Information about age can help improve prediction about vote leave - if age is called X, and vote leave is called Y, we want to predict the value of Y given the value of X - This can be summarized with the following notation E[Y | X] which means the expectation of Y conditioned upon X - E[Y] is the expectation of Y, and is the ​population mean - So the symbol | means conditional upon - If we want to know the expectation of Y for a specific value of X it would be E[Y | X = x]

The Common Support Assumption

Interaction terms require common support—that is, there are variations across the different possible values of the key independent variable (binary_growth) and the moderator (binary_openness). If there is no variation in either variable, the interaction becomes meaningless.

IQR

Interquartile Range (IQR) - a measure of variability- the range between the 75th and 25th percentile. (Q3 - Q1) A definition of outliers = over 1.5 IQR above upper quartile or below lower quartile

Sampling Distribution

It is a distribution of possible averages from the population distribution - it is all the possible averages we could calculate from different samples and probability of getting those sample averages. - The bigger the samples, the more the distribution will narrow and concentrate around the actual mean, meaning ​variance ​and ​standard deviation ​will be lower - The more samples you take, the more the averages of the sampling distribution converges to the population distribution - We shouldn't make generalizations from small populations, because for these samples it is impossible to tell how they compare to the overall population - This also means that independent of the population distribution, the sampling distribution will be ​normal - When many samples are taken, the ​sampling distribution distributes normally The same probabilistic reasoning also applies to the statistical measures we have studied so far, including the mean and standard deviations. In fact, these are also random variables because their values will vary depending on the sample we use. Suppose that we draw all possible samples of size n from a given population. Suppose further that we compute a statistic (e.g. a mean, proportion, standard deviation) for each sample. The probability distribution of this statistic is called a sampling distribution (e.g. sampling distribution of the mean). And the standard deviation of this statistic is called the standard error.

Outcome Controls

It is important to make sure your control's aren't ​outcome controls → ​variables that are affected ​by ​x and then can affect y, so that the chain goes x → z → y, instead of Z acting on both x and y before their association - We don't want to control for things realized ​after t​ he x - so you don't want to control for income ​after ​education, because it will be affected by education - - Here controls (c) can be ​outcomes ​of (x)When including outcome controls, ​regressions controls away for the consequences of X - The effect of X on Y can go through C - Which means controls should be ​predetermined ​before the effect of x on y to avoid bad control bias

The difficulty of obtaining a representative sample

Many surveys are done via telephone surveys - often with random digit dialing - But this can still lead to biased sampling frames (if people have multiple phones because they are wealthy they could be double counted) - You could also have unit non response from people screening calls from unknown people - An alternative is using internet surveys - These are usually run by opt-in panels which means that it is respondent driven sampling which is not probability sampling and may lead to selection bias - Thus these surveys are cheap but non representative since they usually reach the young more than old, the rich more than the poor, the urban based more than rural based - You can correct for some of this bias with statistical methods but this is an imperfect art

Summary of Issue / Test / Fix for 4 OLS Assumptions

Multicollinearity Issue: Inflates SE Test: Regress the suspect on the other IV Fix: Incrtease the sample size or remove suspect Zero Conditional Mean Issue: Biased Estimate Test: Plot residuals against fitted values Fix: Include confounders Heteroskedasticity Issue: Biased SE Test: Plot standardized residuals against fitted values Fix: use robust SE Normality of Error Term Issue: Wrong P-Value (if small sample) Test: Plot distribution of standarized residuals Fix: None

Can Controls Address All Confounders?

No - Introducing controls are not a ​sufficient​ condition to address confounding variables - The researcher would have to know ​all t​ he control that are in the ε and that are influencing X and Y, such that E[β(hat)] = β - And not all confounders can be observed, measured, or accounted for - Other research designs can help address this concern (like randomized control trials)

Confidence Intervals for Proportions (Binary Variables)

Nominally, the interpretation of a 95% confidence interval is that under repeated samples or experiments, 95% of the resultant intervals would contain the unknown parameter in question. However, for binomial data, the actual coverage probability, regardless of method, usually differs from that interpretation. This result occurs because of the discreteness of the binomial distribution, which produces only a finite set of outcomes, meaning that coverage probabilities are subject to discrete jumps and that the exact nominal level cannot always be achieved. Thus, to avoid confidence intervals that include negative values or values larger than one, we refer to probabilities when calculating the standard error.

Significance Tests and CI's

Notice that significance tests look similar to CIs → they are essentially the same thing - We could use a CI around the difference between two sample means to test the hypothesis that they are the same - A 95% CI would just be 1.96*SE - We've just worked out that SE ≈ 3 - So CIs and significance tests are doing the same job, just presenting the information in a slightly different way

Null Hypothesis

Null hypothesis ( H 0 ) is directly tested - This is a statement that the parameter we are interested in has a value similar to ​no effect - Regarding ideology, old people are the same as young people - This is directly tested

Two Sample Test

Often we wish to compare two samples - E.g. examine H0 that means of two populations are equal Here we want to estimate the difference between the populations (the parameter) using the difference between the sample means (the statistic) - Significance test on the test statistic to discover whether samples likely to represent real differences between the populations of men and women - The null hypothesis is simply that there is no difference between male and female mean levels of worship

Randomized Control Trials (RCT)

One of the key ways you deal with the problem of bias and matching is randomization -- The key idea: randomization of the treatment makes the treatment and control groups 'identical' on average - The two groups are similar in terms of ​all ​(both observed and unobserved) characteristics - Thus, you can attribute the average difference in outcome to the difference in the treatment

List Experiments

One way to address response bias - One option is to do ​list experiments​, where you ask someone to respond about how many options out of a list of options you support - The control group lacks the key variable you are trying to measure, whereas the treatment group includes the that variable - You can then subtract the mean from the treatment group from the mean of the control group to estimate how much a certain group of people support something which may be sensitive and hard to poll specifically - This approach still has issues from ​floor ​and ​ceiling effects, ​since when given a list, saying either 0 or the highest number reveals that you definitely support or don't support a certain thing - this can lead to response bias as well

Multiplication Rule for Probability

P(A and B) = P(A|B)P(B) = P(B|A)P(A)

General Addition Rule

P(A or B) = P(A) + P(B) - P(A and B) If P(happy) = 0.5 and P(dating) = 0.5 then P(dating or happy) =0.6

Law of Total Probability

P(A) = P(A and B) + P (A and not B) If P(dating) = 0.5, and P(dating and happy) = 0.4), then P(dating but unhappy) = 0.1 P(A) = P(A|B)P(B) + P(A|not B)P(not B)

Conditional Probability

P(A|B) is the conditional probability of an event occurring ​given ​that event B occurs - P(A|B) = Joint Probability / Marginal Probability = P(A and B) / P(B)

Dealing with Non-Linearities

Polynomial terms can be used to model non-linearities. For example, we get a model with a linear and a quadratic term - Polynomial terms involve a multiplication of x by itself, leading to this equation: - Y=β0+β1X+β2X^2 +ε - This is known as a second-order polynomial in x - We can also include a third-order polynomial if we want --> Y = β0 +β1X + β2X^2 +β3X^3 +ε - An important piece of advice: never include the higher polynomial without including the lower polynomial too - i.e. never include X2 without X or X3 without X2 - This allows for more flexibility and so is a good safety measure When β2 > 0 we get a U-Shaped curve (there is no maximum, there is a minimum) - When β2 < 0 we get an inverted U-Shaped curve (there is no minimum but there is a maximum)

Predictive Inference

Predict future data on the basis of past data - what is the expected weight of a 2-year old in three years

Causality in Regression

Prediction is not causality - Predicting Y on the basis of x does not imply that it is a change in X that causes Y - Causality implies that we make sure other factors (confounders​) that can cause a change in both X and Y are properly accounted for - Regression is helpful only because it provides a framework to account for other ​observed ​confounders

Probability

Probability is a mathematical model of uncertainty or chance - Frequentist interpretation: ​Repeated experiments → an approximation via repeatable events that approach the limit - Bayesian interpretation: ​Subjective belief → personal measure of uncertainty - There is an experiment, a sample space, and an event - Experiment: - Flipping a coin, rolling a die, voting in a referendum - Sample space ( Ω ) - all possible outcomes of the experiment- {head, tail}, {1, 2, 3, 4, 5, 6}, {abstain, leave, remain} - Event: any subset of outcomes in the sample space - Head, tail, head or tail; 1, even number, odd number, does not exceed 3; do not abstain, not not vote leave, etc. - Under this outline, the probability of A is P(A) - If all outcomes in the sample space are equally likely to occur: - P(A) = (Number of elements in A) / (Number of elements in Ω )

Quantile

Quantile (quartile, quintile, percentile, etc.) - values that divide our data set at different percentiles - Commonly used = quartiles - 25 percentile = lower quartile - 50 percentile = Median - 75 percent = upper quartile

Random Assigntment to Treatment

Random assignment to treatment solves this problem since observed and unobserved con- founders are balanced between control and treatment groups in expectation. researchers must demonstrate that both treatment and control groups are balanced with respect to salient pre-treatment characteristics. This involves quantitative balance tests, checking that the means of of the pre-treatment characteristics are the same between treatment and control groups, but also qualitative observations about how treatment was assigned and why it was (as-if) at random.

RDD

Regression Discontinuity Designs (RDD), which use arbitrary cutoffs (those just above and just below the cut off should be very comparable)

Regression

Regression allows us to predict the value of Y for any value of X, even if the specific x is not included in the sample - E[Y|X= xany ]= β0 +β1 ×xany - In a predictive model, β1(hat) is interpreted as the ​expected​ difference in ​y ​where there is a one unit increase in x

Type I Error

Rejecting H0 when H0 is true Type I error occurs in the proportion of the α we chose - so if alpha is = 0.5, Type I errors will happen that often Hypothesis tests thus control the probability of Type I error, which is equal to the level of tests or α - Higher significance level (lower alpha) makes it more difficult to detect a real effect, but gives more confidence that any effect we find is real

Rule of Thumb on Statistical Significance

Remember that statistical significance isn't the same as real world significance

Residual Standard Error

Residual Standard Error is measure of the quality of a linear regression fit. Theoretically, every linear model is assumed to contain an error term epsilon. Due to the presence of this error term, we are not capable of perfectly predicting our dependent variable (vote share) from our independent variables. The Residual Standard Error is the average amount that the dependent variable will deviate from the true regression line.

RSS

Residual Sum of Squares - much more accurate than TSS and what is produced by the line of best fit Yi - Y(hat)

Type II Error

Retaining H0 when H0 is false A larger sample size reduces the probability of this happening, but it can't be controlled the same way that Type I error is

Selection Bias

Selection bias is where some groups ​self-select f​or the treatment or control - Statistical controls are needed to help control for that - This should help get rid of the issues of bias and allow inferences about causal effects - However, even with these, there may be some unobserved confounding factor that poses a threat to our findings

Multivariate Linear Regression Visualization

Similarly, in multivariate linear regression analysis, we fit a (hyper)plane through a multi-dimensional cloud of data points - and this (hyper)plane is, by a similar logic, the one that minimises the sum of squared residuals. More specifically, the multiple linear regression fits a line through a multi-dimensional cloud of data points. The simplest form has one dependent and two independent variables.

SRS

Simple Random Sampling The Best way to do probability sampling - where every unit has an ​equal ​chance of being selected to be surveyed But there are some issues with this approach: - Random samples may not include enough of a particularly interesting group for analysis since the random element makes it hard to select for such a group - It can also be costly and difficult - If the random people you select are hard to reach, this raises many difficulties - Getting a sample frame for a population can be very difficult

Standard Error

Standard error = estimated standard deviation of the population - The standard deviation of a sampling distribution is called a standard error to distinguish it from the standard deviation of a population or sample - A high standard error gives us a short and flabby sampling distribution and a low standard error gives us a 'tall and tight' sampling distribution Standard error is X(bar) = s /√n This standard error will give us an estimate of how far any sample mean typically deviates from the population mean - Standard error equation means that- As the n of the sample increases, the sampling distribution will be tighter - So the bigger the sample is the better it is at estimating the population mean - As the distribution of the population becomes tighter, the sampling distribution is also tighter - If a population is dispersed it will be more unlikely to get observations near the mean This works for binary variables too - not just continuous ones - where the mean is just the proportion: - Knowing the standard error (and the Central Limit Theorem), we can calculate how likely it is that a specific range around our estimate contains the population mean - We can calculate what's called a confidence interval SEE equation sheet for this equation

SSR

The Sum of Squared Residuals - This is what is used to minimize the gap between the residuals and the line See equation sheet for what the equation is

Coefficients for Continuous Variables

The coecients for continuous variables show the effect of a one-unit change in the associated variable on the expected vote share other things equal.

Coefficients for Dummy Variables

The coecients for the dummy variables indicate the magnitude of the shift in the intercept when the associated variable takes a value of 1, all other things equal.

Standard Error

The coefficient Standard Error is an estimate of the standard deviation of our coefficient estimate. It measures the average amount that the coefficient estimate varies from the actual average value of our response variable i.e. if we were to run the model again and again, the Standard Error provides an estimate of the expected difference. It is thus an estimate of the variance of the strength of the effect, or the strength of the relationship between each covariate variable and the outcome variable. If the standard error is large, then the effect size will have to be stronger for us to be able to be sure that it's a real effect, and not just an artifact of randomness. We'd ideally want a lower number relative to its coefficients. Interpretation of standard errors in a regression output: in 95% of surveys or "repeated samples", the difference between our estimate and the true value is less than approximately 2*SE of the estimate. As you have seen, Standard Errors can also be used to compute confidence intervals and to statistically test the hypothesis of the existence of a relationship between our dependent and each respective independent variable.

R-Squared

The difference in mistakes between the RSS and the TSS r^2 = TSS − RSS/TSS = 1 − (RSS/TSS) The closed r-squared is to 1 the better - it is always between 0 and 1 The r2 represents the proportion of total variation in the outcome variable explained by the predictor(s) included in the data- r2 thus explains how well a variable explains an outcome - if it is 1 it means it ​perfectly ​explains it - but this obviously never happens unless the Y variable is the same as the X variable - r2 does not explain the ​relevance ​of the variable (i.e. its statistical significance), however R2 represents the proportion of total variables in the outcome variable explained by the dependent variable - Larger R2 implies we can explain the more variation in the outcome variable with our included variables- Is this our objective as social scientists? - As discussed earlier most times we pose causal questions, the objective is to explain the effects of given X on Y → here we don't care as much about the R2 but about β1 - Thus larger R2 is no indication of the effect of a given x on y

Normality Assumption

The error term follows a normal distribution - The error term is independent of the explanatory variable and is normally distributed with mean zero and variance (σ2 ): μi ~ N(0, σ2) which implies Y | X ~ N(β0 + β1X, σ2) → this implies Assumption 4 and 5 - zero conditional mean and homoskedasticity. Together, assumptions 1-6 are called the ​classical linear model assumptions - Non-normality of the errors is not a serious concern in large samples because we can rely on the Central Limit Theorem. Variable transformations can help to come closer to normality. FROM WORKSHEET: This assumption means that the error term (Error) is independent of the explanatory variables and is normally distributed with the mean of zero. Non-normality is the "least of our problems" in regression diagnostics because its violation does not entail bias or inefficiency of our OLS estimates. It matters for the calculation of p-values, but this is only a concern when the sample size is small. If it is large enough, the Central Limit Theorem entails that we can reasonably assume a normal distribution of the error terms. Note that errors and residuals are not the same! The error term is an unobserved quantity based on the true population model. The residuals is the observed difference between the predicted or fitted value Yˆ and the true observation Y . However, the difference between the expected values and the observed ii values and the difference between fitted values and the residuals equals to zero if the above-mentioned OLS assumptions hold. The next test one should run is therefore to ensure that the residuals are roughly normally distributed. We already know how to produce simple residuals, but now we need the variance of the residuals to be standardized, so we will generate what are called "studentized" or "standardized" residuals. The standardized residual is the residual divided by its standard deviation. The reason why we standardize the residuals is that the variance of the residuals at different values of the explanatory variables may differ, even if the variances of the errors at these different values are equal. Indeed, because of the way OLS works, the distribution of residuals at different data points (of the explanatory variable) may vary even if the errors themselves are normally distributed (intuitively, the variability of residuals will be higher in the middle of the domain than at the end). Hence to compare residuals at different values of the explanatory variables, we need to adjust the residuals by their expected variability i.e. standardize them.

Heteroskedasticity

The error term has constant variance, i.e. the same variance regardless of the value of X - Here the focus is on the constant variance in OLS estimators This lead to the assumption of homoskedasticity, which says that the error term has the same (conditional) variance given the any value of the explanatory variable: V [ε | X] = σ2 → this means there is constant variance - This is violated when the variance is different: V[ε|X=1] =/V[ε|X=0]- This phenomenon is known as heteroskedasticity - We want tn−2 = β(hat)1 / SE[β(hat)1] - But we get tn−2 = β(hat)1 /SE[β(hat)1 | X] From Worksheet: Homoskedasticity means that the error term has the same variance given any value of the explanatory variables. In other words, the variance of the error term, conditional on the explanatory variables, is constant for all combinations of values of the explanatory variables. Homoskedasticity is also referred to as "constant variance". If this assumption is violated, that is the error term varies differently at certain values of the explanatory variable, we have heteroskadisticity (from the Greek: homo = same, hetero = different, skedasis = dispersion). Homoskedasticity is not necessary to get unbiased coefficient estimates - so long as the zero- conditional mean assumption is met! It is important still because in case of heteroskedasticity - i.e. if the error term varies at different levels of our explanatory variables - we will get biased standard errors for our coefficient estimates! Indeed, the formula for the variance of Beta(i) requires constant variance of the error term conditional on Xi. In case of heteroskedascitity, this invalidates the usual formula for the standard errors. What to do if we encounter heteroskedasticity? Heteroskedasticity can be corrected for by adjusting how we calculate our standard errors. In particular, we can use "robust standard errors" that correct for the uneven variance of the residuals. Robust standard errors weigh the residuals according to their relative importance to correct for imbalances in their dispersion. It is worth noting that robust standard errors typically (but not always) give us more conservative (i.e. larger) estimates of a coefficient's standard error. However, keep in mind that robust standard errors do not fix problems stemming from model specification, and should not be used just to "be safe": they should be used only if we see heteroskedasticity to be a problem, otherwise it will inflate the standard errors for no reason. .

interpretation of interaction effects

The interpretation of interaction effects is less straightforward than in the additive regression case. In particular, the Beta coefficient of the main effect must not be interpreted as the average effect of a change in X on Y as it can in a linear-additive model. Because the effect of Economic Growth on incumbent vote share is now dependent on the values of Openness, the magnitude of the effect is given by Beta1 + Beta3 x EconomicOpenness That is, we cannot know the effect of binary_growth by just looking at Beta1. Beta1 is now a particular case: it represents the relationship between Economic Growth and delta_vote when binary_openness is zero.

Consistency in Statistical Inference

The law of large numbers mean that as the sample size increases, the mean of X will converge towards p - this is known as consistency

Line of Best Fit

The line of best fit is the line that best minimizes the errors between the predictions and the line The goal of a regression is to minimize the residuals → the line of best fit is the line that minimizes all of the residuals as much as possible - This is represented by the model - Y=α+βX+μ - With ( α, β) as the coefficients and μ as the unobserved error / disturbance To get this line we use the Sum of Squared Residuals (SSR) which helps minimize the gap between the residuals and the line The line that does best as minimizing the residuals is called the ordinary least squares linear regression (OLS regression)

The Residual

The mistake between the prediction and the actual data point So the residual is Y1 − Yi(hat) → this is denoted by μi

Adjusted R^2 and Multiple R^2

The more predictors we add to our model, the bigger our R squared gets. This is automatic! Thus with multivariate models, we look for the adjusted R squared, which adds a penalty when calculating the model fit. The R2 statistic provides a measure of how well the model is fitting the actual data. It takes the form of a proportion of variance. R2 is a measure of the linear relationship between our dependent variable (vote share) and our independent variables. It always lies between 0 and 1 (i.e.: a number near 0 represents a regression that does not explain the variance in the response variable well and a number close to 1 does explain the observed variance in the response variable). In multiple regression settings, the R2 will always increase as more variables are included in the model. That's why the adjusted R2 is the preferred measure as it adjusts for the number of variables considered. SEE equation sheet

Interval Variables

The numbers represent quantitative variables - This is like income, or age- There is a specific distance between each level - we can say someone is not only older, but that they are 1.6 years older

Random Sampling Assumption

The observed data represent a random sample from the population described in the model - So here, we want the observed data (yi, xi) for i = 1... n where they were randomly sampled from the population - Examples consistent with this assumption: - Surveys based on random selection - Samples chosen on the basis of independent variables - Violation - sampling selection based on the dependent variable, clustering in space or time or both. This is an example of sampling ​clustered ​within a specific group - Non-random sampling can lead to bias that may lead a researcher to draw the wrong conclusion about the population from his or her sample. This can be pervasive in social sciences

P-Value

The p-value is the probability, computer under H0 of observing a value of the test statistic at least as extreme as its observed value - A smaller p-value presents stronger evidence against H0 - So how small is small enough? - The level (size) of the test: Pr(rejection | H0 ) = α - P-values less than α indicate statistical significance - this is the α -level test - Commonly used values of α are 0.05 or 0.01 - So when you say something is statistically significant, you are saying your p-value is less than α - P-values are arbitrary levels that our test must meet - Maybe we need to be 99% confident that we are correctly rejecting the null hypothesis - Make the judgement that p-values of e.g. 5% and below are ​probably good evidence the null hypothesis can be rejected - The p-value is NOT the probability that H0 is true or H1 is false - Statistical significance indicated by the p-value does not necessarily imply scientific significance the p-value is the likelihood that the t-statistic you found came from a normal distribution centered around 0 (the distribution created by the null hypothesis) The p-value is a the probability of observing a coefficient at least as extreme as our estimate in a sample if the true coefficient in the population was zero. In other words, it is the probability of obtaining an effect that is at least as extreme (i.e. as far from 0) as the estimate in our sample data, assuming that the true effect is zero. "Assuming that the true effect is zero" is what we usually refer to as the "null hypothesis". The p-value evaluates how well the sample data supports the idea that the null hypothesis is true. In other words, it measures the compatibility of the sample with the null hypothesis (and not the truth of the hypothesis!). A small p-value indicates that it is unlikely that we would observe a relationship between the predictor (presidential elections) and outcome (vote share) variables due only to chance. Typically, a p-value of 5% or less is a good cut-off point. A p-value of 0.05 means that if we repeated this study, we would wrongly reject the null hypothesis 5 out of 100 times

Observational Studies

The problem is that often, randomization is ethical or possible - this is seen with trying to observe the effect of smoking on lung cancer - Observational studies are a way to naturally assign a treatment - it is looking at data naturally observed in the world to infer about counterfactuals - Observational studies have better ​external validity ​-which means they can be generalized beyond the treatment - But they have worse ​internal validity

Standardized Residual

The standardized residual is the residual divided by its standard deviation. The reason why we standardize the residuals is that the variance of the residuals at different values of the explanatory variables may differ, even if the variances of the errors at these different values are equal. Indeed, because of the way OLS works, the distribution of residuals at different data points (of the explanatory variable) may vary even if the errors themselves are normally distributed (intuitively, the variability of residuals will be higher in the middle of the domain than at the end). Hence to compare residuals at different values of the explanatory variables, we need to adjust the residuals by their expected variability i.e. standardize them.

Variability of a Sampling Distribution

The variability of a sampling distribution is measured by its standard deviation. The variability of a sampling distribution depends on three factors: • N: The number of observations in the population. n: The number of observations in the sample. • The way that the random sample is chosen (with or without replacement). If the population size is much larger than the sample size, then the sampling distribution has roughly the same standard error, whether we sample with or without replacement. On the other hand, if the sample represents a significant fraction (say, 1/20) of the population size, the standard error will be meaningfully smaller, when we sample without replacement.

OLS Regression

There is only one line that fulfills this purpose. It does so because it is estimated aiming at minimizing the sum of squared errors. Thus, it chooses those β1 and β0 values that minimize SSR. This estimation is called the ​Ordinary Least Squares ​linear regression, which is also known as OLS regression - The important thing is that the OLS estimator will give us the β1 that allows us to minimize the space between the residuals and the line - In OLS, the mean of the residuals is always ​0 - and only one line satisfies this condition Thus, the goal of OLS is to minimize the Sum of Squared Residuals with the line of best fit - So OLS = mean of μ(hat) - This is done with COV (X, Y) / VAR (X) The total sum of the errors will include both positive and negative numbers, cancelling each other out. Squaring each error overcomes this problem of opposite signs. The technique that minimizes the sum of the squared errors for linear regression is also known as Ordinary Least Squares estimation (OLS).

Normal Distribution

These are symmetrical distributions, with the area above the mean the same as the area below the mean - Many variables, like income, have skewed distributions - Even non-normal distributions can be transformed to produce approximately normal distributions by 'logging them' The normal distribution (also called the Gaussian distribution) has a mound-shaped and symmetric shape: data tend to cluster around the mean value and they are rarer and rarer the more we move away from the mean. The normal distribution is defined by two parameters: the mean (μ) and the standard deviation (sigma). Knowing these two parameters, we can draw any normal distribution The normal distribution is so important because many natural and social science phenomena follow a normal distribution. It is also pretty easy to define, and just by knowing the mean and standard deviations, we can know what proportion of the data are located within a given interval

Bar Plots

Theses are a very simple way to visualize the distribution of a categorical (factor) variable- In the afghanistan case, we can look at if someone reported victimization by the ISAF and divide it into three bars of harm, no harm, and no response, which gives a clear visualization of that

Rejection Region

This creates a ​rejection region ​of the part of the distribution centered around 0 that would lead us to not reject the null hypothesis, which is plus or minus the t-value at that level (1.96 in this case) - so we reject the null hypothesis if our t-statistic value is above 1.96 in this case

Quota Sampling

This sets certain fixed quotas for types of respondents to try to get the responding population to match the characteristics of the overall population - i.e. making sure that 50% are male and 50% female - Thus, interviews focus on capturing quotas within the population - But there are two issues for this - 1) This will only be representative for the characteristics you set quotas for, which will mean there are unobserved characteristics impacting the answers - this is ​selection bias - 2) Within these quotas, interviewers can interview whoever they want, and this may lead interviewers to interview whoever is easiest to find, which again can create skewing characteristics in the surveyed population, leading to ​selection bias again

Increase in the R-Squared

Thus an increase in the R squared does not in itself necessarily mean a better regression model, as it is no indication that we are actually explaining the effect of a given predictor x on our outcome y. In social sciences, we are typically not so much interested in predicting or fully determining y, but rather estimating possible connections between x and y. This is why assessing the relationship between x and y, and the confidence we have in its causality is our chief aim, not increasing the R squared!

Interaction Terms for Widow Example

Thus, without the interaction we lose out on the interaction effect b between the two variables - We move from Y = β0 +β1X(tv) + β1X(widowed) +εi to Y=β0+β1X(tv)+β2X(wid) + β3X(tv)X(wid) +μi - Since the interaction term is significant, there seems to be an important interactions between the two factors - Thus, if we want to make the best prediction and explain the outcome we should include it

TSS

Total Sum of Squares (TSS) is predicting the value of Y just using the mean of Y → this will have a much higher error rate than if you use the line of best fit Yi - Mean Y

Independence

Two events A and B are said to be independent if - P(A and B) = P(A)P(B) - This means the probability of the two events do not affect each other - If A and B are independent, then - P(A|B) = P(A)

Pre-treatment variable or characteristics

Variables realized before the administration of the treatment, which are therefore not affected by it (e.g. gender, age).

Continuous Variables

Variables with no clear subdivisions between the groups - Can be seen with age if coded a certain way, or with time

Mulitiplicative Interaction Terms

We can examine these kinds of conditional relationships adding multiplicative interaction terms to the OLS model. This allows the effect of economic growth on support for incumbents to be different in relatively open and relatively closed countries. when we include multiplicative effects, we also need to add separate terms for both of the interacting variables. Interaction effects (moderating effects, conditional effects) allow us to determine whether the effect of one independent variable (binary_growth) on an outcome variable (delta_vote) is conditional on the level of another variable (binary_openness). We refer to two kinds of effects in interaction models: main effects and interaction effects. • Main effect: The Beta coefficient of a particular variable (represented by Beta1 and Beta2 in equation 2) represent the relationship between that variable and the outcome delta_vote. • Interaction effect: Beta3 is the coefficient of the "interaction term". It indicates that the effects of binary_growth and binary_openness are now dependent on the values of the other variable.

Item Non-Response

When units just don't answer certain questions this is called ​item non-response - Where some respondents may choose not to answer certain questions - This can create bias even if the sample is representative - This often happens with sensitive questions, which can lead people to be wary of responding - If those who answer are systematically different from those who don't answer, this can result in the inference being biased

Non-linearity

When we model a relationship with OLS, we assumes a linear relationship between the parameters (this is actually the first OLS assumption). This implies that if X and Y are positively related at low values of X, they are positively related and to the same extent at high values of Y. This is not always the case. Think for instance about the relationship between age and savings: you start your life without savings, but your savings typically increase until retirement age, and declines thereafter until you die. A linear relationship between savings and age does not allow you to capture this. How do we account for non-linearity in regression? We can model this type of relationship by adding a polynomial term to the regression equation. Specifically, we add a polynomial to the variable(s) where we think a non-linear relationship exists, with as many terms as the power of the polynomial which we think best expresses the relationship: two for a quadratic relationship (linear and squared), three for a cubic relationship (linear, squared, and to the power of three), and so on. The polynomial is flexible, in the sense that the model doesn't have priors as to whether the estimated relationship is convex or concave—whether the relationship is U-shaped or inverted U-shaped.

What is in the error term

When you can't measure for a control, they remain in the error term

Equation for Multivariate Regression

Y =alpha+ B1X1 + B2X2 +...+BpXp + e

Increasing Independent Variables in a Multivariate Regression

You can increase the amount of your independent variables as long as their inclusion is motivated by theory, they are predetermined (i.e. they are observable before the treatment occurs/the outcome is observed), they are correlated with both the dependent and the independent variable of interest, and there are sufficient degrees of freedom (number of observations is higher than the number of independent variables in the model).

Confounders

You can never match on everything - the unobserved things are known as ​confounders which are variables associated with the treatment and the outcome - they lead to selection bias any pretreatment variable that influences both the dependent and independent variables. They result in confounding bias, making it impossible to draw causal inferences from the data.

Variable

a measurement of a characteristic of a case (something or someone) that (usually) varies across cases in a population of cases

Regression Analysis

a statistical method allowing to examine the relationship between two or more variables of interest. At the core, most types of regression analyses study the influence of one or more independent variables (right-hand side variables) on a dependent variable (left-hand side variable, also known as the outcome of interest).

natural experiment

a type of observational study in which an event allowing for a (quasi)-random as- signment of study subjects to different groups is exploited to investigate a particular research question. They are used in situations in which controlled experiments are unfeasible (e.g. for ethical reasons). Examples: weather events, natural disasters, large and unexpected migration flow.

Random Variables

all the variables we study in political science are actually random variables because each value they take is a result of a probabilistic process. The probability that a random variable takes a given value generates its probability distribution. The values of random variables must represent mutually exclusive and exhaustive events.

Experimental Study

an empirical investigation in which researchers introduce an intervention (or "treatment") and study its effects. Experimental studies are usually randomized, meaning the subjects are randomly allocated to either the treatment or the control group (we then talk about Randomized Controlled Trials)

Observational Study

an empirical investigation that makes an inference from a sample to the population, or estimates the effects of one (set of) variables on another, when the main independent variables are not deliberately administered and so the researcher is not "in control" of study conditions.

outlier

an outlier is an observation that lies at an "abnormal" or "extreme" distance from the rest of the values for a variable in a random sample from a population. There is no ex ante, unique definition of what an abnormal distance is - it is often left to the researcher's subjective judgement. When visualizing a boxplot in R, outliers are observations which fall outside of the interval [Q1 --> 1.5XIQR; Q3 --> 1.5XIQR] where Q1 and Q3 are the first and third quartiles respectively, and IQR is the inter-quartile range.

Balance

broadly refers to the idea that treatment and control groups are similar along a range of pre- treatment, observable characteristics (e.g. as many democratic as autocratic countries, as many rich as poor individuals, as many men as women).

Causal Inference

does a specific factor cause a phenomenon to happen - like what is the effect of democracy on economic growth

Y-Intercept on Regression With Dummy Variables

factorial variables with k levels can be added to a regression as k ≠ 1 binary variables. If you included the k dummy variables in the regression at the same time, R would automatically leave one out, and that one would become the reference category. The last level is captured by the intercept: in this case, the intercept represents the incumbent's expected vote share in a Western European/North American country when all independent variables take a value of zero.

Descriptive Inference

from what we observe to what we are interested in

Bias

generally, bias refers to the tendency of a measurement process to over or under-estimate the real value of a population parameter. In statistics, the bias of an estimator is the difference between the estimator's expected value and the true value of the parameter estimated

Linearity Assumption

he population regression model is linear in its parameters - This is why the OLS model uses a line (that minimizes the sum of the square errors) instead of a curve or other potential fits to the data - Sample estimators are assessed on the basis of two criteria: unbiasedness and efficiency - In a perfect world, we want to minimize bias and maximize efficiency - The OLS estimator is chosen because if some assumption hold, it is (among the family of comparable estimators) the one most likely to be unbiased and most efficient So the reason we pick the OLS line because it is the most efficient estimator among all the unbiased estimators From Worksheet: When a set of key assumptions hold, OLS gives us the most efficient and least biased estimate of the true relationship between dependent and independent variables. By bias, we mean the expected difference between the estimator (the rule we use to estimate the parameters) and the population parameter (i.e. the true relationship between X and Y ). The least biased estimator minimizes this difference. The model that yields the least variance in explaining the population parameters is the most efficient (this makes sense intuitively, as variance is a measure of how far the coefficient estimates potentially are from the true — in the population). Lower variance means that under repeated sampling, the estimates are likely to be similar.

Dropout / Attrition

it refers to the loss of participants or study subjects in an experiment or survey - typically when subjects are interviewed at several points in time i.e. in longitudinal studies.

Level-log

level-log means the X variable is logged (and divided by outcome because it is percentage)

log-level

log-level means we log out outcome but not the X variable (the X is multiplied by 100 because it the Y is now in a percentage)

Log-log

log-log means both are logged (this is interpreted in how much a ​percentage change in x leads to a ​percentage change in y​)

Kernal Density Estimation

non-parametric density estimation method. The intuition is the following: one of the issues with histograms when used to visualize the distribution of a continuous variable is that they are not smooth as they depend on the width and end point of the bins. Kernel estimators centre a function called kernel function at each data point. It amounts to a local average, as Kernel estimators smooth the contribution of each data point over a local neighborhood of that observation.

Ordinal Variables

ordered categorical measures - Survey questions that ask you to rank your support for something - But although we know the difference between categories, we don't know the degree of this difference ​- we don't know the gap between people who 'agree' with something and those who 'strongly agree'

Descriptive Methods

range of statistical methods used to summarize the data-like the mean, average, standard deviation etc.-and are usually reported to make it easier to understand and display the information.

Inferential Methods

range of statistical methods which use data on a sample of the population to make predictions about the population as a whole.

Bivariate vs. Multivariate Analysis

refers to the analysis of the relationship between two variables - as opposed to univariate analysis (analysis of one variable) and multivariate analysis (analysis of the relationship between two or more variables simultaneously).

Random Sample

sample of i subjects from a population in which each sample of that size has the same probability i.e. chance of being selected. Random = equal chance of selection.

Study Arms

the arm of a randomized experimental study refers to a group of subjects who receive a specific treatment (or no treatment) among several possible treatments. One can think of them as different treatment conditions.

Degrees of Freedom

the degrees of freedom (N-k) which is the difference between the total number of observations and the number of explanatory variables - So the maximum degrees of freedom is 1-N for a one variable test and 2-N for a 2 variable test, etc. Degrees of freedom are the number of data points that went into the estimation of the parameters used after taking into account these parameters (restriction).

Zero Conditional Mean Assumption

the expected value of the error term (the population mean) is zero conditional on the explanatory variables A linear regression model is ​linear approximation of the relationship between explanatory variables X and a dependent variable Y, with an error term that covers the uncalculated variables - β is a regression coefficient of X, which describes the association between X and Y - In order for β(hat) to be unbiased we need the following ​zero conditional mean c​ondition: E[ε | X] = 0 - In observational studies, X is likely to be determined by omitted variables in ε which could also be related to Y - Therefore E[β(hat)] =/= β - This is known as omitted variable bias - A common practice that aims to account for omitted variable bias is to use C control in the analysis So the best thing you can do to decrease selection/omitted variable bias is to try to include controls for the key alternative variables FROM WORKSHEET: This is the most crucial assumption: if it does not hold, our OLS estimates will be biased. This assumption means that the error term ' has an expected value of zero given any value of the explanatory variable Xi. This means that the expected value of the error term conditional on the independent variable Xi is zero. In other words, other factors not included in the model and affecting your outcome Y are NOT related on average (i.e. in expectation) with any of the independent variables Xi included in your model. Put slightly differently, there should be no discernible correlation between the unexplained part of your outcome (i.e. the error term) and the independent variables which participate in explaining the outcome. Omitting an important factor that is related to any of your explanatory variables Xi will violate this assumption. When the zero conditional mean assumption holds, we say that we have exogenous explanatory variables. If, on the other hand, an independent variable Xi is correlated with the error term for any reason, then Xi is said to be an endogenous explanatory variable

The Fundamental Problem of causal inference

the fundamental problem of causal inference is the need to infer counterfactual outcomes - ​what would have happened if x had happened instead of y? Or if y just hadn't happened at all?

population and sample

the population is the universe of potential and possible cases of interest (individ- uals, events, objects etc.) defined with respect to a study. The population of a national election study, for instance, is all the possible voters in that country. A sample is the subset of the population for which the study collects and has data on.

Probability Distribution

the probability of an event that a random variable takes a certain value - P(coin): P(coin = 1), P(coin = 0) - P(survey): P(survey = 4), P(survey = 3), etc.

Standard Error and Sample Size

the size of our confidence intervals depends on the sample size. In case of a small sample size the confidence interval is rather wide. In other words, the larger our sample the smaller our standard error and therefore the narrower the confidence interval and more precise estimate of the population mean.

Treatment and Control Groups

the treatment group is the group that has received the treatment (hence we only know what the post-treatment outcome is under treatment conditions for all subjects in this group). The control group is the group which does not receive the treatment (hence we only observe the post-treatment outcome under no treatment for all subjects in this group). Think of medical studies for instance.

No Perfect Collinearity Assumption

there is variation in the explanatory variables - This is also known as multicollinearity - when two or more independent variables are highly related (overlap significantly) - Examples are two variables, socializing per week and studying per week on exam scores - If the goal is to estimate the effect of an explanatory variable on an outcome, multicollinearity makes it difficult to gauge the effect of variables independently - Moreover multicollinearity leads to inflated standard errors for estimates of regression parameters: - This lead to larger confidence intervals - This mean you can be less likely to reject the null hypothesis than you otherwise should be - You can diagnose multicollinearity, both by understanding the relationship between the concept being measured - Or you can also diagnose this by looking at correlation - Thus, very strong correlations mean that the variables move very closely with each other, so a variable with a correlation too close to 1 or to -1 is a sign of multicollinearity - Thus, you can diagnoses this by complementing correlation analysis with an understanding of the potential relationship between the concepts The central limit theorum applies to samples as well - if the sample size is large enough (rule of thumb >30), the distribution of sample means (the sampling distribution) is approximately normal FROM WORKSHEET: The assumption of no perfect collinearity requires that there are no exact linear relationships between the independent variables. Exact linear relationship means you can write an explanatory variable as a linear combination of the other explanatory variables (e.g. one variable is a multiple of another, or the sum of some of the others). If you have perfect correlation, only one of the two variables can be estimated. More commonly, you may have high (but not perfect) correlation between two or more independent variables. This is called multicollinearity. While this does not constitute a violation of any of the assumptions needed for OLS to be the best linear unbiased estimator, it will inflate the standard errors, and is therefore something to look out for. All other things equal, to estimate Beta(j), it is better to have less correlation between Xj and the other independent variables. The coefficients are less precise (look at the standard errors) from those found in Week 4. Again, this does not constitute a violation of any of the assumptions (OLS is still unbiased), but the coefficients are less precisely estimated, making statistical inference more difficult. If you suspect there is multicollinearity between some of your independent variables, you can look at the correlation between the your independent variables. Alternatively, you can regress the suspect (let's call it Xj) on the other independent variables, and look at the R-squared from that regression - which gives you the proportion of the total variation in Xj that can be explained by the other independent variables. A higher R-squared in the regression which takes as outcome variable the suspect independent variable reflects multicollinearity with other independent variables in your model. It is also possible to test directly for multicollinearity. What can be done about this? If multicollinearity is an artifact of small sample size, a solution would be to increase sample size - but often that is not feasible. You can also drop one of the independent variables - but be careful: if you drop a variable that is a predictor of your outcome in the population, this can result in bias! Remember that the inclusion of explanatory variables has to be motivated theoretically and on the basis of their relationship with a) the outcome, and b) other explanatory variables included in the model.

T-Statistic

tn−2 = β(hat)1 / SE[β(hat)1] This creates a ​t-statistic, ​which creates a normal distribution, that has a wider spread than a regular normal distribution So the t-statistic is the standard error around the standard distribution of the null hypothesis Thus, if the t-statistic crosses the needed threshold (usually 1.96) it allows us to reject the null hypothesis We can test the null hypothesis by seeing if 0 is included in the confidence interval - if it isn't, it is very unlikely that the null hypothesis would have happened and it can safely be rejected - This distribution can become wider if we use a higher t-value and a critical value of 0.01 the t-statistic can be gotten by dividing the beta by the SE, and this can be used to assess if the finding is within the range of the null hypothesis or not

Treatment Variable

typically a key causal variable of interest, the effect of which we want to estimate

Bin of a histogram

when building a histogram, you start by dividing the range of values in a series of intervals called "bins", before counting how many values fall into each internal. Bins are usually consecutive, non-overlapping intervals of the same size (though that is not required).

Balance on Pre-Treatment Characteristics

when using (quasi-)experimental data, treatment should be assigned randomly among the sample subjects, hence the characteristics of the control and treatment group before treatment administration should be approximately equal. This is referred to as balance

The Hawthorne Effect

where people behaved differently because they knew they were being watched

Standard Deviation

√(1/n−1)∑(xi −x)^2 tells us how far away data points are, on average, from the mean - Variance = standard deviation squared (this whole thing without the square root function at the end) - This is very useful for understanding the spread in the data This tells us, on average, how far away a data point is from the mean - Without taking the square root, we call this the variance - The Standard Deviation is effected by outliers because it is based on difference from the mean


Ensembles d'études connexes

Chapter 6- Learning: PSYCH (Exam #2)

View Set

Naming and Formula Writing (ALL)

View Set

Chapter 6 - Target Markets: Segmentation, Evaluation, and Positioning

View Set

Frankenstein - Chapter 8 Questions

View Set

An Introduction to the National Incident Management System

View Set

ITE 100: Intro to info systems Mid-Term Exam Study Guide

View Set

Chapter 70: Constipation/ Diarrhea

View Set

Quality and Evidence-Based Respiratory Care

View Set