pols 201 midterm
direction
>0 is positive <0 is negative
difference-in-means estimator
Y-bar treatment group - Y-bar control group or the average outcome for the treatment group - the average outcome for the control group non-binary interpretation: average in the same unit of measurement as the variable binary interpretation: percentage points in % after multiplying by 100
standard deviation
a measure of the spread of a variable's distribution
Xlittle1
a particular observation of X, where i denotes the position of the observation and n is the total number of observation in the variable
correlation coefficient or correlation
a statistic that summarizes the relationship between two variables with a number shows direction and strength of a linear association
sample
a subset of individuals chosen for a study
randomized experiment
a type of study design in which treatment assignment is randomized
histogram
a variable is the visual representation of its distribution through bins of different heights
table of proportions
a variable shows the proportion of observations that take each value in the variable
representative sample
accurately reflects the characteristics of the population from which it is drawn, that is, characteristics appear in the sample in similar proportions as in a the population as a whole
if a variable is binary
as a proportion, in % after multipying the result by 100
if variable is non-binary
as an average, in the same unit of measurement as the variable
β hat is ∆ Y ...
associated with change in X = 1 (slope)
conclusion statement final paragraph
assuming that [the treatment and control groups are comparable] (a reasonable assumption because...), we estimate that [the treatment] (increases/decreases) [the outcome] by [size and unit of measurement of the effect] on average
lm()
calculates linear model in formula of Y~X ex: lm(data$final ~ data$midterm)
cor()
calculates the correlation ex: cor(star$reading, star$math)
median()
calculates the median
sd()
calculates the standard deviation x-bar - x sd(X) x-bar + x sd(X)
datasets
capture the characteristics of a particular set of individuals or entities
least squares method
chooses the line that minimizes the prediction errors
strength
closer the absolute value is to 1 the stronger the associate 0 is no association
variables
column that contains the values of a changing characteristic
mean ()
computes the mean of a variable
sqrt()
computes the square root of the argument specified inside the parentheses
var()
computes the variance ex: var(voting$birth)
hist()
creates a histogram based on one variable
plot()
creates a scatter plot using plot(x=data$x_var, y=data$y_var)
assignment operator
creates an object through a name
ifelse()
creates the contents of a new variable based on the values of an existing one requires three arguments, separated by commas, in the following order: 1. logical test (using ==) 2. return value if logical test is true 3. return value if logical test is false ex: ifelse (data$variable == "yes", 1, 0)
unit of observation
defines the individuals or the entities that each observation in the data frame represents such as the unit of observation being students if each observation represents a different student
scatter plot
enables us to visualize the relationship between two variables by plotting one variable against the other in the two-dimensional space
substantive interpretation of β hat
ex: an increase in midterm scores of 1 point is associated with a predicted increase in final exam scores of 0.97 points, on average
substantive interpretation of α hat
ex: when a student scores 0 points on the midterm, we predict that in the final exam they will score -6.01 points, on average
[ ]
extracts a selection of observations from a variable to its left, we specify the variable we want to subset inside the square brackets, we specify the criteria of selection ex: data$var1[data$var2==1] which will extract the observations of the variable var1 for which the variable var2 equal 1
Yhat = α hat + β hat X
fitted line
how do you write a function?
function_name(required_argument, optional_argument_name = optional_argument)
object
how R stores data
control group
individuals who did not receive the treatment
treatment group
individuals who received the treatment
α hat
intercept
linear model
line on a scatter plot?
random sampling
makes the sample and the target population on average identical to each other in all observed and unobserved characteristics
random treatment assignment
makes the treatment and control groups an average identical to each other in all observed and unobserved pre-treatment characteristics
emasures of centrality
mean or median
numeric non-binary
more than two numeric values (ex: distance traveled)
proportion of observations
number of observations that meet criterion/ total number of observations (ex: 3/6 is 50%)
median
of a variable is the value at the mid-point f the distribution that divides the data into two equal-size groups
descriptive statistics
of a variable numerically summarize the main characteristics of its distribution
numeric binary
only two numeric values (0,1) that represent the presence or absence of a trait (ex: voted or not)
View()
opens a new tab in the upper-left window of RStudio with the contents of the dataset (only function with a capital first letter)
dataframes
organized datasets in observations and variables
y hat is the predicted outcome
our predictions of Y using x
hat
predicted or estimated
∆Y hat = β hat∆X
predicts change in y hat associated with change in X
dim()
provides the dimensions of the data frame using the name of the object (order of row, columns)
read.csv()
reads the CSV files - the only required argument is the name of the CSV file in quotes ex: read.csv("file.csv")
∆
represents change
observation
row and a particular entity or individual in the study
setwd()
set the working directory, that is, directs R to the folder on your computer where the dataset is save (Session >> Set Working Directory >> To Source File Location)
table ()
shows how many observations did one thing or another ex: table(voting$voted) gives the solution of who did and did not vote using 1 and 0
prop.table(table( ))
shows in a percent which observations did what ex: prop.table(table(voting$voted))
head()
shows the first six rows or observations in a dataset using the name of the object
frequency table
shows the values the variable takes and the number of times each value appears in the variable
β hat
slope
population
something like the residents of country - infeasible to collect data from an entire population
slope
specifies the angle or steepness of the line
intercept
specifies the veritical location of the line
measures of spread
standard deviation or variance
what is a function?
takes input -> performs actions with the inputs -> produces output
character variable
text not in quotes
X-bar
the average of X
average causal effect
the average of all the individual causal effects of X on Y within a group
causal effects
the cause-and-effect connection between two variables
causal effect of X on Y
the change in the outcome variable caused by a change in the treatment variable
the larger the standard deviation...
the flatter the distribution
what values do not need to be in quotes?
the names of objects, names of functions, and names of arguments as well as special values such as TRUE, FALSE, NA, and NULL and numbers never need to be in quotes
i
the position or row number of the observation
unit of measurement
the quantity in which the value is measured (points, miles, kilometers)
variance
the square of the standard deviation of a variable - sd are easier to interpret because they are in the same unit of measurement as the variable
mean
the sum of the values across all observations divided by the total number of observations
n
the total number of observations
percentage point
the unit of measurement for the arithmetic difference between two percentages
outcome (y)
the variable that we want to predict
predictor(X)
the variable we want to sue to predict the outcome
what is # used for?
to comment notes that will not be used by r
aim of predictions
to predict Y as accurately as possible with the smallest errors possible
r script
type of file we use to store the code we write to analyze data
$
used to access an element inside an object such as a variable inside a data frame
fit
used to create an object for a linear model
==
used to create logical tests that evaluate whether the observations of a variable equal a particular value (values in quotes if not numbers ex: "yes")
abline()
using the name of object with fitted line it will then add that line to the scatter plot
interpretation of mean()
using unit of measurement binary ex: the average year of birth of registered voters is the year 1956 non-binary ex: 31% of registered voters voted
outcome variable (Y)
variable that may change as a result of a change in the treatment variable (either binary or non-binary)
treatment variable (X)
variable whose change may produce a change in the outcome variable (ALWAYS BINARY in this class)
x are predictors
variables that we use as the basis for our predictions
y is the outcome variable
what we want to predict
counterfactual outcome
what would happen if we had made different decisions - impossible to observe
writing a conclusion statement breakdown
what's the assumption we are making when estimating the average causal effect? why is this a reasonable assumption? what's the treatment? what's the outcome? what's the direction, size and unit of measurement of the average causal effect?
α hat is Y when...
x = 0
∈ hat is the errors of our predictions
∈ hat = y - y hat