R : Statistics
prop.table( )
converts a table object into a relative frequency table EX cm_table <- table(comics$gender, comics$align) prop.table(cm_table)
confusion matrix
cross tabulates the reality with what the model predicted
Observational study
data is collected in a way that does not directly interfere with how the data arise only a correlation between the explanatory and response variables can be inferred
Fitting a line to a binary response
data_space <- ggplot(MedGPA, aes(y = Acceptance, x = GPA)) + geom_jitter(width = 0, height = 0.05, alpha = 0.5) # linear regression line data_space + geom_smooth(method = "lm", se = FALSE)
adding logistic curve
data_space <- ggplot(data = MedGPA, aes(x = GPA, y = Acceptance)) + geom_jitter(width = 0, height = 0.05, alpha = .5) # add logistic curve data_space + geom_smooth(method = "glm", se = FALSE, method.args = list(family = "binomial"))
str( )
shows how many var and observations are in the dataset
simple random cluster in R
states_srs <- us_regions %>% sample_n(8)
stratified cluster in R
states_str <- us_regions %>% group_by(region) %>% sample_n(2) #must use group by first
Experimental study
subject are randomly assigned to various treatments causation can be inferred
table ( )
table(comics$gender, comics$align) shows the relation between the variables of a dataset
Null Hypothesis (H0)
the claim that is not interesting goal is to disprove this hypothesis
residuals
the difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). Each data point has one residual. Residual = Observed value - Predicted value. e = y - ŷ Both the sum and the mean of the residuals are equal to zero
GLM: generalized linear model
used with a binary response variable glm(Acceptance ~ GPA, data = MedGPA, family = binomial) the logistic regression model never reaches 1 or 0
== translation
"is equal to"
adding interaction terms- x:z
# include interaction glimpse(mario_kart) lm(totalPr ~ cond + duration + cond:duration, data = mario_kart) produces a model with 4 coefficients
cor(x,y): Pearson product-moment correlation coefficient
A type of correlation coefficient used with interval and ratio scale data. In addition to providing information on the strength of relationship between two variables, it indicates the direction (positive or negative) of the relationship. ncbirths %>% summarize(N = n(), r = cor(weight, mage)) ncbirths %>% summarize(N = n(), r = cor(weight, weeks, use = "pairwise.complete.obs")) #used for NA values
four characteristics of a distribution
Center Variability (spread) Shape Outliers
Principles of experimental design: Control
Control: compare treatment of interest to a control group
Alternative Hypothesis (Ha)
The alternative hypothesis corresponds to the research question of interest
Y=β0+β1⋅X+ϵ, where ϵ∼N(0,σϵ).
Y = response variable B0 = y-intercept B1 = slope coefficient e = disturbance/ random noise The fitted model for the poverty rate of U.S. counties as a function of high school graduation rate is: povertyˆ=64.594 − 0.591⋅hs_grad In Hampshire County in western Massachusetts, the high school graduation rate is 92.4%. These two facts imply that the poverty rate in Hampshire County is expected to be near 10 %
response variable
a variable that measures an outcome or result of a study
explanatory variable
a variable that we think explains or causes changes in the response variable
stratified sampling
a variation of random sampling; the population is divided into homogeneous subgroups called strata and then sampled from within each stratum
correlation
a way of quantifying the strength of a linear relationship values span between -1 and 1 the sign (positive or negative) corresponds to the direction values close to 1 indicate near perfect positive correlatoin values close to .75 are considered strong correlations values close to .5 are considered moderate values close to .2 are considered weak plots with no real leanear relationship will be close to 0
Regression to the mean
accommodates the occurrence of chance It explains why Michael Jordan's sons were middling college basketball players and Jakob Dylan wrote two good songs. It is why there are no American parent-child pairs among Hall of Fame players in any major professional sports league."
Principles of experimental design: Block
account for the potential affect of confounding variables group subjects into blocks based on these variables randomize within each block to treatment groups
R^2adj
adjusted R2 includes a term that penalizes a model for each additional explanatory variable (where p is the number of explanatory variables).
r^2 (coefficient of determination)
bdims_tidy %>% summarize(var_y = var(wgt), var_e = var(.resid)) %>% mutate(R_squared = 1 - var_e / var_y) gives a percent value of how much the of the response value (y) can be explained by the explanatory value (x) the closer r^2 is to 1 the more y can be explained by x 46.4% of the variability in poverty rate among U.S. counties can be explained by high school graduation rate.
augment( )
bdims_tidy <- augment(mod) function of the broom package Augment displays the objects of the linear model in a data frame format containing original explanatory and response values along with several quantities specific to the regression model, including the fitted values, residuals, leverage scores, and standardized residuals.
Box plot quick note
box plots can hide bi-modal distributions
log( )
can transform a heavily skewed dataset
SSE :sum of squared errors
captures how much the regression model missed by SSE = sum( .resid ^2) or = (n - 1) * var( .resid))
blocking variables
categorical variables included in the statistical analysis of experimental data as a way of statistically controlling or accounting for variance due to that variable
cluster sampling
clusters of participants within the population of interest are selected at random, followed by data collection from all individuals in each cluster.
Principles of experimental design: Replicate
collect a sufficiently large sample within a study, or replicate the entire study
droplevels( )
comics <- comics %>% filter(align != "Reformed Criminals") %>% droplevels()
using factor( ) to re-order data names
comics$align <- factor(comics$align, levels = c("Bad", "Neutral", "Good"))
Discretizing a variable
converting a numerical value to a categorical variable based on search on criteria EX testing whether a score is below average or above avg_read <- 52.3 mutate(read_cat = ifelse(read < avg_read, "below averae", "at or above average"))
ei = yi− ŷi
ei = .residual yi = actual value ŷi = .fitted (this is the line of best fitt, aka expected/projected values) residual : the difference between the actual observed value of the response variable and and the expected value (fitted) according to the response of the lm model residuals(mod) fitted.values(mod)
case_when( )
evals <- evals %>% mutate(cls_type = case_when( cls_students <= 18 ~ "small", cls_students >= 19 & cls_students <= 59 ~ "midsize", cls_students >= 60 ~ "large" ) ) #alternative to the ifelse statement
simple random sampling
every member of the population has an equal probability of being selected for the sample EX drawing names from a hat
Facet quick note
facet on the basis of a categorical variable
predict( )
function can be used for "out of sample" observatoins predict( lm, newdata) newdata must be a data frame newdata argument must contain the same name as the explanatory value used to fit the linear model
geom_abline( )
function is used in addition to geom_point to manually define the intercept and slope ggplot(data = bdims, aes(x = hgt, y = wgt)) + geom_point() + geom_abline(data = coefs, aes(intercept = `(Intercept)`, slope = hgt), color = "dodgerblue")
Filtering out OUTLIERS
gap_asia <- gap2007 %>% filter(continent == "Asia") %>% mutate(is_outlier = lifeExp < 50) # Remove outliers, create box plot of lifeExp gap_asia %>% filter(!is_outlier) %>% ggplot(aes(x = 1, y = lifeExp)) + geom_boxplot()
ggplot2: histogram creation
ggplot(cars, aes(x = city_mpg)) + g eom_histogram() + facet_wrap(~ suv)
conditional bar chart
ggplot(comics, aes(x = align, fill = gender )) + geom_bar(position = "fill") + ylab("proportion")
geom_bar: side by side & stacked
ggplot(comics, aes(x = align, fill = gender)) + geom_bar(position = "dodge") "dodge" signifies that the bar plots will be side by side ggplot(comics, aes(x = align, fill = gender)) + geom_bar(position = "fill") + ylab("proportion") "fill" signifies that the bar plots will be stacked
ggplot2: box plot
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) + geom_boxplot()
ggplot2: density plot
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) + geom_density(alpha = .3) alpha command determines level of transparancy both density plots and box plots display the central tendency and spread of the data, but the box plot is more robust to outliers.
Scatterplot interaction
ggplot(mario_kart, aes(y = totalPr, x = duration, color = cond)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
box plot counterparts
great for displaying multiple distributions The y aesthetic for a box plot indicates the variable of interest 2nd quartile - is the median data below it shows outliers
.cooksd
high leverage and high residual determine influence .cooks distance (.cooksd) combines the two measures to determine influence
calculating odds
how often it happens divided by how often it doesn't happen y / 1-y
Simpsons paradox
illustrates that the omission of an explanatory variable can have on the measure of association between another explanatory variable and the response variable -(the inclusion of a third variable can change the relationship between the other explanatory/response variables
geom_jitter()
is used to move data points on a plot up or down by a random amount. creates an allusion of separation in the data this is often used for graphs with a response variable that is categorical ex "dead" or "alive"
RMSE (root mean squared error)
it gives an average of how far the observed values are from the predicted values The magnitude of a typical residual can give us a sense of generally how close our estimates are. The residual standard error of the summary( ) function is the RMSE recorded in the same units as the response variable sqrt(sum(residuals(mod)^2) / df.residual(mod))
levels( )
levels(comics$align) #displays the different values this variable holds
smooth regresson line
library(dplyr) library(ggplot2) glimpse(bdims) ggplot(data = bdims, aes(x = hgt, y = wgt)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
Categorical variables
limited number of distinct categories ordinal: finite number of values with a given range
fitting parllel slopes model
lm(totalPr ~ wheels + cond, data = mario_kart) has a numeric explanatory variable and a categorical explanatory variable
lm( )
lm(wgt ~ hgt, data = bdims) "for every"
Measures of center
mean & median gap2007 %>% group_by(continent) %>% summarize(mean(lifeExp), median(lifeExp))
leverage
measures the disproportionate influence an outlier may have on the slope the bigger the number the bigger the influence .hat variable measures leverage
parallel slopes model
model with a numerical explanatory variable and a categorical explanatory variable
MLR: multiple linear regression model
model with two or more numeric explanatory variables
Modality
number of prominent humps in the distribution Distributions can be Unimodal Bimodal Multimodal Uniform
3D scatter plot
p <- plot_ly(data = mario_kart, z = ~ totalPr, x = ~ duration, y = ~ startPr, opacity = 0.6) %>% add_markers() # draw the plane p %>% add_surface(x = ~x, y = ~y, z = ~plane, showscale = FALSE)
Numerical variables
quantitative values Continuous: infinite number of values within a given range, (often measued) Ex: height Discrete: set of numeric values that can be counted (often counted) EX amount of pets
Random sampling vs Random assignment
random assignment conclusions only apply to the same not general population or make an inference about an association between the variables studied
Principles of experimental design: Randomize
randomly assign subjects to treatments
skew
right skew, left skew, symetric #in direction of the long tail
Measures of Spread
variance- measures how much the data is spread from the center , - var ( ) standard deviation- standard deviation is the square root of the variance sd( ) range- diff(range(x)) IQR- IQR( ) #good for heavily skewed data Median & IQR , Like mean and standard deviation, measure the central tendency and spread, but are unaffected by outliers and non-normal data
group_by and summarize logic
you are basically grouping the dataset by what you want to study then pulling the data you want from those groups EX "Within non-spam emails, is the typical length of emails shorter for those that were sent to multiple people?" email %>% filter(spam == "not-spam") %>% group_by(to_multiple) %>% summarize(median(num_char))