R : Statistics

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

prop.table( )

converts a table object into a relative frequency table EX cm_table <- table(comics$gender, comics$align) prop.table(cm_table)

confusion matrix

cross tabulates the reality with what the model predicted

Observational study

data is collected in a way that does not directly interfere with how the data arise only a correlation between the explanatory and response variables can be inferred

Fitting a line to a binary response

data_space <- ggplot(MedGPA, aes(y = Acceptance, x = GPA)) + geom_jitter(width = 0, height = 0.05, alpha = 0.5) # linear regression line data_space + geom_smooth(method = "lm", se = FALSE)

adding logistic curve

data_space <- ggplot(data = MedGPA, aes(x = GPA, y = Acceptance)) + geom_jitter(width = 0, height = 0.05, alpha = .5) # add logistic curve data_space + geom_smooth(method = "glm", se = FALSE, method.args = list(family = "binomial"))

str( )

shows how many var and observations are in the dataset

simple random cluster in R

states_srs <- us_regions %>% sample_n(8)

stratified cluster in R

states_str <- us_regions %>% group_by(region) %>% sample_n(2) #must use group by first

Experimental study

subject are randomly assigned to various treatments causation can be inferred

table ( )

table(comics$gender, comics$align) shows the relation between the variables of a dataset

Null Hypothesis (H0)

the claim that is not interesting goal is to disprove this hypothesis

residuals

the difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). Each data point has one residual. Residual = Observed value - Predicted value. e = y - ŷ Both the sum and the mean of the residuals are equal to zero

GLM: generalized linear model

used with a binary response variable glm(Acceptance ~ GPA, data = MedGPA, family = binomial) the logistic regression model never reaches 1 or 0

== translation

"is equal to"

adding interaction terms- x:z

# include interaction glimpse(mario_kart) lm(totalPr ~ cond + duration + cond:duration, data = mario_kart) produces a model with 4 coefficients

cor(x,y): Pearson product-moment correlation coefficient

A type of correlation coefficient used with interval and ratio scale data. In addition to providing information on the strength of relationship between two variables, it indicates the direction (positive or negative) of the relationship. ncbirths %>% summarize(N = n(), r = cor(weight, mage)) ncbirths %>% summarize(N = n(), r = cor(weight, weeks, use = "pairwise.complete.obs")) #used for NA values

four characteristics of a distribution

Center Variability (spread) Shape Outliers

Principles of experimental design: Control

Control: compare treatment of interest to a control group

Alternative Hypothesis (Ha)

The alternative hypothesis corresponds to the research question of interest

Y=β0+β1⋅X+ϵ, where ϵ∼N(0,σϵ).

Y = response variable B0 = y-intercept B1 = slope coefficient e = disturbance/ random noise The fitted model for the poverty rate of U.S. counties as a function of high school graduation rate is: povertyˆ=64.594 − 0.591⋅hs_grad In Hampshire County in western Massachusetts, the high school graduation rate is 92.4%. These two facts imply that the poverty rate in Hampshire County is expected to be near 10 %

response variable

a variable that measures an outcome or result of a study

explanatory variable

a variable that we think explains or causes changes in the response variable

stratified sampling

a variation of random sampling; the population is divided into homogeneous subgroups called strata and then sampled from within each stratum

correlation

a way of quantifying the strength of a linear relationship values span between -1 and 1 the sign (positive or negative) corresponds to the direction values close to 1 indicate near perfect positive correlatoin values close to .75 are considered strong correlations values close to .5 are considered moderate values close to .2 are considered weak plots with no real leanear relationship will be close to 0

Regression to the mean

accommodates the occurrence of chance It explains why Michael Jordan's sons were middling college basketball players and Jakob Dylan wrote two good songs. It is why there are no American parent-child pairs among Hall of Fame players in any major professional sports league."

Principles of experimental design: Block

account for the potential affect of confounding variables group subjects into blocks based on these variables randomize within each block to treatment groups

R^2adj

adjusted R2 includes a term that penalizes a model for each additional explanatory variable (where p is the number of explanatory variables).

r^2 (coefficient of determination)

bdims_tidy %>% summarize(var_y = var(wgt), var_e = var(.resid)) %>% mutate(R_squared = 1 - var_e / var_y) gives a percent value of how much the of the response value (y) can be explained by the explanatory value (x) the closer r^2 is to 1 the more y can be explained by x 46.4% of the variability in poverty rate among U.S. counties can be explained by high school graduation rate.

augment( )

bdims_tidy <- augment(mod) function of the broom package Augment displays the objects of the linear model in a data frame format containing original explanatory and response values along with several quantities specific to the regression model, including the fitted values, residuals, leverage scores, and standardized residuals.

Box plot quick note

box plots can hide bi-modal distributions

log( )

can transform a heavily skewed dataset

SSE :sum of squared errors

captures how much the regression model missed by SSE = sum( .resid ^2) or = (n - 1) * var( .resid))

blocking variables

categorical variables included in the statistical analysis of experimental data as a way of statistically controlling or accounting for variance due to that variable

cluster sampling

clusters of participants within the population of interest are selected at random, followed by data collection from all individuals in each cluster.

Principles of experimental design: Replicate

collect a sufficiently large sample within a study, or replicate the entire study

droplevels( )

comics <- comics %>% filter(align != "Reformed Criminals") %>% droplevels()

using factor( ) to re-order data names

comics$align <- factor(comics$align, levels = c("Bad", "Neutral", "Good"))

Discretizing a variable

converting a numerical value to a categorical variable based on search on criteria EX testing whether a score is below average or above avg_read <- 52.3 mutate(read_cat = ifelse(read < avg_read, "below averae", "at or above average"))

ei = yi− ŷi

ei = .residual yi = actual value ŷi = .fitted (this is the line of best fitt, aka expected/projected values) residual : the difference between the actual observed value of the response variable and and the expected value (fitted) according to the response of the lm model residuals(mod) fitted.values(mod)

case_when( )

evals <- evals %>% mutate(cls_type = case_when( cls_students <= 18 ~ "small", cls_students >= 19 & cls_students <= 59 ~ "midsize", cls_students >= 60 ~ "large" ) ) #alternative to the ifelse statement

simple random sampling

every member of the population has an equal probability of being selected for the sample EX drawing names from a hat

Facet quick note

facet on the basis of a categorical variable

predict( )

function can be used for "out of sample" observatoins predict( lm, newdata) newdata must be a data frame newdata argument must contain the same name as the explanatory value used to fit the linear model

geom_abline( )

function is used in addition to geom_point to manually define the intercept and slope ggplot(data = bdims, aes(x = hgt, y = wgt)) + geom_point() + geom_abline(data = coefs, aes(intercept = `(Intercept)`, slope = hgt), color = "dodgerblue")

Filtering out OUTLIERS

gap_asia <- gap2007 %>% filter(continent == "Asia") %>% mutate(is_outlier = lifeExp < 50) # Remove outliers, create box plot of lifeExp gap_asia %>% filter(!is_outlier) %>% ggplot(aes(x = 1, y = lifeExp)) + geom_boxplot()

ggplot2: histogram creation

ggplot(cars, aes(x = city_mpg)) + g eom_histogram() + facet_wrap(~ suv)

conditional bar chart

ggplot(comics, aes(x = align, fill = gender )) + geom_bar(position = "fill") + ylab("proportion")

geom_bar: side by side & stacked

ggplot(comics, aes(x = align, fill = gender)) + geom_bar(position = "dodge") "dodge" signifies that the bar plots will be side by side ggplot(comics, aes(x = align, fill = gender)) + geom_bar(position = "fill") + ylab("proportion") "fill" signifies that the bar plots will be stacked

ggplot2: box plot

ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) + geom_boxplot()

ggplot2: density plot

ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) + geom_density(alpha = .3) alpha command determines level of transparancy both density plots and box plots display the central tendency and spread of the data, but the box plot is more robust to outliers.

Scatterplot interaction

ggplot(mario_kart, aes(y = totalPr, x = duration, color = cond)) + geom_point() + geom_smooth(method = "lm", se = FALSE)

box plot counterparts

great for displaying multiple distributions The y aesthetic for a box plot indicates the variable of interest 2nd quartile - is the median data below it shows outliers

.cooksd

high leverage and high residual determine influence .cooks distance (.cooksd) combines the two measures to determine influence

calculating odds

how often it happens divided by how often it doesn't happen y / 1-y

Simpsons paradox

illustrates that the omission of an explanatory variable can have on the measure of association between another explanatory variable and the response variable -(the inclusion of a third variable can change the relationship between the other explanatory/response variables

geom_jitter()

is used to move data points on a plot up or down by a random amount. creates an allusion of separation in the data this is often used for graphs with a response variable that is categorical ex "dead" or "alive"

RMSE (root mean squared error)

it gives an average of how far the observed values are from the predicted values The magnitude of a typical residual can give us a sense of generally how close our estimates are. The residual standard error of the summary( ) function is the RMSE recorded in the same units as the response variable sqrt(sum(residuals(mod)^2) / df.residual(mod))

levels( )

levels(comics$align) #displays the different values this variable holds

smooth regresson line

library(dplyr) library(ggplot2) glimpse(bdims) ggplot(data = bdims, aes(x = hgt, y = wgt)) + geom_point() + geom_smooth(method = "lm", se = FALSE)

Categorical variables

limited number of distinct categories ordinal: finite number of values with a given range

fitting parllel slopes model

lm(totalPr ~ wheels + cond, data = mario_kart) has a numeric explanatory variable and a categorical explanatory variable

lm( )

lm(wgt ~ hgt, data = bdims) "for every"

Measures of center

mean & median gap2007 %>% group_by(continent) %>% summarize(mean(lifeExp), median(lifeExp))

leverage

measures the disproportionate influence an outlier may have on the slope the bigger the number the bigger the influence .hat variable measures leverage

parallel slopes model

model with a numerical explanatory variable and a categorical explanatory variable

MLR: multiple linear regression model

model with two or more numeric explanatory variables

Modality

number of prominent humps in the distribution Distributions can be Unimodal Bimodal Multimodal Uniform

3D scatter plot

p <- plot_ly(data = mario_kart, z = ~ totalPr, x = ~ duration, y = ~ startPr, opacity = 0.6) %>% add_markers() # draw the plane p %>% add_surface(x = ~x, y = ~y, z = ~plane, showscale = FALSE)

Numerical variables

quantitative values Continuous: infinite number of values within a given range, (often measued) Ex: height Discrete: set of numeric values that can be counted (often counted) EX amount of pets

Random sampling vs Random assignment

random assignment conclusions only apply to the same not general population or make an inference about an association between the variables studied

Principles of experimental design: Randomize

randomly assign subjects to treatments

skew

right skew, left skew, symetric #in direction of the long tail

Measures of Spread

variance- measures how much the data is spread from the center , - var ( ) standard deviation- standard deviation is the square root of the variance sd( ) range- diff(range(x)) IQR- IQR( ) #good for heavily skewed data Median & IQR , Like mean and standard deviation, measure the central tendency and spread, but are unaffected by outliers and non-normal data

group_by and summarize logic

you are basically grouping the dataset by what you want to study then pulling the data you want from those groups EX "Within non-spam emails, is the typical length of emails shorter for those that were sent to multiple people?" email %>% filter(spam == "not-spam") %>% group_by(to_multiple) %>% summarize(median(num_char))


Ensembles d'études connexes

Nutrition Chapter 4 Review Questions

View Set

Quiz - Class 8 - Free Throws, Point of Interruption and Correctable Errors

View Set

The Bush and Clinton Presidencies 222

View Set

Business Policy and Strategy Exam 1

View Set