PH 142 Midterm 1
Describe the role of %in% and c()
%in% is the "in" operator used to select the specific values from a list c() combines variables into a list- NEED the variables to be in a list when using multiple ones
How to make seperate plots for each value in a different variable?
+ facet_wrap(~variable) to ggplot +facet_grid(var1~var2) for seperate plots of combos of variables
How do you add a line of best fit to a scatterplot?
+ geom_abline(intercept=*number*, slope=*number*)
What is a conditional distribution?
-distribution of one variable within level of second variable -ex: can look at distribution of having cancer conditional on being a smoker
Explain what head(), dim(), names(), and str() do
4 basic functions to get to know the data set head()- prints first 5 rows dim()- Returns the number of rows/individuals and number of columns/variables names()- names of variables str()- Returns information about the types of variables found in the data set in terms of whether they are quantative (int, num) or categorical (Factor) and provides some information about each one.a
What values are included in the 5 # summary?
min = min() Q1 = quantile( , 0.25), median=median() max = max() Q3 = quantile(, 0.75) where blanks are filled with the dataset
What is a categorical variable? Describe the different types of categorical variables.
A categorical variable represents types of data which may be divided into groups. Mathematically, you can calucalate the proportion of individuals in each level of the category -nominal variable: no particular order, like hair color -ordinal variable: has particular order, like military ranking
Describe the difference between discrete and continuous variables
A discrete variable is a variable whose value is obtained by counting, i.e. whole numbers like # of kids A continuous variable is a variable whose value is obtained by measuring, can be decimals (salary, gestational period)
What is a regression line?
A straight line fitted to data to minimize distance between data and fitted line/line of best fit that can be used to describe the relationshi between explanatory ad response variables
mutate()
Add variables (columns)
Explain the different types of objects
Also called data types -numbers (double, integer) -text(strings, type=character, text between quotation marks) -2D containers(contain +1 items in another type in rows/columns)(type=data frame)
How do you find correlation between two variables in r?
cor_data <- data %>% summarize(correlation = cor(var1, var2))
How do histograms differ from bar charts?
Bar charts show you how many of a kind make up what percent of the variable Histograms' bars touch because the underlying scale is continuous -data is binned into categorices -height of each bar is # or % in each category
Describe a box plot. Why is it used?
Box plot summarizes the distribution of a continuous variable. If there's no outiers, a box plot is the visualization of the five number summary. Maximum- top whisker Q3- top of box Median- middle of box Q1- bottom of box Min- bottom whisker If there are outliers, outliers can be above/below whiskers
What is a quantitative variable and how do they differ from categorical variables?
Continuous, numeric variables that you can perform mathematical operations on- you can take the mean, median, etc. -two types: DISCRETE ones can be counted, while CONTINUUS ones can be measured precisely with rulers/scales
Describe what a correlation coefficent can/cannot tell us
Correlation coefficent, a # between -1 and 1, measures the association between two variables (must be quantative) -only useful for linear associations -cannot confirm causation -affected by outliers
What does correlation measure?
Correlation, or r, measures the direction and strength of the linear relationship between two quantitate variables
Describe the correlation coefficent's relationship to the value r
correlation coefficent is r squared the fraction of variation in y values represented by the line of best fit; therefore a higher r^2 is more accurate
How do you tell if a distribution is skewed right or left? What does this mean about the average?
Data is skewed right when large peak at the left side Mean > median due to the majority of the values being less than the mean
What is a data frame? What does the first column identify? What do the rest of them identify?
Data set = data frame First column identifies indivdiuals Rest of columns idnetify recorded/measured variables
3 main problem types?
Descriptive: learning about one attribute of a population Causative/Etiologic: do changes in an explanatory variable cause changes in a response variable? Predictive: how can we best predict the value of a response variable for an indvidual?
Difference between observational/experimental
Experimental- researcher controls outcocmes of at least one variable
What is the difference between explanatory and response variables?
Explanatory = x-variable response: y-variable
Difference between prospective/retrospective
FILL IN LATER
Explain the difference between a marginal and a conditional distribution
FILL IN LATER
What does the peak of a distribution tell you about the mean/median
If the distribution is unimodal (one peak), and it peaks at the middle, the mean and median are close/~equal If distribution is skewed right, the mean is greater than the median, because the "middle" value is less than the average.
If you wanted to filter with two conditions BOTH being true, what would you do? If you wanted to filter with EITHER one condition or another being true, what would you do?
If want condition A AND condition B to be satisfied, use commas or & filtered_data %>% filter(color ==blue, size== large) If you want condition a OR b, use | or %in% filtered_data %>% filter(color ==blue | size== large)
Interpret the slope and intercept values of a linear equation
Intercept: value of the outcome when x = 0 (often extrapolated value that doesn't make sense in the real world) Slope: for a one-unit change in x, the outcome/y-value changes by this number
What is the IQR? What values lie in it?
Interquartile range contains the middle 50% of values Q1: 25th percentile- 25% of individuals have measurements below Q1 Q2: 50th "" 50% "" Q2 Q3: 75th "" 75% ""
What goes inside the aes()? Provide examples
Linking features to variables goes inside of the aes( ) function 'col', 'fil', 'size' 'lty'
explain the function of arrange()
Order observations (rows) by a certain variable (column) or variables (columns)
explain the function of group_by()
Order observations (rows) by a certain variable (column) to be used for later function (verb) calls
What are the two ways to investigate causation?
Randomized controlled trials Observational studies designed to investigate causation, reduced risk of bias
rename() (how is this different from mutate()?)
Rename variables (columns) -mutate adds a column, while rename just modifies the existing column
How do we visualize two continuous/discrete variables?
Scatter plot
How do you select a subset of variables? How do you disselect
Selecting clean_data <- data %>% select(var1, var2, var3) Disselecting clean_data <- data %>% select(-var1) *negative sign removes it from dataset*
How to tell if a distribution is skewed left or skewed right?
Skewed left: tail at left, big peak on right
filter()
Subset Observations (Rows) by certain conditions
select()
Subset variables (columns)
if s^2 = variance of a smaple, what is standard deviation?
They are the same. This is how to calculate: 1/(n-1) * the sum of all the (x-mean of x)^2
How do the correlations for "average" measures compare to that of individual data?
They are tyically stronger
Why does it not make sense to rearrange the bars of a histogram?
They go in order counting and the frequency of value in each of the seperate "bin" is counted
How do you add new variables to a dataset?
To add a new variable, i.e. "percent" when given a variable "ratio" in the data set: new_data <- data %>% mutate(percent = ratio * 100)
How do you make predictions given linear regression of logged data?
[copy steps onto cheat sheet] 1. write down line of best fit loge(y) = m\times{loge(x)} + b 2. plug in give variable into question 3. exponentiate both sides exp(x), etc.
What is simpson's paradox?
an association/comparison that holds true for all of several groups can reverse direction whendata are combined to form a single group
What is the difference between bidirectional and unidirectional relationships?
bidirectional: x pred. y or y pred. x (there is a general association between x and y, correlation) unidirectional: x causes y
How do you rename a variable in a dataset?
clean_data <- data %>% rename(new_label = old_label) creates new dataset with relabeled names
How do you rename multiple variables?
clean_data <- data %>% rename(new_name = old_name, new2 = old2
summarize()
creates a data.frame of summary terms- it's used anytime we want to take a variable and summarize something about it into one number, like it's mean or median ex: summarize(mean = mean(data), s_d = sd(data)) gives a data frame of mean and sd
How to filter for EITHER blue individuals, or those with a ratio of over 0.5?
data %>% filter(color == 'blue' | ratio > 0.5)
How do you filter for a blue individual with a ratio greater than 0.5?
data %>% filter(color == 'blue', ratio > 0.5)
Given a dataset where we only want individuals with red and blue colors, how do we select for them?
data %>% filter(colors %in% c('red', 'blue')
How do you change data to its log?
data <- data %>% mutate(var = log(var, base)) default base e
How do you read a csv file?
data <- read_csv(".../Data/R02_Data.csv") -use assignment operator to assign the data to a name (data)
What package is the pipe operator used with?
dplyr
Functions you need to know
dplyr ■ arrange() ■ group_by() ■ select() ■ mutate() ■ rename() ■ filter() ■ summarize() ○ ggplot2 ■ ggplot() and the associated geoms stats (built-in) ■ lm()
What do form, direction, strength, and outliers tell us?
form: (of points)- linear or curved direction: + or - strength: how close do points fit to a line outliers: anyone deviate from the pattern?
How to change color of scatterplot?
geom_point(col = 'blue')
If given a dataset, what does group_by() do? Why is it useful?
group_by() allows you to sort the data based on a CATEGORICAL variable.It is useful because it can allow you to see numerical data about these individuals when grouped by their variable ex: data %>% group_by(color) %>% summarize(mean = mean(ratio)) *where this allows you to see the mean for every color after grouping all the individual by color*
Histograms are used to illustrate the distribution of _____ variables, while bar charts are used to illustrate distribution of _________
histograms- numeric (continuous or discrete) bar plots- categorical variable (distribution is count or % of individuals in each category)
How do you select which rows to keep?
i.e. to select ratios higher than 0.5: filtered_data <- data %>% filter(ratio > 0.5)
How can you tell if you should take logs of data?
if x&y are widely spread out
What does the labs() function do? How is it implemented?
labs(title = "chicken nugs per hour", y = "nugs", x = "time (hour)")
How do you create a scatterplot given data called 'nugs' of chicken nuggets per hour?
library(ggplot) ggplot(data=nugs, aes(x = hour, y = chicken_nugs) +geom_point() GGPLOT makes the cavas and geom_point() tells ggplot how to display data aes is used to assign variables
What does aes() do?
links various graph attributes to variables, i.e. aes(col = race) aes(lty = nugs)
How do you fit a linear model in R?
lm(formula = y~x, data=dataset) y = response variable x= explanatory variable
What is faceting?
making seperate plots based on variable facet_wrap(~ var1): make separate plots for different levels of ONE variable facet_grid(var1 ~ var2) make separate plots for combinations of two variables
What is the full range of a distribution
max - min
When are mean and median the same? Different?
median: value for which 1/2 measurements are larger and 1/2 are smaller mean is about the same as median whenthe data has one peak and is roughly symmetric -if left skewed, mean usually smaller than mean outliers pull mean towards their values but do not affect median; skewed distributions pull mean out into tail
When are measures of central tendency not helpful?
multi-modal distrubtions- mean/median lands between the peaks
identify the class, name, type, and value of the following variable: var <- 2
name: var value: 2 class: numeric type: double
Given a scatterplot of nuggets/hour from a dataset including reg. and spicy nuggets, how to create two seperate lines on one line graph of data>
new_nugs <- nugs %>% filter(chicken_nugs %>% c('reg', 'spicy)) + geom_line(aes(col = chicken_nugs))
How do you reorder variables in a dataset?
ordered_data <- data %>% mutate(ordered_var = fct_relevel(var, 'first'. 'second', 'third')) default = alphabetical
What does arrange() do?
orders data in ascending order; for descending add - sign data %>% arrange(var1, var2) arranges by 2 variables
What is pearson's correlation and how is it used?
r - quantifies the strength of linear relationshp between two variables (do we need to know formulas?)
given variables x and y in the dataset "mydata", how do you show the results of a linear regression?
reg_data <- lm(y ~ x, data=mydata) summary(reg_data)
How do you remove an individual from a dataset?
removed_blue <- data %>% filter(color != 'blue') removed_blue_red <- data %>% filter(color != 'blue' && color != 'red')
What graph method is used to analyze the relationship between two continuous variables?
scatter plot
How to calculate st. dev and variance?
sd() var() use summarize() to output variables
What is the difference between select and filter
select for variables filter for rows
Describe the 3 ways to describe a histogram's distribution
shape -symmetric vs. skewed (right or left) center -unimodal or bimodal? spread -how spread out are the data? -range of data? outliers?
How do you calculate statistics of variables?
sumarize()
What does the stat="identity" command do?
the y variable supplied in the ggplot does not need modification geom_bar(stat="identity") means the y-variable supplied in the ggplot is exactly what we want to plot When you do not specify a y variable containing a count or percent for ggplot to plot on the y-axis, you do not need this command. When `geom_bar()` is doing the counting for us, then we do not use the `stat="identity"` argument.
What is the marginal distribution, with respect to a two way table?
those in margins of table are usually row/column totals marginal distribution is for a single categorical variable -ex: given a dataset of 25 smokers and 75 nonsmokers, marginal distribution of smoking is 25%, nonsmoking 75%
When do you use quotes in r?
variables do NOT need quotes strings, like axis labels or specific individual names need quotes
What are boxplots?
visual depiction of 5# summary -center line is median -top of box is Q3 -bottom of box is Q1 -top of top whisker is Q3 +1.5(IQR) -bottom of bottom whisker is Q1 - 1.5(IQR) -data points above/below whiskers are outliers
What are on x&y axes of histograps?
x-axis: the different values for the variables y-axis: the # indviduals per each variable value (do not need to set y to anything)