PH 142 Midterm 1

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Describe the role of %in% and c()

%in% is the "in" operator used to select the specific values from a list c() combines variables into a list- NEED the variables to be in a list when using multiple ones

How to make seperate plots for each value in a different variable?

+ facet_wrap(~variable) to ggplot +facet_grid(var1~var2) for seperate plots of combos of variables

How do you add a line of best fit to a scatterplot?

+ geom_abline(intercept=*number*, slope=*number*)

What is a conditional distribution?

-distribution of one variable within level of second variable -ex: can look at distribution of having cancer conditional on being a smoker

Explain what head(), dim(), names(), and str() do

4 basic functions to get to know the data set head()- prints first 5 rows dim()- Returns the number of rows/individuals and number of columns/variables names()- names of variables str()- Returns information about the types of variables found in the data set in terms of whether they are quantative (int, num) or categorical (Factor) and provides some information about each one.a

What values are included in the 5 # summary?

min = min() Q1 = quantile( , 0.25), median=median() max = max() Q3 = quantile(, 0.75) where blanks are filled with the dataset

What is a categorical variable? Describe the different types of categorical variables.

A categorical variable represents types of data which may be divided into groups. Mathematically, you can calucalate the proportion of individuals in each level of the category -nominal variable: no particular order, like hair color -ordinal variable: has particular order, like military ranking

Describe the difference between discrete and continuous variables

A discrete variable is a variable whose value is obtained by counting, i.e. whole numbers like # of kids A continuous variable is a variable whose value is obtained by measuring, can be decimals (salary, gestational period)

What is a regression line?

A straight line fitted to data to minimize distance between data and fitted line/line of best fit that can be used to describe the relationshi between explanatory ad response variables

mutate()

Add variables (columns)

Explain the different types of objects

Also called data types -numbers (double, integer) -text(strings, type=character, text between quotation marks) -2D containers(contain +1 items in another type in rows/columns)(type=data frame)

How do you find correlation between two variables in r?

cor_data <- data %>% summarize(correlation = cor(var1, var2))

How do histograms differ from bar charts?

Bar charts show you how many of a kind make up what percent of the variable Histograms' bars touch because the underlying scale is continuous -data is binned into categorices -height of each bar is # or % in each category

Describe a box plot. Why is it used?

Box plot summarizes the distribution of a continuous variable. If there's no outiers, a box plot is the visualization of the five number summary. Maximum- top whisker Q3- top of box Median- middle of box Q1- bottom of box Min- bottom whisker If there are outliers, outliers can be above/below whiskers

What is a quantitative variable and how do they differ from categorical variables?

Continuous, numeric variables that you can perform mathematical operations on- you can take the mean, median, etc. -two types: DISCRETE ones can be counted, while CONTINUUS ones can be measured precisely with rulers/scales

Describe what a correlation coefficent can/cannot tell us

Correlation coefficent, a # between -1 and 1, measures the association between two variables (must be quantative) -only useful for linear associations -cannot confirm causation -affected by outliers

What does correlation measure?

Correlation, or r, measures the direction and strength of the linear relationship between two quantitate variables

Describe the correlation coefficent's relationship to the value r

correlation coefficent is r squared the fraction of variation in y values represented by the line of best fit; therefore a higher r^2 is more accurate

How do you tell if a distribution is skewed right or left? What does this mean about the average?

Data is skewed right when large peak at the left side Mean > median due to the majority of the values being less than the mean

What is a data frame? What does the first column identify? What do the rest of them identify?

Data set = data frame First column identifies indivdiuals Rest of columns idnetify recorded/measured variables

3 main problem types?

Descriptive: learning about one attribute of a population Causative/Etiologic: do changes in an explanatory variable cause changes in a response variable? Predictive: how can we best predict the value of a response variable for an indvidual?

Difference between observational/experimental

Experimental- researcher controls outcocmes of at least one variable

What is the difference between explanatory and response variables?

Explanatory = x-variable response: y-variable

Difference between prospective/retrospective

FILL IN LATER

Explain the difference between a marginal and a conditional distribution

FILL IN LATER

What does the peak of a distribution tell you about the mean/median

If the distribution is unimodal (one peak), and it peaks at the middle, the mean and median are close/~equal If distribution is skewed right, the mean is greater than the median, because the "middle" value is less than the average.

If you wanted to filter with two conditions BOTH being true, what would you do? If you wanted to filter with EITHER one condition or another being true, what would you do?

If want condition A AND condition B to be satisfied, use commas or & filtered_data %>% filter(color ==blue, size== large) If you want condition a OR b, use | or %in% filtered_data %>% filter(color ==blue | size== large)

Interpret the slope and intercept values of a linear equation

Intercept: value of the outcome when x = 0 (often extrapolated value that doesn't make sense in the real world) Slope: for a one-unit change in x, the outcome/y-value changes by this number

What is the IQR? What values lie in it?

Interquartile range contains the middle 50% of values Q1: 25th percentile- 25% of individuals have measurements below Q1 Q2: 50th "" 50% "" Q2 Q3: 75th "" 75% ""

What goes inside the aes()? Provide examples

Linking features to variables goes inside of the aes( ) function 'col', 'fil', 'size' 'lty'

explain the function of arrange()

Order observations (rows) by a certain variable (column) or variables (columns)

explain the function of group_by()

Order observations (rows) by a certain variable (column) to be used for later function (verb) calls

What are the two ways to investigate causation?

Randomized controlled trials Observational studies designed to investigate causation, reduced risk of bias

rename() (how is this different from mutate()?)

Rename variables (columns) -mutate adds a column, while rename just modifies the existing column

How do we visualize two continuous/discrete variables?

Scatter plot

How do you select a subset of variables? How do you disselect

Selecting clean_data <- data %>% select(var1, var2, var3) Disselecting clean_data <- data %>% select(-var1) *negative sign removes it from dataset*

How to tell if a distribution is skewed left or skewed right?

Skewed left: tail at left, big peak on right

filter()

Subset Observations (Rows) by certain conditions

select()

Subset variables (columns)

if s^2 = variance of a smaple, what is standard deviation?

They are the same. This is how to calculate: 1/(n-1) * the sum of all the (x-mean of x)^2

How do the correlations for "average" measures compare to that of individual data?

They are tyically stronger

Why does it not make sense to rearrange the bars of a histogram?

They go in order counting and the frequency of value in each of the seperate "bin" is counted

How do you add new variables to a dataset?

To add a new variable, i.e. "percent" when given a variable "ratio" in the data set: new_data <- data %>% mutate(percent = ratio * 100)

How do you make predictions given linear regression of logged data?

[copy steps onto cheat sheet] 1. write down line of best fit loge(y) = m\times{loge(x)} + b 2. plug in give variable into question 3. exponentiate both sides exp(x), etc.

What is simpson's paradox?

an association/comparison that holds true for all of several groups can reverse direction whendata are combined to form a single group

What is the difference between bidirectional and unidirectional relationships?

bidirectional: x pred. y or y pred. x (there is a general association between x and y, correlation) unidirectional: x causes y

How do you rename a variable in a dataset?

clean_data <- data %>% rename(new_label = old_label) creates new dataset with relabeled names

How do you rename multiple variables?

clean_data <- data %>% rename(new_name = old_name, new2 = old2

summarize()

creates a data.frame of summary terms- it's used anytime we want to take a variable and summarize something about it into one number, like it's mean or median ex: summarize(mean = mean(data), s_d = sd(data)) gives a data frame of mean and sd

How to filter for EITHER blue individuals, or those with a ratio of over 0.5?

data %>% filter(color == 'blue' | ratio > 0.5)

How do you filter for a blue individual with a ratio greater than 0.5?

data %>% filter(color == 'blue', ratio > 0.5)

Given a dataset where we only want individuals with red and blue colors, how do we select for them?

data %>% filter(colors %in% c('red', 'blue')

How do you change data to its log?

data <- data %>% mutate(var = log(var, base)) default base e

How do you read a csv file?

data <- read_csv(".../Data/R02_Data.csv") -use assignment operator to assign the data to a name (data)

What package is the pipe operator used with?

dplyr

Functions you need to know

dplyr ■ arrange() ■ group_by() ■ select() ■ mutate() ■ rename() ■ filter() ■ summarize() ○ ggplot2 ■ ggplot() and the associated geoms stats (built-in) ■ lm()

What do form, direction, strength, and outliers tell us?

form: (of points)- linear or curved direction: + or - strength: how close do points fit to a line outliers: anyone deviate from the pattern?

How to change color of scatterplot?

geom_point(col = 'blue')

If given a dataset, what does group_by() do? Why is it useful?

group_by() allows you to sort the data based on a CATEGORICAL variable.It is useful because it can allow you to see numerical data about these individuals when grouped by their variable ex: data %>% group_by(color) %>% summarize(mean = mean(ratio)) *where this allows you to see the mean for every color after grouping all the individual by color*

Histograms are used to illustrate the distribution of _____ variables, while bar charts are used to illustrate distribution of _________

histograms- numeric (continuous or discrete) bar plots- categorical variable (distribution is count or % of individuals in each category)

How do you select which rows to keep?

i.e. to select ratios higher than 0.5: filtered_data <- data %>% filter(ratio > 0.5)

How can you tell if you should take logs of data?

if x&y are widely spread out

What does the labs() function do? How is it implemented?

labs(title = "chicken nugs per hour", y = "nugs", x = "time (hour)")

How do you create a scatterplot given data called 'nugs' of chicken nuggets per hour?

library(ggplot) ggplot(data=nugs, aes(x = hour, y = chicken_nugs) +geom_point() GGPLOT makes the cavas and geom_point() tells ggplot how to display data aes is used to assign variables

What does aes() do?

links various graph attributes to variables, i.e. aes(col = race) aes(lty = nugs)

How do you fit a linear model in R?

lm(formula = y~x, data=dataset) y = response variable x= explanatory variable

What is faceting?

making seperate plots based on variable facet_wrap(~ var1): make separate plots for different levels of ONE variable facet_grid(var1 ~ var2) make separate plots for combinations of two variables

What is the full range of a distribution

max - min

When are mean and median the same? Different?

median: value for which 1/2 measurements are larger and 1/2 are smaller mean is about the same as median whenthe data has one peak and is roughly symmetric -if left skewed, mean usually smaller than mean outliers pull mean towards their values but do not affect median; skewed distributions pull mean out into tail

When are measures of central tendency not helpful?

multi-modal distrubtions- mean/median lands between the peaks

identify the class, name, type, and value of the following variable: var <- 2

name: var value: 2 class: numeric type: double

Given a scatterplot of nuggets/hour from a dataset including reg. and spicy nuggets, how to create two seperate lines on one line graph of data>

new_nugs <- nugs %>% filter(chicken_nugs %>% c('reg', 'spicy)) + geom_line(aes(col = chicken_nugs))

How do you reorder variables in a dataset?

ordered_data <- data %>% mutate(ordered_var = fct_relevel(var, 'first'. 'second', 'third')) default = alphabetical

What does arrange() do?

orders data in ascending order; for descending add - sign data %>% arrange(var1, var2) arranges by 2 variables

What is pearson's correlation and how is it used?

r - quantifies the strength of linear relationshp between two variables (do we need to know formulas?)

given variables x and y in the dataset "mydata", how do you show the results of a linear regression?

reg_data <- lm(y ~ x, data=mydata) summary(reg_data)

How do you remove an individual from a dataset?

removed_blue <- data %>% filter(color != 'blue') removed_blue_red <- data %>% filter(color != 'blue' && color != 'red')

What graph method is used to analyze the relationship between two continuous variables?

scatter plot

How to calculate st. dev and variance?

sd() var() use summarize() to output variables

What is the difference between select and filter

select for variables filter for rows

Describe the 3 ways to describe a histogram's distribution

shape -symmetric vs. skewed (right or left) center -unimodal or bimodal? spread -how spread out are the data? -range of data? outliers?

How do you calculate statistics of variables?

sumarize()

What does the stat="identity" command do?

the y variable supplied in the ggplot does not need modification geom_bar(stat="identity") means the y-variable supplied in the ggplot is exactly what we want to plot When you do not specify a y variable containing a count or percent for ggplot to plot on the y-axis, you do not need this command. When `geom_bar()` is doing the counting for us, then we do not use the `stat="identity"` argument.

What is the marginal distribution, with respect to a two way table?

those in margins of table are usually row/column totals marginal distribution is for a single categorical variable -ex: given a dataset of 25 smokers and 75 nonsmokers, marginal distribution of smoking is 25%, nonsmoking 75%

When do you use quotes in r?

variables do NOT need quotes strings, like axis labels or specific individual names need quotes

What are boxplots?

visual depiction of 5# summary -center line is median -top of box is Q3 -bottom of box is Q1 -top of top whisker is Q3 +1.5(IQR) -bottom of bottom whisker is Q1 - 1.5(IQR) -data points above/below whiskers are outliers

What are on x&y axes of histograps?

x-axis: the different values for the variables y-axis: the # indviduals per each variable value (do not need to set y to anything)


Set pelajaran terkait

OB EXAM PRACTICE QUESTIONS PART 2

View Set

AP Calculus Derivative and Integral Formulas

View Set

SOC 323 Midterm, SOC 323 Quiz #1, SOC 323 Quiz #2, SOC 323 Quiz #3, SOC 323 midterm

View Set

1c) The doors of the brain II: the choroid plexus, CSF and B-CSF barrier

View Set

Capítulo # 6 Anatomía y fisiología general

View Set