Introduction, Working with data (dplyr), Visualizing data (ggplot), describing the data with numbers
pipe operator
%>% new_dataset <- old_dataset %>% function () can use the pipe to stack many dplyr functions in a row new_data <- old data %>% arrange (ph) %>% mutate (age in years = age/12) %>% select (age, race, gender)
Examining a histogram
- Shape: Is the distribution symmetric or skewed to the left or right? - Center: Does the histogram have one peak (unimodal), or two (bimodal) or more? - Spread: How spread out are the values? What is the range of the data? - Outliers: Do any of the measurements fall outside of the range of most of the data points?
Type of variable
1. Categorical: A variable that has grouping levels Mathematically you can calculate the proportion (%) of individuals in each level of the category - nominal: no inherent rank/ordering: disease status, race, hospital ID - ordinal: can be ordered/ranked: BMI, age, socioeconomic status 2. Quantitative: A continuous, numeric variable that you can perform mathematical operations on. Mathematically, we can you take the median or average of these variables - discrete: can be counted (not half numbers, only whole numbers), number of brain lesions, number of previous births - continuous: can be measured precisely, with a ruler or scale (can be decimal), annual income, gestational age
3 problem types
1. Descriptive: learning about some particular attribute ex: how many people have been exposed to covid? ex: what is distribution of cases amongst age groups? ex: 2. Causative: do changes in an explanatory variable cause changes in a response variable? ex: what are risk factors for covid? ex: Which vaccine is most effective? 3. Predictive: how can we best predict the value of the response variable for an individual? ex: What is a patients expected RBC count? ex: What is the probability of _____?
steps to create a bar graph
1. Read in the data code: data_set <- read_csv ("ID_data") 2. build ggplot canvas 1st code: library (ggplot) 2nd code: p1 <- ggplot (data_set, aes (x= x variable, y = variable)) + geom_bar(stat="identity") *** use stat="identity" when using a y-variable with a a count or percent for ggplot and you don't need ggplot to calculate for you *** not always a y variable *** will be arranged alphabetically on x-axis by default 3. other add-ons base_size controls for font size theme_minimal removes grey back ground and adds grid lines (add using + to the code) 4. Reordering variables (make a new variable=disease_ordered) data_set <- data_set %>% mutate (disease_ordered = fct_reorder (disease, percent_cases, .desc=T)) ***ordered diseases in descending order according to percent cases 5. fill bars according to another variable in geom_bar portion of code geom_bar(stat="identity", aes (fill=other variable)) + theme (legend.position = "top")
R functions to get to know dataset
1. head(your_data) :Shows the first six rows of the supplied dataset 2. dim(your_data) :Provides the number of rows by the number of columns 3. names(your_data) :Lists the variable names of the columns in the dataset 4. str(your_data) :Summarizes the above information and more
steps of creating a histogram
1. read in the data code: data_set <- read_csv ("data source") 2. creating histogram p2 <- ggplot (data_set, aes (x = x variable)) + geom_histogram (col="white") + labs (x= "x label", y = "y label") + theme_minimal(base_size = 15)
PPDAC
Problem, Plan, Data, Analysis, Conclusion
dplyr's summarize() to calculate the standard deviation and the variance
CS_dat %>% summarize (cs_sd = sd(cs_rate), cs_var = var(cs_rate))
greater than or equal to, less than or equal to
Greater than or equal to: >= Less than or equal to: <=
histogram
Histograms are used to illustrate the distribution of a numeric (continuous or discrete) variable. There are no spaces between bars, bc underlying variables are continuous ***doesn't make sense to reorder a histogram data needs to be binned into categories, and number/percent of each category becomes the height of bar bins must be consecutive and non-overlapping, equal sized intervals (default in 30 bins)
measures of spread
IQR, standard deviation, and variance
Interquantile range (IQR)
Q1 is the 1st quartile/the 25th percentile. 25% of individuals have measurements below Q1. Q2 is the 2nd quartile/the 50th percentile/the median. 50% of individuals have measurements below Q2. Q3, the 3rd quartile/the 75th percentile. 75% of individuals have measurements below Q3. Q1-Q3 is called the inter-quartile range (IQR). 50% of individuals lie in the IQR code: data_set %>% summarize (Q1 = quantile(sym, 0.25), median = median(sym), Q3 = quantile(sym, 0.75)) ex: CS_dat %>% summarize (Q1 = quantile (cs_rate, 0.25), median = median(cs_rate), Q3 = quantile(cs_rate, 0.75))
Box Plots
The centre line is the median The top of the box is the Q3 The bottom of the box is the Q1 The top of the top whisker is either the max value, or equal to the highest point that is below Q3 + 1.5*IQR The bottom of the bottom whisker is either min value, or equal to the lowest point that is above Q1 - 1.5*IQR code: p1 <- ggplot (CS_dat, aes(y = cs_rate)) + geom_boxplot() + ylab("Cesarean delivery rate (%)") + labs(title = "Box plot of the CS rates across US hospitals", caption = "Data from: Kozhimannil et al. 2013.") + theme_minimal(base_size = 15)
range
The difference between the minimum and maximum value
outliers
The mean is not resistant to outliers. If a data set is larger the impact of the outliers will not be as drastic as if the data set was small. The median is typically not impacted outliers.
bar graphs
best for display of categorical data distribution of a categorical variable, with spaces between bars
==
equal to exactly lake_data_filtered <- old_data %>% filter ( age == 14)
read_csv
function of readr, used to import csv files specifically code template: your data <- read_csv ("data source")
filter() with combine function
lake_data %>% filter(lakes %in% c("Alligator", "Blue Cypress")) c () combines into a list
group_by() and summarize()
lake_data %>% group_by(age_data) %>% summarize(mean_ph = mean(ph))
readr
library to import data file into R code: library (readr)
skewed data
mean does not equal median Skewed distributions also pull the mean out into the tail left skewed: When the mean value of a data distribution is less robust than the median, it allows outliers to affect the mean by pulling it in the direction of the skew. - If the distribution is skewed to the left, the outlier or less common values are smaller than the bulk of the data, making the mean smaller than the median. right skewed: When the mean value of a data distribution is more robust than the median, it allows outliers to affect the mean by pulling it in the direction of the skew. If the distribution is skewed to the right, the outlier or less common values are bigger than the bulk of the data, making the mean greater than the median.
measures of centrality
mean, median Measures of central tendency are not very helpful in multi-modal distributions mean and median are roughly the same when data has one peak and roughly symmetric and no outliers code: mean (data_set [, "sym"]) median (data_set [,"sym"]) data_set %>% summarize (mean=mean(sym), median = median(sym))
five number summary
min, Q1,median,Q3, max full range of a dataset, where the middle 50% of the data lie, and the middle value rent_data %>% summarize (min = min(sym), Q1 = quantile(sym, 0.25), median = median(sym), Q3 = quantile(sym, 0.75), max = max(sym))
!
not equal to lake_data_filtered <- old_data %>% filter ( age != 14)
arrange ()
puts variable in order code template: data_set %>% arrange (pH) default is put in ascending order to put in descending order: code template: data_set %>% arrange (-pH) can do multiple variables code template: data_set %>% arrange (age, pH)
data manipulation functions using dplyr
rename() select() arrange() filter() mutate() group_by() summarize()
rename()
renames a dataset code template: new_dataset <- rename(old_dataset, new_name = old_name) example: make variable name MJ= michael jackson code: new_data <- rename (old data, MJ= michael jackson) to check if it worked, use the name function to view new variable names name (new_data)
filter ()
select which rows we want to keep in a data set ex: lake_data_filtered <- old_data %>% filter (pH > 7) filtering more than one variable: lake_data %>% filter(ph > 6, chlorophyll > 30) filter() using "or": lake_data %>% filter(ph > 6 | chlorophyll > 30) | is the OR operator. At least one of ph > 6 or chlorophyll > 30 needs to be true. filter() with combine function: lake_data %>% filter(lakes %in% c("Alligator", "Blue Cypress")) -only presents data frame for those individuals c () combines into a list
select()
selects subset of variables code template: new_dataset <- old_data %>% select (singers, dancers, musicians) To select a range of columns by name, use the ":" (colon) operator new_data <- select (variable a : variable z) select and starts with - to select data that starts with a certain letter/letters code: select(data, starts_with("___")) ex: select (data, "starts_with ("br")) ***will return all columns that start with those letters to check if it worked, use the name function to view new variable names name (new_data)
time plots
specific subset of plots where the x variable is time unlike previous plots, time plots show a relationship between two variables (quantitative variable and time) purpose: to look for seasonal patterns and trends
library(readxl)
this library helps with reading xlsx and xls files into R
geom_line
to make a graph with a line p2 <- ggplot (data_set, aes (x=year, y= y variable)) + geom_line (col = "white") + labs (title = "title", x= "x label", y= "y label")
geom_point
to make a graph with dots p1 <- ggplot (data_set, aes ( x=year, y= y variable)) + geom_point() + labs (title = "title", x= "x label", y= "y label") ** can specify color of the dots within geom_point ()
mutate()
used to add new variables to the data set lake_data_new fish <- lake_data %>% mutate (actual fished sampled = fish samples * 100) new data set <- old data %>% mutate (new variable = new variable with variation)
negative select
used to deselect variables code template: new_data <- old _data %>% select (-musicians)
dplyr
used to visualize data code: library (dplyr)