Module 3: explore data and R, cleaning/organizing data, biased data

Ace your homework & exams now with Quizwiz!

4 main types of operators in R

Arithmetic, Relational, Logical, Assignment

necessary packages for cleaning data

Here Skimr Janitor dplyr

rename_with( )

Similar to the rename function, it can change column names to be more consistent. For example, maybe we want all of our column names to be in uppercase. We can use the rename_with() function to do that. This will automatically make our column names uppercase. rename_with(penguins, toupper) to change back to lowercase? rename_with(jpenguins, tolower)

logical NOT (!)

The NOT operator simply negates the logical value, and evaluates to its opposite. In R, zero is considered FALSE and all non-zero numbers are considered TRUE. example: apply the NOT operator to your variable (x <- 10): !(x < 15) [1] FALSE The NOT operation evaluates to FALSE because it takes the opposite logical value of the statement x < 15, which is TRUE (10 is less than 15).

readr

The goal of readr is to provide a fast and friendly way to read rectangular data. readr supports several read_ functions. Each function refers to a specific file format. read_csv(): comma-separated values (.csv) files read_tsv(): tab-separated values files read_delim(): general delimited files read_fwf(): fixed-width files read_table(): tabular files where columns are separated by white-space read_log(): web log files

which functions can a data analyst use to get a statistical summary of their dataset?

The sd(), cor(), and mean() functions can provide a statistical summary of the dataset using standard deviation, correlation, and mean.

how do tibbles make printing in R easier?

They won't accidentally overload your console because they're automatically set to pull up only the first 10 rows and as many columns as fit on screen. Super useful when you're working with large sets of data.

Here package

This package makes referencing files easier. To install it, we'll just write install.packages. Then in the parentheses, we'll put Here and RStudio will install it. After we install it, we'll also need to load it using library install.packages("here") library("here")

filename don'ts

Use unnecessary additional characters in filenames Use spaces or "illegal" characters; examples: &, %, #, <, or > Start or end your filename with a symbol Use incomplete or inconsistent date formats; example: M-D-YY Use filenames for related files that do not work well with default ordering; examples: a random system of numbers or date formats, or using letters first

standards for organization of data within R

Variables are organized into columns. Observations are organized into rows and each value must have its own cell.

%>% drop_na() %>%

We're not interested in NA values so we can leave those out using the drop underscore NA argument. addresses any missing values in our dataset. note: be careful when using drop_na. It's useful doing a group level summary statistic like this example but it will remove rows from the data.

wide data

Wide data has observations across several columns. Each column contains data from a different condition of the variable

Data analysts are cleaning their data in R. They want to be sure that their column names are unique and consistent to avoid any errors in their analysis. What R function can they use to do this automatically?

clean_names()

if we just want to know the column names what function can you use?

colnames( )

data frames: what to know before working with them

columns should be names. (using empty column names can create problems with your results later on) the data stored in your data frame can be many different types, like numeric, factor, or character. (Often data frames contain dates, time stamps and logical vectors) each column should contain the same number of data items, even if some of those data items are missing. Data frames are foundational.

A data analyst inputs the following command: quartet %>% group_by(set) %>% summarize(mean(x), sd(x), mean(y), sd(y), cor(x, y)). Which of the functions in this command can help them determine how strongly related their variables are?

cor(x,y)

what are the building blocks for analysis in R?

data frames and tibbles

rectangular data

data that fits nicely inside a rectangle of rows and columns, with each column referring to a single variable and each row referring to a single observation

you can use the _____ function to load datasets in R. if you run the data function without an argument, R will display a list of the available datasets

data( )

Which syntax would you use to import a dataset called quarter_earnings.csv into RStudio?

earnings_df <- read_csv("quarter_earnings.csv")

If the model is unbiased, the outcome should be pretty close to zero. A high result means that your data might be biased.

example: bias(actual_temp, predicted_temp) if you get a result like 0.7134 the result is not biased if you get a result like -35 it may be biased data

when you just run `arrange()` without saving your data to a new data frame, it does not alter the existing data frame If you wanted to create a new data frame that had those changes saved, you would use the <- as written in the code chunk below to store the arranged data in a data frame named 'hotel_bookings_v2'

example: hotel_bookings_v2 <- arrange(hotel_bookings, desc(lead_time))

You can also use the`mutate()` function to make changes to your columns. Let's say you wanted to create a new column that summed up all the adults, children, and babies on a reservation for the total number of people. Modify the code chunk below to create that new column:

example_df <- bookings_df %>% mutate(guests = adults + children + babies) head(example_df)

Calculate the total number of canceled bookings and the average lead time for booking - you'll want to start your code after the %>% symbol. Make a column called 'number_canceled' to represent the total number of canceled bookings. Then, make a column called 'average_lead_time' to represent the average lead time. Use the `summarize()` function to do this in the code chunk below:

example_df <- bookings_df %>% summarize(number_canceled = sum(is_canceled), average_lead_time = mean(lead_time)) head(example_df)

You can also find out the maximum and minimum lead times without sorting the whole data set using the `arrange()` function in this case, you need to specify which data set and which column using the $ symbol between their names

examples: max(hotel_bookings$lead_time) min(hotel_bookings$lead_time)

A data analyst is working with the penguins data. They write the following code: penguins %>% The variable species includes three penguin species: Adelie, Chinstrap, and Gentoo. What code chunk does the analyst add to create a data frame that only includes the Gentoo species?

filter(species == "Gentoo")

glimpse( )

gives a quick idea of what's in the dataset with a summary of the data (gives the number of rows and columns, etc)

skim_without_charts( )

gives us a summary with the name of the dataset and the number of rows and columns. It also gives us the column types and a summary of the different data types contained in the data frame

which functions returns a summary of the data frame, including the number of columns and rows?

glimpse( ) skim_without_charts( )

Janitor package

has functions for cleaning data

the _____ function provides a preview of the first 6 rows of a data frame

head( ) This is useful if you want to quickly check out the data, but don't want to print the entire data frame.

be sure to install and load the _____ package before trying to save CSV files

here

use the `read_csv()` function to import data from a .csv in the project folder called "hotel_bookings.csv" and save it as a data frame called `hotel_bookings`:

hotel_bookings <- read_csv("hotel_bookings.csv")

Now, your boss wants to know a lot more information about city hotels, including the maximum and minimum lead time. They are also interested in how they are different from resort hotels. You don't want to run each line of code over and over again, so you decide to use the `group_by()`and`summarize()` functions. You can also use the pipe operator to make your code easier to follow. You will store the new data set in a data frame named 'hotel_summary':

hotel_summary <- hotel_bookings %>% group_by(hotel) %>% summarise(average_lead_time=mean(lead_time), min_lead_time=min(lead_time), max_lead_time=max(lead_time))

when you run the data( ) function, R will load the dataset. where else will the dataset appear?

in the environment pane of RStudio The Environment pane displays the names of the data objects, such as data frames and variables, that you have in your current workspace

== meaning in R

it means "exactly equal to"

assignment operators

let you assign values to variables best practice in R is using <- instead of =

arithmetic operators

let you perform basic math operations like addition, subtraction, multiplication, and division.

the summarize function...

lets us get high level information about our data

long data

long data has all the observations in a single column, and the variable conditions are placed into separate rows.

Skimr package

makes summarizing data really easy and let's you skim through it more quickly

A data analyst is working with a data frame named salary_data. They want to create a new column named wagesthat includes data from the rate column multiplied by 40. What code chunk lets the analyst create the wages column?

mutate(salary_data, wages = rate*40)

syntax for importing csv file into RStudio

own_df <- read_csv("<filename.csv>")

to get just the species column to get everything except the species column

penguins %>% select(species) penguins %>% select(-species)

As part of the tidyr package, you can use this R function to lengthen the data in a data frame by increasing the number of rows and decreasing the number of columns. Similarly, if you want to convert your data to have more columns and fewer rows, you would use the pivot_wider() function.

pivot_longer()

if you want to convert your data to have more columns and fewer rows, you would use which function?

pivot_wider()

the select statement is useful for...

pulling just a subset of variables from a large dataset. This lets you focus on specific groups of variables

The readr package comes with some sample files from built-in datasets that you can use for example code. To list the sample files, you can run the readr_example() function with no arguments

readr_example( )

tidy data

refers to the principles that make data structures meaningful and easy to understand it's a way of standardizing the organization of data within R

functions for cleaning data in R

rename_with() clean_names() skim_without_charts() rename() glimpse()

functions that can be used to get summaries of the data frame

skim_without_charts( ) glimpse( ) head( ) select( )

You are working with the penguins dataset. The location where each record was collected is stored in the column island. The mass of each penguin is stored in the column body_mass_g. You want to know which island had penguins with the greatest mass. At this point, the following code to group each island's penguins by mass has already been written into the script: penguins %>% drop_na() %>% group_by(island) %>% Add the code chunk using the summarize() and mean() functions that lets you find the mean value for the variable body_mass_g.

summarize(body_mass_g) = mean(body_mass_g))

str( )

the structure function can be used to highlight the structure of a data frame. Gives some high-level info like the column names and the type of data contained in those columns.

Rename the variable 'hotel' to be named 'hotel_type' to be crystal clear on what the data is about:

trimmed_df %>% select(hotel, is_canceled, lead_time) %>% rename(hotel_type = hotel)

Based on your notes you are primarily interested in the following variables: hotel, is_canceled, lead_time. Create a new data frame with just those columns, calling it `trimmed_df`

trimmed_df <- bookings_df %>% select(hotel, is_canceled, lead_time)

The rename_with() function can be used to reformat column names to be upper or lower case. T or F?

true

Group by is usually combined with other functions. For example, we might want to group by a certain column and then perform an operation on those groups. T or F?

true example: penguins %>% group_by(island) %>% drop_na() %>% summarize(mean_bill_length_mm = mean(bill_length_mm))

functions to transform data in R

unite() separate() mutate()

A data analyst is working with a data frame named weather. It has separate columns for temperatures (temp) and measurement units (unit). The analyst wants to combine the two columns into a single column called display_temp, with the temperature and unit separated by the string " Degrees ". What code chunk lets the analyst create the display_temp column?

unite(weather, "display_temp", temp, unit, sep = " Degrees ")

logical operators in R

& Element-wise logical AND && Logical AND | Element-wise logical OR || Logical OR ! Logical NOT

examples of file types that store rectangular data

.csv (comma separated values): a .csv file is a plain text file that contains a list of data. They mostly use commas to separate (or delimit) data, but sometimes they use other characters, like semicolons. .tsv (tab separated values): a .tsv file stores a data table in which the columns of data are separated by tabs. For example, a database table or spreadsheet data. .fwf (fixed width files): a .fwf file has a specific format that allows for the saving of textual data in an organized fashion. .log: a .log file is a computer-generated file that records events from operating systems and other software programs.

good filename examples

2020-04-10_march-attendance.R 2021_03_20_new_customer_ids.csv 01_data-sales.html 02_data-sales.html

examples of filenames to avoid

4102020marchattendance<workinprogress>.R _20210320*newcustomeridsforfebonly.csv firstfile_for_datasales/1-25-2020.html secondfile_for_datasales/2-5-2020.html

the rename function makes it easy to...

change column names example: penguins %/% rename(island_new)

tibbles

In the tidyverse, tibbles are like streamlined data frames tibbles never change the names of your variables, or the data types of your inputs They won't change your strings to factors or anything else. Tibbles also never change the names of your variables, and they never create row names. tibbles automatically only preview the first 10 rows of data(as many columns as fit on screen) tibbles make printing in R easier

file name do's

Keep your filenames to a reasonable length Use underscores and hyphens for readability Start or end your filename with a letter or number Use a standard date format when applicable; example: YYYY-MM-DD Use filenames for related files that work well with default ordering; example: in chronological order, or logical order using numbers first

head( )

can use Head to get a preview of the column names and the first few rows of this data set. We can use select to specify certain columns or to exclude columns we don't need right now. if you only needed to check the species column: We can input penguins, then a pipe to indicate we'll add another command, and our select. penguins %>% select(species)

data frame

a collection of columns There's column names and rows and cells with data. The columns contain one variable, and the rows have a set of values that match each column. data frames help summarize data and put it into a format that's easy to read and use (it's a lot like a spreadsheet or a SQL table) usually the starting point for analyzing data in R

operator

a symbol that identifies the type of operation or calculation to be performed in a formula

the arrange function sorts in ascending order. if you want to sort in descending order what would you do?

add a minus sign before the column name example: penguins %>% arrange(-bill_length_mm)

example of mutate function in action

adding a column input mutate and then tell R we want to add a new column to the diamonds data frame. We'll first call mutate followed by the name of the data frame we want to change. Then we'll add a column and the name of the new column we want to create. Then we want to calculate this new column. In this case, to make it easier to read the carat column we'll multiply it by 100 to create a new column carat_2 ex: >library(tidyverse) >mutate(diamonds, carat_2=carat*100)

logical operators

allow you to combine logical values. Logical operators return a logical data type or boolean (TRUE or FALSE).

relational operators

also known as comparators, allow you to compare values. Relational operators identify how one R object relates to another—like whether an object is less than, equal to, or greater than another object. The output for relational operators is either TRUE or FALSE (which is a logical data type, or boolean).

you can use the _____ function to choose which variable to sort by

arrange example: library(tidyverse) penguins %>% arrange(bill_length_mm)

functions for organizing data in R

arrange() filter() mean() group_by() drop_na() max() summarize()

`arrange()` automatically orders by ascending order, and you need to specifically tell it when to order by descending order, like the following code chunk

arrange(hotel_bookings, desc(lead_time))

Let's say you want to arrange the data by most lead time to least lead time because you want to focus on bookings that were made far in advance. You decide you want to try using the `arrange()` function and run the following command:

arrange(hotel_bookings, lead_time)

clean names function in the Janitor package

automatically makes sure the column names are unique and consistent example: clean_names(penguins) This ensures that there's only characters, numbers, and underscores in the names.

you can make more changes to _____, but _____ are easier to use

base data frames, tibbles (This saves time because you won't have to do as much cleaning or changing data types in tibbles)

bias function

bias() finds the average amount that the actual outcome is greater than the predicted outcome. It's included in the sim design package. install.packages("SimDesign") library(SimDesign)

mutate function mutate( )

can be used to make changes to our data frame. The mutate function is part of the dplyr package which is in the tidyverse (So you'll need to load the tidyverse library before using it)


Related study sets

Penny Abdomen - Ch. 3 Gallbladder Quiz

View Set

Chapter 28: Child with Endocrine Dysfunction

View Set

ТАБЛИЦА СЛОВООБРАЗОВАНИЯ (F)

View Set

Man. of Strat. All 13 chapter Ultimate Final Study Guide

View Set

FIN 331 Final Pt. 1, Jean Snavely, Western Kentucky University

View Set