Data Analytics - Course 7 (pt. 3)

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

what is the tibbles package a part of?

-> The tibble package is part of the core tidyverse. So, if you've already installed the tidyverse, you have what you need to start working with tibbles.

what is anscombe's quartet

four identical datasets that have nearly identical summary statistics

using the mutate function to create a new variable that would capture each person's age in twenty years

mutate(people, age_in_20 = age + 20)

what does pivot_longer() function do?

As part of the tidyr package, you can use this R function to lengthen the data in a data frame by increasing the number of rows and decreasing the number of columns.

what does the pivot_wider() function do?

Similarly, if you want to convert your data to have more columns and fewer rows, you would use the pivot_wider() function.

what function gives the name and type of each column? (from built-in dataset)

The "mtcars.csv" file refers to the mtcars dataset that was mentioned earlier. Let's use the read_csv() function to read the "mtcars.csv" file, as an example. In the parentheses, you need to supply the path to the file. In this case, it's "readr_example("mtcars.csv"). read_csv(readr_example("mtcars.csv"))

what does the head() function do?

We can also use Head to get a preview of the column names and the first few rows of this data set. Having the column names summarized like this will make it easier to clean them up.

what is one of the most common forms of data storage?

read_csv() function, since .csv files are one of the most common forms of data storage and you will work with them frequently.

examples of good file names

-> 2020-04-10_march-attendance.R -> 2021_03_20_new_customer_ids.csv -> 01_data-sales.html -> 02_data-sales.html

what is the goal of readr?

-> The goal of readr is to provide a fast and friendly way to read rectangular data.

what does the bias function do?

Basically the bias function finds the average amount that the actual outcome is greater than the predicted outcome. It's included in the sim design package. So it's helpful to install that and practice on your own. If the model is unbiased, the outcome should be pretty close to zero. A high result means that your data might be biased.

what if you want to create another data frame, that focuses on the average daily rate

If you want to create another data frame using `bookings_df` that focuses on the average daily rate, which is referred to as `adr` in the data frame, and `adults`, you can use the following code chunk to do that: new_df <- select(bookings_df, `adr`, adults)

what is the assignment operator?

Note that you use the assignment operator to store these values: x <- 2 y <- 5

what does the glimpse() function do?

Or we could use Glimpse to get a really quick idea of what's in this dataset. When we run this command, it'll show us a summary of the data. There's 344 rows and eight columns. We have species, island, measurements for bills, which are basically beaks and flippers, the penguins' body mass in grams, the sex, and finally, the year the data was recorded.

how do you create a data frame in R?

Sometimes you will need to generate a data frame directly in `R`. There are a number of ways to do this; one of the most common is to create individual vectors of data and then combine them into a data frame using the `data.frame()` function.

Fill in the blank: The bias function compares the actual outcome of the data with the _____ outcome to determine whether or not the model is biased.

The bias function compares the actual outcome of the data with the predicted outcome to determine whether or not the model is biased.

A data analyst inputs the following command: quartet %>% group_by(set) %>% summarize(mean(x), sd(x), mean(y), sd(y), cor(x, y)). Which of the functions in this command can help them determine how strongly related their variables are?

The cor() function returns the correlation between two variables. This determines how strong the relationship between those two variables is.

why do we use two equal signs in the filer function?

The double equal sign means exactly equal to in R.

Which summary functions can you use to preview data frames in R? Select all that apply.

The head(), glimpse(), and str() summary functions allow you to preview data frames in R. The head() function returns the columns and the first several rows of data.The mutate() function lets you change the data frame, not preview it. Going forward, you can use summary functions to inspect the data frames you create in your career as a data analyst.

Which of the following functions can a data analyst use to get a statistical summary of their dataset? Select all that apply.

The sd(), cor(), and mean() functions can provide a statistical summary of the dataset using standard deviation, correlation, and mean.

Tidy data is a way of standardizing the organization of data within R.

Tidy data refers to the principles that make data structures meaningful and easy to understand. It's a way of standardizing the organization of data within R.

what is tidy data in R?

Tidy data refers to the principles that make data structures meaningful and easy to understand. It's a way of standardizing the organization of data within R. These standards are pretty straightforward. Variables are organized into columns. Observations are organized into rows and each value must have its own cell.

why use the arrange function?

We can use the arrange function to choose which variable we want to sort by, for example let's say we want to sort our penguin data by bill length. We'll type in a range and our column name. And when we execute this command it will return a tibble with data sorted by bill lengths. It's currently in ascending order. If we want to sort it in descending order we just add a minus sign before the column name. library (tidyverse) penguins %>% arrange(bill_length_mm)

unite function example

We'll start with unite and indicate the data frame we're referring to. Then, we'll name the column we're combining first name and last name in. And then we'll say which columns we're combining. No quotation marks needed here. And finally, we can include a space that separates them. And when we run that, those two columns are combined. unite (employee, 'name', first_name, last_name, sep= ' ' )

Which of the following are best practices for creating data frames? Select all that apply.

When creating data frames, columns should be named and each column should contain the same number of data items.

what is wide data?

Wide data has observations across several columns. Each column contains data from a different condition of the variable. In this example, different years.

what are logical operators?

allow you to combine logical values. Logical operators return a logical data type or boolean (TRUE or FALSE). You encountered logical operators in an earlier reading, Logical operators and conditional statements, but here is a quick refresher.

what are relational operators?

also known as comparators, allow you to compare values. Relational operators identify how one R object relates to another—like whether an object is less than, equal to, or greater than another object. The output for relational operators is either TRUE or FALSE (which is a logical data type, or boolean).

how to use the read() function to import data and save it as a data frame

bookings_df <- read_csv("hotel_bookings.csv")

clean_names function example

clean_names(penguins) -> This ensures that there's only characters, numbers, and underscores in the names.

what packages do you need for basic data cleaning?

here, skimr, janitor, dplyr

what is long data?

long data has all the observations in a single column, and variables in separate columns.

filter function example

penguins %>% filter(species == "Adelie" -> And now we have a data frame that only contains data on Adelie penguins. This lets us narrow down our analysis if we need to

example of the group by function

penguins %>% group_by(island) %>% drop_na() %>% summarize (mean_bill_length_mm = mean(bill_length_mm)) -> We're not interested in NA values so we can leave those out using the drop underscore NA argument. This addresses any missing values in our dataset. It's important to be careful when using drop_na. It's useful doing a group level summary statistic like this but it will remove rows from the data. -> And when we run this we get a data frame with the three islands and the mean bill length of the penguins living there. We can get other summaries too, for example, if we want to know the maximum bill length, we can write a similar function and replace mean with max.

what are some readr functions?

read_csv(): comma-separated values (.csv) files read_tsv(): tab-separated values files read_delim(): general delimited files read_fwf(): fixed-width files read_table(): tabular files where columns are separated by white-space read_log(): web log files

Which R function can be used to make changes to a data frame?

The mutate() function can be used to make changes to a data frame.

what are are some examples of file types that store rectangular data

-> .csv (comma separated values): a .csv file is a plain text file that contains a list of data. They mostly use commas to separate (or delimit) data, but sometimes they use other characters, like semicolons. -> .tsv (tab separated values): a .tsv file stores a data table in which the columns of data are separated by tabs. For example, a database table or spreadsheet data. -> .fwf (fixed width files): a .fwf file has a specific format that allows for the saving of textual data in an organized fashion. -> .log: a .log file is a computer-generated file that records events from operating systems and other software programs.

E​xamples of filenames to avoid

-> 4102020marchattendance<workinprogress>.R -> _20210320*newcustomeridsforfebonly.csv -> firstfile_for_datasales/1-25-2020.html -> secondfile_for_datasales/2-5-2020.html

there are 3 common sources for data

-> A`package` with data that can be accessed by loading that `package` -> An external file like a spreadsheet or CSV that can be imported into `R` -> Data that has been generated from scratch using `R` code

what do the Skimr and Janitor packages do?

-> As a quick reminder, these packages simplify data cleaning tasks. They're both really useful and do slightly different things. The Skimr package makes summarizing data really easy and let's you skim through it more quickly. -> The Janitor package has functions for cleaning data.

Which of the following are standards of tidy data?

-> Variables are organized into columns, observations are organized into rows, and each value must have its own cell.

creating a data frame (example)

-> First, create a vector of names by inserting four names into this code block between the quotation marks and then run it: -> Then create a vector of ages by adding four ages separated by commas to the code chunk below. -> With these two vectors, you can create a new data frame called `people` names <- c("kev", "jess", "rosy", "maria") age <- c(22,24 ,11 ,33 ) people <- data.frame(names, age)

how are tibbles different from standard data frames?

-> First, tibbles never change the data types of the inputs. They won't change your strings to factors or anything else. You can make more changes to base data frames, but tibbles are easier to use. This saves time because you won't have to do as much cleaning or changing data types in tibbles. -> Tibbles also never change the names of your variables, and they never create row names. -> they never create row names -> Finally, tibbles make printing in R easier. They won't accidentally overload your console because they're automatically set to pull up only the first 10 rows and as many columns as fit on screen. Super useful when you're working with large sets of data.

rename_with() function example

-> For example, maybe we want all of our column names to be in uppercase. We can use the rename_with() function to do that. This will automatically make our column names uppercase. But since variable names are usually lowercase, we'll use the "To lower" option to change it back. rename_with (penguins, toupper)

how do you load a specific dataset? (that's already in R)

-> If you want to load a specific dataset, just enter its name in the parentheses of the data() function. For example, let's load the mtcars dataset, which has information about cars that have been featured in past issues of Motor Trend magazine. data(mtcars) -> When you run the function, R will load the dataset. The dataset will also appear in the Environment pane of your RStudio.

do's when naming files

-> Keep your filenames to a reasonable length -> Use underscores and hyphens for readability -> Start or end your filename with a letter or number -> Use a standard date format when applicable; example: YYYY-MM-DD -> Use filenames for related files that work well with default ordering; example: in chronological order, or logical order using numbers first

mutate function example

-> Let's go back to our penguin dataset. Right now, the body mass column is measured in grams. Maybe we want to add a column with kilograms. To do that, we'll use mutate to perform the conversion and add a new column. You can make calculations on multiple new variables by adding a comma. Let's add a column converting the flipper length too. View(penguins) penguins %>% mutate(body_mass_kg=body_mass_g/1000, flipper_length_m = flipper_length_mm/1000)

how to save the above as a data frame

-> Now it's important to remember this data is just in our console to save this as a data frame will start by naming it. Then we'll input the function we used to arrange the previous version of the penguins data. penguins2 <- penguins %>% arrange(-bill_length_mm) view(penguins2) When we execute this it'll save a new data frame and we can use view penguins2 to add it to our data. This lets you save cleaned data without losing information from the original dataset.

how do you then get a preview of the dataset? (2 ways)

-> Now that the dataset is loaded, you can get a preview of it in the R console pane. Just type its name... mtcars ...and then press ctrl (or cmnd) and enter. -> You can also display the dataset by clicking directly on the name of the dataset in the Environment pane. So, if you click on mtcars in the Environment pane, R automatically runs the View() function and displays the dataset in the RStudio data viewer.

what functions can you use to inspect the data frame? (with example)

-> One common function you can use to preview the data is the `head()` function, which returns the columns and the first several rows of data. You can check out how the `head()` function works by running the chunk below: head(people)

what is rectangular data?

-> Rectangular data is data that fits nicely inside a rectangle of rows and columns, with each column referring to a single variable and each row referring to a single observation.

what does the clean_names function do?

-> The clean names function in the Janitor package will automatically make sure that the column names are unique and consistent.

how can you start working with the readr package?

-> The readr package is part of the core tidyverse. So, if you've already installed the tidyverse, you have what you need to start working with readr. If not, you can install the tidyverse now.

what function makes it easy to change column names?

-> The rename function makes it easy to change column names. Starting with the penguin data, we'll type rename and change the name of our island column to island underscore new. penguins %>% rename(island_new=island)

what does the skim_without_charts() function do?

-> The skim without charts function gives us a pretty comprehensive summary of a dataset. Let's try it out. When we run this, we get a lot of info back. -> First, it gives us a summary with the name of the dataset and the number of rows and columns. It also gives us the column types and a summary of the different data types contained in the data frame.

Which of the following functions returns a summary of the data frame, including the number of columns and rows? Select all that apply.

-> The skim_without_charts() and glimpse() functions both return a summary of the data frame, including the number of columns and rows.

what are tibbles? (refresher)

-> Tibbles are like streamlined data frames that are automatically set to pull up only the first 10 rows of a dataset, and only as many columns as can fit on the screen. This is really useful when you're working with large sets of data. Unlike data frames, tibbles never change the names of your variables, or the data types of your inputs. Overall, you can make more changes to data frames, but tibbles are easier to use.

don'ts when naming files

-> Use unnecessary additional characters in filenames -> Use spaces or "illegal" characters; examples: &, %, #, <, or > -> Start or end your filename with a symbol -> Use incomplete or inconsistent date formats; example: M-D-YY -> Use filenames for related files that do not work well with default ordering; examples: a random system of numbers or date formats, or using letters first

what does the select() function do?

-> We can use select to specify certain columns or to exclude columns we don't need right now. Let's say we only need to check the species column. penguins %>% select(species) -> Now we have just the species column, or maybe we want everything except the species column. We'll put minus species (-species) instead of species, and now we have every column but species. The select statement is useful for pulling just a subset of variables from a large dataset.

when will you use a data frame?

-> Wherever data comes from, you will almost always want to store it in a data frame object to work with it.

what other function will allow you to sort data?

-> You can also sort by data using the group by function. Group by is usually combined with other functions.

how can you create a tibble from existing data? (what function)

-> You can create a tibble from existing data with the as_tibble() function. Indicate what data you'd like to use in the parentheses of the function. In this case, you will put the word "diamonds." as_tibble(diamonds)

how can you display a list of the available datasets in R?

-> You can use the data() function to load these datasets in R. If you run the data function without an argument, R will display a list of the available datasets. data() This includes the list of preloaded datasets from the datasets package.

what is a data frame?

-> a collection of columns -> It's a lot like a spreadsheet or a SQL table. -> We use data frames for a lot of the same reasons as tables too. They help summarize data and put it into a format that's easy to read and use.

what should you do before trying to save .csv files?

-> be sure to install and load the Here package

what does the separate function do?

-> for example, if a column has both first and last name, you can use the separate function to split these into separate columns

what does the str() function do?

-> for example, we could use the str() function to highlight the structure of this data frame -> This gives us some high-level info like the column names and the type of data contained in those columns.

what other functions can give you the structure of the data frame?

-> functions like str() and colnames()

what does the colnames() function do?

-> if we just want to know the column names we can use colnames instead.

what are tibbles?

-> in the tidyverse, tibbles are like streamlined data frames -> They make working with data easier, but they're a little different from standard data frames.

what does the Here package do?

-> it makes referencing files easier (when data cleaning) -> To install it, we'll just write install.packages. Then in the parentheses, we'll put Here and RStudio will install it. After we install it, we'll also need to load it using library.

what functions help organize your data?

-> mean() -> max() -> group_by() -> summarize() -> arrange() -> filter() -> drop_na()

example of the mutate() function

-> mutate(diamonds, carat_2=carat*100) -> we'll add a column and the name of the new column we want to create. Then we want to calculate this new column. In this case, to make it easier to read the carat column we'll multiply it by 100 to create a new column carat_2. And when we run this, presto, our data frame has a new column. You won't lose any columns when you create the new one.

what functions help clean your data?

-> rename() -> rename_with() -> skim_without_charts() -> glimpse() -> select() -> clean_names()

what functions help transform your data?

-> separate() -> mutate() -> unite()

what are a few different functions you can use to get summaries of the data frame?

-> skim_without_charts() -> glimpse() -> head() -> select()

what function will just give you only the first 6 rows?

-> the head function -> this is a nice preview of the entire dataset. -> accidentally printing the full data frame to the console can be annoying and can take a long time to compute. You can avoid printing the full data frame by using functions like head to get a quick preview.

what function will allow you to make changes to your data frame?

-> the mutate() function -> The mutate function is part of the dplyr package which is in the tidyverse. So you'll need to load the tidyverse library before you test out mutate.

what package is great for reading rectangular data?

-> the readr package

what function can change column names to be more consistent?

-> the rename_with() function and clean_names

what does the units function do?

-> this function allows us to merge columns together -> it basically does the opposite of the separate function

how can mutate be used to transform your data

-> we can also create new variables in our data frame using the mutate function. We worked with mutate a little bit before to clean and organize our data. But mutate can also be used to add columns with calculations.

how to load a dataset (refresher)

-> you can load the dataset with the data() function using the following code: library(tidyverse) data(diamonds) -> Then, let's add the data frame to our data viewer in RStudio with the View() function. View(diamonds)

Which syntax would you use to import a dataset called quarter_earnings.csv into RStudio?

The proper syntax to use for importing the "quarter_earnings.csv" dataset is earnings_df <- read_csv("quarter_earnings.csv"). The results of this function display as column specifications of the data frame it creates. Going forward, you can import data into RStudio with read_csv() for projects throughout your career as a data analyst.

how do you list the sample files in the readr package?

The readr package comes with some sample files from built-in datasets that you can use for example code. To list the sample files, you can run the readr_example() function with no arguments. readr_example()

what does the sample() function do?

The sample() function is just one of many functions and methods in R that you can use to address bias in your data. Depending on the kind of analysis you are conducting, you might need to incorporate some advanced processes in your programming

Why are tibbles a useful variation of data frames?

Tibbles can make printing easier. They also help you avoid overloading your console when working with large datasets. Tibbles are automatically set to only return the first ten rows of a dataset and as many columns as it can fit on the screen.

how do you create new variables in your data frame

To create new variables in your data frame, you can use the `mutate()` function. This will make changes to the data frame, but not to the original data set you imported. That source data will remain unchanged. mutate(new_df, total = `adr` / adults)

how to upload your own .csv file to import

To do this, go to the Files tab in the lower-right console. Then, click the Upload button next to the + New Folder button. This will open a popup to let you browse your computer for a file. Select any .csv file, then click Open. Now, write code in the chunk below to read that data into `R`:

separate function example

We'll start with separate, and then the data frame we want to work with and the column we'd like to separate. Then we'll add what we'd like to split the name column into. We'll just name these new columns, first name and last name. And finally, we'll tell R to separate the name column at the first blank space. When we run this, it will build us new columns for the first and last names. separate ( employee, name, into=c('first_name', 'last_name'), sep= ' ' )

using the bias function

We'll use the bias function to compare forecasted temperatures with actual temperatures. For this example we'll just take a small sample of our weather data and input them here. We'll label this the actual temp. install.packages("SimDesign") library(SimDesign) actual_temp <- c(68.3, 70, 72.4, 71, 67, 70) predicted_temp <- c(67.9, 69, 71.5, 70, 67, 69) bias(actual_temp, predicted_temp) When we run this we find out that the result Is 0.71. That's pretty close to zero but the prediction seemed biased towards lower temperatures which, means they aren't as accurate as they could be. And now that the local weather channel knows about this, they can find the problem in their system that's causing biased predictions. This doesn't mean that their predictions will be perfect all the time, but they'll be more accurate overall.


Ensembles d'études connexes

Chapter 58: Assessment and Management of Patients With Breast Disorders prepu

View Set

Chapter 17: Physiological Transition of the Newborn Q&A

View Set

MindTap: Worksheet 08.2: Copyrights & Trade Secrets & International Protections for Intellectual Property

View Set

Bichem Exam III (Ch 14, 15, 16 )

View Set