Data Analytics - Course 7
advice for someone just learning R
"The advice I would give to someone just learning R is that mistakes are part of the process. Errors and error messages are part of the process. When I think about the people who are even better than I am in R, I've come to realize they're not necessarily smarter than I am, but they may be a little bit more persistent and delving a little bit deeper."
variable example
# Here's an example of a variable first_variable <- "This is my variable"
variable example using a numeric data type
# Here's an example of a variable first_variable <- "This is my variable" second_variable <- 12.5
vector example
# Here's an example of a variable first_variable <- "This is my variable" second_variable <- 12.5 vec_1 <- c(13, 48.5, 71, 101.5, 2) We'll then close our parentheses and press enter.
calculation example
# our first calculations quarter_1_sales <- 35657.98 quarter_2_sales <- 43810.55 midyear_sales <- quarter_1_sales + quarter_2_sales then hit run
what are some things to keep in mind when working with data frames?
-> First, columns should be named. -> Second, data frames can include many different types of data, like numeric, logical, or character. -> Finally, elements in the same column should be of the same type.
pro tip for writing code in R
-> R is case sensitive
what should you be aware of when writing code in the console?
-> You can type commands directly into the console, but they'll be forgotten when you close your current session.
what can you do if you want to find out more about a specific function?
-> all you have to do is type a question mark, the function name, and a set of parentheses. -> This returns a page in the Help window, which helps you learn more about the functions you're working with. Keep in mind that functions are case-sensitive, so typing Print with a Capital P brings back an error message.
how do you open the file once again?
-> file -> open file -> select file and open
how do you save your script so that we can use these same variables again?
-> file -> save as -> type a file name -> hit save button
Before you get started working with dates and times, you should load both tidyverse and lubridate. Lubridate is part of tidyverse.
-> first intall tidyverse -> Next, load the tidyverse and lubridate packages using the library() function. First, load the core tidyverse to make it available in your current R session: library(tidyverse) Then, load the lubridate package: library(lubridate) Now you're ready to be introduced to the tools in the lubridate package.
what are the six primary types of atomic vectors?
-> logical, integer, double, character (which contains strings), complex, and raw. -> The last two-complex and raw-aren't as common in data analysis, so we will focus on the first four.
what are the two main ways of writing code in RStudio?
-> using the console or using the source editor
what is an argument?
-An argument is information that a function in R needs in order to run.
what is a data structure in programming?
-In programming, a data structure is a format for organizing and storing data. Data structures are important to understand because you will work with them frequently when you use R for data analysis. -Think of a data structure like a house that contains your data.
advantages of using R
-Popularity: R is frequently used for data analysis -Tools: R has a convenient library of ready-to-use tools for data cleaning and analysis -Focus: R was created with statistics in mind; data analysts can conveniently use a rich library of statistical routines -Adaptability: R adapts well for use in both machine learning and data analysis projects -Availability: R is an open source programming language
Spreadsheets, SQL, and R: a comparison
-They all use filters: for example, you can easily filter a dataset using any of these tools. In R, you can use the filter function. This performs the same task as a basic SELECT-FROM-WHERE SQL query. In a spreadsheet, you can create a filter using the menu options. -They all use functions: In spreadsheets, you use functions in formulas, and in SQL, you include them in queries. In R, you will use functions in the code that is part of your analysis.
how can you make the panes smaller or larger?
-You can make the panes smaller or larger by clicking on the minimize or maximize buttons at the upper right of each pane. -You can also click and drag the borders of the panes to adjust their size -Click on the Panes button under view for more feature options.
basic functions of R include
-functions -comments -variables -data types -vectors -pipes
functions (refresher)
-functions are a body of reusable code used to perform specific tasks in R. Functions begin with function names like print or paste, and are usually followed by one or more arguments in parentheses.
what are some other data types in R?
-logical -date -date time
what purpose does the console serve?
-the place where you give commands to R
how to install the core tidyverse packages and load them
1. In the bottom of the console, type install.packages("tidyverse") and press return 2. Load the tidyverse library with the library() function. To load the core tidyverse, type library(tidyverse) and press return 3. Load the lubridate package. Type library(lubridate) into the console pane and press return
how to ask a question on kaggle (for help with R)
1. Log in to your Kaggle account by clicking Sign In in the top-right corner. 2. Navigate to the Discussions tab in the left-hand menu. Select the Getting Started subcategory. 3. Then, click the New Topic button to create a new thread on the forum. 4. Write the Topic Title and content of your question, then click Publish Topic.
benefits of using any programming language
1. clarify the steps of your analysis 2. saves time 3. lets you easily reproduce and share your work
situations when you might use R
1. reproducing your analysis 2. processing lots of data 3. creating data visualizations
what is a data frame?
A data frame is a collection of columns-similar to a spreadsheet or SQL table. Each column has a name at the top that represents a variable, and includes one observation per row. Data frames help summarize data and organize it into a format that is easy to read and use.
what are the 3 types of data that refer to an instant in time:
A date ("2016-08-16") A time within a day ("20:11:59 UTC") And a date-time. This is a date plus a time ("2018-03-31 18:15:48 UTC")
how do you access the 4th pane in RStudio
A fourth pane is hidden by default, but it's easy to open. Just click on File in the menu, then select New File and R Script.
what is a pipe?
A pipe is a tool in R for expressing a sequence of multiple operations. A pipe is represented by a % sign, followed by a > sign, and another % sign. It's used to apply the output of one function into another function. Pipes can make your code easier to read and understand.
what is a variable? (refresher)
A variable is a representation of a value in R that can be stored for use later during programming.
what can you include as your variable name?
A variable name should start with a letter and can also contain numbers and underscores. So the variable 5penguin wouldn't work well because it starts with a number. Also just like functions, variable names are case-sensitive. Using all lower case letters is good practice whenever possible.
what is the history tab in the upper right?
All your previous commands are saved here and they're easy to search and re-execute. You'll find the most recent line of code at the bottom of the list. You can copy any line to the command console by double-clicking it.
why are comments helpful when writing code?
Comments are helpful when you want to describe or explain what's going on in your code. Use them as much as possible so that you and everyone can understand the reasoning behind it. Comments should be used to make an R script more readable. A comment shouldn't be treated as code, so we'll put a # in front of it. Then we'll add our comment.
what is computer programming?
Computer programming refers to giving instructions to a computer to perform an action or set of actions.
what function will copy a file?
Copying a file can be done using the file.copy() function. In the parentheses, add the name of the file to be copied. Then, type a comma, and add the name of the destination folder that you want to copy the file to. file.copy ("new_text_file.txt" , "destination_folder") If you check the Files pane in RStudio, a copy of the file appears in the relevant folder
3. lets you easily reproduce and share your work
Data analysis is most useful when you can reproduce your work and share it with other people. They can double-check it and help you solve problems. Code automatically stores all of the steps of your analysis so you can reproduce, and share your work at anytime in the future, weeks, months, or even years later.
3. creating data visualizations
Finally R can create powerful visuals and has state-of-the-art graphic capabilities. As you've seen in this program, tools like spreadsheets and Tableau offer lots of options for visualizing your data. R's on another level. With only a small bit of code, you can create histograms, scatter plots, line plots and so much more.
1. reproducing your analysis
First R can save and reproduce every step of your analysis. Earlier, we discussed how data analysis is most useful when you can easily reproduce your work and share it with others. In R, reproducing your analysis is as easy as pressing a button on your keyboard. Your code stores it forever. And you can share it with anyone at any time.
how to use a variable
For example, if you want to filter a dataset, just assign a variable to the function you used to filter the data. That way, all you have to do is use that variable to filter the data later. When naming a variable in R, you can use a short phrase.
when does RStudio truly shine? (example)
For example, imagine you are analyzing sales data for every city across an entire country. That is a lot of data from a lot of different groups-in this case, each city has its own group of data Here are a few ways RStudio could help in this situation: Using RStudio makes it easy to take a specific analysis step and perform it for each group using basic code. In this example, you could calculate the yearly average sales data for every city. RStudio also allows for flexible data visualization. You can visualize differences across the cities effectively using plotting features like facets-which you'll learn more about later on. You can also use RStudio to automatically create an output of summary stats—or even your visualized plots—for each group.
2. saves time
For example, take the process of cleaning and transforming your data. With one line of code, you can create a separate dataset without any missing values. With another line, you can apply multiple filters on your data. This lets you spend less time preparing your data and more time on the analysis itself.
what are the three ways you are likely to create date-time formats:
From a string From an individual date From an existing date/time object R creates dates in the standard yyyy-mm-dd format by default.
what is a downside of using functions?
Functions are great, but it can be pretty time-consuming to type out lots of values.
what is the environment pane (tab) in the upper right?
Here you'll find all the data you currently have loaded and can easily organize and save it. For example if you import data from a spreadsheet, it'll be visible in the Environment pane. You can view each object in the Environment pane by clicking on it. You can also toggle between a List view and a Grid view.
what follows a variable in quotation marks?
Next, we'll add the value that our variable will represent. We'll use the text, "This is my variable."
what will happen after you run the variable example from above?
If we type the variable and hit Run, it will return the value that the variable represents. This is a very basic way of using a variable.
what can you do to save and reproduce your work?
If you save your script in the editor, you can access your work again at any time and show others how you did it.
how do you determine the specific type of vector you are working with? (ex.2)
In this example, R returns a value of FALSE because the vector does not contain characters, rather it contains logicals. y <- c(TRUE, TRUE, FALSE) is.character(y) #> [1] FALSE
what is an IDE
Integrated Development Environment -> a software application that brings together all the tools you may want to use in a single place
how do you determine the types of elements a list contains? (ex.2)
Let's use the str() function to discover the structure of our second example. First, let's assign the list to the variable z to make it easier to input in the str() function. z <- list(list(list(1 , 3, 5))) Let's run the function. str(z) #> List of 1 #> $ :List of 1 #> ..$ :List of 3 #> .. ..$ : num 1 #> .. ..$ : num 3 #> .. ..$ : num 5 The indentation of the $ symbols reflect the nested structure of this list. Here, there are three levels (so there is a list within a list within a list).
how is a list different from a vector?
Lists are different from atomic vectors because their elements can be of any type—like dates, data frames, vectors, matrices, and more. Lists can even contain other lists.
how do you name a list?
Lists, like vectors, can be named. You can name the elements of a list when you first create it with the list() function: list('Chicago' = 1, 'New York' = 2, 'Los Angeles' = 3) $Chicago [1] 1 $`New York` [1] 2 $`Los Angeles` [1] 3
what is one way of creating a vector? (refresher)
One way to create a vector is by using the c() function (called the "combine" function). The c() function in R combines multiple values into a vector. In R, this function is just the letter "c" followed by the values you want in your vector inside the parentheses, separated by a comma: c(x, y, z, ...).
what does open source mean?
Open source means that the code is freely available and may be modified and shared by the people who use it.
what are R packages? (cont.)
Packages are units of reproducible R code. Members of the R community create packages to keep track of the R functions that they write and reuse. Packages offer a helpful combination of code, reusable R functions, descriptive documentation, tests for checking your code, and sample data sets.
2. processing lots of data
Processing lots of data is also something R does really well, just like SQL. As you learned earlier spreadsheets organize projects in sheets or tabs. If you've ever had to deal with spreadsheet files that have tons of sheets or lots of data in each sheet, you know that things can start to move very slowly. Working with too much data in a spreadsheet can even cause crashes. R can handle large amounts of data much more quickly and efficiently.
how is programming different?
Programming goes even further. It gives you the highest level of control over your data. SQL can communicate with databases, but a general-purpose programming language lets you create your own applications and build your own functions from scratch.
what is a programming language?
Programming languages are the words and symbols we use to write instructions for computers to follow. You can think of a programming language as a bridge that connects humans and computers, and allows them to communicate. Programming languages have their own set of rules for how these words and symbols should be used, called syntax.
1. clarity
Programming languages have specific rules and guidelines for giving instructions to the computer. When you're telling a computer what to do, your instructions have to be very clear. There can't be any inconsistency in the way you write code. If there is, the code won't work. Translating your thoughts into code forces you to figure out exactly how to write each step of your analysis and how all the steps fit together.
why RStudio?
R and RStudio are designed to handle large data sets, which spreadsheets might not be able to handle as well. RStudio also makes it easy to reproduce your work on different datasets. When you input your code, it's simple to just load a new dataset and run your scripts again. You can also create more detailed visualizations using RStudio.
R vs. Python
R has been used by professionals who have a statistical or research-oriented approach to solving problems; among them are scientists, statisticians, and engineers. Python has been used by professionals looking for solutions in the data itself, those who must heavily mine data for answers; among them are data scientists, machine learning specialists, and software developers.
what exactly R is
R is a programming language used for statistical analysis, visualization, and other data analysis. As a data analyst, you will use R to complete many of the tasks associated with the data analysis process.
How does the experience of using RStudio differ from other environments like the standard R program?
RStudio works as an integrated development environment, which provides further functionality.
what is a vector?
Simply put, a vector is a group of data elements of the same type stored in a sequence in R. You can make a vector using the combined function. In R this function is just the letter c followed by the values you want in your vector inside parentheses. c(x, y, z,...)
what is syntax? (refresher)
Syntax shows you how to arrange the words and symbols you enter so they make sense to a computer. Coding is writing instructions to the computer in the syntax of a specific programming language. Just like the variety of human languages around the world, there's lots of different programming languages available to communicate with computers
what is the R console?
The R Console is the program window in R where you make use of the R programming language. It is an interface that lets you view, write, edit, and execute your R code. In RStudio, the R Console is often referred to as the console pane (pictured below).
what is R useful for?
The R programming language is super useful for organizing, cleaning, and analyzing data -> "I soon realized that R could help me do almost anything involving data even better and faster than I thought possible. Fortunately, there are tons of great online resources for R and a super supportive online community. If I had a question, I'd go online and find the answer."
what is the tidyverse
The tidyverse is a collection of packages in R with a common design philosophy for data manipulation, exploration, and visualization. For a lot of data analysts, the tidyverse is an essential tool.
creating date-time components
The ymd() function and its variations create dates. To create a date-time from a date, add an underscore and one or more of the letters h, m, and s (hours, minutes, seconds) to the name of the function: ymd_hms("2021-01-20 20:11:59") #> [1] "2021-01-20 20:11:59 UTC" mdy_hm("01/20/2021 08:01") #> [1] "2021-01-20 08:01:00 UTC"
why do you use <- after the variable name?
This is the assignment operator. It assigns the value to the variable. It looks like an arrow, which makes sense, since it's pointing from the value to the variable. There are other assignment operators that work too, but it's always good to stick with just one type in your code.
how do you create a vector using integers?
To create a vector of integers using the c() function, you must place the letter "L" directly after each number. c(1L, 5L, 15L)
how can you save time when using functions?
To save time, we can use variables to represent the values. This lets us call out the values any time we need to with just the variable.
which vectors are known as numeric vectors?
Together, integer and double vectors are known as numeric vectors because they both contain numbers.
pipe example
ToothGrowth %>% filter(dose == 0.5) %>% arrange(len) (this pipe filters and sorts the data)
(t/f) In RStudio, you can execute code in both the R console pane and the source editor pane.
True
what function will create a new folder?
Use the dir.create function to create a new folder, or directory, to hold your files. Place the name of the folder in the parentheses of the function. dir.create ("destination_folder")
what function will create a blank file?
Use the file.create() function to create a blank file. Place the name and the type of the file in the parentheses of the function. Your file types will usually be something like .txt, .docx, or .csv. file.create ("new_text_file.txt") file.create ("new_word_file.docx") file.create ("new_csv_file.csv") If the file is successfully created when you run the function, R will return a value of TRUE (if not, R will return FALSE)
What is an advantage of using the RStudio in the cloud?
Using the cloud makes it easier to share your work with others.
what other name is there for variables?
Variables can also be called objects.
what are the most common data structures in R?
Vectors Data frames Matrices Arrays
when does RStudio truly shine?
When the data is spread across multiple categories or groups, it can be challenging to manage your analysis, visualize trends, and build graphics. And the more groups of data that you need to work with, the harder those tasks become. That's where RStudio comes in.
how do you check the total of the above calculation?
When we run code in a script, the return shows up in the console. This total's now assigned to the mid-year underscore sales variable. We can check this by typing in midyear underscore sales into the console and hitting Enter.
how does the editor and console work together in RStudio?
When you execute code in the editor, the code automatically appears in the console. If you're working on a long analysis, this makes it easy to execute the entire code all at once or run specific sections of it as you go along.
how do you determine the specific type of vector you are working with? (ex.1)
You can also check if a vector is a specific type by using an is function: is.logical(), is.double(), is.integer(), is.character(). In this example, R returns a value of TRUE because the vector contains integers. x <- c(2L, 5L, 11L) is.integer(x) #> [1] TRUE
how do you create a list?
You can create a list with the list() function. Similar to the c() function, the list() function is just list followed by the values you want in your list inside parentheses: list(x, y, z, ...). In this example, we create a list that contains four different kinds of elements: character ("a"), integer (1L), double (1.5), and logical (TRUE). list("a", 1L, 1.5, TRUE) Like we already mentioned, lists can contain other lists. If you want, you can even store a list inside a list inside a list—and so on. list(list(list(1 , 3, 5)))
what function will delete R files?
You can delete R files using the unlink() function. Enter the file's name in the parentheses of the function. unlink ("some_.file.csv")
how do you determine the length of the vector you are working with?
You can determine the length of an existing vector-meaning the number of elements it contains-by using the length() function. In this example, we use an assignment operator to assign the vector to the variable x. Then, we apply the length() function to the variable. When we run the function, R tells us the length is 3. x <- c(33.5, 57.75, 120.05) length(x) #> [1] 3
how do you determine what type of vector you are working with?
You can determine what type of vector you are working with by using the typeof() function. Place the code for the vector inside the parentheses of the function. When you run the function, R will tell you the type. For example: typeof(c("a" , "b")) #> [1] "character" Notice that the output of the typeof function in this example is "character".
how do you name the elements of a vector?
You can name the elements of a vector with the names() function. As an example, let's assign the variable x to a new vector with three elements. x <- c(1, 3, 5) You can use the names() function to assign a different name to each element of the vector. names(x) <- c("a", "b", "c")
Switching between existing date-time objects
You can use the function as_date() to convert a date-time to a date. For example, put the current date-time—now()—in the parentheses of the function. as_date(now()) #> [1] "2021-01-20"
what's something to keep in mind when loading packages?
You only need to install a package once, but you need to reload it every time you start a new session.
What should you use to assign a value to a variable in R?
You should use an operator to assign a value to a variable in R. You should use operators such as <- after a variable to assign a value to it.
what is the purpose of the console?
You'll use the console mostly to show the results of your programming.
what is an operator? (refresher)
a symbol that names the type of operation or calculation to be performed in a formula.
what are R packages?
are custom solutions to data problems developed by R users. RStudio gives you access to a library of R packages known as the tidyverse. You can upgrade, install, and manage your library in the Packages pane. Packages loaded in your current session have a check mark
what are the two types of vectors?
atomic vectors and lists
how do you create a vector containing characters or logicals?
c("Sara" , "Lisa" , "Anna") c(TRUE, FALSE, TRUE)
what is the most common way of storing and analyzing data in R?
data frames
how do you determine the types of elements a list contains? (ex.1)
f you want to find out what types of elements a list contains, you can use the str() function. To do so, place the code for the list inside the parentheses of the function. When you run the function, R will display the data structure of the list by describing its elements and their types. Let's apply the str() function to our first example of a list. str(list("a", 1L, 1.5, TRUE)) We run the function, then R tells us that the list contains four elements, and that the elements consist of four different types: character (chr), integer (int), number (num), and logical (logi). #> List of 4 #> $ : chr "a" #> $ : int 1 #> $ : num 1.5 #> $ : logi TRUE
how do you access all of the shortcuts in R
go to tools, then Keyboard Shortcuts Help
will you need to create data frames in R often?
in most cases, you won't need to manually create a data frame yourself, as you will typically import data from another source, such as a .csv file, a relational database, or a software program.
what happens after you type your variable and press enter again?
it returns your vector underneath your variable (We can use this vector anywhere in our analysis with only its variable name vec_1.)
what package allows you to work with dates/times?
lubridate package
whats the purpose of a pipe?
makes a sequence of code easier to work with and read.
function example
print ("Coding in R") -> This function name will return whatever we include in the values in parentheses. -> We'll type an open parenthesis followed by a quotation mark. -> Then press enter
what function gives the current date-time
the now() function. Note that the time appears to the nearest second. now() #> [1] "2021-01-20 16:25:05 UTC"
what function gives you the current date?
to get the current date you can run the today() function. The date appears as year, month, and day. today() #> [1] "2021-01-20"
every vector will have what two key properties?
type and length
what are assignment operators?
used to assign values to variables and vectors.
what are arithmetic operators?
used to complete math calculations
how do you manually create a data frame in R?
you can use the data.frame() function. The data.frame() function takes vectors as input. In the parentheses, enter the name of the column, followed by an equals sign, and then the vector you want to input for that column. In this example, the x column is a vector with elements 1, 2, 3, and the y column is a vector with elements 1.5, 5.5, 7.5. data.frame(x = c(1, 2, 3) , y = c(1.5, 5.5, 7.5)) If you run the function, R displays the data frame in ordered rows and columns.
can you have a vector that contains both logicals and numerics?
you cannot, because a vector is a group of data elements of the same type