Getting and Cleaning Data Part 2

¡Supera tus tareas y exámenes ahora con Quizwiz!

Explain the difference between quartile versus quantile versus percentile.

0 quartile = 0 quantile = 0 percentile 1 quartile = 0.25 quantile = 25 percentile 2 quartile = 0.5 quantile = 50 percentile (median) 3 quartile = 0.75 quantile = 75 percentile 4 quartile = 1 quantile = 100 percentile

List 3 items of information you can get from viewing a table of a specific variable from a data set using the table( ) command.

1) It can help you quickly identify negative values that should not be in the data set. 2) For a specific variable it can tell you the number of missing values. 3) It can give you the quantity of a specific variable value.

List 3 reasons why you would want to use the str( ) command of R.

1) It tells you the class of the dataset, such as a data frame, matrix, vector, etc. 2) Provides information about the dimensions of the data set. 3) It tells you all the different classes that the different columns correspond to, like factor variable, integer and so forth.

List the 4 different types of databases you can access with the RODBC package of R.

1) PostgreSQL 2) MySQL 3) Microsoft Access 4) SQLite

List the 4 steps in getting data from the web or internet sources.

1) Test to see if directory exists, if it does not, then create a directory. 2) Create an R object with the file path to the URL where the data is located. 3) Download the file and save it to the directory destination you created. 4) Read the downloaded file into an R object so you can work with the file.

Describe 3 aspects of a tidy data set.

1) Variables in the columns., 2) Observations in the rows. 3) Only the observations that you want to be able to analyze.

List 3 other formats besides the traditional format in terms of text files for data that you can access or send queries to a database from R.

1) images 2) GIS data 3) music data

List the 3 available image formats for reading images for accessing or sending queries to a database from R.

1) jpeg 2) bitmap 3) png

List 4 packages that can be loaded into R that allow you to access image files that load them into R directly and manipulate them.

1) jpeg 2) readbitmap 3) png 4) EBImage (Bioconductor)

List 3 different packages available for R that will allow you to access and play with different kinds of GIS data that are exported by proprietary and not proprietary software.

1) rgdal 2) rgeos 3) raster

List 2 available packages for R that allow you to read musical data directly from mp3 files and do analysis on the musical data.

1) tuneR 2) seewave

Provide an example of R code that allows you to easily check to see if there are negative values for a specific variable of a data set.

all(restData$zipCode > 0)

Provide another example of R code that allows you to easily check for missing values of a specific variable of a data set.

any(is.na(restData$councilDistrict))

Provide the R command for ordering a data frame named X by a variable named var1, using the plyr package. This should be in decreasing order of var1.

arrange(X, desc(var1)) ## here we use the desc( ) function to tell it to put it in descending order.

Provide an example of R code that downloads the file and saves it to the directory destination you created.

download.file(fileUrl, destfile="./data/restaurants.csv", method="curl")

Describe a key process of data cleaning when using R.

A key process of data cleaning is looking at the data set that you have loaded into R, and identifying any sort of quirks or weird issues or missing values, or other problems that you need to address before you can do downstream analysis.

Describe how to ensure the number of missing or NA values is included when using the table( ) command to view a table of a specific variable from a data set.

An argument must be added to the table( ) command of useNA="ifany". This will ensure that if there are any missing values, there will be an added column to the table, which will be NA, and it will tell you the number of missing values there are.

Describe the meaning of the following code: set.seed(13435) X <- data.frame("var1"=sample(1:5), "var2"=sample(6:10), "var3"=sample(11:15)) X <- X[sample(1:5), ] X$var2[c(1,3)] = NA X

Begin with set.seed( ) function for random number generation. Here, a data frame is created with 3 variables labeled var1, var2 and var3. The variables are scrambled so that they are not in a specific order with the X <- X[sample(1:5), ] step. Two of the values are made to be missing or NA values located in column name var2 and in rows 1 and 3 per the subsetting vector. The last step auto prints the values of the data frame.

What is the default setting for the head( ) command if you do not specify the first 3 rows of a data frame?

By default, it will show the top six rows of any data frame. So if you do not give it the the n parameter, it will just return the top six rows of the data frame.

What is the 2nd of 2 additional resources for learning more about subsetting, reordering and sorting in R?

Consult Andrew Jaffe's lecture notes: http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%202.pdf

What is the 1st of 2 additional resources for learning more about subsetting, reordering and sorting in R?

Consult R Programming course in Data Science Track.

The ___________________________________________________ is very nice for analyzing and manipulating image data.

EBI image package in Bioconductor

True or False. The default setting on the table( ) command includes the number of NA or missing values for a specific variable.

False. By default, the table function in R does not tell you the number of missing values.

True or False. You cannot subset a data frame using logical statements.

False. It is possible to subset a data frame using logical statements. You can pass logical arguments when subsetting rows and/or columns and end up with just the rows and /or columns where the conditions are met.

True or False. You can only access data or send queries to a database from R that is in the traditional format in terms of text files and so forth.

False. You can also read other things that are not necessarily data in the traditional format in terms of text files and so forth.

What does "GIS" an acronym for when referencing GIS data?

GIS is an acronym for geographic information systems.

Once you have loaded data into R, then you will want to ____________ that data and set it up to be a ___________ data set.

manipulate, tidy

When subsetting a data frame using logical statements, the logical operator of OR is in the form of __________.

|

Regarding searching help files in R or RStudio, explain the difference between the following where "whatever" is the term or function you are researching: 1) ?whatever 2) ??whatever 3) ??"whatever second term"

?whatever is a search for a function called whatever. ??whatever is a search of the text of all the help files for the word whatever. ??"whatever secondterm" can be used if you have some kind of distinctive word to describe the data you are wanting to load in, you could search for that term through all the help pages, and if you use quote marks, you can look for an exact phrase.

How can you view a data frame with a limited number of records or rows?

Use the head( ) command for viewing beginning rows or the tail( ) command for viewing ending rows of the data frame.

Describe the 2-dimensional table you can get from the table( ) command in R.

Using the table( ) command, you can pass in two specific variables. It breaks down a table that is 2-dimensional by variable 1 and by variable 2. You can do this for different qualitative variables and get a sense of the relationship between those variables.

When subsetting a data frame using logical statements, the logical operator of AND is in the form of __________.

&

Explain how to interpret the following R code for determining if you have missing values or NA values of a specific variable of a data set: sum(is.na(restData$councilDistrict))

Here the is.na( ) function is used. The is.na( ) function will return a 1 if it is missing or NA value and a zero if it is not missing. The is.na( ) function is nested in the sum( ) function. If the sum is equal to zero, that means there are no missing values.

Explain how to interpret the following R code for determining if you have missing values or NA values of a specific variable of a data set: any(is.na(restData$councilDistrict))

Here the is.na( ) function is used. The is.na( ) function will return a TRUE if it is missing or NA value and FALSE if it is not missing. The is.na( ) function is nested in the any( ) function. The commands will look through the entire set of values for the specific variable. If none of the values are NA or missing, the any( ) function returns FALSE, because all of the values are FALSE. If some of the values are NA or missing, the any( ) function returns TRUE because one or more of the is.na( ) function values is TRUE.

Explain what is happening in the following R code: X$var4 <- rnorm(5) X

Here, a new variable 4 or var4 is assigned to a random normal vector of length that is the same dimension as the number of rows of the data frame X. The var4 variable is a variable that was not in the data frame X before. Since it is labeled as a variable of data frame X, this effectively adds the new variable var4 to the X data frame. The updated data frame with new column is auto printed.

Explain the subsetting that is occurring in the following R code: X[1:2, "var2"]

Here, the subsetting will output the first two rows of data frame X and the second column (or column named var2) of X. Here you are subsetting on both rows and columns at the same time.

Why is the order important for the location of X and rnorm(5) in the following R code: Y <- cbind(X, rnorm(5)) Y

If you put these in any other order, like if you put the rnorm(5) vector first, and then put the data frame X second, the rnorm(5) column would be binded to the left side of data frame X. Which is intuitive.

Describe how you can add a row to a data frame using the rbind( ) command. How is the form of the command similar to cbind( )?

If you want to bind the rows, you can use the rbind command. It is the same command, only with an r at the beginning. If you put the vector for the row or rnorm( ) for this example at the end it will bind it to the bottom of the data frame X, and if you put it before, it will bind it to the top of the data frame X.

Where can you go to find more information on how to use the Foreign package of R to read files and access data of other programming languages and/or statistical programming languages?

If you want to see a lot more about this, you can go read about the Foreign package in the help file which is pretty self explanatory, and you will find you will be able to read most files from other programming languages. http://cran.r-project.org/web/packages/foreign/foreign.pdf

What is the next step you should follow after reading in the data set or data frame into R in the process of data cleaning?

Look at your data set.

What is usually the next step after loading data into R? Why?

Once you have loaded data into R, then you will want to manipulate that data and set it up to be a tidy data set.

Explain what is happening in the following R code: Y <- cbind(X, rnorm(5)) Y

Take the data set or the data frame X, and it will column-bind the new vector to the data frame. This adds a column on the right-hand side of the X data frame.

Describe the 2nd of two available R packages for accessing data from a MongoDB database.

The R package of RMongoDB can also be used to interface with MongoDB and extract data from that database.

What is the 1st thing you want to do in the process of data cleaning when using R and before doing downstream analysis?

The first thing that you want to do is summarize the data that you have loaded in.

What is the main idea behind grouping data? Which function in the dplyr package of R is responsible for doing this?

The main idea behind grouping data is that you want to break up your dataset into groups of rows based on the values of one or more variables. The group_by( ) function is responsible for doing this.

Describe the general process for using the various available R packages for accessing data or sending queries to the compatible database for the respective R package.

The process for sending queries is very similar to the way you were doing with the R MySQL package. You will have to send queries to the database using the database's own syntax. If you are going to use the R package, you will have to learn a little of the syntax for the database to be able to extract data.

Give an example of where the summary( ) command can prove its value in the process of cleaning data.

The summary( ) command can help identify variables that have negative values when they obviously should not have negative values like a zip code.

Describe the 1st of two available R packages for accessing data from a MongoDB database.

There is a very nice R package called RMongo which can be used to interface with MongoDB and extract data from that database.

Describe how you can access data from a PostgreSQL database using R.

There is an RPostgreSQL package available that provides a database compliant database connection from R.

Explain what is happening in the following R code: if(!file.exists("./data")){dir.create("./data")}

This code first tests to see if "./data" exists and if it does not, then it will create the directory of "./data".

Describe what the table( ) command does in R.

This command can be used to look at data by making a table from a specific variable. You can also make tables.

Why is using the str( ) function to find the different classes of the different columns correspond to like factor variables, integer, etc. a useful thing to know?

This information is useful to know if you need to manipulate things and change quantitative variables to qualitative variables and vice versa.

True or False. Whatever your application is, you can just Google search the R package that will allow you to access the data. There is always an R package for that.

True

How do you deal with NA or missing values in a data frame and logical statements to subset? Why is this needed?

When dealing with NA or missing values in a data frame, subsetting on NA or missing values will not produce the actual rows. The which( ) command returns the index where the logical statement is met. When you have an index value where the value in the column or row has a value that meets the logical statement, it doesn't return an NA value. The which( ) command ensures you can subset a data set even when dealing with NA or missing values in the data frame.

Explain the difference between the AND versus the OR logical operators.

With the AND logical operator, both logical conditions must be met. With the OR logical operator, one or the other of 2 logical conditions must be met.

Provide an example of R code that adds a column to a data frame using subsetting.

X$var4 <- rnorm(5) X

Suppose you have a data frame named X where the first column name is var1. Provide the command in R to subset that first column.

X[ , "var1"]

Suppose you have a data frame named X. Provide the command in R to subset the first column.

X[ , 1]

Provide an example of subsetting using logical statements and the logical operator of AND.

X[(X$var1 <= 3 & X$var3 > 11), ] Note, here X is the name of the data frame and var1 and var3 are the names of columns in the same data frame. The AND logical operator is represented by &.

Provide an example of subsetting using logical statements and the logical operator of OR.

X[(X$var1 <= 3 | X$var3 > 15), ] Note, here X is the name of the data frame and var1 and var3 are the names of columns in the same data frame. The OR logical operator is represented by |.

Provide an example of a command in R or code that subsets by both rows and columns at the same time of a data frame.

X[1:2, "var2"]

Provide the R command that sorts by increasing order of a particular variable named var1 in a data frame named X. This will reorder the rows of data frame X so the variable var1 is in increasing order.

X[order(X$var1), ]

Provide the R command that sorts by increasing order of multiple variables, first with a variable named var1 and then if there are multiple values of the first variable that are the same, it will sort the values of a second variable named var3 in increasing order within those values, in a data frame named X. It sorts the first variable first, and the second variable in the first variable.

X[order(X$var1, X$var3), ]

Provide an example of R code that adds a column to a data frame using the cbind command.

Y <- cbind(X, rnorm(5)) Y

Describe an alternative approach to the sort( ) command or subsetting using the order( ) command for ordering a data frame by a variable in R.

You can actually do that the same thing with the plyr package, which is a package written by Hadley Wickham. Once you have loaded the plyr library, you can use the arrange command.

What else can be done with the table( ) command in R?

You can also use it to make 2-dimensional tables using 2 specific variables from the data set.

What can you do with the quantile( ) command in R?

You can use quantile to look at variability of quantitative variables. You can find the smallest value, the largest value and the median value. You can also tell it to look at different probabilities. If you pass the argument of probs to quantile, it will look at different percentiles of the distribution.

Describe what the summary( ) command does in R.

You can use the summary command to get an overall summary of a data set that has been loaded into R. For every single variable, it will give you some information. For variables that are text-based variables or factor variables, it will tell you the count of each factor. For variables that are quantitative, it will tell you the minimum, the first quartile, the median, and so forth.

Why is it not a good idea to view the entire data set or data frame after reading in the data into R?

You could just type the name of the data frame in and hit return, and it will return the entire data frame. But often the data frame will be a little bit too big for you to be able to see it all very neatly.

Provide an example of R code that creates an R object with the file path to the URL where the data is located.

fileUrl <- "https://data.baltimorecity.gov/api/views/k5ry-ef3g/rows.csv?accessType=DOWNLOAD"

Provide the command in R that will show the first 3 rows of a data frame.

head(<data frame name>, n=3)

Provide the web address for a tutorial on how to access data from a MongoDB database using the RMongo package of R.

http://cran.r-project.org/web/packages/RMongo/Rmongo.pdf

Provide the web address for a help file that will walk you through most of the functions on how to access data from a compatible database using the RODBC package of R.

http://cran.r-project.org/web/packages/RODBC/RODBC.pdf

Provide the web address for a tutorial on how to access data from a compatible database using the RODBC package of R.

http://cran.r-project.org/web/packages/RODBC/vignettes/RODBC.pdf

Provide the web address for a help file on how to access data from a PostgreSQL database using the RPostgreSQL package of R.

http://cran.r-project.org/web/packages/RPostgreSQL/RPostgreSQL.pdf

Provide the web address for learning more about the jpeg package for R.

http://cran.r-project.org/web/packages/jpeg/index.html

Provide the web address for learning more about the png package for R.

http://cran.r-project.org/web/packages/png/index.html

Provide the web address for learning more about the raster package for R that allows you to access GIS data.

http://cran.r-project.org/web/packages/raster/index.html

Provide the web address for learning more about the readbitmap package for R.

http://cran.r-project.org/web/packages/readbitmap/index.html

Provide the web address for learning more about the rgdal package for R that allows you to access GIS data.

http://cran.r-project.org/web/packages/rgdal/index.html

Provide the web address for learning more about the rgeos package for R that allows you to access GIS data.

http://cran.r-project.org/web/packages/rgeos/index.html

Provide the web address for learning more about the tuneR package for R that allows you to access musical data.

http://cran.r-project.org/web/packages/tuneR/

Provide the web address for learning more about the seewave package for R that allows you to access musical data.

http://rug.mnhn.fr/seewave/

Provide the web address for learning more about the EBImage package from Bioconductor for R.

http://www.bioconductor.org/packages/2.13/bioc/html/EBImage.html

Provide the web address for an example of how to access data from a MongoDB database using the RMongoDB package of R.

http://www.r-bloggers.com/r-and-mongodb/

Provide the web address for a tutorial on how to access data from a PostgreSQL database using the RPostgreSQL package of R.

https://code.google.com/p/rpostgresql/

Provide the R command for ordering a data frame named X by a variable named var1, using the plyr package. This should be in increasing order of var1.

library(plyr) ## loads the plyr package arrange(X, var1)

Provide an example of R code with output that provides quantiles or quartiles.

quantile(restData$councilDistrict, na.rm=TRUE) Output: 0% 25% 50% 75% 100% 1 2 9 11 14

Provide an example of R code with output that provides percentiles.

quantile(restData$councilDistrict, probs=c(0.5, 0.75, 0.9)) Output: 50% 75% 90% 9 11 12

Provide the command of the Foreign package of R to read files and access data of the SPSS statistical programming language.

read.spss( )

Provide the command of the Foreign package of R to read files and access data of the SAS statistical programming language.

read.xport( )

Provide an example of R code that reads the downloaded file into an R object so you can work with the file.

restData <- read.csv("./data/restaurants.csv")

Provide the R command that sorts the values in a column named var1 of a data frame named X in the order of increasing values.

sort(X$var1)

Provide the R command that sorts the values in a column named var1 of a data frame named X in the order of decreasing values.

sort(X$var1, decreasing=TRUE)

Provide the R command that sorts the values in a column named var2 of a data frame named X in the order of increasing values and places the NA or missing values at the end.

sort(X$var2, na.last=TRUE)

Provide the R code for the str( ) command.

str(<data frame name>)

Provide an example of R code that allows you to easily check for missing values of a specific variable of a data set.

sum(is.na(restData$councilDistrict))

Provide the R code for the summary( ) command.

summary(<data frame name>)

Provide an example of R code that uses the table( ) command to produce a table for a specific variable of a data frame.

table(restData$zipCode, useNA="ifany")

Provide the command in R that will show the last 3 rows of a data frame.

tail(<data frame name>, n=3)


Conjuntos de estudio relacionados

Integumentary system NCLEX Question style

View Set

ASTRONOMY 101 - Test 3 (chapter 6)

View Set

Chapter 3: Sources and Bodily Effects of Drugs

View Set

Final: Introduction to Rock Music Songs

View Set