R Programming Study Set
What is the "not" "!" operator, and what does it do?
!TRUE = FALSE ! FALSE = TRUE
What is the DRY principle and why is it relevant?!
"DO NOT REPEAT YOURSELF!" This is the point at which you should learn to write functions. Bit the bullets.
R IDE: How do you save a function/re-use code snippets?
"Extract Function" form the Code menu
When assigning values to a new vector, why do you need the "c"?
"c" stands for combine or concatenate, and it strings together the values in a new vector.
dbl_var <- c(1, 2.5, 4.5) typeof(dbl_var) is.double(dbl_var) is.atomic(dbl_var)
"double" TRUE TRUE
int_var <- c(1L, 6L, 10L) typeof(int_var)
"integer"
Which dplyr function performs merges using any same-named variables between data sets?
"join" It comes in 3 flavors: - inner_join(x, y) - left_join(x, y ) - full_join(x, y)
planets_df[rings_vector, "name"] What is "name"? How do I print out ALL the columns?
"name" is a column of data. Typing "name" causes it to be printed out. planets_df[rings_vector, ] - that is, do not specify ANY column; leave it blank
# Create speed_vector speed_vector <- c("medium", "slow", "slow", "medium", "fast") Now create factor_speed_vector with values in order (ordinal)
# Convert speed_vector to ordered factor vector factor_speed_vector <- factor(speed_vector, ordered = TRUE, levels = c("slow", "medium", "fast"))
If star_wars_matrix <- matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE) and region <- c("US", "non-US") titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi") name the columns and rows appropriately
# Name the columns with region rownames(star_wars_matrix) <- titles colnames(star_wars_matrix) <- region
Correlation
# Save the differences of each vector element with the mean in a new variable diff_A <- A - mean(A) diff_B <- B - mean(B) # Do the summation of the elements of the vectors and divide by N-1 in order to acquire the covariance between the two vectors cov <- sum(diff_A*diff_B)/(3-1)
COMPARING MATRICES # The social data has been created for you > linkedin <- c(16, 9, 13, 5, 2, 17, 14) > facebook <- c(17, 7, 5, 16, 8, 13, 14) > views <- matrix(c(linkedin, facebook), nrow = 2, byrow = TRUE) >
# When does views equal 13?> views == 13 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE [2,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
Data set = d x = read, y = write Plot me a scatterplot with a loess smooth layer that provides a "best-fit" curve to the data.
# both scatter plot and loess smooth layers ggplot(d, aes(x=read, y=write)) + geom_point() + geom_smooth() ## `geom_smooth()` using method = 'loess'
Substitution with sub() We can also substitute based on pattern matches with base R function sub(). Now the syntax is sub(pattern, replacement, x), where pattern is the pattern to be matched, replacement is the replacement text, and x is the string vector.
# change STAT to STATISTICAL centers$name <- sub("STAT", "STATISTICAL", centers$name) centers$name ## [1] "IDRE STATISTICAL CONSULT" "OIT CONSULTING" ## [3] "ATS STATISTICAL HELP" "OAC STATISTICAL CENTER"
Name the parts of this for loop: for (i in 1:5) { print("Hello world!") }
# keyword for, then loop control variable, i, and vector 1:5 inside () for (i in 1:5) { print("Hello world!") }
write for loop to print "Hello world" 5 times.
# keyword for, then loop control variable, i, and vector 1:5 inside () for (i in 1:5) { print("Hello world!") }
Provide sample syntax, function.
# not run, just a template # my_fun is the name # 2 arguments, arg1 has no default, arg2 has default value zero my_fun <- function(arg1, arg2=0) { do something to arg1 and arg2 output } The value returned by the function is the last line in the body (above, the object output would be returned by the function). You can also explicitly use return() to return the object you want.
How can I add a separator into the "paste" function?
# separate with comma and space paste(centers$CITY, centers$STATE, sep=", ") ## [1] "LOS ANGELES, CA" "DALLAS, TX" "MIAMI, FL"
Input arguments to functions can be specified by either name or position. Example:
# specifying arguments by name seq(from=1, to=5, by=1) ## [1] 1 2 3 4 5 # specifying arguments by position seq(10, 0, -2) ## [1] 10 8 6 4 2 0
dimnames(b) <- list(c("one", "two"), c("a", "b", "c"), c("A", "B")) b
## , , A ## ## a b c ## one 1 3 5 ## two 2 4 6 ## ## , , B ## ## a b c ## one 7 9 11 ## two 8 10 12
What produced? dim(c) <- c(2, 3) c
## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6
length(a) ## [1] 6 nrow(a) ## [1] 2 ncol(a) is.....???
## [1] 3
rownames(a) <- c("A", "B") colnames(a) <- c("a", "b", "c") a length(b)
## a b c ## A 1 3 5 ## B 2 4 6 ## [1] 12
The pipe operator
%>%
Pipe Operator: %>%
%>% chains functions together: takes the output of the function on the left and provides it as the first parameter to the function on the right
Why use the pipe operator (%>%)?
%>% chains functions together: takes the output of the function on the left and provides it as the first parameter to the function on the right. It's easier to understand than nested functions.
What are logical operators?
& and | & works as expected - returns TRUE only if BOTH logical values are true | works as "OR" - returns TRUE only if ONE logical value is true
Using data.frame( ) Ex. mydata <- data.frame(diabetic = c(TRUE, FALSE, TRUE, FALSE), height = c(65, 69, 71, 73) (1) what does mydata[3,2] return? (2) mydata[1:3, "height"]? (3) mydata[, "diabetic"?
(1) 71 (2) 65 69 (3) TRUE FALSE TRUE FALSE
What are the two types of vectors, the basic data structure in R?
(1) Atomic vectors (2) Lists
Data set = d x = math color = female Plot two densities: math by female. Then do using boxplots.
(1) ggplot(d, aes(x=math, color=female)) + geom_density( ) (2) ggplot(d, aes(x=female, y=math)) + geom_boxplot()
How do you get help for a function?
(?function_name)
Again, what are the components of a plot?
- Data and aesthetic mappings (variables to aesthetics) - Geometric objects - Scales, and - Statistical transformations - The coordinate system - Facet specification
What will the object "a" contain here? a <- c(0,1)
0 1
What will the object "a" contain? a <- c(rep(0,2), seq(1,5, by=2))
0 2 1 3 5
Describe the iterative Data Analysis Process.
1. Ask questions 2. Wrangle data 3. Explore data 4. Draw conclusions 5. Communicate findings
What are the components of "if/then" statements
1. The IF statement takes a condition, 2. If the condition evaluates to TRUE, 3. The R code associated with the IF statement is executed. if(condition) { expr }
Describe the process: for (i in 1:5) { print("Hello world!") }
1. The loop control variable, i, takes on the value 1 2. The loop checks that the value of i is in the vector (1,2,3,4,5). If so, the code block is executed. If not, the loop ends. 3. The code block executes, and then i increments by 1. 4. Repeat step 2 and 3 until i increments to 6.
What are the three common properties of vectors (either atomic vectors or lists)?
1. Type, typeof( ), what it is 2. Length, length( ), how many elements it contains 3. Attributes, attributes(), additional arbitrary metadata.
What will the object "a" contain? a <- c(10, seq(5,1,-1__
10 5 4 3 2 1
Give an example of a combination of logical operators.
<= >=
What does a "double ampersand" expression evaluate.
A "double ampersand" expression only evaluates the FIRST term of the vector(s)
How is a LAYER formed in ggplot2?
A LAYER is formed by the: - Data - Mappings - Statistical transformation - Geometric object Layers are responsible for creating the objects that we perceive on the plot
what is a call?
A call represents the action of calling a function. Like lists, calls are recursive; they can contain constants, names, pairlists and other calls. ast( ) prints ( ) and then lists the children. The first child is the function that is called, and the remaining children are the function's arguments: ast(f()) ## \- () ## \- 'f
What is a categorical variable?
A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories.
What is the difference between a categorical variable and a continuous variable ?
A categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values.
what's the difference between a categorical variable and a continuous variable?
A categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values.
What is a categorical variable?
A categorical variable can belong to limited number of categories. (Contrast with continuous variable.)
What is a data frame?
A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier. Under the hood, a data frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list. This means that a data frame has names(), colnames(), and rownames(), although names() and colnames() are the same thing. The length() of a data frame is the length of the underlying list and so is the same as ncol(); nrow() gives the number of rows.
What does a density plot do? How do you specify one (assume variable x = write, data set "d")
A density plot smooths out the shape of a histogram, looks like a curve, and does not fill in the space below the curve. ggplot(d, aes(x = write)) + geom_density( ) Obviously, density plot is specified by the choice of geom.
What is a factor?
A factor is a statistical data type used to store categorical variables.
What is a factor?
A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are built on top of integer vectors using two attributes: the class, "factor", which makes them behave differently from regular integer vectors, and the levels, which defines the set of allowed values. x <- factor(c("a", "b", "b", "a")) x ## [1] a b b a ## Levels: a b class(x) ## [1] "factor" levels(x) ## [1] "a" "b"
What does a generic function do?
A generic function matches object classes to the appropriate function. Generic functions accept objects from multiple classes, then pass the object to a specific function (called methods) designed for the object's class.
Define "geom" in the "grammar of graphics" (for ggplot2).
A geom is an object or shape on a graph.
What geom bins continuous variables?
A histogram, plotted with bars whose height is proportional to the number of points in each bin. Another example is mapping the size (or color, or texture) of the lines in a contour plot to the height of the contour.
What is faceting?
A more general case of techniques known as conditioning, trellising, and latticing, and produces small multiples showing different subsets of the data.
What is a name or symbol?
A name or symbol represents the name of an object RATHER THAN ITS VALUE. ast() prefixes names with a backtick: ast(x) ## \- `x
A variable is.....? In R, what are its attributes?
A named storage container for data - Starts with a letter - Is case sensitive - Avoid special characters and whitespace - Can use dot "." underscore "_" and dash "-"
variable
A named storage container for data - - Starts with a letter - Is case sensitive - Avoid special characters and whitespace - Can use dot "." underscore "_" and dash "-"
What are the two types of categorical variables?
A nominal categorical variable and an ordinal categorical variable.
What is a NOMINAL categorical variable?
A nominal variable is a categorical variable without an implied order. This means that it is impossible to say that 'one is worth more than the other'. For example, think of the categorical variable animals_vector with the categories "Elephant", "Giraffe", "Donkey" and "Horse". Here, it is impossible to say that one stands above or below the other.
How is a numeric different from an integer?
A numeric can have decimal places. Integers are natural numbers. They are also numerics, but in R, integers do not carry decimal places.
What is a scale?
A scale is a function, and its inverse, along with a set of parameters. A scale controls the mapping from data to aesthetic attributes, and so we need one scale for each aesthetic property used in a layer. For example, the color gradient scale maps a segment of the real line to a path through a color space. Scales are represented by legends for the graphs - e.g. mapping to shape, color and size.
What is a string?
A string is a character value. Columns in external datasets that contain any non-number value will generally be read in as strings. You can generate string variables manually with either double quotes, "", or single quotes, ''.
What is a tibble?
A tibble is a class to which data frames are assigned in tidyverse; it's a structure that alters how data frame behave. They work in most functions that require data frame inputs.
What is a variable?
A variable allows you to store a value (e.g. 4) or an object (e.g. a function description) in R. You can then later use this variable's name to easily access the value or the object that is stored within this variable.
What attributes does a data frame possess?
Adata frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list. This means that a data frame has names(), colnames(), and rownames(), although names() and colnames() are the same thing.
How do you make an atomic vector behave like a multi-dimensional array.
Adding a dim attribute to an atomic vector allows it to behave like a multi-dimensional array. A special case of the array is the matrix, which has two dimensions. Matrices are used commonly as part of the mathematical machinery of statistics. Arrays are much rarer, but worth being aware of. Matrices and arrays are created with matrix() and array(), or by using the assignment form of dim(): # Two scalar arguments to specify rows and columns a <- matrix(1:6, ncol = 3, nrow = 2) # One vector argument to describe all dimensions b <- array(1:12, c(2, 3, 2)) # You can also modify an object in place by setting dim() c <- 1:6 dim(c) <- c(3, 2) c ## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## [3,] 3 6
List all arithmetic operators and define them where necessary.
Addition: + Subtraction: - Multiplication: * Division: / Exponentiation: ^ Modulo: %% The ^ operator raises the number to its left to the power of the number to its right: for example 3^2 is 9. The modulo returns the remainder of the division of the number to the left by the number on its right, for example 5 modulo 3 or 5 %% 3 is 2.
How do you specify which variables are mapped to which aspects of the ggplot2 plot?
Aesthetics
List the 3 parts of an R function:
All R functions have three parts: the body(), the code inside the function. the formals(), the list of arguments which controls how you can call the function. the environment(), the "map" of the location of the function's variables. f <- function(x) x^2 f #> function(x) x^2 formals(f) #> $x body(f) #> x^2 environment(f) #> <environment: R_GlobalEnv> If the environment isn't displayed, it means that the function was created in the global environment.
What is coercion?
All elements of an atomic vector must be the same type, so when you attempt to combine different types they will be coerced to the most flexible type. Types from least to most flexible are: logical, integer, double, and character.
R IDE: How do you shorten a commend (shortcut)?
Alt + Shift + K
What is an aesthetic?
An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points.
x <- 4 y <- x * 10 y y is 40; you assign the result to y from the actual result (40) z <- quote(y <- x * 10) z quotee() returns an expression. What is an expression? What are the four possible components of an expression? Why is an expression also called an abstract syntax tree (AST)?
An expression is an object that represents an action that can be performed by R. Constants, names, calls, and pairlists. An expression is called an abstract syntax tree (AST) b/c it represents the hierarchical tree structure of the code.
What is an interval variable.
An interval variable is similar to an ordinal variable, except that the intervals between the values of the interval variable are equally spaced. For example, suppose you have a variable such as annual income that is measured in dollars, and we have three people who make $10,000, $15,000 and $20,000. The second person makes $5,000 more than the first person and $5,000 less than the third person, and the size of these intervals is the same.
What is a value?
An object of the same class as .data.
What is an ordinal variable.
An ordinal variable is similar to a categorical variable. The difference between the two is that there is a clear ordering of the variables. For example, suppose you have a variable, economic status, with three categories (low, medium and high). In addition to being able to classify people into these three categories, you can order the categories as low, medium and high.
What's the difference between appending and merging?
Appending ADDS more ROWS of observations while MERGING adds more COLUMNS of variables.
Functions
Are commends that preform moresophisticated calculations, such as calculating the square root of a number.
What are arguments to a function?
Arguments are the inputs to a function.
How do you create an atomic vector?
Atomic vectors are usually created with c(), short for combine: dbl_var <- c(1, 2.5, 4.5) # With the L suffix, you get an integer rather than a double int_var <- c(1L, 6L, 10L) # Use TRUE and FALSE (or T and F) to create logical vectors log_var <- c(TRUE, FALSE, T, F) chr_var <- c("these are", "some strings")
What type of graphs are best used for categorical variables?
Bar graphs, which are created using the geom_bar( ).
Why are lists sometimes called recursive vectors?
Because a list can contain other lists. This makes them fundamentally different from atomic vectors. x <- list(list(list(list()))) str(x) ## List of 1 ## $ :List of 1 ## ..$ :List of 1 ## .. ..$ : list() is.recursive(x) ## [1] TRUE
Why is lapply( ) called a functional?
Because it takes a function as an argument.
Name some statistical transforms provided by ggplot2.
Bin; boxplot; contour; density; jitter; quantile; smooth; summary; unique. Define them.
What exactly does geom_bar do?
By default, geom_bar() counts the number of observations for each value of the variable mapped to x.
How does grep( ) work?
By default, grep() returns the index (row) number of the matches, but we can get the strings themselves returned with value=TRUE. # stat center names centers$name ## [1] "IDRE STAT CONSULT" "OIT CONSULTING" "ATS STAT HELP" ## [4] "OAC STAT CENTER"
In ggplot2, how do I choose a shape to display on the graph?
By using the geom, e.g. geom.point( )
If a variable is not continuous, it is.....?
Categorical. These are not variables on which you can compute statistics meaningfully.
Why bother naming vectors?
Character subsetting, described in subsetting, is the most important reason to use names and it is most useful when the names are unique. Not all elements of a vector need to have a name. If some names are missing when you create the vector, the names will be set to an empty string for those elements. If you modify the vector in place by setting some, but not all variable names, names() will return NA (more specifically, NA_character_) for them. If all names are missing, names() will return NULL.
>select()
Chooses columns (x variables) and order, rename
Coercion is often automatic. How will most mathematical function coerce?
Coercion often happens automatically. Most mathematical functions (+, log, abs, etc.) will coerce to a double or integer, and most logical operations (&, |, any, etc) will coerce to a logical. You will usually get a warning message if the coercion might lose information. If confusion is likely, explicitly coerce with as.character(), as.double(), as.integer(), or as.logical().
What is a nested function?
Combination of functions, among which the innermost function is processed first: e.g. Head(arrange(df, desc(hwy)))
What is a constant?
Constants are length one atomic vectors, like "a" or 10.ast() displays them as is.
Assume: d <- read_csv("https://stats.idre.ucla.edu/stat/data/hsbraw.csv") This reads in a data frame with columns about students, subjects they study, and their test scores in those subjects. What are the two types of variables that exist? Define them.
Continuous variables are QUANTITATIVE. Categorical variables represent MEMBERSHIP TO A CLASS. The methods that explore the two variables types differ.
What does correlation do?
Correlations provide quick assessments of whether two continuous variables are linearly related to one another.
You have a data table with a column "mycol". Return the column.
DT[, mycol, with= FALSE)
Can data.table work with data.frame?
Data.table inherits from data.frame. It is a data.frame, too. A data.table can be passed to any package that only accepts data.frame and that package can use [.data.frame syntax on the data.table. However data.table and data.frame are very different syntactically. E.g. DT[3] refers to the 3rd row, but DF[3] refers to the 3rd column. DT[3,] == DT[3], but DF[,3] == DF[3] (somewhat confusingly) For this reason we say the comma is optional in DT, but not optional in DF
>filter()
Determines which rows are returned or discarded. Multiple expressions are ANDed. Use & | and () within an expression for AND, OR and order of execution
What is the "unit of analysis" rule?
Each row of the dataset should represent one unit of analysis. Units may be subjects, or trials within subjects, or whole groups of subjects.
> animals_vector <- c("Elephant", "Giraffe", "Donkey", "Horse") > factor_animals_vector <- factor(animals_vector) what's next?
Elephant, Giraffe, Donkey, Horse
What does "stacking" refer to in bar charts?
Example: use of color to determine attributes within each bar count: ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity)) Yields a bar chart, each bar (whole bar repr cut quality) is comprised of rectangles of differing color representing the differences in cut clarity. Stackig is performed automagically by "position adjustment"
is.numeric("hello")
FALSE
facebook <- c(17, 7, 5, 16, 8, 13, 14) linkedin <- c(16, 9, 13, 5, 2, 17, 14) facebook <= linkedin
FALSE TRUE TRUE FALSE FALSE TRUE TRUE
What attribute of plotting can you use to separate out behavior of categorical variables (ggplot2)?
Faceting
What does factor( ) do?
Factor( ) encodes the vector as a factor. E.g. factor <-("Male", "Female", "Female", "Male", "Male")
When are factors useful?
Factors are useful when you know the possible values a variable may take, even if you don't see all values in a given dataset. Using a factor instead of a character vector makes it obvious when some groups contain no observations: sex_char <- c("m", "m", "m") sex_factor <- factor(sex_char, levels = c("m", "f")) table(sex_char) ## sex_char ## m ## 3 table(sex_factor) ## sex_factor ## m f ## 3 0
What is a factor?
Factors in R provide a way to represent categorical variables both numerically and categorically. Basically, factors assign an integer number (beginning with 1) to each distinct category, and then a character label to each category. We convert character variables to factors with factor(), which gives the categories numeric values. Specify the names of the categories in the levels= argument in the desired ordering. If you omit levels=, R will alphabetically sort the categories.
Why would you prefer "join" operations to Base R "merge"?
Faster.
List the ways in which stringr differs from paste( ).
First, its default separator is the empty character "": # default is no separating character str_c(centers$CITY, centers$STATE) ## [1] "LOS ANGELESCA" "DALLASTX" "MIAMIFL" "JUNEAUAK" Second, concatenating a NA into a string will result in NA: # separate with comma and space str_c(centers$CITY, centers$STATE, c("USA", "USA", "USA", NA), sep=", ") ## [1] "LOS ANGELES, CA, USA" "DALLAS, TX, USA" "MIAMI, FL, USA" ## [4] NA
Why use %>% ?
For these multi-step tasks, the pipe operator provides a useful, time-saving and code-saving shorthand. For these multi-step tasks, the pipe operator provides a useful, time-saving and code-saving shorthand.
How do we best represent categorical variables in visual plots?
Frequency tables demonstrate the distribution of membership to each category of the categorical variables.
Every operation is a " " (2 words).
Function call
What is the syntax of a function?
Function definitions consist of the following elements: • function name: used to call it • arguments: the inputs • body of code: what the function executes when called, usually encased in {} The first line of a function definition consists of its name, <-, the keyword function, and the arguments. For example: my_fun <- function(arg1, arg2=0) In that syntax, my_fun is the name of the function, arg1 is the first argument name, arg2 is the second argument name, and its default value is set to zero. The keyword function tells R that my_fun is a function (often called type closure in R errors/messages/warnings).
magrittr
Function pipes a chain of commands using %>%
ggplot2 is the PACKAGE, while ggplot( ) is the....?
Function.
O.K., so you HATE to write functions. Why perfect your function-writing ability?
Functional programming is; - more compact - only needs to be updated in one location - works for any number of columns - not likely to accidentally treat one vector differently from another Can be generalized (this especially pertains to functions using lapply)
Programming Terms: Define "functions"
Functions - are commends that preform more sophisticated calculations, such as calculating the square root of a number.
What are functions?
Functions allow us to code tasks in a general way, allowing the inputs to vary. Whenever you have written the same code multiple times, consider using a function instead.
gather( )
Gather columns into key-value pairs
How do you specify which shape to display on a ggplot2 plot?
Geom
What does a geom do in ggplot2?
Geometric objects, or geoms for short, control the type of plot that you create. For example, using a point geom will create a scatterplot, whereas using a line geom will create a line plot.
In the "diamonds" data, how can you use color to distinguish among cut quality?
HERE, outline of bar is colored: ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, colour = cut)) OR [here it FILLS IN color] ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = cut))
- Orders in descending order of highway - On the subset in the variable df - Then gives the first 6 rows
Head(arrange(df, desc(hwy)))
How do you handle continuous variables with plots? Note that continuous variables are a unique distribution of variables which need to be gathered into intervals to make sense.
Histograms, density plots, and boxplots. Each plot has a corresponding ggplot2 geom.
R's base data structures can be organised by their dimensionality (1d, 2d, or nd) and whether they're homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types). This gives rise to the five data types most often used in data analysis: (list them)
Homogen Heterogen 1d atomic vector list 2d matrix data frame 3d array There are no scalars or 0-dimensional types, just vectors with length of 1.
Why is "group = 1"? : ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
If group is not set to 1, then all the bars have prop == 1. The function geom_bar() assumes that the groups are equal to the x values, since the stat computes the counts within the group. You just get a bunch of bars, all the same length ( = 1.0).
Define a matrix.
In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-dimensional.
What is a matrix in R? Using what function may you construct this entity?
In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-dimensional. You can construct a matrix in R with the matrix() function.
What is an ORDINAL categorical variable?
In contrast, ordinal variables do have a natural ordering. Consider for example the categorical variable temperature_vector with the categories: "Low", "Medium" and "High". Here it is obvious that "Medium" stands above "Low", and "High" stands above "Medium".
In the Usage section for a function, the value specified after an argument is ...?
In the Usage section, a value specified after an argument is its default value. Arguments without values have no defaults and usually need to be supplied by the user.
Consider the following example: matrix(1:9, byrow = TRUE, nrow = 3) What do the arguments mean?
In the matrix() function: The first argument is the collection of elements that R will arrange into the rows and columns of the matrix. Here, we use 1:9 which is a shortcut for c(1, 2, 3, 4, 5, 6, 7, 8, 9). The argument byrow indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just place byrow = FALSE. The third argument nrow indicates that the matrix should have three rows. matrix(1:9, byrow = TRUE, nrow = 3)
How can you specify input arguments to functions?
Input arguments to functions can be specified by either name or position. Ex.: # specifying arguments by name seq(from=1, to=5, by=1) ## [1] 1 2 3 4 5 # specifying arguments by position seq(10, 0, -2) ## [1] 10 8 6 4 2 0
What's another name for the first and third quartiles?
Interquartile range.
What does jitter geom do?
It adds a small amount of random variation to the location of each point, and is a useful way of handling overplotting caused by discreteness in smaller datasets.
How does %in% work?
It adds on to a data frame, recycling elements until columns are the same length. E.g. df[df$time %in% c(0.5, 3), ]
List the arguments that stringr has in common with paste( ).
It has both sep= and collapse=arguments that function that same as those in paste().
How does unite( ) work?
It is strictly designed to concatenate columns stored in the same dataset. It will not accept a vector from outside the dataset. Indeed, the dataset is the first argument to unite().
Why do dot plots of describe( )?
It's a quick way to isolate suspicious data.
How does "unite( )" (dplyr) differ from "str_c( )" and unite( )?
Its important distinction from paste() and str_c() is that it is strictly designed to concatenate columns stored in the same dataset. It will not accept a vector from outside the dataset.
What is lexical scoping?
Lexical scoping is the set of rules that govern how R looks up the value of a symbol. In the example below, scoping is the set of rules that R applies to go from the symbol x to its value 10: x <- 10 x ## [1] 10
What is the process of lexical scoping?
Lexical scoping teaches you how R finds values from names.
Functions - even anonymous ones - in R are objects in their own right. What are the components of a function?
Like all functions in R, anonymous functions have formals, a body and an environment.
What are lists, and how do you specify them?
Like vectors, lists are "one-dimensional"structures, but the elements can be a MIXTURE of types (vectors, matrices and data frames). Lists can be manually generated with list():
How is a list different form atomic vectors?
Lists are different from atomic vectors because their elements can be of any type, including lists. You construct lists by using list() instead of c(): x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9)) str(x) ## List of 4 ## $ : int [1:3] 1 2 3 ## $ : chr "a" ## $ : logi [1:3] TRUE FALSE TRUE ## $ : num [1:2] 2.3 5.9
How are lists used in R?
Lists are used to build up many of the more complicated data structures in R. For example, both data frames (described in data frames) and linear models objects (as produced by lm()) are lists:
Look up help in R for: ?position_dodge ?position_fill ?position_identity ?position_jitter a nd ?position_stack
Look up help in R for: ?position_dodge ?position_fill ?position_identity ?position_jitter a nd ?position_stack
What's a looping structure and what's it good for?
Looping structures provide another tool to code repetitive tasks efficiently. Most (probably all) programming languages support looping structures, including R. Loops in R are useful for iterating across the columns of a dataset or even across different datasets.
Give examples of different trend lines.
Mean, moving average, functions (exponential, log, polynomial), lease squared regression, etc....
What is a method?
Methods are class-specific functions. The methods() functions lists what methods exist in the current R session. Methods( ) lists all specific functions that the generic function searches for a class match.
How are missing values specified?
Missing values are specified with NA, which is a logical vector of length 1. NA will always be coerced to the correct type if used inside c(), or you can create NAs of a specific type with NA_real_ (a double vector), NA_integer_ and NA_character_.
Why do we care about the relationships between two variables?
Namely, we are generally interested whether the values of one variable are independent of the other, or whether they are associated (i.e. correlated or predictive).
Are vectors the only 1-dimensional data structure?
No. You can have matrices with a single row or single column, or arrays with a single dimension. They may print similarly, but will behave differently. The differences aren't too important, but it's useful to know they exist in case you get strange output from a function (tapply() is a frequent offender). As always, use str() to reveal the differences.
What are the four types of scales of measurements? (hint: g(ROIN)
Nominal - classification only Ordinal - ranked or sorted (e.g. survey options) Interval - Differences between items (e.g. Fahrenheit) Ratio - Non-arbitrary or absolute zero (Kelvin, mass, length, duration)
Again: the four scales of data, defined.
Nominal - classification or category(e.g. gender, nationality) Ordinal - Ranked or sorted (e.g. likert scale) Interval - Differences between items (e.g. Celcius) Ratio - non-arbirtrary or absolute zero (Kelvin, mass, length, duration)
Programming Terms: Define "operators"
Operators - symbols that represent calculations such as addition (+) or multiplication (*)
What is "jitter" a solution for?
Over plotting - when there are so many data points that its hard to see where the mass of data is. "Jitter" adds a small amount of random noise to each point, spreading the points out.
What is overplotting and how do you handle it?
Problem when plotting discrete data (e.g. cyl) and have multiple points at same location: - Move the points around a small amount - Make points somewhat transparent. (e.g. alpha = )
What is an IDE? What is R's IDE called?
RStudio, an integrated development environment (IDE) that features: • a console • a powerful code/script editor featuring ○ syntax highlighting ○ code completion ○ smart indentation • special tools for plotting, viewing R objects and code history • Cheats sheets for R programming -tab-completion for object names and function arguments
What is regression, and what types of variables does it rely upon?
Regression involves using one or more variables, labelled independent variables, to predict the values of another variable, the dependent variable.
What are regular expressions?
Regular expressions are a "language" in which a sequence of character codes forms a search pattern for string matching and substitution. For example, the expression "[0-9]" matches any single number. The expression "[0-9]{2}-[0-9]{2}" matches any two numbers followed by a hyphen followed by any two numbers.
What is a regular expression?
Regular expressions are a "language" in which a sequence of character codes forms a search pattern for string matching and substitution. For example, the expression "[0-9]" matches any single number. The expression "[0-9]{2}-[0-9]{2}" matches any two numbers followed by a hyphen followed by any two numbers.
What is the "full_join" behavior?
Returns all rows from x and from y; unmatched rows in either will have NA in new columns.
What is the "inner_join" behavior?
Returns all rows from x where there is a matching value in y (returns only matching rows).
What is the "left_join" behavior?
Returns all rows from x, unmatched rows in x will have NA in the columns from y. Unmatched rows in y not returned.
What does str( ) do?
Returns the STRUCTURE of the ojbect, including its class, and the data types of elements (e.g. rows and columns).
What kind of plot is an obvious choice to depict the relationship between two variables?
Scatter plots are an obvious choice to depict the relationship between 2 variables. We can also add a loess smooth layer (geom_smooth()) that provides a "best-fit" curve to the data.
What is lexical scoping?
Scoping is the set of rules that govern how R looks up the value of a symbol.
Define Simpson's Paradox.
Simpson's paradox, or the Yule-Simpson effect, is a phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. It is sometimes given the descriptive title reversal paradox or amalgamation paradox. It is particularly problematic when frequency data is unduly given causal interpretations
What do you need to be able to merge data sets?
Some common id variable.
What can happen when a column you thought was a vector instead produces a factor? This occurs when a non-numeric value turns up in the column. How do you handle it?
Sometimes when a data frame is read directly from a file, a column you'd thought would produce a numeric vector instead produces a factor. This is caused by a non-numeric value in the column, often a missing value encoded in a special way like . or -. To remedy the situation, coerce the vector from a factor to a character vector, and then from a character to a double vector. (Be sure to check for missing values after this process.) Of course, a much better plan is to discover what caused the problem in the first place and fix that; using the na.strings argument to read.csv() is often a good place to start. # Reading in "text" instead of from a file here: z <- read.csv(text = "value\n12\n1\n.\n9") typeof(z$value) ## [1] "integer" as.double(z$value) ## [1] 3 2 1 4 BUT these are the levels of a FACTOR, NOT the valuew we read in. class(z$value) ## [1] "factor" To FIX it: as.double(as.character(z$value)) ## Warning: NAs introduced by coercion ## [1] 12 1 NA 9 OR - change how we read it in: z <- read.csv(text = "value\n12\n1\n.\n9", na.strings=".") typeof(z$value) ## [1] "integer"
arrange()
Sort the order of rows by variable values using arrange() from dplyr. E.g. : arrange(d, desc(sex), age)
spread( )
Spread a key-value pair across multilple columns.
What do these brackets enable: [ ] ?
Subsetting.
What does the stat transform stat_summary() do?
Summarizes the y values for each unique x value to draw attention to the summary you're computing - e.g. ggplot(data = diamonds) + stat_summary( mapping = aes(x = cut, y = depth), fun.ymin = min, fun.ymax = max, fun.y = median ) (produces something similar to whisker plot)
What does double ampersand (&&) only examines the FIRST element of each vector. For example: c{TRUE,TRUE, FALSE) && c(TRUE, FALSE, FALSE)
TRUE
int_var <- c(1L, 6L, 10L) is.atomic(int_var)
TRUE
int_var <- c(1L, 6L, 10L) is.integer(int_var)
TRUE
!is.numeric("hello")
TRUE (note the "not" operator)
vector: linkedin <- c(16, 9, 13, 5, 2, 17, 14) linkedin > 10.....????
TRUE FALSE TRUE FALSE FALSE TRUE TRUE
What happens when a logical vector is coerced to an integer or a double?
TRUE becomes 1 and FALSE becomes 0. This is very useful in conjunction with sum() and mean(). x <- c(FALSE, FALSE, TRUE) as.numeric(x) ## [1] 0 0 1 # Total number of TRUEs sum(x) ## [1] 1 # Proportion that are TRUE mean(x) ## [1] 0.3333333
How does the OR ("|") operator work?
TRUE | TRUE => TRUE TRUE|FALSE => TRUE FALSE|TRUE => TRUE FALSE|FALSE => FALSE
List all the logical or comparison operators known to R.
The (logical) comparison operators known to R are: < for less than > for greater than <= for less than or equal to >= for greater than or equal to == for equal to each other != not equal to each other
Why does DT[,5] return 5, and why should you not use it?
The 2nd argument for DATA TABLE is an expression which is evaluated withing the scope of DT. (5 evaluates to 5.) It is bad practice to refer to columns by numbers instead of names with Data Table. A better choice is DT[,region], for example. You can place ANY R expression in that position - e.g., DT[, colA*colB/2]. If you have to refer to a column by number use "with=FALSE" argument.
Why is it preferable to use tidyverse version of read_csv to read in categorical variables?
The Base R function read.csv() by default reads in character variables as factors using alphabetical ordering, which is not always desirable. Thus we recommend the readr (part of tidyverse) function read_csv(), which leaves them as character.
What does the DBI package do?
The DBI package defines an interface for communication between R and relational database management systems (RDBMS). All classes in this package are virtual and need to be extended by the various R DBMS implementations, so-called DBI back ends.
In the help file for functions, what does the Value section specify?
The Value section specifies what is returned to the function.
What is a stat? When/where is it used?
The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. A stat is used by a geom, for example, the histogram function geom_bar, which uses stat_count( ).
What is the assignment operator? What does it do?
The assignment operator: <- is used to assign data to a variable.
What is the basic syntax for a for loop?
The basic syntax of a for loop is: • first line includes the keyword for, and then a loop control variable and a vector of values encased in () a code block to execute for each value in the vector, encased in {}
Which function explore correlation between two continuous variables?
The cor( ) function estimates correlations. If supplied with 2 vectors, cor() will estimate a single correlation. If supplied a data frame with several variables, cor() will estimate a correlation matrix.
Explain the summarize( ) conflict.
The dplyr package provides summarize() for those who prefer an American spelling to summarise(). However, the package Hmisc also contains a summarize() function, and if loaded into the same session as dplyr(), a conflict arises between the two summarize() functions. The function from the package that was loaded first will be masked when the second package is loaded into the session (with an accompanying red warning message). The function loaded second will be the default function used when summarize() is specified.
How does subset( ) work? Ex.: subset(planets_df, subset = rings)
The first argument of subset( ) specifies the data set for wich you want a subset.. By adding the second argument, you give R the necessary info and conditions to select the correct subset.
Now we want to translate all -99, -98, and "not assessed" values to NA.
The following syntax can be used to: Translate all values of a to b in dataset df:df[df==a] <- b d[d==-99] <- NA d[d==-98] <- NA d[d=="not assessed"] <- NA
mutate( )
The function mutate() allows us to transform many variables in one step without having to respecify the data frame name over and over.
Calculate the correlation coefficient between two variables A and B (little "r").
The general formula for calculating the correlation coefficient between two variables is r=cov(A,B)sA⋅sB, where cov(A,B) is the covariance between A and B, while sA and sB are the standard deviations.
What do the boxplot geom demonstrate?
The median, lower and upper quartiles (the hinges) and outliers. Unlike histograms and density plots, map the variable whose distribution we want to plot to y instead of x. If we are making a single boxplot, we need an arbitrary value for x, just as a place holder.
Base R and stringr provide several functions for powerful and flexible pattern matching. What are they?
The primary base R function for pattern matching is grep(). Several other related functions also match patterns, but return slightly different output (e.g. grepl() and regexpr()). The basic syntax is grep(pattern, x), where pattern is the text pattern to match in character vector x.
What is "scaling"?
The process of mapping an aesthetic to a variable is known as SCALING. Scale transformation happens before statistical transformation. This ensures that a plot of log(x) versus log(y) on linear scales looks the same as x versus y on log scales. Transformation is only necessary for nonlinear scales, because all statistics are location-scale invariant.
Assignment Operator
The simplest operator that almost every programming language has is the assignment operator, which is nothing more than the equal sign (=) symbol, such as
Programming Terms: Define "assignment operator"
The simplest operator that almost every programming language has is the assignment operator, which is nothing more than the equal sign (=) symbol, such as VariableName = Value In R, "<-"
Extract column vectors by name or by number
The syntax x[["colname"]] and x[[colnum]] can be used to extract column vectors by name or by number,
What is the term for a class-specific function? Given an example.
The term is "method." Examples of a class-specific method are: extract(), filter( ), ggplot(), head() - all performed on data frames.
What is a pipe operator?
The the pipe operator to chain functions together. It is loaded as part of tidyverse. It looks like %>%. It chains functions together: takes the output of the function on the left and provides it as the first parameter to the function on the right
What do you need to know for a function?
The two main decisions for writing a function are: • What are the inputs? What is the output?
How can you test for and coerce to a list?
The typeof() a list is list. You can test for a list with is.list() and coerce to a list with as.list(). You can turn a list into an atomic vector with unlist(). If the elements of a list have different types, unlist() uses the same coercion rules as c().
What's the difference between atomic vectors and lists?
The types of their elements; all elements of an atomic vector must be the same type, whereas the elements of a list can have different types.
What are the two types of categorical variables?
There are two types of categorical variables: a nominal categorical variable and an ordinal categorical variable. A nominal variable is a categorical variable without an implied order. In contrast, ordinal variables do have a natural ordering. (Ex.: temperature: Low, Medium, High).
What data type are texts or strings? Why do they need " "?
These are "character" data types and they need " " to indicate that "some text" is a character.
what do functional components do?
They describe the three main components of a function.
What is a "level"?
Think of levels as the unique list of values a factor can take. There is a level( ) function to display the levels.
Why can't I use cbind( ) to create a data frame from multiple vectors (as opposed to adding a vector to a data frame)?
This doesn't work because cbind() will create a matrix unless one of the arguments is already a data frame. Instead use data.frame() directly:
How do you create a categorical variable in R?
To create factors in R, you make use of the function factor(). First thing that you have to do is create a vector that contains all the observations that belong to a limited number of categories. Ex.: sex_vector <- c("Male", "Female", "Female", "Male", "Male") factor_sex_vector <- factor(sex_vector)
How do I extract a single character in base R and stringr?
To extract characters, or a "substring", from a string by position, we can use either of two similar functions: • substr() from base R • str_sub() from stringr For both functions, we specify a string variable, a start position, and an end position. # start at first, end at third character substr(centers$BUILDING, 1, 3) ## [1] "MAT" "HOU" "ORA" "SNO"
How do you install packages in R?
To use packages in R, we must first install them using the install.packages() function, which typically downloads the package from CRAN and installs it for use. E.g. install.packages("installr") install.packages("tidyverse")
What are the Boolean values and what data type are they?
True, False, and logical.
What's the default behavior for most data loading functions in R?
Unfortunately, most data loading functions in R automatically convert character vectors to factors. This is suboptimal, because there's no way for those functions to know the set of all possible levels or their optimal order. Instead, use the argument stringsAsFactors = FALSE to suppress this behaviour, and then manually convert character vectors to factors using your knowledge of the data. A global option, options(stringsAsFactors = FALSE), is available to control this behaviour, but I don't recommend using it.
unite( )
Unite multilple columns into one.
DT[, region] returns a vector, but what I want is a 1-column data.table.
Use DT[,.(region)]. .( ) is an alias for list( ) and ensures a data.table is returned .
How coerce an object to a data frame?
Use as.data.frame( ): A vector will create a one-column data frame. A list will create one column for each element; it's an error if they're not all the same length. A matrix will create a data frame with the same number of columns and rows as the matrix.
How do you change the case of a string in base R?
Use base R functions tolower() and toupper() to change the case of string variables. # lower case city names tolower(centers$citystate) ## [1] "los angeles, us" "dallas, us" "miami, us" "juneau, us"
How to check if an object is a data frame?
Use class() or test explicitly with is.data.frame(): typeof(df) ## [1] "list" class(df) ## [1] "data.frame" is.data.frame(df) ## [1] TRUE
What function can you use when column headers are data, not variable names?
Use gather() to create a variable out of column headings and restructure the dataset
select( )
Use select( ) to reorder columns - e.g. tests_first <- select(d, t1=test1, t2=test2, everything()) names(tests_first)
What function can you use when you have multiple variables crammed into one column?
Use spread( ) to break a variable into multiple columns.
How do you prevent data.frame( )'s default behavior which turns strings into factors?
Use stringsAsFactors = FALSE to suppress this behaviour: df <- data.frame( x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE) str(df) ## 'data.frame': 3 obs. of 2 variables: ## $ x: int 1 2 3 ## $ y: chr "a" "b" "c"
How do you create a trend line in ggplot2? How do you make that a linear model trendline? How do you remove the confidence interval?
Use x = disp and y = mpg geom_smooth(mapping = aes(x =disp, y = mpg) geom_smooth(mapping = aes(x =disp, y = mpg), method = "lm") geom_smooth(mapping = aes(x =disp, y = mpg), method = "lm", se = FALSE)
What characterizes the ODBC connection to SQL?
Uses ODBC package Uses DBI package Works with tidyverse dplyr package Preferred by RStudio Supports read and write operations (DML/DDL)
#colMeans(mydata$Height) is what class( )?
Vector input
What is a vector?
Vectors are one-dimension arrays that can hold numeric data, character data, or logical data.
How does "paste( )" work?
Vectors of numbers used in paste() are converted to character before pasting.
Exploring continuous AND categorical variables...what questions are we asking?
We are often interested in whether the distribution (i.e. mean, variance, etc.) of the continuous variable is the same between the classes of the categorical variable.
What happens inside the aes( ) function.
We then specify which variables are mapped to which aesthetics, which can include: ○ x-axis and y-axis ○ color, size, and shape of objects
How do we plot the distributions of the continuous variables by groups defined by the categorical variables?
We will plot separate density plots and boxplots of the continuous variables for each group of the categorical variable. The grouping variable is commonly mapped to aesthetics that take on categories themselves, such as color or shape, but can be mapped to x as well if it is numeric.
What's wrong with this statement? # Create box_office box_office <- c(new_hope + empire_strikes + return_jedi)
What's wrong with this statement? # Create box_office box_office <- c(new_hope, empire_strikes, return_jedi) (Don't use "+"!)
What are the three ways you can name a vector?
When creating it: x <- c(a = 1, b = 2, c = 3). By modifying an existing vector in place: x <- 1:3; names(x) <- c("a", "b", "c"). Or: x <- 1:3; names(x)[[1]] <- c("a"). By creating a modified copy of a vector: x <- setNames(1:3, c("a", "b", "c")).
how does paste() to concatenate string?
When used on vectors, paste() concatenates element-wise. By default, joined elements are separated by a space.
How does paste( ) work?
When used on vectors, paste() concatenates element-wise. By default, joined elements are separated by a space. # element-by-element concatenation paste(centers$ROOM, centers$BUILDING) ## [1] "X4919 MATH SCIENCE" "Y521 HOUSTON HALL" "M234 ORANGE BUILDING" ## [4] "B2431 SNOW HALL"
When should you consider writing a function?
Whenever you've copied and pasted a block of code more than twice.
Why is it usually best to explicitly convert factors to character vectors if you need string-like behaviour.
While factors look (and often behave) like character vectors, they are actually integers. Be careful when treating them like strings. Some string methods (like gsub() and grepl()) will coerce factors to strings, while others (like nchar()) will throw an error, and still others (like c()) will use the underlying integer values.
How do you load files into R? Which 2 functions are critical?
Without further specification, files will be loaded from and saved to the working directory. The functions getwd() and setwd() will get and set the working directory, respectively. #get current directory (not run) getwd() # set new working directory (not run) setwd("/path/to/directory")
Which geoms work best with the following? X-Y plot (scatterplot) Trend line Overplotting Categorical variables
X-Y plot (scatterplot) - geom_point Trend line - geom_smooth Overplotting - geom_jitter (aes: alpha) Categorical variables - facet_wrap, facet_grid, and use of aesthetics (color, shape)
Can R objects belong to more than one class? What does class( ) function do? Which other function returns the object class?
Yes. But many functions only accept objects of a specific class, so it is important to know the classes of our objects. The class() function lists all classes to which the object belongs. The str() function also returns the class of the object.
What happens when you combine a character and an integer? str(c("a", 1))
Yields a character chr [1:2] "a" "1"
How do you manually string together strings.
You can generate string variables manually with etierh double quotes, " ", or single qotes, ' '.
How can you assign names to the elements of a vector?
You can give a name to the elements of a vector with the names() function. Have a look at this example: some_vector <- c("John Doe", "poker player")
How can you keep track of what exactly you are loading into vectors, by naming it?
You can give a name to the elements of a vector with the names() function. Have a look at this example: some_vector <- c("John Doe", "poker player") names(some_vector) <- c("Name", "Profession")
How can you test is an object is a matrix or array?
You can test if an object is a matrix or array using is.matrix() and is.array(), or by looking at the length of the dim(). as.matrix() and as.array() make it easy to turn an existing vector into a matrix or array.
Create a data frame.
You create a data frame using data.frame(), which takes named vectors as input: df <- data.frame(x = 1:3, y = c("a", "b", "c")) str(df) ## 'data.frame': 3 obs. of 2 variables: ## $ x: int 1 2 3 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3
What happens when you pass a regression model object (class - lm) to summary( ), as in: model1 <- lm(Height ~ Diabetes, data=mydata)?
You pass a regression model object (class - lm) to summary( ) which calls summary.lm( ) and produces a regression table instead. Summary() calls summary.lm() if given an lm object: summary(model1)
What is the output for read_csv() and read_delim()?
You'll see the data type of each column. Try it: dat_csv <- read_csv("https://stats.idre.ucla.edu/stat/data/hsbdemo.csv")
Select ALL rows
[,column]
Select ALL columns
[row,]
If b<- data.frame(letters=c("a","b,""c"), numbers=c(1,2,3)) what is the result of b[b$numbers<2,]?
a 1
mutate()
add new columns
Each geom can only display certain ______________________.
aesthetics For example, a point geom has position, color, shape, and size aesthetics. A bar geom has position, height, width, and fill color.
Bind a new variable "worldwide_vector" as a column to star_wars_matrix, as part of a matrix: all_wars_matrix.
all_wars_matrix <- cbind(star_wars_matrix, worldwide_vector)
Combine two Star Wars trilogies (star_wars_matrix, star_wars_matrix2) into one matrix
all_wars_matrix <- rbind(star_wars_matrix, star wars_matrix2)
Hmisc
another large collection of tools for most aspects of data analysis, but we use it here for describe(), a dataset summary function
arrange mpg data frame by "hwy" (lowest to high) "mpg" data set in Hadley Wickham's "R for Data Science"
arrange(df, hwy)
If b<- data.frame(letters=c("a","b,""c"), numbers=c(1,2,3)) what is the result of b[2,]?
b 2
What are three different ways to access "c" in data frame b? b<- data.frame(letters=c("a","b,""c"), numbers=c(1,2,3))
b$letters[3] b[3,1] b[[1]][3]
group_by() and summarise( )
by_doc <- group_by(d, docid) class(by_doc)
Group our data frame, d, by female
by_female <- group_by(d, female)
How are cbind(), rbind(), and abind() used?
c() generalises to cbind() and rbind() for matrices, and to abind() (provided by the abind package) for arrays.
Combine/concatenate will.....
c() will combine several lists into one. If given a combination of atomic vectors and lists, c() will coerce the vectors to lists before combining them. Compare the results of list() and c(): x <- list(list(1, 2), c(3, 4)) y <- c(list(1, 2), c(3, 4)) str(x) ## List of 2 ## $ :List of 2 ## ..$ : num 1 ## ..$ : num 2 ## $ : num [1:2] 3 4 str(y) ## List of 4 ## $ : num 1 ## $ : num 2 ## $ : num 3 ## $ : num 4
summarize()
calculate statistics across groups
Which functions can I use to combine data frames?
cbind( ) rbind( )
df <- data.frame( x = 1:3, y = c("a", "b", "c") Use cbind( ) or rbind( )to add a 3rd column "z" containing numbers 3,2,1.
cbind(df, data.frame(z = 3:1)) ## x y z ## 1 1 a 3 ## 2 2 b 2 ## 3 3 c 1
what is "position adjustment"?
cf. # 488. Basically, in a stacked bar chart, where each bar is comprised of smaller rectangles (texture, color, etc.) to represent yet another attribute of the variable, position = "identity" will place each object where it falls in the context of the graph. works on graphs which are NOT bar charts too.
Create a character vector and assign values "a" "b" and "c" to it.
character_vector <- c("a", "b", "c")
How do you check the data type of a variable?
class ( )
Sum columns and rows.
colSums() and rowSums()
We previously encountered unite() when tidying data, and its important distinction from paste() and str_c() is that it is strictly designed to concatenate columns stored in the same dataset. It will not accept a vector from outside the dataset. Indeed, the dataset is the first argument to unite(). It is best used to add a column of concatenated existing variables to a dataset (and remove the original columns).
concatenate CITY and STATE as citystate, then remove them centers <- unite(centers, col="citystate", CITY, STATE, sep=", ") centers ## # A tibble: 4 x 5 ## NAME ROOM BUILDING UNIVERSITY citystate ## * <chr> <chr> <chr> <chr> <chr> ## 1 IDRE STAT CONSULT X4919 MATH SCIENCE UCLA LOS ANGELES,~ ## 2 OIT CONSULTING Y521 HOUSTON HALL LONE STAR UNIV. DALLAS, TX
Editor and....
console. Console shows the result of the R code you submitted.
Data set = d Variables: "write" "read" Explore correlation.
cor(d$write, d$read)
Use logical subsetting to assign NA to the data entry errors
d$age[d$age<18|d$age>120] <- NA # change all impossible age values to NA # assume all adults in this dataset
Convert the gender-specific character variables into factors. Data set = d, variable = female.
d$female <- factor(d$female) levels(d$female) A: "female" "male"
OK, data set is still "d" and categorical variable is "ses". Convert ses to a factor, specify levels explicitly, and sort alphabetically, using factor( )
d$ses <- factor(d$ses, levels=c("low", "middle", "high")
Use levels() on a factor to check the ordering of levels for data set "d".
d$ses <- factor(d$ses, levels=c("low", "middle", "high"))
In the file "http://stats.idre.ucla.edu/stat/data/hsbsemi.txt", the fields are separated by ";". What function could I use to read it in?
d_semi <- read_delim("https://stats.idre.ucla.edu/stat/data/hsbsemi.txt", delim=";")
What info does ggplot( ) consume?
data set inside the ades() function we specify variables, mapped to aesthetics, like the x and y axes, color, size and shape of objects
Nested function: arrange (filter(dataframe, A==B), B) Now say it using the pipe operator.
dataframe %>% filter (A==B) %>% arrange (B)
lubridate
date and time variable processing
Every geom has a default ___________, and every _____________________has a default geom.
default statistic, statistic For example, the bin statistic defaults to using the bar geom to produce a histogram.
Which tidyverse package is helpful in examining relationships between continuous AND categorical variables?
dplyr( ) provides a useful function, group_by(), which converts a data frame into a grouped data frame, grouped by one or more variables. After grouping the data frame, we then use the dplyr function summarize() to calculate statistics by group.
Drop rows containing missing values.
drop_na( )
Extract one column into multiple columns.
extract( )
Ordinal factor for temperature - create factor_temperature_vector with temps low, medium and high.
factor_temperature_vector <- factor(temperature_vector), order = TRUE, levels = c("Low", "Medium", "High"))
How do I express an ordinal variable - example, temperatures "low", "medium", and "high". Define factor_temperature_vector.
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
The factor "temperature_vector" is ordinal. How can you express this so that values for the vector are printed out in a specific order?
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "medium", "High")
Use "filter" to select all rows for cylinder ("cyl" is 5 "mpg" data set in Hadley Wickham's "R for Data Science"
filter(df = mpg, cyl ==5)
fixmissing() Why?
fixmissing( ) helps you clean up a single vector - e.g. replace bad data with "NA"
Alternative to geom_bar?
geom_col( ) - There are two types of bar charts: geom_bar makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col instead.
Which functions get and set your working directories?
getwd( ) and setwd( )
data set is "d", x = 1, y = math Build me a boxplot.
ggplot(d, aes(x = 1, y = math)) + geom_boxplot() Unlike histograms and density plots, map the variable whose distribution we want to plot to y instead of x. If we are making a single boxplot, we need an arbitrary value for x, just as a place holder.
Data set is "d" and x variable is "write". Build me a histogram with 10 intervals.
ggplot(d, aes(x = write)) + geom_histogram(bins=10)
Sometimes you need to explore categorical variables by other categorical variables graphically. Here we map prog to fill, the color used to fill the bars of the bar graph. (The color aesthetic specifies the color of the outline of the bars)
ggplot(d, aes(x=ses, fill=prog)) + geom_bar() This produces a stacked bar chart. We again see that "high" ses has a higher proportion of "academic" https://stats.idre.ucla.edu/stat/data/intro_r/intro_r_flat.html
The position argument in geom_bar() changes how the colors are sorted on the graph. We can specify that the color positions should stack (the default), dodge (side-by-side), or fill (uniform height to examine proportions). Adding color by prog to our scatter plot of read vs write assesses whether the read-write relationship appears the same between programs.
ggplot(d, aes(x=ses, fill=prog)) + geom_bar(position="dodge") https://stats.idre.ucla.edu/stat/data/intro_r/intro_r_flat.html
Take the data in "d" and operate only on variable D$math rows where math is less than its mean. you're plotting the distribution for "write" for students with math scores below the mean math score.
ggplot(d[d$math < mean(d$math),], aes(x=write)) + geom_histogram(bins=10)
For all plotting exercises, use the data set "mpg". -create a scatterplot and linear trendline of mpg - miles per gallon vs. engine cylinders - WITH jitter
ggplot(data - mpg, mapping = aes(x = cyl, y = hwy)) + geom_jitter( ) + geom_smooth(method = "lm", se = FALSE)
For all plotting exercises, use the data set "mpg". -create a scatterplot and linear trendline of mpg - miles per gallon vs. engine cylinders - no jitter
ggplot(data - mpg, mapping = aes(x = cyl, y = hwy)) + geom_point( ) + geom_smooth(method = "lm", se = FALSE)
What are the 7 parameters that the "layered grammar of graphics" accepts?
ggplot(data = <DATA>) + <GEOM_FUNCTION>( mapping = aes(<MAPPINGS>), stat = <STAT>, position = <POSITION> ) + <COORDINATE_FUNCTION> + <FACET_FUNCTION> The grammar of graphics is based on the insight that you can uniquely describe any plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme.
data set is "d", x variable is "write" and y variable is "read". ggplot me a scatterplot
ggplot(data=d, aes(x=write, y=read)) + geom_point()
Modify this to facet on the variable "cyl" (cylinder) ggplot(df_filtered, mapping = aes(x = displ, y = hwy)) + geom_jitter(alpha = 0.5) + geom_smooth(method = "lm", se = FALSE)
ggplot(df_filtered, mapping = aes(x = displ, y = hwy)) + geom_jitter(alpha = 0.5) + geom_smooth(method = "lm", se = FALSE) + facet_wrap (~cyl)
Which package of functions in Hadley Wickham's "tidyverse" library enables plotting data on coordinates?
ggplot2 (don't forget the "2"!)
What is the difference between ggplot2 and ggplot( )?
ggplot2 is a package of functions, whereas ggplot( ) is a function.
The describe() function from package Hmisc will...
give different summary statistics depending on the variable's type and number of distinct values.
How to quickly view data set.
glimpse( )
How do you view the beginning and ends of data sets? Can you specify the # of rows you see? Demonstrate.
head( ) and tail( ) head(dat_csv, 2) - first 2 rows of data fram dat_csv tail(8) - last 8 rows of whatever data frame is in play
Use "-" to specify the bottom 2 values, arrange "mpg" data frame by "hwy" excepting 2 "worst" values "mpg" data set in Hadley Wickham's "R for Data Science"
head(arrange(df, -2, hwy)
arrange mpg data frame by class first by class, then by highest to lowest hwy "mpg" data set in Hadley Wickham's "R for Data Science"
head(arrange(df, class, desc(hwy)), n = 10)
arrange mpg data frame to find top 6 worst gas guzzlers using arrange( ) and head( ) and variable "hwy" "mpg" data set in Hadley Wickham's "R for Data Science"
head(arrange(df, hwy)) returns the 6 lowest hwy values, and corresponding "displ", "cyl" and "class"
Factors are represented both by their integers and their character labels. Factors are converted to 0/1 variables in regression models. Specify that the head(d$ses) be represented as numeric values.
head(d$ses) head(as.numeric(d$ses$)) Then find the value for "low" d$ses[1] == "low"
"mpg" data frame, select first 10 rows, then last 10 rows "mpg" data set in Hadley Wickham's "R for Data Science"
head(df, n = 10) tail(df, n = 10)
Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?
https://ggplot2.tidyverse.org/reference/
Special calls describe two types of functions: ...?
infix and replacement functions
How do you install packages in R?
install("package") - e.g. install(tidyverse)
What is a general test for the "numberliness" of a vector and returns TRUE for both integer and double vectors. It is not a specific test for double vectors, which are often called numeric.
is.numeric( ) (returns T/F)
What does is.vector( ) test?
is.vector( ) does NOT test whether an object is a vector. Instead it returns TRUE only if the object is a vector with no attributes apart from names. Use is.atomic(x) || is.list(x) to test if an object is actually a vector.
What is lapply( ) and what are its three inputs?
lapply( ) applies the function to each element of a list and returns a new list. It's inputs are x (a list), f (a function), and other arguments to pass to f( ).
length() and names() have high-dimensional generalisations, which are....?
length() generalises to nrow() and ncol() for matrices, and dim() for arrays. names() generalises to rownames() and colnames() for matrices, and dimnames(), a list of character vectors, for arrays.
What will be the result of dim(b)? b<- data.frame(letters=c("a","b,""c"), numbers=c(1,2,3)) dim(b)
letters numbers a 1 b 2 c 3
survey_vector <- c("M", "F", "F", "M", "M") factor_survey_vector <- factor(survey_vector) # Specify the levels of factor_survey_vector
levels(factor_survey_vector) <- c("Female", "Male")
haven( ) is a part of tidyverse, a PACKAGE that can read in datasets from other statistical analysis software. It has to be loaded separately. Demonstrate how.
library("tidyverse") require(haven) Ex. dat_spss <- read_spss("https://stats.idre.ucla.edu/stat/data/hsb2.sav")
Load tidyverse.
library(tidyverse)
What are four common types of atomic vectors?
logical, integer, double (often called numeric), and character In addition, there are two rare types: complex and raw.
Specify a matrix with 9 elements, spread by rows, with 3 rows. How do you modify to spread by columns?
matrix(1:9, byrow = TRUE, nrow = 3). If by columns, byrow = FALSE.
b<- data.frame(letters=c("a","b,""c"), numbers=c(1,2,3)) is a data frame. (class(b)) Why does mean(b) not work?
mean() cannot operate on several columns. It expects a numeric of logical vector as an input.
Get me the mean, median, variance and standard deviation for the column d$read.
mean(d$read) median(d$read) var(d$read) sd(d$read)
How do you find out what functions accept data frames as arguments?
methods(class = "data.frame")
How do you find out what classes of objects the generic function summary() accept?
methods(summary) Generic functions match classes to methods.
Example of useful functions for summarise():
min(), max(), mean(), sum(), var(), sd() • n(): number of observations in the group • n_distinct(x): number of distinct values in variable x
Use the pipe operator to sort "mpg" column names ("names") "mpg" data set in Hadley Wickham's "R for Data Science"
mpg %>% names( ) %>% sort( )
Using the "mpg" data frame, change the column name "displ" to "displacement" "mpg" data set in Hadley Wickham's "R for Data Science"
mpg %>% rename(displacement = displ) %>% names( )
Use the minus sign ("-") to remove columns "trans" "fl" from the "mpg" data frame's column names "mpg" data set in Hadley Wickham's "R for Data Science"
mpg %>% select(-trans, -fl) %>% names( )
"mpg" data set: select "class" as the first column followed by all others and include in the variable "names" "mpg" data set in Hadley Wickham's "R for Data Science"
mpg %>% select(class, everything()) %>% names( )
Concatenating specifications: Using the "mpg" data frame, select all the columns between "displ" and "fl", except "trans", followed by "manufacturer" and assign to "names" "mpg" data set in Hadley Wickham's "R for Data Science"
mpg %>% select(displ:fl, -trans, manufacturer) %>% names( )
"mpg" data set: select all columns that end with the letter y and include in (variable) "names" "mpg" data set in Hadley Wickham's "R for Data Science"
mpg %>% select(ends_with("y")) %>% names( )
Using "mpg" data set, use the pipe operator to select four columns from the "mpg" data set and sort in alpha order "mpg" data set in Hadley Wickham's "R for Data Science"
mpg %>% select(hwy, displ, cyl, class) %>% names( )
"mpg" data set: select all columns that start with the letter m and include in (variable) "names" "mpg" data set in Hadley Wickham's "R for Data Science"
mpg %>% select(starts_with("m")) %>% names( )
my_df <- mtcars[1:10,] Change my_df to df and assign to my_list
my_list <- list(my_df = df)
What does names( ) do? Give an example.
names( ) enables naming of variables. For example: for the vector: some_vector <- c("John Doe", "poker player") names(some_vector) <- c("Name", "Profession") assigns a name to each variable
Pre-existing vector: roulette_vector <- c(-24, -50, 100, -350, 10) Create a names vector to assign names of the week days to this vector.
names(roulette_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
Which function gives you the ranked position of each element when it is applied on a variable?
order( )
At least 3 functions loaded in the current session (base R and tidyverse) can concatenate strings together. What are they?
paste() from base R str_c() from stringr (tidyverse) unite() from dplyr (tidyverse)
You will often want to select an entire column, namely one specific variable from a data frame. If you want to select all elements of the variable diameter, for example, both of these will do the trick: planets_df[,3] planets_df[,"diameter"] Use a shortcut to pull up column "diameter"
planets_df$diameter
# Print out diameter of Mercury (row 1, column 3), data frame = planets_df
planets_df[1,3]
what does "position = dodge" do?
position = "dodge" places overlapping objects directly beside one another. This makes it easier to compare individual values.
What does "position = fill" do? E.g. ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
position = "fill" works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups. (You can compare the relative size of a single color across all the bars, which are of the same height.)
What function joins two matrices?
rbind( )
What are your two choices (base R, dplyr) for appending rows of data? How do they differ?
rbind( ) and bind_rows( ) They differ in whether the datasets being appended have non-matching columns. rbind( ) will produce an error where there are non-matching columns.
How do you read in a csv file using tidyverse? How about a text data file delimited by other characters?
read_csv( ) read_delim( ) For read_delim( ), specify the delimiter in the delim= argument.
Replace missing values
replace( )
What are return values (functions)
return values discuss how and when functions return values, and how you can ensure that a function does something before it exits.
Choose the correct geom to visualize the following: scatterplot trend line overplotting categorical variables
scatterplot: ggplot; geom_point; overplotting: geom_jitter, use of "alpha"; categorical var: facet_wrap( ); facet_grid()
select()
select columns (you can rename as you select)
filter()
select rows according to conditions
Give the R functions for these SQL commands: SELECT WHERE ORDER BY What do these do?
select( ) - chooses columns filter( ) - determines which rows are returned, etc. arrange( ) - list multiple columns
Create a vector "selection_vector" for all values of "roulette_vector" with positive returns. Then select from roulette_vector those days, assigning to new variable, "roulette_winning_days. Hint: subset
selection_vector <- roulette_vector > 0 roulette_winning_days <- c(roulette_vector[selection_vector])
Separate one column into multiple columns.
separate( )
# The variables mov, act and rev are available for: moviename, actors, and reviews # Finish the code to build shining_list with the named variables
shining_list <- list(moviename = mov, actors = act, and reviews = rev)
#
signifies what follows is commentary, not code.
arrange()
sort rows
What function other than spread( ) can you use to break up a column?
split( )
List a variety of "select" helper functions that help you select variables based on their names. "mpg" data set in Hadley Wickham's "R for Data Science"
starts_with( ) contains( ) one_of( ) matches( ) num_range( ) ends_with( ) everything( )
What does str( ) do?
str( ) delineates the structure of the data frame.
What's a fast way to assess an objects data structure?
str()
what's the best way to quickly get a view of a data frame?
str(dataframe)
How do you concatenate strings in tidyverse?
str_c() from string
stringr
string variable manipulation
What's the string manipulation function in tidyverse?
stringr( )
what's the name of the tidyverse package for manipulating strings?
stringr. It is loaded automatically with tidyverse.
df is "planets_df" select all planets with diameter less than 1
subset(planets_df, subset = diameter<1)
Specify a function to evaluate a variable by groups in summarize(). First specify the (grouped) dataset, then the functions to run on variables in the dataset.
summarize(by_female, mean(math), var(math)) ## # A tibble: 2 x 3 ## female `mean(math)` `var(math)` ## <fctr> <dbl> <dbl> ## 1 female 52.39450 83.74108 ## 2 male 52.94505 93.40806
Which function quickly provides an overview of data?
summary( ) can be run on defined factors
What kind of function is summary( )? Why?
summary( ) is a generic function. When a data frame is passed to summary( ), the data frame is then passed to a specific function called summary.data.frame( ), which provides a numeric summary of all variables in the data frame.
What's the quickest way to compute critical statistics for column d$read? What values does the function return?
summary(d$read) min, max, mean, median, 1st and 3rd quartiles
Operators
symbols that represent calculationssuch as addition (+) or multiplication (*)
How can you transpose a matrix?
t( )
How do I build a frequency table?
table( ) prop.table( ) can be used on tables produced by table( ) to see the frequencies expressed as proportions.
In the sample data set = "d", some categorical variables are: programs: "general" "academic" "honors" female:general: "male" "female" honors: "enrolled" "not enrolled" ses (socioeconomic class): "low" "medium" "high" Query for gender and ses.
table(d$female) female male 109 91 table(d$ses) high.... 58....
Use logical expressions to subset
the expression d$age>60 returns a vector of TRUE/FALSE, # rows with TRUE are selected d[d$age>60, c("age", "pain")]
read_csv() assigns the data frame to a class called...?
tibble, a tidyverse structure that slightly alters how data frames behave
express total_revenue_vector as a sum of all the columns in the matrix "all_wars_matrix"
total_revenue_vector <- colSums(all_wars_matrix)
How can you determine a vector's type with functionS?
typeof( ) is.character( ) is.double( ) is.integer( ) is.logical () is.atomic( )
What function can you use to concatenate a split column?
unite( )
How can you view a data set in a spreadsheet style?
view( ) - e.g. view(dat_csv)
geom_jitter( ) has two arguments that control visualization. what are they?
width controls the amount of vertical displacement, and height controls the amount of horizontal displacement.
Write the data set dat_csv to a file.
write_csv(dat_csv, file = "path/to/save/filename.csv")
Write a conditional statement where R prints "x is a negative number" if x is <- -3.
x <- -3 if(x <0) { print("x is a negative number") }
Extract the column vector with name colname
x$colname
Not all elements of a vector need to have a name. If some names are missing when you create the vector, the names will be set to an empty string for those elements. If you modify the vector in place by setting some, but not all variable names, names() will return NA (more specifically, NA_character_) for them. If all names are missing, names() will return NULL. Example:
y <- c(a = 1, 2, 3) names(y) ## [1] "a" "" "" v <- c(1, 2, 3) names(v) <- c('a') names(v) [1] "a" NA NA z <- c(1, 2, 3) names(z) ## NULL
| is the "x" operator. What is "x"?
| is the "OR" operator
R Logical operators and functions
• ==: equality • >, >=: greater than, greater than or equal to • !: not • &: AND • |: OR • %in%: matches any of (2 %in% c(1,2,3) = TRUE) • is.na(): equality to NA • near(): checking for equality for floating point (decimal) numbers, has a built-in tolerance
What are the 3 principles of a "tidy" data set?
• A dataset is a set of values organized into variables (columns) and observations (rows) • A variable should measure the same attribute across observations • An observation should represent the same unit measured across variables.
What are the tidyr structural rules?
• A dataset is a set of values organized into variables (columns) and observations (rows) • A variable should measure the same attribute across observations • An observation should represent the same unit measured across variables.
What are the three main actions you need to know how to take with strings?
• concatenate strings together • extract and change characters in strings - match patterns in strings
Useful R functions for transforming:
• log(): logarithm • min_rank(): rank values cut(): cut a continuous variable into intervals • scale(): standardizes variable (substracts mean and divides by standard deviation) • lag(), lead(): lag and lead a variable (from dplyr) • cumsum(): cumulative sum • rowMeans(), rowSums(): means and sums of several columns • recode(): recode values (from dplyr)
What are the THREE functions (base R and tidyverse) that can be used to concatenate strings?
• paste() from base R - str_c() from stringr (tidyverse) - unite() from dplyr (tidyverse)
Often we want to select a group of related variables with similar names. dplyr supplies helper functions to find columns whose names match a specified pattern. Examples?
• starts_with(x): matches names that begin with the string x • ends_with(x): matches names that end with the string x • contains(x): matches names that contain the string x • matches(re): matches regular expression re • num_range(prefix, range) matches names that contain prefix and one element of range
What are the three (3) parts of a function?
• the body(), the code inside the function. • the formals(), the list of arguments which controls how you can call the function. • the environment(), the "map" of the location of the function's variables.
What sections turn up in a help file?
Usage section, where a value specified after an argument is its default value. Value sectionnn, specified what is returns. Examples are at the bottom. Vignettes provide tutorials for packages.
How do I assign data to and store it in an object?
Use the operators <- or =.
How do I ACCESS matrix elements?
Using matrix[row,column] notation. Omitting row requests all rows, and omitting columns requests ALL columns.
How do you map variables TO the aesthestics layer in ggplot2?
Using the aes( ) function.
Conditional selection, or subsetting by value is used when?
Vector elements subset with a logical (TRUE/FALSE) vector; this is logical subsetting.
How do you install a package?
install.packages("packagename")
How do I determine a vector's length?
length( )
How do you determine the LENGTH of a function?
length( ) function returns its length.
How do you load a package into the R environment?
library("package name")
How do I list all the objects in the current session?
ls( )
Demonstrate how list elements can be named.
mary_info <- list(classes = c("Biology", "Math", "Music") mary_info "Biology" "Math" "Music"
How do I generate a matrix?
matrix( ). The input to matrix( ) is a one-dimensional vector, which is RESHAPED into a TWO dimensional matrix according to dimentions specified by the user in the arguments nrow and/or ncol.
If a list accepts a mixture of data types, give me a list of a numeric vector, and integer vector, and a character vector.
mylist <- list(1.1, c(1L,3L,7L), c("abc", "def")) [[1]] 1.1 [[2]] 1 3 7 [[3]] "abc" "def"
class(mydata$Height)
numeric
How do I create a vector with a PREDICTABLE SEQUENCE of elements?
rep( ) - as in: rep(0, times = 3) ## [1] 0 0 0 or rep("abc", 4) ## {1} "abc" "abc" "abc" "abc" The first value specifies the value, the second, the # of times it appears.
Vector elements can be named, then subset by name (using " "). E.g. scores <- c(John=25, Marge=34, Dan=24, Emily=20) How can I subset John and Emily?
scores [c("John", "Emily")
How do I create a vector with SEQUENTIAL elements?
seq( ) - as in seq(from=1, to=5, by=2_ ## {1} a 3 5 seq(10, 0, -5) ##{1} (10, 5, 0) In this case, the first value is the beginning, the second is the end of the vector, and the 3rd is the decrement.
How do I determine a vector's type?
typeof( )
How do you determine the TYPE of vector?
typeof( ) function ID's the vector's type.
Again, the 6 elements of the grammar of graphics.
Data (variables); Geoms; Stats; Scales; Coordinate Systems; Faceeting.
Define "data" in the "grammar of graphics" (for ggplot2).
Data are variables mapped to aesthetic features of the graph
How MUST data for plotting with ggplot2 tools be stored?
Data must be stored in a data frame.
In the "grammar of graphics", what are the six elements?
Data; geoms; stats; scales; coordinate systems; faceting.
What is faceting in ggplot2?
Faceting is splitting the data into subsets to create multiple variations on the same graph.
What is a function in R?
Functions perform most of the work on data in R, similar to functions in math.
What is Hmisc?
Hmisc is documentation for statistical summary in R.
Which classes imply an implicit class of "vector"?
If class( ) returns a basic data type ("numeric", "character", "integer"), the object has an implicit class of vector.
To how many classes can an R object belong?
Objects can belong to more than one class. Many functions only accept objects of a specific class, so it is important to know the classes of our objects.
What is R Studio?
R Studio is an IDE, integrated development environment.
What is R?
R is a powerful statistical programming language.
What is R?
R is a programming environment that can serve as a data analysis and storage facility.
What is R optimized to do?
R is designed to perform operations on vectors and matrices, using a simple programming language called S.
What is a programming object in R?
R stores both data and output from data analysis in objects.
What does dim( ) do?
Returns the number of ROWS and COLUMNS in a data frame or two-dimensional object.
Define "scales" in the "grammar of graphics" (for ggplot2).
Scales are mapping of aesthetic values to data values.
How do I print the contents of an object, say object names "abc"?
Simply type abc.
What is faceting (ggplot2)?
Splitting the data into subsets to create multiple variations of the same graph.
Define "stats" in the "grammar of graphics" (for ggplot2).
Stats are statistical transforms that summarize data (e.g. mean).
Interpret: c(1,2,3) <2
TRUE FALSE FALSE
Functions in R are similar to functions in math. How?
They perform an operation on an input and return an output. E.g. mean( ) takes a vector or numbers and returns its mean.
How do I get "help" for a function?
Type "?" before it - e.g. ?(getwd)
What are the components of R Studio?
A console, a code/script editory, special tools for plotting, viewing R objects and code history.
Create a vector called "first_vec" with the integers 1,3, and 5.
first_vec <- c(1,3,5)
#mydata is of class data.frame. class(mydata)
"data.frame"
What is the function of "#"
# tells R not to operate on any text that follows it on that line
What are the four vector types which can be used to represent a single variable?
(1) Logical (TRUE or FALSE, (1 or 0)); (2) Integer(integers only, represented as a # followed by L, as in 10 = 10L): (3) double, which are real numbers, also known as "numeric"; and (4) characters or strings.
NEST functions - e.g. rep(seq(1,3,1), times = 2)
1 2 3 1 2 3
seq(from=1, to=5, by=2) means?
1 3 5
How can I use [ ] to subset a vector? Given vector: a <- seq(10,1,-1) which is 10 9 8 7 6 5 4 3 2 1 what is a[c(134)]?
10 8 7
How can I use [ ] to subset a vector? Given vector: a <- seq(10,1,-1) which is 10 9 8 7 6 5 4 3 2 1 what is a[seq(1,5)?
10 9 8 7 6 (the first 5 elements)
Whey is R unhappy here? c(2,3,4) + c(10, 20)
12 13 14 - warning the longer object length is not a multiple of shorter object length
Interpret: c(1,2,3) + 1
2 3 4
Interpret: c(1,2,3,4,5,6) + c(1,2)
2 4 4 6 6 8
If abc <- 3, and I type "abc", what will the console return?
3
What will this matrix look like? b ,_ matrix(5:14, nrow=2, byrow=TRUE)
5 6 7 8 9 10 11 12 13 14
For matrix: 1 2 3 4 5 6 what is a[2,3]
6
Given numeric vector: mydata[["height"]] returns 65 69 71 73 What is: mydata[["height"]][2]? ...and why?
69
How can I use [ ] to subset a vector? Given vector: a <- seq(10,1,-1) which is 10 9 8 7 6 5 4 3 2 1 what is a[2]?
9 (the 2nd element)
How can I search help for a topic?
??keyword searches R documentation for keyword (e.g. ??logistic)
In a data frame, what is a column, and what is a row?
A column is a variable, and a row is a set of observations corresponding to variables (the columns in which they are found).
What is the fundamental data structure in R? Describe it.
A vector is the fundamental data structure in R, and are one-dimensional and homogeneous strings of data.
What is a single value?
A vector length of one.
What is an argument?
An argument is an input to a function. Ex.: install.packages("tidyverse"), where "tidyverse" is the argument.
What is a geom?
An object or shape on the graph.
List the means by which you can access a list element.
By position - e.g. mary_info[[2]] By name - e.g. mary_info$SAT 1450 By position element in specific vector mary_info$friends[2] "Dan"
R objects below to....? How do I figure out what an object is (what category)?
Classes. Objects can belong to more than one class. Many functions only accept objects of a SPECIFIC class (e.g. data frame). The class( ) function tells you to which class an object belongs.
What is a value specified after an argument?
It is the argument's default value. An example is "tidyverse" in "install.packages("tidyverse")
How do data frames combine the features of matrices and lists?
Like matrices, data frames are rectangular, where the COLUMNS are VARIABLES and the ROWS are OBSERVATIONS. Like lists, data frameees can have elements (column vectors) of different data types.
What is a list?
Lists are one-dimensional structure (like vectors), BUT the elements can be a mixture of types of other lists, matrices and data frames. Lists or one-dimensional but HETEROGENEOUS.
Back to data structures. Again, what are the vector data types that can represent a single variables?
Logical, integer, double, and character.
What is a scale (ggplot2)?
Mapping of aesthetic values to data values - legends and axes display these mappings.
where scores <- c(John=25, Marge=34, Dan=24, Emily=20) what does this return?: scores[c(FALSE, TRUE, TRUE, FALSE)]?
Marge Dan 34 24
What is a matrix/matrices - two attributes?
Matrices are TWO-DIMENSIONAL, HOMOGENEOUS data structures.
Are data frame homogeneous?
No. They are two-dimensional, HETEROGENEOUS, rectangular structures.
What is the first layer for any ggplot2 graph?
The aesthetics layer. Aesthetics are the visually perceivable components of the graph. You map variables TO the aesthestics using the aes( ) function.
In ggplot2, variables are mapped to which aspects of the graph?
The aesthetics. Variables are mapped to aesthetics including the x and y aces, color, size and shapes of objects.
What is a coordinate system (ggplot2)?
The plane on which data are mapped on the graphic.
Coordinate systems in ggplot2 are...?
The plane on which data are mapped.
What happens when you perform an operation on two vectors of unequal length?
The shorter vector is recycled until the two vectors are the same length.
What is the stat (ggplot2)?
The statistical transformation that summarize data (e.g. mean, confidence intervals).
What is a factor?
The term factor refers to a statistical data type used to store categorical variables.
How can you subset a data frame?
You can use the same methods you used for matrices or lists. As a two dimensional structure data frames can be subset like matrices [rows, columns]. Ex>: mydata[3,2] selects row 3, column 2
What does this matrix look like? a <_ matrix(1:6, nrow = 2)
[,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 Notice the positional markers - e.g. [,1]. What do those mean?
What function do you use to create a vector?
c( ), or the "concatenate" function combines values of common types together
#colMeans(mydata) is the means of columns. what is class(mydata$colMeans)
class data.frame or matrix
How do you list all the classes to which an object belongs?
class( ) function lists all classes to which the ojbect belongs.
How do you return the column names of the data frame?
colnames(data_frame) returns the column names of data_frame.
Assume your data_frame is named "mydata". Assign two column names: "Diabetic" and "Height."
colnames(mydata) <- c("Diabetic", "Height")
From whence do you download Base R and most R packages?
cran.r-project.org
How can you manually create a data frame?
data.frame( ) Ex. mydata <- data.frame(diabetic = c(TRUE, FALSE, TRUE, FALSE), height = c(65, 69, 71, 73)
What two commands examine the structure of a 2-dimensional object to get # of rows, columns?
dim() and str()
What is dplyr?
dplyr is a data management function in tidyverse package
Give an example of a function in math.
f(x) = x^2