R Commands

Ace your homework & exams now with Quizwiz!

matrix()

#generate a 5x4 numeric matrix MyMatrix <- matrix (vector, nrow=t, ncol=c, byrow=FALSE, #false is the default# dimnames=list(char_vector_rownames, char_vector_colnames)) >y <- matrix(1:20, nrow = 5, ncol = 4) [,1] [,2] [,3] [,4] [1,] 1 6 11 16 [2,] 2 7 12 17 [3,] 3 8 13 18 [4,] 4 9 14 19 [5,] 5 10 15 20 ^note: if this was byrow=TRUE, the first row would be 1 2 3 4 >cells <- c(1,26,24,68) >rnames <- c("R1", "R2") >cnames <- c("C1", "C2") >mymatrix <- matrix (cells, nrow=2, ncol=2, byrow=TRUE, dimnames=list(rnames,cnames)) >mymatrix C1 C2 R1 1 26 R2 24 68 ^for dimnames, the first item in the list is going to be the headings on the side of the matrix, not the top

mfv()

(most frequent value) to use, you must download modeest in packages Returns multiple modes y <- c (2,2,2,1,1,1) mfv(y) [1] 1 2

as.data.frame vs as_data_frame

. IS DATA FRAME t <- as.data.frame(c(x,y)) c(x, y) 1 kelsey 2 scott 3 zac 4 crook _ IS ITIBBLE s <- as_data_frame(c(x,y)) # A tibble: 4 × 1 value <chr> 1 kelsey 2 scott 3 zac 4 crook class(t) [1] "data.frame" > class(s) [1] "tbl_df" "tbl" "data.frame"

load()

>load("F:/MyImportedDallasLA.R") load(file, envir = parent.frame(), verbose = FALSE)

seq()

>seq(10) #starts at 1, goes up to your parameter [1] 1 2 3 4 5 6 7 8 9 10 seq(from, to, by): seq(1, 10, by = 0.5) [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0

is_data_frame()

?????????

Selecting elements of matrices

A Row: matrixName[2, ] A Column: matrixName[ ,2] A Value: matrixName[2,2]

Theta-Join

A more general version of Equi-Join The condition is restricted to 'equals' In a theta-join any comparison operator may be used General Form: JOIN R AND S WHERE C

Measures of center

Mean: Greatly affected by a few large or a few small values in an otherwise regular data set (Not robust) Median: Not greatly affected by extreme values (Robust)

Data Frame

A table or two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column Different columns can have different modes (numeric, character, etc.), but all elements of a data frame need to be vectors of equal length. you can't use the same name for two different variables looks similar to a matrix except the row names default to 1 2 3 instead of [1,] [2,] [3,]

Tibble

A type of data frame created by using tibble() It includes all the useful features of a traditional data frame and excludes the more problematic features, such as: Show only the first 10 rows, and all the columns that fit the screen (Whereas data frames show all rows) it doesn't convert characters to factors, change the names of variables, or create its own row names.

ts(z)

convert numeric vector into an R time series object then, plot(z) ^timeplot

barplot()

Bar Graphs Bars should never touch, as each one represents a category barplot(H, xlab, ylab, main, names.arg, col, border) H: a vector (or matrix) containing numeric values used in the bar chart NAMES.ARG: a vector of names appearing under each bar

Union

Binary Creates a relation consisting off all the distinct tuples in the two relations Can also be used to add tuples to a relation General form R UNION S

Cartesian Product

Binary Creates relation consisting of every tuple of one relation combined with every tuple of the other relation General form: R TIMES S

Difference

Binary The difference of relation S from relation R is a relation with tuples in R but not in S Relations must have same number of attributes General Form: R MINUS S Returns tuples in R that are not in s If R has PSTAT 131, PSTAT 10, PSTAT 170, and PSTAT 140 and S has PSTAT 141, PSTAT 170, and PSTAT 231 R minus S returns PSTAT 10 and PSTAT 140

tail()

Print just the last n rows tail(data, n=1) Prints the last row

nominal

Categories only Names, labels Ex: Hair color Can't be organized in an ordering scheme

ordinal

Categories with some order/can be ranked in logical order Ex: grades, level of education, ranking of food preference (excellent, good, poor)

rbind()

Combine the rows of data frames: new_faculty_profile <- data.frame (First = c("tom", "Jerry"), Last = c("Patel", "Murphy"), Department = ("Ling", "Philosophy")) all.faculty.profiles <- rbind(faculty.profile, new_faculty_profile) Gives us more rows but keeps number of columns --> Adds the other matrix below it *MUST have some number of columns to work (even if number of rows is different)*

cbind()

Combines columns of data frames: First <- c("Ann", "Paul", "Bob") Last <- c("Smith", "Liu", "Lopez") Department <- c("Math", "Physics", "Bio") faculty.profile <- cbind(First, Last, Department) Fills by column! (gives us more columns but keeps number of rows) --> adds the other matrix onto the side of it *MUST have same number of rows to work (even if number of columns is different)*

ratio

Counted values Differences AND a natural (zero) starting point Ex: prices of textbooks, age

interval

Counted values Differences but no natural starting point the interval between measurements is the same, but ratio between measurements is not known. Ex: years (1000, 2000, 1726, etc.), temperature (can start from any value, even negative)

function()

Creates functions Stored in R as objects just like anything else Class: Function f <- function(arguments){ #do something }

tabulate()

counts the number of times an int occurs in a vector. Goes from 1-number of bins, and counts 0 if the value isn't there tabulate(bin, nbins = max(1, bin, na.rm = TRUE)) bin: a numeric vector (of positive integers), or a factor. Long vectors are supported. nbins: the number of bins to be used.

Intersection

Defines a relation composed of all the tuples which occur in both relations General Form: R INTERSECT S

repeat loop

Executes a sequence of statements many times and abbreviates the code that manages the loop variable Keeps going until a stop condition is met, but needs a break statement A_Loop <- c("hello, "loop") count <- 2 repeat { print(A_Loop) count <- count+1 if(count > 5) { break } }

ordered()

FOR ORDINAL DATA having ordered data is helpful when creating tables/printing results To create an ordered factor in R: Use the factor() function with the argument ordered = TRUE: ordered.status <- factor(status, levels=c("Lo", "Med", "Hi"), ordered=TRUE) ordered.status [1] Lo Hi Med Med Hi Levels: Lo < Med < Hi ORRRR Use the ordered() function > x <- factor(c(1,2,3,3,5,3,2,4,NA)) ordered(x) [1] 1 2 3 3 5 3 2 4 <NA> Levels: 1 < 2 < 3 < 4 < 5

length()

Get or set the length of vectors (including lists- and data frames bc data frames are lists) and factors, and of any other R object for which a method has been defined. mydata <- data.frame(d, e, f, row.names = list("R1", "R2", "R3", "R4"), check.rows=FALSE) > length(mydata) [1] 3 ^would be 2 if it were only d and e

hist(z)

Histograms (quantitative) Connect the bars ! Rule of thumb: start with 5 to 10 bins (sturges is an R algorithm that tries to decide your break points for you) hist(x, breaks = "Sturges", freq = NULL, probability = !freq, col = NULL, border = NULL, main = paste("Histogram of" , xname), xlim = range(breaks), ylim = NULL, xlab = xname, ylab, ...) FREQ: logical; if TRUE, the y axis is frequencies of a certain bin (this is the default) and if FALSE, probability densities are plotted on the y axis- how likely it is that an interval of values of the x-axis occurs *note: you can reorder the terms as long as your vector comes first

stem(z, scale = )

Make sure you put a 0 in front of the single digits >z <- c(09, 09, 22, 32, 33, 39, 39, 42, 49, 52, 58, 70) >stem(z, scale=2) Scale controls the length of the plot: With the default scale, the numbers left of the bar go up by TWO (scale=1) But we only want it go up ONE at a time (scale=2). 0-101: scale=2 102-151: scale=3 152-201: scale =4 202-251: scale=5 252-299: scale=6 Stemplots preserve data If you're given one but don't have original data, you can reconstruct it by reading off the values on the stemplot (Histogram doesn't do this)

return

Many a times, we will require our functions to do some processing and return back the result. This is accomplished with the return() function

Equi-Join

More flexible than natural join Defines a relation of tuples which is a combination of pairs of tiples taken from two relations, one from each, and where the specific attributes have the same value in each of the pair of tuples General Form: JOIN R AND S WHERE R.A = S.B (A and B are attributes)

Import Wizard

Name: Choose a name for the imported file Skip: do we want to skip any of the rows? Maybe a header we don't want included in the data? First Row As Names: these will usually be column header Trim Spaces: usually yes Open Data Viewer: a matter of preference Shows a new R file Delimiter: comma Ignore the rest and leave them as they are

Keys

Primary: each tuple has a unique identifier. It's a column or combo of columns that uniquely identify a record. Can only be one. Candidate: the keys that weren't chosen to be primary even if they still have a unique identifier. Can be any column or a combo of columns that can qualify as unique keys in the database. There can be multiple in one table. Foreign: An attribute whose values match primary key values in the related table. Name of the attributes don't have to be the same but they have to belong to the same domain

head()

Print just the first n rows head(data, n=2) Prints the first two rows

IQR()

Q3-Q1

dotchart(z)

Really only useful for small data sets Show the entire raw data as dots above their values >z <- c(11, 12, 14, 17, 21) dotchart(z)

The Structural Part of the Relational Model

Relation: the table itself (order doesn't matter) Attribute: column (first row is attribute name, the rest are the values) Domain: pre-defined value scope, each attribute of a relation is associated with a domain (CourseID could be the domain associated with Course#) Tuple: row

while loop

Repeats a statement or group of statements while a given condition is true It tests the condition before executing the code in the loop (what makes it different from a repeat loop) Doesn't need a break statement! Keeps going until condition is met instead B_loop <- c("Hello", "while loop") count <- 2 while(count < 7) { print(B_loop) count = count +1 }

date()

Returns today's date and current time date() [1] "Wed May 09 10:35:23 2018"

for loop

Similar to the while loop but tests the condition at the end of the loop statements c_loop <- letters[1:4] for(i in c_loop) { print(i) }

The Relational Model Parts

Structural: the building blocks from which databases are constructed Manipulative: includes operations for retrieving data, for updating the database Integrity: collection of rules that all databases must obey

Projection

Unary Returns only the column that you are projecting on PROJECT ENROLMENT ON Course# Removes duplicates Because by definition, a relation cannot have duplicate tuples Will return a single column: PSTAT 131 PSTAT 10 PSTAT 170 PSTAT 140 General Form PROJECT R ON X (x consists of attributes)

%in%

if("hello" %in% x) used instead of "in" like in c++ 6:10 %in% 1:36 [1] TRUE TRUE TRUE TRUE TRUE

Outer Join

The resulting relation of an outer join is defined on relations R and S Each tuple is formed either by joining a tuple of R with a tuple of S for which C is true OR by joining other tuples of R and S with null value General Form: OJOIN R AND S WHERE C

Semi-Join

The set of all the tuples of the first which join with the tuples of the second General form: SJOIN R AND S WHERE C

Restrict

Unary Used to define a new relation that contains only those tuples (rows) of a relation for which some condition is true The resulting relation has the same attributes as those in the first relation General form RESTRICT R WHERE C (c is condition)

data()

indicates to R the dataset of interest Put dataset in quotes

scan()

This assumes that the data for successive time points is in a simple text file w one column The first three lines of the text file are not required so SKIP THEM ex: >kings <- scan("C:\\Users\\holmes\\Documents\\Documents\\PSTAT 10 COURSE\\MYLECTURES\\kings.dat.txt", skip=3) Read 42 times (bc there are 42 nums) scan ("location of data file, name,", skip)

select = -c(d:b)

To delete multiple columns in a data frame Apart of the subset function editedAPdata4 <- subset(airport_data, select = -c(Flight:FirstClass))

is.na()

To find missing values Returns boolean values returns TRUE if it is in fact missing a value (na)

select = ()

To keep only selected columns in a data frame Apart of the subset function editedAPdata3 <- subset(airport_data, select = (FirstClass, Economy))

paste()

To merge columns of a data frame MergedAirport <- paste(airport_data$FirstClass+airport_data$Economy) Gives a subset from the two data frames merged ORRR Get a space between hello and world: x <- "hello" y<- "world" >paste(x,y) [1] "hello world"

select = -column name

To remove a column in a data frame Apart of the subset function editedAPdata <- subset(airport_data, select = -FirstClass)

select = -c()

To remove more than one column in a data frame Apart of the subset function editedAPdata2 <- subset(airport_data, select = -c(FirstClass, Economy))

Natural Join

Used to combine two relations on the basis of all attributes that occur in both of them Tuples are combined where the common attributes have the same value General form: JOIN R AND S

Division

binary Defines a relation comprising attributes which are in R but not in S S must be defined on attributes that are also in R Needs both of them but not exclusively General form DIVIDE R BY S

is.data.frame

checks if an object is a data frame returns true if it is, false if not

is.tibble

checks if an object is a tibble returns true if it is, false if not

read.csv()

data <- read.csv("import_text.txt") by writing our own code to create a DATA FRAME Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.

read_csv()

data <- read_csv("import_text.txt") import wizard (or writing our own code) created a TIBBLE

data.frame()

data.frame(vectors, row.names=NULL, check.rows=FALSE, check.names=TRUE, fix.empty.names=TRUE, stringsAsFactors=default TRUE)) check.rows: if TRUE, then the rows are checked for consistency of length and names. check.names: If TRUE, then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names and are not duplicated ex: d <- c(2,3,4,6) e <- c("red", "white", "red", NA) f <- c(TRUE, TRUE, TRUE, FALSE) mydata <- data.frame(d, e, f) d e f 1 2 red TRUE 2 3 white TRUE 3 4 red TRUE 4 6 <NA> FALSE

boxplot()

displays the five number summary in a Box and whisker plot ~ is used to separate the left and right hand sides in a model formula (in this case weight and cylinders) >boxplot(wt~cyl, data=mtcars, main=toupper("compare vehicle weight to number of cylinders"), xlab="Number of Cylinders", ylab="Weight", col="purple") Gives us a vertical boxplot (default is vertical, need to do horizontal=TRUE to change it)

as_data_frame()

just verifies that the list is structured correctly (i.e. named, and each element is same length) then sets class and row name attributes. as_data_frame is considerably simpler/faster than as.data.frame, making it more suitable for use when you have lists includes methods for data frames, tibbles (returns unchanged input), lists, matrices, and tables. Other types are first coerced via as.data.frame() with stringsAsFactors = FALSE.

objects()

list user-defined objects

which.max

locates the index of the element of a vector that is the largest

array()

matrices with more than 2 dimensions (creates multiple tables) each row is the same length, and likewise for each column and other dimensions. array(data=NA, dim=length(data), dimnames=NULL) result <- array(c(vector1,vector2), dim = c(3,3,2), dimnames = list(row.names, column.names, matrix.names)) Data: has to be provided (if one of the vectors is longer it's ok, but the extra values are not going to show up in the array) Dim: has to be created (columns, rows, tables) Dimnames: names of the dimensions If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3 columns.

range()

max value - min value >range(HowMuch) >[1] 7.7 45.5 #gives minimum then the maximum rather than the difference

Calculating mean without missing values

mean(x, na.rm = T)

Right Skew

mean>median tail is to the right, curve towards the left

Symmetric

median=mean

Left Skew

median>mean tail is to the left, curve towards the right

Merge()

merge two datasets By default the data frames are merged on the columns with names they both have (finds the intersection between two different sets of data.) A <- c(1:3) B <- c(2:4) merge(A,B) x y 1 1 2 2 2 2 3 3 2 4 1 3 5 2 3 6 3 3 7 1 4 8 2 4 9 3 4

nlevels()

number of levels of a factor nlevels(gender) [1] 2

pie()

pie(x, labels, radius, main, col, clockwise) X: a vector containing the numeric values used in the pie chart LABELS: used to give descriptions to the slices RADIUS: the pie is drawn centered in a square box whose sides range from -1 to 1 CLOCKWISE: is a logical value indicating if the slices are drawn clockwise or counter-clockwise (default= FALSE: counter clockwise) Ex: slices <- c(10, 12,4, 16, 8) lbls <- c("US", "UK", "Australia", "Germany", "France") pie(slices, labels = lbls, main="Pie Chart of Countries") *note: you can reorder the terms as long as your vector comes first

print()

prints its argument and returns it invisibly. It is a generic function which means that new printing methods can be easily added for new classes x <- c("kelsey", "scott") print(x) [1] "kelsey" "scott" x <- c("kelsey", "scott") y <- c("zac", "crook") print(c(x, y)) [1] "kelsey" "scott" "zac" "crook"

read.table

read.table(file, header = TRUE or FALSE, text= 'insert text here') Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file. header: indicates whether the file contains the names of the variables as its first line. If missing, the value is determined from the file format: header is set to TRUE if and only if the first row contains one fewer field than the number of columns.

rm(list = ls())

remove all variables in the environment

na.omit

removes any missing values >x <- c(1:4, NA, 6:7, NA) >x1 = na.omit(x) in a data frame R will remove the entire row that has an NA because vectors must be equal length to qualify as a data frame

replace missing values with 0

replace(x, is.na(x), 0) [1] 1 2 3 4 0 6 7 0

table()

returns a frequency of values x<- c("yes", "no", "no", "yes", "yes")) table(x) no yes 2 3

ls()

returns a vector of character strings giving the names of the objects in the specified environment When invoked with no argument at the top level prompt, ls shows what data sets and functions a user has defined.

class()

returns type of object

unique()

returns unique elements (removes duplication)

levels()

show the levels (categories) gender <- factor (c("male", "female", "female", "male")) >levels(gender) [1] "female" "male" R will assign 1 to the level " female" and 2 to the level "male". This is because f comes before m alphabetically, even though the first element in the vector was male.

break

stops the iterations and flow of control outside of the loop In a nested looping situation, this statement exits from the innermost loop that is being evaluated.

str()

structure: a display of the structure of an R object z <- c ("bob", "ann", "jon", "ivy") str(z) [1] chr [1:4] "bob" "ann" "jon" "ivy"

finding total number of missing values

sum(is.na(x))

Time Plots

the variable is plotted AGAINST TIME Time: x axis Variable of interest: y axis scan() and ts() used with them

as.data.frame

to coerce other object to a data frame. df2 <- as.data.frame(matrix(1:12,3,4),1:3) If a list is supplied, each element is converted to a column in the data frame. Similarly, each column of a matrix is converted separately. Other types are first coerced via as.data.frame() with stringsAsFactors = FALSE.

as.tibble

to coerce other object to a tibble as_tibble does the same thing

apply()

to return a vector, array, or list of values obtained by applying a function to margins of an array or matrix. allows us to make entry-by-entry changes to data frames and matrices. apply(x, margin, fun) X: an array (including matrices) Margin: a vector giving the subscripts which the function will be applied over. Ex: for a matrix, MARGIN=1 indicates rows, MARGIN=2 indicates columns, MARGIN=c(1, 2) indicates rows and columns. Fun: the function to be applied across the elements of the array the function name must be quoted ex: # Return the sum of each of the columns apply(matrixName, 2, sum)

factor()

used for nominal data needs to be sortable x <- factor (c("yes", "no", "no", "yes", "yes")) levels: no yes ^the levels go in ABC order

vector()

vector("class of object", length) vector("character", length = 10)

identify location of missing values in vector

which(is.na(x))

read_excel

works the same as reading a csv file but instead with an excel file

write.csv

write.csv(books, "books.csv", row.names = FALSE) Can also use write.excel() row.names = TRUE is the default Keeps the row names file saves automatically in working directory

list()

x <- list("dog", 3, "cat", "mouse", 7, 12, 9, "chicken") by using list, we can store different types of objects without having to change their types. Prints as separate lines because of this possibility of having mixed elements. Lists are sometimes called generic vectors, because the elements of a list can be of any type of R object, even lists containing further lists. They allow you to gather a variety of possibly unrelated objects under one name

show values that are not NA

x[!is.na(x)] [1] 1 2 3 4 6 7


Related study sets

Tophat Questions Midterm NUR233-

View Set

Intro To Climate Change exam 1 UH

View Set

Emergency Management Midterm Exam

View Set

Primer Parcial Fundamentos de Psicología

View Set