R programming

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

the chi-squared distribution

similar to normal distribution and t-distribution

dim() and names()

dim() shows the dimensions of data frame, namely, the number of rows and columns in a data frame. name() shows the names of variables in the data set.

check the nearly normal conditions of two variables in linear regression(if residuals are nearly normal distributed, centered at 0)

1. using a histogram. ie, ggplot(data = m1, aes(x = .resid)) + geom_histogram(binwidth = 25) + xlab("Residuals") 2. using a normal probability plot of the residuals. ggplot(data = m1, aes(sample = .resid)) + stat_qq() Note that the syntax for making a normal probability plot is a bit different than what you're used to seeing: we set `sample` equal to the residuals instead of `x`, and we set a statistical method `qq`, which stands for "quantile-quantile", another name commonly used for normal probability plots.

plot_ss()

An interactive function that will generate a scatterplot of two variables, then allow the user to click the plot in two locations to draw a best fit line. For instance, plot_ss(x = at_bats, y = runs, data = mlb11, showSquares = TRUE)

as.numerical() and as.character()

As the names indicate, as.numerical() changes a variable into a numerical variable; as.character() changes a variable into a character. Note: when changing a factor variable(categorical variable) into a numerical variable, just using as.numerical() is not enough(you can see by using group function with the converted variable). You should change factor into character first then change it into numerical data to do data summary. The reason is, Factor entries have two parts: the text we see on the screen, and a numeric order (remember how 10 was coming between 1 and 2 because of the alphabetical order). When we say "turn this into a number", R uses the numeric order in which it stores the values to do that conversion, as opposed to the names of the levels of the categorical variable. Hence, we need a conversion method that will use the text strings that label the levels, as opposed to the storage order of these levels. We can do this by first saving the variable as a character variable, and then turning it into a number: selected_nzes2011 <- selected_nzes2011 %>% mutate(numlikenzf = as.numeric(as.character(jnzflike)))

data.frame()

Data Frame A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b. > n = c(2, 3, 5) > s = c("aa", "bb", "cc") > b = c(TRUE, FALSE, TRUE) > df = data.frame(n, s, b)

The binomial distribution {stats}

Density, distribution function, quantile function and random generation for the binomial distribution with parameters size and prob. This is conventionally interpreted as the number of 'successes' in size trials. dbinom(x, size, prob, log = FALSE) pbinom(q, size, prob, lower.tail = TRUE, log.p = FALSE) qbinom(p, size, prob, lower.tail = TRUE, log.p = FALSE) rbinom(n, size, prob) x, q vector of quantiles. p vector of probabilities. n number of observations. If length(n) > 1, the length is taken to be the number required. size number of trials (zero or more). prob probability of success on each trial. log, log.p logical; if TRUE, probabilities p are given as log(p). lower.tail logical; if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x].

The normal distribution {stats}

Density, distribution function, quantile function and random generation for the normal distribution with mean equal to mean and standard deviation equal to sd. dnorm(x, mean = 0, sd = 1, log = FALSE) pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) rnorm(n, mean = 0, sd = 1) x, q vector of quantiles. p vector of probabilities. n number of observations. If length(n) > 1, the length is taken to be the number required. mean vector of means. sd vector of standard deviations. log, log.p logical; if TRUE, probabilities p are given as log(p). lower.tail logical; if TRUE (default), probabilities are P[X ≤ x] otherwise, P[X > x]. pnorm() 是知道数字求percentile,qnorm() 是知道percentile求数字。(见p和q的定义)

geom_jitter()

Description The jitter geom is a convenient shortcut for geom_point(position = "jitter"). It adds a small amount of random variation to the location of each point, and is a useful way of handling overplotting caused by discreteness in smaller datasets. https://stackoverflow.com/questions/39255781/what-is-difference-between-geom-point-and-geom-jitter-in-simple-language-in-r Overplotting is when one or more points are in the same place (or close enough to the same place) that you can't look at the plot and tell how many points are there. Two (not mutually exclusive) cases that often lead to overplotting: Noncontinuous data - e.g., if x or y are integers, then it will be difficult to tell how many points there are. Lots of data - if your data is dense (or has regions of high density), then points will often overlap even if x and y are continuous. Jittering is adding a small amount of random noise to data. It is often used to spread out points that would otherwise be overplotted. It is only effective in the non-continuous data case where overplotted points typically are surrounded by whitespace - jittering the data into the whitespace allows the individual points to be seen. It effectively un-discretizes the discrete data. With high density data, jittering doesn't help because there is not a reliable area of whitespace around overlapping points. Other common techniques for mitigating overplotting include using smaller points using transparency binning data (as in a heat map)

predict() - Using linear regression model to predict the responsive values we want to know

First, we need to create a new data frame(include the explanatory values from the lm model): newprof <- data.frame(gender = "male", bty_avg = 3) Then, I can do the prediction using the `predict` function(m_bty_gen is a lm model containing gender and bty_avg variables): predict(m_bty_gen, newprof) We can also construct a prediction interval around this prediction, which will provide a measure of uncertainty around the prediction: predict(m_bty_gen, newprof, interval = "prediction", level = 0.95)

lm()

Fitting Linear Models lm is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance For example, m1 <- lm(runs ~ at_bats, data = mlb11) The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this information using the summary function.(summary())

Return rows with matching conditions filter{dplyr}

For example, rdu_flights <- nycflights %>% filter(dest == "RDU") ggplot(data = rdu_flights, aes(x = dep_delay)) + geom_histogram() The output is a histogram with the x=dep_delay of rdu_flights, y=count of delayed rdu_flights.

ggpairs() - Make a matrix of plots with a given data set

For example, ggpairs(evals, columns = 13:19) The plot is like:

The number of observations in the current group n{dplyr}

For example, nycflights %>% group_by(origin) %>% summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>% arrange(desc(ot_dep_rate)) The summarise step is telling R to count up how many records of the currently found group are on time - sum(dep_type == "on time") - and divide that result by the total number of elements in the currently found group - n() - to get a proportion, then to store the answer in a new variable called ot_dep_rate.

Encode a numerical variable into a categorical variable

For example, ggplot(nycflights, aes(x = factor(month), y = dep_delay)) + geom_boxplot() Side-by-side box plots require a categorical variable on the x-axis, however in the data frame month is stored as a numerical variable (numbers 1 - 12). Therefore we can force R to treat this variable as categorical, what R calls a factor, variable with factor(month).

Group by one or more variables group_by{dplyr}

For example, rdu_flights %>% group_by(origin) %>% summarise(mean_dd = mean(dep_delay), sd_dd = sd(dep_delay), n = n()) The output is the summary data for each group.

seq()

Generate regular sequences. seq is a standard generic with a default method. seq.int is a primitive which can be much faster but has a few restrictions. seq_along and seq_len are very fast primitives for two common cases. for more details, see '?seq' in Rstudio. Example, d <- data.frame(p <- seq(0, 1, 0.01)) to make a vector p that is a sequence from 0 to 1 with each number separated by 0.01.

inference() {statsr}

Hypothesis tests and confidence intervals: For example, inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ht", null = 0, alternative = "twosided", method = "theoretical") Let's pause for a moment to go through the arguments of this custom function. The first argument is y, which is the response variable that we are interested in: weight. The second argument is the explanatory variable, x, which is the variable that splits the data into two groups, smokers and non-smokers: habit. The third argument, data, is the data frame these variables are stored in. Next is statistic, which is the sample statistic we're using, or similarly, the population parameter we're estimating. In future labs we can also work with "median" and "proportion". Next we decide on the type of inference we want: a hypothesis test ("ht") or a confidence interval ("ci"). When performing a hypothesis test, we also need to supply the null value, which in this case is 0, since the null hypothesis sets the two population means equal to each other. The alternative hypothesis can be "less", "greater", or "twosided". Lastly, the method of inference can be "theoretical" or "simulation" based.

stat_smooth() or geom_smooth()

It creates a linear regression plot. For example, ggplot(data = mlb11, aes(x = at_bats, y = runs)) + geom_point() + stat_smooth(method = "lm", se = FALSE)

summary()

It gives the summary of a dataset(column). For example, min, 1st qu, median, 3rd qu, mean, max

Load packages/data in R

Just like you have to load your tools in order to start working, you have to load packages and data for Rstudio to start working on your project. Function: ##load packages Library(package name) ##load data Data(data file name)

Reduces multiple values down to a single value summarise {dplyr}

Summary statistics: Some useful function calls for summary statistics for a single numerical variable are as follows: mean median sd var IQR range min max

'$' sign in R

The dollar sign basically says "go to the data frame that comes before me, and find the variable that comes after me". For example, arbuthnot$boys

%in% operator

When interested in filtering for multiple values a variable can take, the %in% operator can come in handy: selected_nzes2011 %>% filter(jnzflike %in% c("0","10")) %>% group_by(jnzflike) %>% summarise(count = n()) The output would be: ## jnzflike count ## <fctr> <int> ## 1 0 622 ## 2 10 134

Add a new variable in the data frame: mutate()

arbuthnot <- arbuthnot %>% mutate(total = boys + girls) %>% is an operator called piping. A note on piping: Note that we can read these three lines of code as the following: "Take the arbuthnot dataset and pipe it into the mutate function. Using this mutate a new variable called total that is the sum of the variables called boys and girls. Then assign this new resulting dataset to the object called arbuthnot, i.e. overwrite the old arbuthnot dataset with the new one containing the new variable." <- is a symbol that means adding up an assignment. In this case, you already have an object called arbuthnot, so this command updates that data set with the new mutated column. Another example, arbuthnot <- arbuthnot %>% mutate(more_boys = boys > girls) This command add a new variable to the arbuthnot data frame containing the values of either TRUE if that year had more boys than girls, or FALSE if that year did not (the answer may surprise you).

choose()

choose() in R is the calculation of # of scenarios of k happening in n trails. Namely, the choose function in statistics.

?ggplot

ggplot(data = arbuthnot, aes(x = year, y = girls)) + geom_point() We use the ggplot() function to build plots. If you run the plotting code in your console, you should see the plot appear under the Plots tab of the lower right panel of RStudio. Notice that the command above again looks like a function, this time with arguments separated by commas. The first argument is always the dataset. Next, we provide thevariables from the dataset to be assigned to aesthetic elements of the plot, e.g. the x and the y axes. Finally, we use another layer, separated by a + to specify the geometric object for the plot. Since we want to scatterplot, we use geom_point.

check the linearity of two variables in linear regression(if residuals gather around y = 0)

ggplot(data = m1, aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, linetype = "dashed") + xlab("Fitted values") + ylab("Residuals") Notice here that our model object `m1` can also serve as a data set because stored within it are the fitted values ($\hat{y}$) and the residuals. ml is created by lm(y~x).

grep()

grep() is a pattern matching and replacement function. When I want to search for a variable name or important text, I can use grep() function. For example, grep("singlefav", names(selected_nzes2011), value = TRUE) "singlefav" is the key word; names(selected_nzes2011) is the scope to find, which is a matrix; value=true means to return the matching elements themselves.

What is fitted values in a linear regression model?

http://www-ist.massey.ac.nz/dstirlin/CAST/CAST/HleastSqrs/leastSqrs_c3.html

|| and | in R

https://stackoverflow.com/questions/26197759/inference-function-insisting-that-i-use-anova-versus-two-sided-hypothesis-test

logical operators in R

https://www.statmethods.net/management/operators.html

is.na()

is.na() is a function that can show only NA rows of a variable matrix. For example, filter(is.na(variable name)) !is.na() shows all data of a variable except for NA data because of ! in front of is.na(). For example, filter(!is.na(X_singlefav))

the t-distribution{dplyr}

it is the package for t-distribution similar to normal distribution: dt() qt() pt() rt()

sample()

sample n rows from a table For example, coin_outcomes <- c("heads", "tails") sample(coin_outcomes, size = 1, replace = TRUE, prob = c(0.2, 0.8)) It means sample 1 row from coin_outcomes table/vector. The replace = TRUE argument indicates we put the slip of paper back in the hat before drawing again. prob=c(0.2,0.8) means the odd of getting 'heads' is 0.2, and the odd of getting tails is 0.8.

sample_n() & rep_sample_n()

sample_n(): to sample n rows from a table rep_sample_n(): repeating sampling.

set.seed()

set.seed() is a function that makes the random number generator in R produce fixed numbers, so for assignments people get the same answers. It is not quite obvious from your question which lab you are talking about, so that is about as specific as I can be with the information in hand. It locks the random number generator so everyone running the code gets the same simulation.

str()

str(object) can be used to see the structure of a data set. For example, str(nycflights) The output is: Classes 'tbl_df' and 'data.frame': 32735 obs. of 16 variables: $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... $ month : int 6 5 12 5 7 1 12 8 9 4 ... $ day : int 30 7 8 14 21 1 9 13 26 30 ... ......

.rmd 拓展名文件

意思是markdown文件。Markdown 是一种轻量级标记语言,创始人为約翰·格魯伯(英语:John Gruber)。它允许人们"使用易读易写的纯文本格式编写文档,然后转换成有效的XHTML(或者HTML)文档"。[4]这种语言吸收了很多在电子邮件中已有的纯文本标记的特性。 在Rstudio中,将.rmd结尾的文件打开(注意chrome没法保存成rmd结尾的文件),并选择knit,可以打开html结尾的阅读文件。


Kaugnay na mga set ng pag-aaral

Lab Assignment 5 - Atomic Structure

View Set

Chapter 5: Species Interactions, Ecological Succession, and Population Control

View Set

Money and Banking Test 2, Chapter 8

View Set