ECON 381E Test #1 Notes
scores <- c(3,8,5,8,0)
"scores" is a vector
Things you can run
1+2, 1*2, 1/2,
CPS85[1:2, c(1, 3, 10)]
1:2 reveals that you are picking the first 2 rows of the CPS85 data set. c(1, 3, 10) reveals that you are choosing the first, third, and 10th column of the data set
sample variation
A sample will almost never be the exact replica of the parent population
Inter-Quartile Range (IQR)
IQR = Q3 - Q1
rflip(10)
This flips a coin 10 random times
print(scores)
This prints the scores we have
what does a bigger sample size do?
a bigger sample size makes our histogram closer to uniform because it is more representative of the population
random sampling
a representative, unbiased sample, where every population entity has an equal probability of being included in the sample
dataset name[row numbers, column numbers]
allows us to collect a subset of the data set (ex: CPS85[1, 4] -- this gives us row 1, column 4)
gf_histogram(~ variable1, data = dataset, fill= ~ variable2)
allows us to compare the histograms of two different variables on top of each other
gf_histogram(~ variable1, data = dataset) %>% gf_facet_grid(variable2 ~ .)
allows us to look at two different histograms one on top of the other
gf_histogram(~ variable1, data = dataset) %>% gf_facet_grid(~ variable2)
allows us to look at two different histograms side by side (it also shows how a categorical variable explains the variation in another variable, more specifically a quantitative variable)
sqrt(var(studentdata$weight))
another way of getting the standard deviation of the values of a variable
library(supernova) aov1 <- supernova(lm1) aov1
another way of getting the sum of squared residuals (SS) and the variance (MS)
y = b0 + e
b0 = sample mean, y = variable, e = residual
gf_boxplot(variable1 ~ variable2, data = dataset)
comparing box plots
gf_boxplot(weight ~ ., data = dataset) %>%gf_facet_grid(~ two_heights)
comparing box plots between one variable split into two sections to see if it explains at least some of the variation in another variable (weight)
gf_jitter(variable1 ~ variable2, data = dataset)
comparing jitter plots
gf_point(variable1 ~ variable2, data = dataset)
comparing scatterplots
gf_boxplot(shuffle(variable1) ~ variable2, data=dataset, color="blue") %>% gf_jitter(height=0, alpha=0.25, width=.2, size=2)
comparing two box plots but shuffling the first variable so that its randomly assigned to the second variable to see if variable 2 explains at least some of the variation in variable 1
gf_point(weight ~ height, data = dataset)
comparing two variables using a scatterplot
gf_boxplot(variable1 ~ variable2, data = dataset, color = "red") %>% gf_jitter(height=0, alpha=0.25, width=.2, size=2)
comparing two variables using box plot and jitter
gf_dhistogram(~ rweight, data=dataset, color="black", fill="grey") %>% gf_density()
distribution of residuals
v2 <- 16:20
easier way of typing out v2 <- c(16,17,18,19,20)
==
equal to
f1 <- as.factor(age)
f1 is telling us that there are three levels "all_other, child, and senior"
gf_histogram(~ HPI, data = HappyPlanetIndex, bins = 20, fill = "yellow", color ="blue")
for this histogram we would use library(Lock5withR). Everything is the same but we add the bin to be equal to 20
lm1$coefficients
gives the intercept which is also the number we need for b0. This is most likely going to be the same number as the mean
gf_bar(~ variable, data=dataset, color = "darkblue", fill= "pink")
gives us a bar graph
gf_boxplot( ~ variable, data = dataset)
gives us a box plot
favstats(~ variable, data = dataset)
gives us min, Q1, median, Q3, mean, max, and sd
v5[c(11,13,15)]
gives us specific numbers of the 11th, 13th, and 15th position in the v5 vector
dev_weight <- (dataset$variable - mean(dataset$variable)) sum(dev_weight)
gives us the deviation weight of the variable
tail(CPS85)
gives us the last 6 lines of the CPS85 data set
max(newvector)
gives us the maximum of the new vector
gf_histogram(~ variable, data = dataset, color="green", fill="yellow") %>% gf_labs(title = "title") %>% gf_vline(xintercept=mean(dataset$variable), color="red", linewidth = 1.5)
gives us the mean of the variable on the histogram
min(newvector)
gives us the minimum of the new vector
sd(dataset$variable)
gives us the standard deviation of the values of a variable. We can also use favstats to get the sd
str(CPS85)
gives us the structure of the data set (we can know how many observations and variable there are)
head(CPS85, 10)
gives us the top 10 lines of the CPS85 data set
head(CPS85)
gives us the top 6 lines of the CPS85 data set
var(dataset$variable)
gives us the variance of a variable
>
greater than
>=
greater than or equal to
range
highest value - lowest value
gf_histogram(~ variable, data = dataset, fill = "red", color = "darkblue")
how to plot a histogram
categorical variable
it is usually a factor variable (ex: race, sex)
what does the height of each bar in a bar graph represent?
it represents the number of observations in each category of the categorical variable
what does the x axis show on a bar graph?
it shows the categories - not a quantitative measure (so max and min values are meaningless for bar graph)
<
less than
<=
less than or equal to
!=
not equal to
CPS85[c(1,2),]
picking the first 2 rows and analyzing it with every variable
dataset$pweight <- predict(lm1) head(dataset$pweight) f_point(weight ~ height, data=dataset) %>% gf_point(pweight ~ height, data=dataset, color = "red")
plotting predicted values against the actual values of the variable
sample1 <- resample(1:6, 12)
sampling numbers 1-6 12 times (ex: 5 5 2 6 4 3 6 2 2 2 2 3)
weight ← sex + other factors
sex and other factors (other observed variables, sampling error, errors in measurements, etc) explain at least some of the variation in weight
gf_point(variable1 ~ variable2, data = dataset) %>% gf_hline(yintercept=lm1$coefficients, color="red", linewidth=1.5)
shows us a best fitted model in a scatterplot
filter(CPS85, sex == "F")
shows us the observations for females specifically
gf_jitter(variable1 ~ variable2, data = dataset, height=0, alpha=0.25, width=.2, size=2)
specific numbers for the jitter plot
unit of observation
specific variable (a typical entity on which info on its variablevalues are collected)
tally(newvector)
tally's each number up in the new vector (how many of each number there is)
glimpse(CPS85)
tells you what kind of variable the variable is (ex: <fct>)
lm(variable ~ NULL, data = dataset) lm1 <- lm(variable ~ NULL, data = dataset)
the best fitted model then saving it as lm1. The intercept number we get from this is our b0
residual
the left over that cannot be explained (by the the model)
values
the measurements in the data table (what is in the row next to the observation)
first quartile
the median of the lower 50% of the observations
third quartile
the median of the upper 50% of the observations
dataset$two_heights <- ntile(dataset$height, 2) %>% dataset$two_heights <- factor(dataset$two_heights, levels=c(1,2), labels=c("short", "tall)) %>% gf_histogram(~ weight, data = dataset) %>%gf_facet_grid(two_heights ~ .)
this allows us to compare two histograms between one variable that's split into two sections (height is split into tall/short). This allows us to see if height explains at least some of the variation in weight
select(CPS85, variable1, variable2, variable3, ...)
this analyses specific variables you chose in the data set
arrange(CPS85, desc(variable))
this arranges the variable in the CPS85 data set in decreasing order
arrange(CPS85, variable)
this arranges the variable in the CPS85 data set in increasing order
sample_w <- sample(population_w, 24)
this gives us a random sample of the population for 24 people
tally( ~ variable, data=CPS85)
this gives us specific information about the variable in the data set
tally(variable1 ~ variable2, data=CPS85)
this gives us specific information between two variables in the data set
gf_dhistogram(~ variable, data = dataset, fill = "yellow", color = "blue") %>% gf_density()
this gives us the density and the overall shape of the histogram
CPS85[,1]
this gives us the information on one specific column or variable from all rows
mean(CPS85$variable)
this gives us the mean of the variable in the CPS5 data set
levels(dataset$variable)
this gives us the possible values of the variable
v5[6:9]
this gives us whatever numbers are in the 6th through 9th position in the v5 vector
residuals_lm1 <- residuals(lm1) SumResiduals_lm1 <- sum(residuals_lm1)
this is how to get the residuals and sum of residuals
SSresiduals_lm1 <- sum(residuals_lm1^2) SSresiduals_lm1
this is how to get the sum of squares of residuals
x^2
this is how we square a value
v3 <- rep(2, 5)
this repeats the number 2 five times
sort(newvector, decreasing = TRUE)
this sorts the new vector in decreasing order
sort(newvector)
this sorts the new vector in order of increasing numbers
variables
usually at the top of the data set (rows)
observation
usually on the left side of the data set (column)
How do we use histogram to explore the distribution of data?
visually check whether there is anything that looks odd, allows us to see if there are any outliers, helps us detect possible errors in the data,
gf_histogram(...) %>% gf_labs(title = "title")
we can include a title after histogram by using gf_labs
x[1] <- 0
we changed the first number in the vector to 0
$
we use this after the data set name to select a variable in the data set (ex: CPS85$wage)
what do we want our sample to be like?
we want our sample to be representative of the population. If it is not then it is biased
moonsofties <- c("USSR", "USA", "China", "India")
when writing words they have to be in quotation marks
y <- x
x explains some of the variation in y. y is the outcome variable and x is the predictor variable