ECON 381E Test #1 Notes

¡Supera tus tareas y exámenes ahora con Quizwiz!

scores <- c(3,8,5,8,0)

"scores" is a vector

Things you can run

1+2, 1*2, 1/2,

CPS85[1:2, c(1, 3, 10)]

1:2 reveals that you are picking the first 2 rows of the CPS85 data set. c(1, 3, 10) reveals that you are choosing the first, third, and 10th column of the data set

sample variation

A sample will almost never be the exact replica of the parent population

Inter-Quartile Range (IQR)

IQR = Q3 - Q1

rflip(10)

This flips a coin 10 random times

print(scores)

This prints the scores we have

what does a bigger sample size do?

a bigger sample size makes our histogram closer to uniform because it is more representative of the population

random sampling

a representative, unbiased sample, where every population entity has an equal probability of being included in the sample

dataset name[row numbers, column numbers]

allows us to collect a subset of the data set (ex: CPS85[1, 4] -- this gives us row 1, column 4)

gf_histogram(~ variable1, data = dataset, fill= ~ variable2)

allows us to compare the histograms of two different variables on top of each other

gf_histogram(~ variable1, data = dataset) %>% gf_facet_grid(variable2 ~ .)

allows us to look at two different histograms one on top of the other

gf_histogram(~ variable1, data = dataset) %>% gf_facet_grid(~ variable2)

allows us to look at two different histograms side by side (it also shows how a categorical variable explains the variation in another variable, more specifically a quantitative variable)

sqrt(var(studentdata$weight))

another way of getting the standard deviation of the values of a variable

library(supernova) aov1 <- supernova(lm1) aov1

another way of getting the sum of squared residuals (SS) and the variance (MS)

y = b0 + e

b0 = sample mean, y = variable, e = residual

gf_boxplot(variable1 ~ variable2, data = dataset)

comparing box plots

gf_boxplot(weight ~ ., data = dataset) %>%gf_facet_grid(~ two_heights)

comparing box plots between one variable split into two sections to see if it explains at least some of the variation in another variable (weight)

gf_jitter(variable1 ~ variable2, data = dataset)

comparing jitter plots

gf_point(variable1 ~ variable2, data = dataset)

comparing scatterplots

gf_boxplot(shuffle(variable1) ~ variable2, data=dataset, color="blue") %>% gf_jitter(height=0, alpha=0.25, width=.2, size=2)

comparing two box plots but shuffling the first variable so that its randomly assigned to the second variable to see if variable 2 explains at least some of the variation in variable 1

gf_point(weight ~ height, data = dataset)

comparing two variables using a scatterplot

gf_boxplot(variable1 ~ variable2, data = dataset, color = "red") %>% gf_jitter(height=0, alpha=0.25, width=.2, size=2)

comparing two variables using box plot and jitter

gf_dhistogram(~ rweight, data=dataset, color="black", fill="grey") %>% gf_density()

distribution of residuals

v2 <- 16:20

easier way of typing out v2 <- c(16,17,18,19,20)

==

equal to

f1 <- as.factor(age)

f1 is telling us that there are three levels "all_other, child, and senior"

gf_histogram(~ HPI, data = HappyPlanetIndex, bins = 20, fill = "yellow", color ="blue")

for this histogram we would use library(Lock5withR). Everything is the same but we add the bin to be equal to 20

lm1$coefficients

gives the intercept which is also the number we need for b0. This is most likely going to be the same number as the mean

gf_bar(~ variable, data=dataset, color = "darkblue", fill= "pink")

gives us a bar graph

gf_boxplot( ~ variable, data = dataset)

gives us a box plot

favstats(~ variable, data = dataset)

gives us min, Q1, median, Q3, mean, max, and sd

v5[c(11,13,15)]

gives us specific numbers of the 11th, 13th, and 15th position in the v5 vector

dev_weight <- (dataset$variable - mean(dataset$variable)) sum(dev_weight)

gives us the deviation weight of the variable

tail(CPS85)

gives us the last 6 lines of the CPS85 data set

max(newvector)

gives us the maximum of the new vector

gf_histogram(~ variable, data = dataset, color="green", fill="yellow") %>% gf_labs(title = "title") %>% gf_vline(xintercept=mean(dataset$variable), color="red", linewidth = 1.5)

gives us the mean of the variable on the histogram

min(newvector)

gives us the minimum of the new vector

sd(dataset$variable)

gives us the standard deviation of the values of a variable. We can also use favstats to get the sd

str(CPS85)

gives us the structure of the data set (we can know how many observations and variable there are)

head(CPS85, 10)

gives us the top 10 lines of the CPS85 data set

head(CPS85)

gives us the top 6 lines of the CPS85 data set

var(dataset$variable)

gives us the variance of a variable

>

greater than

>=

greater than or equal to

range

highest value - lowest value

gf_histogram(~ variable, data = dataset, fill = "red", color = "darkblue")

how to plot a histogram

categorical variable

it is usually a factor variable (ex: race, sex)

what does the height of each bar in a bar graph represent?

it represents the number of observations in each category of the categorical variable

what does the x axis show on a bar graph?

it shows the categories - not a quantitative measure (so max and min values are meaningless for bar graph)

<

less than

<=

less than or equal to

!=

not equal to

CPS85[c(1,2),]

picking the first 2 rows and analyzing it with every variable

dataset$pweight <- predict(lm1) head(dataset$pweight) f_point(weight ~ height, data=dataset) %>% gf_point(pweight ~ height, data=dataset, color = "red")

plotting predicted values against the actual values of the variable

sample1 <- resample(1:6, 12)

sampling numbers 1-6 12 times (ex: 5 5 2 6 4 3 6 2 2 2 2 3)

weight ← sex + other factors

sex and other factors (other observed variables, sampling error, errors in measurements, etc) explain at least some of the variation in weight

gf_point(variable1 ~ variable2, data = dataset) %>% gf_hline(yintercept=lm1$coefficients, color="red", linewidth=1.5)

shows us a best fitted model in a scatterplot

filter(CPS85, sex == "F")

shows us the observations for females specifically

gf_jitter(variable1 ~ variable2, data = dataset, height=0, alpha=0.25, width=.2, size=2)

specific numbers for the jitter plot

unit of observation

specific variable (a typical entity on which info on its variablevalues are collected)

tally(newvector)

tally's each number up in the new vector (how many of each number there is)

glimpse(CPS85)

tells you what kind of variable the variable is (ex: <fct>)

lm(variable ~ NULL, data = dataset) lm1 <- lm(variable ~ NULL, data = dataset)

the best fitted model then saving it as lm1. The intercept number we get from this is our b0

residual

the left over that cannot be explained (by the the model)

values

the measurements in the data table (what is in the row next to the observation)

first quartile

the median of the lower 50% of the observations

third quartile

the median of the upper 50% of the observations

dataset$two_heights <- ntile(dataset$height, 2) %>% dataset$two_heights <- factor(dataset$two_heights, levels=c(1,2), labels=c("short", "tall)) %>% gf_histogram(~ weight, data = dataset) %>%gf_facet_grid(two_heights ~ .)

this allows us to compare two histograms between one variable that's split into two sections (height is split into tall/short). This allows us to see if height explains at least some of the variation in weight

select(CPS85, variable1, variable2, variable3, ...)

this analyses specific variables you chose in the data set

arrange(CPS85, desc(variable))

this arranges the variable in the CPS85 data set in decreasing order

arrange(CPS85, variable)

this arranges the variable in the CPS85 data set in increasing order

sample_w <- sample(population_w, 24)

this gives us a random sample of the population for 24 people

tally( ~ variable, data=CPS85)

this gives us specific information about the variable in the data set

tally(variable1 ~ variable2, data=CPS85)

this gives us specific information between two variables in the data set

gf_dhistogram(~ variable, data = dataset, fill = "yellow", color = "blue") %>% gf_density()

this gives us the density and the overall shape of the histogram

CPS85[,1]

this gives us the information on one specific column or variable from all rows

mean(CPS85$variable)

this gives us the mean of the variable in the CPS5 data set

levels(dataset$variable)

this gives us the possible values of the variable

v5[6:9]

this gives us whatever numbers are in the 6th through 9th position in the v5 vector

residuals_lm1 <- residuals(lm1) SumResiduals_lm1 <- sum(residuals_lm1)

this is how to get the residuals and sum of residuals

SSresiduals_lm1 <- sum(residuals_lm1^2) SSresiduals_lm1

this is how to get the sum of squares of residuals

x^2

this is how we square a value

v3 <- rep(2, 5)

this repeats the number 2 five times

sort(newvector, decreasing = TRUE)

this sorts the new vector in decreasing order

sort(newvector)

this sorts the new vector in order of increasing numbers

variables

usually at the top of the data set (rows)

observation

usually on the left side of the data set (column)

How do we use histogram to explore the distribution of data?

visually check whether there is anything that looks odd, allows us to see if there are any outliers, helps us detect possible errors in the data,

gf_histogram(...) %>% gf_labs(title = "title")

we can include a title after histogram by using gf_labs

x[1] <- 0

we changed the first number in the vector to 0

$

we use this after the data set name to select a variable in the data set (ex: CPS85$wage)

what do we want our sample to be like?

we want our sample to be representative of the population. If it is not then it is biased

moonsofties <- c("USSR", "USA", "China", "India")

when writing words they have to be in quotation marks

y <- x

x explains some of the variation in y. y is the outcome variable and x is the predictor variable


Conjuntos de estudio relacionados

Kato Thick & Kato-Katz Technique

View Set

Shock NCLEX Questions-complex care

View Set

Organizational Identity and Identification

View Set

A Beka: American Literature Appendix Quiz N (English 11)

View Set

MEDICARE SUPPLEMENT POLICIES (MEDIGAP)

View Set

Junior explorer 5 - unit 5b - speaking - question, answer (pytania z was / were)

View Set

Central Idea and Idea Development

View Set