busn 5000 hw pt 1

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Before we look at the structure of the data, let's do a little provenance work. (Just a little.) Go to the paper linked to above to answer the next six questions.

-

Now let's turn to documentation and structure. The wooldridge vignette (https://cran.r- project.org/web/packages/wooldridge/wooldridge.pdf) provides descriptions of the variables contained in the data set. Use the vignette to answer the next few questions.

-

The estimated correlation between earnings and age among 23-62 year-olds using the March 2009 CPS is ______ .

.13

A terabyte is equal to a _____ bytes.

1000000000000

The average gender earnings gap among 25-34 year-olds is $______ (round to the nearest hundred), which is about $_____ smaller than the gap for all workers. (Use your first answer to Part B Q6.)

10100,8900

A megabyte can store _____ num values, while a terabye can store roughly a _____ times that.

131072, 1000000

On average, we found that men earned roughly $_____ (round to the nearest thousand) more than women, which translates into about a ______ percentage (round to the nearest integer) earnings gap

19000, 43

The source of Card's data is a survey that began in _____ with _____ young men age 14-24.

1966,5525

The same young men were surveyed again in selected years through _____ , effectively creating a _____ data set where the unit of observation is the person- _____ .

1981,panel,year

There are _____ categories in the variable race with individuals identifying as Asian only assigned the value _____

21,4 project 26,4

The dollar gap among 25-34 year-olds translates into a ______ percent (round to the nearest integer), which is about _____ percentage points smaller than the gap for all workers.

26,17

For 25-34 year-olds, the gender gap in the likelihood of earning at least $100,000 is only _____ percentage points. Answer to 2 decimal places, an example answer would read ''1.12'' percentage points.

3.54

The extract was filtered to include individuals who had worked at least _____ hours per week and _____ weeks during the past year.

36, 48 project=40

This average dollar difference translates into roughly a _____ (round to the nearest integer) percent earnings gap.

43

Based on Figure 12, in the first year of a career, male earnings increase _____ % on average while female increase by only _____ % (round to the nearest integer for both answers).

5,3

Using the March 2009 CPS data, the quadratic model of E(earnings|age) predicts earnings increase up to roughly age _____ .

50

The filtered extract contains _____ observations on _____ variables.

50742,12 project 52097,15

Average earnings for men were $______ (round to the nearest integer), while average earnings for women were roughly $______ (round to the nearest 1,000) less.

64190, 19000 project 86880.62,19234.73

The variable age is top-coded at ______.

85 project 62

Using the 2009 CPS data we find, men are almost _____ percentage points more likely to earn at least $100,000.

9

We distinguish 4 stages of data analysis and refer to them compactly as _____ (in all caps). Name the stages.

ATAC acquisition, transformation, analysis, communication

males <- filter (cef_fvm, gender == "Male") females <- filter (cef_fvm, gender == "Female") df_ratio <- data.frame(age = males$age, ratio = males$earnings/females$ df_ratio

Because our interest is in the earnings gap, it would probably be more informative just to examine the ratio of the two CEFs. So, let's do it. We'll use filter to separate the male and female CEF estimates. Then, we'll compute the ratio, put it in a new data frame with age, and list its values. This is a newer skill, so we will check your code here.

In the third stage, the workhorse will be the _____.

CEF

library(readxl) cps09mar<-read_xlsx("./data/cps09mar.xlsx")

First, we need to load the data. Because the March 2009 extract is contained in an .xlsx file, we will the read_xlsx function provided by the readxl package. Load the package and read the file, note the file name is cps09mar.xlsx. (You need to get this right to move on, so we will check your code here.)

library(readxl) cps09mar <- read_xlsx("./data/cps09mar.xlsx") cps_mar_2362 <- cps_mar_2362 %>% filter(age >= 23, age <= 62) %>% mutate(gender = case_when(female == 1 ~ "Female", female == 0 ~ "Male"))

First, we'll replicate the estimated CEFs for women and men we showed in class, but using actual earnings instead of log earnings. Then, we'll evaluate the percentage gap in earnings between women and men. Make sure that you click Submit Answer on each coding exercise where it appears. As in Homework 1, we start by loading the data, refer to prior HW if you need to be reminded of the file name. Picking up where we left off in class, we filter down to workers who are between 23-62 years old using filter and recreate the gender variable, using the mutate() and case_when() functions. We'll do it in one chunk

The _____ says that the expected value of the CEF of, say, Y given X, is the expected value of Y

LIE

Card obtained the data from the _____.

NLSYM

earnings_bar <- cps09mar_2534%>% group_by(gender) %>% summarise(avg_earnings = mean(earnings)) earnings_bar

Next, let's compute the gap in average earnings in dollar and percentage terms. Use code on slide 49 to carry out this calculation.

cps09mar_2534 <- cps09mar %>% filter(age <= 34, age >= 25)

Next, let's filter down to workers who are between 25-34 years old. This will allow us to focus on individuals who (in all likelihood) have completed their educations but are still in the early stage of their careers. To do this we will need the filter function from the dplyr package. To parallel the analysis covered in the class slide deck, we will create a gender indicator using the mutate function from dplyr. We could easily do both things in one chunk but we will split them up. The filtering action will take the original data set, cps09mar, filter on the age restriction, and create a new data set containing only the younger workers. The syntax goes like this: another word for filter(drill down)

cef_fvm <- cps_mar_2362 %>% mutate(age = age - 23) %>% # Center on age=23 group_by(age,gender) %>% summarise ( earnings = mean(earnings) ) options(scipen=999) ggplot(cef_fvm, aes(age, earnings, color=gender)) + geom_point() + geom_line() + ylab("Average log earnings by age") + labs(title="CEFs of log earnings by gender")

Now we are ready to replicate the estimated CEFs for women and men using actual earnings. Remember, to talk in terms of career years, we "center" age on 23. Now, plot the estimated CEFs just like in Figure 11, except the vertical axis should show actual dollar values.

earnings_dist_fvm <- ggplot(cps09mar_2534, aes(x=earnings, group = gender, fill = gend geom_density(adjust=1.5, alpha = 0.4) + labs(title="Distribution of earnings by gender") cps09mar_2534

Ok, now we are ready to do some work. First, let's replicate the earnings distributions shown in Figure 4 of the deck for 25-34 year-olds. Here is all the code you need. Just fill in the blank with the earnings distribution object name and see what you get.

example: current_data_set <- current_data_set %>% mutate(newvariable = case_when(_____ == 1 ~ "Female", _____ == 0 ~ "Male")) real: cps09mar_2534 <- cps09mar_2534 %>% mutate(gender = case_when(age == 1 ~ "Female", age == 0 ~ "Male"))

The next step is, strictly speaking, not necessary because cps09mar already contains a female indicator, but as we said, we want to parallel the analysis in the deck. Here is one way to create the new variable gender indicator using the mutate and case_when functions from dplyr. Then we will add the new variable to cps09mar_2534. Use this sample code to complete the mutating operation that will create a new gender variable that takes on the values Female and Male and adds it to cps09mar_2534.

A variable will not have _____ if it does not measure what it is supposed to.

Validity

six_figs_fvm <- cps09mar_2534 %>% group_by(gender) %>% summarise(six_figs_shrs = mean(earnings >= 100000)) print(six_figs_fvm)

We'll finish this exercise by calculating the gender gap in the likelihood of earning six figures among 25-34 year-olds. Use the code on slide 38 to complete this code chunk. This calculation may take a minute to run, but reach out to Abbi if you have issues getting the answer to finish calculating.

If you model the pattern in Figure 6 with a quadratic function of age, the difference in earnings from one age to the next varies with ______ .

age

The variable exper measures labor-market experience as ______.

age-education-c

How to handle missing data depends on whether they are missing _____.

at random

If we want to estimate E(earnings|age=23) , the simplest thing to do is plug in the sample ______ of earnings of 23-year-olds.

average

If we want to estimate E(earnings|age) , the simplest thing to do is plug in the sample ______ earnings for each value of ______ .

average, age

To estimate E{[X−E(X)][Y−E(Y)]} , we can just plug in the sample _____ for E(X) and E(Y) and replace the outer expectation with another sample _____ .

average,average

homework 2

beginning to learn

Sample selection may be a source of _____ if the data we have does not represent the population we want to learn about.

bias

The survey was not a random sample of the US population because men from neighborhoods with a high concentration of _____ residents were over-sampled.

black

The wage variable is measured in _____. The lwage variable is the _____ transformation of wage .

cents, log

part b

code

We say that data are tidy if each variable corresponds to a _____, each row an _____, and each cell a _____.

column, observation, single value

Modeling the pattern in Figure 6 with a quadratic function of age captures the ______ shape of the relationship between earnings and age.

concave

Comparing the earnings of women and men involves estimating the _____ expectation of _____ given _____.

conditional,earnings,gender

Modeling the pattern in Figure 6 with a linear function of age assumes that the difference in earnings from one age to the next is ______ .

constant

The expression E{[X−E(X)][Y−E(Y)]} defines the ______ between ______ and ______ .

covariance, x,y

homework 1

data fundamentals

Covariance indicates the ______ of a relationship but not the ______ of a relationship.

direction, stregth

The CEF gives the expected value of some random variable Y given the value of another random variable X. Applied to last week's work the gender pay gap, the Y was ______ and the X was ______ .

earnings,gender

part b

empirical code

part b

empirical exercise

Another reason reproducibility matters to guard against _____ and _____.

error,fraud

The _____ is the thing we want to learn about. An _____ is the thing we compute to learn about it, which for a given set of data, gives us an _____

estimand, estimator, estimate

Because we rarely know a random variable's distribution, we typically _____ its expected value using its _____ average.

estimate,sample

The natural log function is the inverse of the _____ function.

exponential

Each record is made of _____ that contain measurements of known types.

fields

One reason reproducibility matters is to protect and support your _____ self.

future

The CEF plots indicate that the gender earnings gap (grows/shrinks) _____ over a typical career.

grows

The quadratic model of E(earnings|age) fits the data well and is also justified by ______ theory.

human capital

The key variable in the data set is _____.

id

Based on Figure 6, earnings tend to ______ early in a career and plateau after age 40 or so.

increase

Because we generally do not know the underlying data-generating process, we try to ______ it from the data we observe.

infer

The frequentist approach to probability defines the probability of some event A as the number of times it occurs out of an _____ number of random trials.

infinite

This representation of the data also makes clear what are the _____ that identify an observation.

key variables

This idea of relative frequency converging to the true probability is an example of the _____.

law of large numbers

The skim() function provided by the skimr package is another useful tool for data documentation Load skimr via a library() command and then "skim" the card data. Answer a few more questions based on the skim() output.

library(skim) skim(card)

First, use the library() and data() functions to load the wooldridge package and card data set.

library(wooldridge) data(card)

homework 3

models for exploration

Based on Figures 11 and 12, you would say male earnings increase (more/less) ______ rapidly than female earnings early in a career.

more

Based on Table 2, you would say male earnings increases are (more/less) ______ variable than female earnings.

more

Compared with the earnings distributions in Figure 4, these show (more/less) _____ overlap.

more

R stores real numbers as a _____ data type and allocates _____ bytes of data to each number.

num,8

What data type is lwage and wage

numeric, integer

Finally, use the object.size function to estimate the amount of memory allocated to store the Card data.

object.size(card)

A quick-serve restaurant chain records sales, staffing and customer traffic every day for each store. You recognize this as a _____ data set where the unit of observation is the store-day.

panel

Log transformations help us talk about _____ differences or changes.

percentage

What does %>% mean in R

pipe operator passes it down the function You start with an object or data frame on the left side of the operator. You use %>% to pass this object as the first argument to a function on the right side of the operator. The output of the function on the right side becomes the input for the next function in the chain.

The term data is

plural

The expected value of an indicator variable that takes on the values 1 and 0 is equivalent to the _____ the random variable equals _____.

probability , 1

You should view a reproducible analysis as a _____ that you should be able to produce again and again.

product

One important component is describing the exact _____ of your raw input data.

provenance

The key idea behind _____ is that one draw from a population does not depend on another.

random sampling

A data set is made up of _____ that contain information on a specific entity.

records

A data table is made up of rows containing _____ and columns containing _____.

records, fields

A _____ is a representation of the data structure comprising all of the attributes of the data and their types.

schema

It is advisable to _____ the acquisition, transformation and analysis tasks.

separate

Because earnings distributions tend to be _____ right, the _______ distribution if often a good model for earnings data.

skewed,normal

The str() function, which provides an overview of the data type, size, and content in a data set. Apply it to determine the structure of the card data set and answer the questions that follow.

str(card)

If we want to estimate how earnings change from one point in a career to the next, we can just _____ the sample ______ earnings for one age value from another.

subtract,average

The second stage involves, among other things, making sure the data are _____ (as the Posit folks would say).

tidy

If E(estimator)equals the thing we want to learn about, we say that it is _____

unbiased

This representation of the data structure identifies the _____ to which each observation pertains.

unit of record

The formula E{[X−E(X)]2}, calculates the ______ of a ______ .

variance, random variable

The concept of a random variable's expected value is a _____ average of all the random variable's possible _____.

weighted,outcomes

A national company has developed a new product and is offering it for sale at a discount to introduce it to the market. Randomly surveying customer who purchased the product in the initial discount period (would/would not) ______ generate a sample representing the population of typical customers.

would not

The log of earnings is undefined if earnings equal _____

zero


Set pelajaran terkait

Chapter 18 Practice Quiz Questions

View Set

Sociology midterm - WGTC - 10-12

View Set

Chapter 24: Physical Examination

View Set

DP-100 Data Science Questions Topic 2

View Set

Lesson 1: Understanding Resistance in DC Combination Circuits

View Set

complex disorders, genetic heterogeneity

View Set