HW 1 Data Fundamentals: BUSN 5000
Based on object.size the Card data take up _____ MB in memory. (Round to 3 digits)
0.438
Pro Tips
1. Make sure the first letter in the answer is NOT capitalized. It will say your answer is wrong even if you're right. 2. for questions with multiple answers make sure you put a comma between the answers plus a space after the comma.
A megabyte can store _____ num values, while a terabye can store roughly a _____ times that.
131072, 1000000
The source of Card's data is a survey that began in _____ with _____ young men age 14-24.
1966, 5525
The same young men were surveyed again in selected years through _____ , effectively creating a _____ data set where the unit of observation is the person- _____ .
1981, panel, year
What percentage of the sample are Black? _____ %. Is that representative of the US population in 1976? (Yes/No) _____ .
23, No
Card's analysis is based on the 1976 survey when the youngest respondents are _____. By 1976, attrition had reduced the sample size to _____ observations. After filtering the sample on observations with valid education and wage data, Card is left with an analysis sample of _____ young men.
24, 3694, 3010
The card data set contains _____ observations and _____ variables.
3010, 34
What percentage of young men in the sample are missing IQ test scores? _____ % . (Answer to 1 decimal place, for example: ''99.9'' percent)
31.5
The third person in the data set is _____ years old, has _____ years of education, has _____ years of experience, and reported a wage of $ ______ .
34, 12, 16, 7.21
How many variables have missing data? _____.
6
In the third stage, the workhorse will be the _____.
CEF
Card obtained the data from the _____.
NLSYM
PART B: BELOW THIS CARD
PART B: BELOW THIS CARD
We distinguish 4 stages of data analysis. Name them.
acquisition, transformation, analysis, communication
The variable exper measures labor-market experience as ______.
age - educ - 6
Another reason reproducibility matters is to guard against confirmation _____.
bias
The survey was not a random sample of the US population because men from neighborhoods with a high concentration of _____ residents were over-sampled.
black
The wage variable is measured in _____. The lwage variable is the _____ transformation of wage.
cents, log
We say that data are tidy if each variable corresponds to a _____, each row an _____, and each cell a _____.
column, observation, single value
This representation of the data structure identifies the _____ to which each observation pertains.
entity
Each record is made of _____ that contain measurements of known types.
fields
One reason reproducibility matters is to protect and support your _____ self.
future
The key variable in the data set is _____.
id
This representation of the data also makes clear what are the _____ that identify an observation.
key variables
The skim() function provided by the skimr package is another useful tool for data documentation Load skimr via a library()command and then "skim" the card data. Answer a few more questions based on the skim() output.
library(skimr) skim(card)
First, use the library() and data() functions to load the wooldridge package and card data set.
library(wooldridge) data(card)
R stores real numbers as a _____ data type and allocates _____ bytes of data to each number.
num, 8
What data type is lwage? _____. How about wage? ______. (Use the full-name description of the data type in your answers.)
numeric, integer
Finally, use the object.size function to estimate the amount of memory allocated to store the Card data.
object.size(card)
A quick-serve restaurant chain records sales, staffing and customer traffic every day for each store. You recognize this as a _____ data set where the unit of observation is the store-day.
panel
The term data is (singlular/plural) _____.
plural
You should view a reproducible analysis as a _____ that you should be able to produce again and again.
product
One important component of reproducibility is describing the exact _____ of your raw input data.
provenance
A data table is made up of rows containing _____ and columns containing _____.
records, fields
The first requirement for a data analysis to be reproducible is if the _____ of the original analysis can be generated using the same _____.
results, inputs
A data _____ is a representation of the data structure comprising all of the attributes of the data and their types.
schema
It is advisable to _____ the acquisition, transformation and analysis tasks.
separate
The str() function, which provides an overview of the data type, size, and content in a data set. Apply it to determine the structure of the card data set and answer the questions that follow.
str(card)
Data that is organized in a table of records and fields is _____.
structured
The second stage involves, among other things, making sure the data are _____ (as the Posit folks would say).
tidy
A variable will not have _____ if it does not measure what it is supposed to.
validity
A national company has developed a new product and is offering it for sale at a discount to introduce it to the market. Randomly surveying customer who purchased the product in the initial discount period (would/would not) ______ generate a sample representing the population of typical customers.
would not