Exam 1
What percentage of grade G loans are for debt consolidation?
(lend$grade=="G") / (lend$grade=="G", lend$purpose=="debt_consolidation")*100
What percentage of these loans have their current status as late by 31-120 days?
(lend$loan_status=="Late (31-120 days)") / (lend$loan_status=="Current") *100
Suppose we define a vector as follows: vec4 <- c(20,20,5,10) What does the following line of code below return in R? sum(vec4==10)
1
As a manager of a retail store location, your profits are reported (in $M) to Corporate. Corporate reports that, compared to profits of all locations, your profit has a Z-score of -1.5. Your profit is:
1.5 standard deviations below the average
Statistic
26.5% = percent of the 200 surveyed households that are headed by a single woman.
The kurtosis() function in the {moments} package in R computes the kurtosis, which uses
3 as a baseline instead of 0
What percentage of observations lie between Q3 and Q1?
50%
The summary() function applied to a categorical variable will return
A list of each category and how often it occurs
What does the following line of R code do? data <- read.csv("myData.csv", strings=T)
Calls in a dataset and stores it as data
Thinking about Chebyschev's and the Empirical (68-95-99.7%) rule, which is it OK to use for this distribution?
Chebychev's only
Suppose we define a vector as follows: vec4 <- c(20,20,5,10) What does the following line of code below return in R? vec4==10
FALSE FALSE FALSE FALSE
TRUE or FALSE: If you open a cdv file in Excel and see a column named "wait time", it will be named "wait time" in R.
FALSE: R does not like spaces or special characters. It will replace a space with a ".". Always look at your data in R using something like the str() function to see how to refer to columns.
TRUE or FALSE: R recognizes sys.time() and Sys.time() as the same function.
FALSE: R is case sensitive
TRUE or FALSE: R is a statistical computing language used mainly in education?
FALSE: R is used widely throughout industry and government all over the world. It is the real industrial strength deal.
What does the following code calculate in R? mean(dName$weight[dName$shipping=="high"])
Mean weight for high shipping cost items
Different kinds of variables
Quantitative (discrete & continuous) Qualitative
The kurtosis of Stock A's returns is 3.157 and for Stock B is 4.596. Which one has more kurtosis risk?
Stock B
The return to risk for Stock A is 0.12 and for Stock B is 0.27. Based on this measure, which one is safer?
Stock B
Consider this R code and output and explain what this means > mean(fileName$billing) [1] NA
There is at least one missing value in "billing"
Which is better for working with data - Python or R?
They are just as good as each other, with each having some advantages and disadvantages
When should you choose a box plot over a histogram for these data? (When would a box plot be better?)
When you want to compare light beer sales by container (bottle, can, keg, etc...)
If you get a + instead of the command prompt > after hitting enter/return in R, this means
You didn't close out parenthesis, quote, etc.
These side by side box plots display the salary distribution (many observations for salary) across 2 categories (broken out by gender). If we had data such that we had one observation for each gender, we should use
a bar chart
Variables
a characteristic of an individual or item in the population - its value varies from individual to individual or from one time period to the next - ex: hair color = light brown ; height = 50 in ; grade = 3 - - saying what the average or how they differ
What is a CRAN mirror?
a repository that stores all the R files for download onto individuals' computers
Identifier variable
a unique identifier assigned to each individual or item in a group - they... -- do not have units -- are a special kind of categorical variable -- are useful in combining data from different sources to avoid duplication -- are not variables to be analyzed -ex: SSN's student ID #'s, tracking #'s, transaction #'s
Population
all households in this city
How do we usually want a categorical variable to be stored in R?
as a factor
gender is a _______ variable
categorical, nominal
The standard deviation expressed as a percent of the mean is the
coefficient of variation
descriptive statistics
collecting, organizing, summarizing, and presenting the data
variables (with R)
columns of the dataset (left to right)
discrete quantitative variable
counted set of values ex. number of hummingbirds
Data on sales revenue, number of customers, and expenses for last month at each Starbucks (more than 20,000 locations as of 2012) would be
cross-sectional
When the <- symbol is used in R, you are
defining an object
Use the summary function on the region 2 column to examine the levels of this categorial variable. How does the Sonoma region of the Napa valley appear?
dfname: wine column name in R: region.2 summary(wine$region.2) - Napa-Sonoma
What are the dimensions of the dataframe?
dim(dfname) ^ gives both column and row nrow(dfname) ^ amount of rows ncol(dfname) ^ amount of columns
inferential statistics
drain conclusions about a population based on data observed on a sample from that population
When hitting enter after the following line of code, you will simply get the cursor vec <- c(0,0,0,0) What did that line of code do?
everything looks OK, it stored the 4 zeros as "vec"
mean beer sales are ______ median beer sales
greater than
qualitative variable (categorical)
have categories as values - arise from the descriptive responses to questions like "What kind of advertising do you use?" - May only have two possible values (like "yes" or "no") - may be a number like a zip code or area code - - ex. do you invest in the stock market (yes/no) - what type of advertising do you use (internet, newspaper, radio) - rate your satisfaction with this product (very negative, negative, neutral, positive, very positive) - could be a number like a zip code/area code
Quantiative variables
have units of measure; have magnitude. the units indicate... - how each value has been measured - the corresponding scale of measurement - how much of something we have - how far apart two values are - have to included units of measurement
observational unit
is the unit upon which an observation is made (e.g., individual, neighborhood, school, country)
What happens to the standard deviation if an observation is added to the low end of the distribution?
it gets higher
What happens to the standard deviation if an unusually low (small) observation is added to a dataset?
it gets larger
What happens to the mean when an observation is added to the low end of the distribution?
it gets lower
What happens to the mean if an unusually low (small) observation is added to a dataset?
it gets smaller
What is the length of a column?
length(dfname$colname) - if it were for the price column on the wine data frame: length(wine$price)
Looking at box plot: female and male annual salaries: Which of the following is a true statement
males make more than females, in general
What is the average hourly wage for subjects in this dataset? Use the mean function.
mean(cps$hrwage) hrwage - column name
What is the average hourly wage for a male?
mean(cps$hrwage[!cps$female]
What is the average hourly wage for subjects are female and have more than 15 years potential work experience?
mean(cps$hrwage[cps$female&cps$pexp>15])
What is the average hourly wage for a female?
mean(cps$hrwage[cps$female],na.rm=T)
What is the average hourly wage for someone with at most 35 years potential work experience?
mean(cps$hrwage[cps$pexp<=35])
What is the average hourly wage for someone with more than 35 years potential work experience?
mean(cps$hrwage[cps$pexp>35])
What is the average of a column?
mean(dfname$colname) of points in the wine data frame mean(wine$points) of price in the wine data frame mean(wine$price,na.rm=T) - summary(price) reveals that there is missing data (DAs) in the price column, so the na.rm=T argument must be addted to the mean() function to tell R to ignore them. The mean function does not automatically ignore them.
A borrower having a hardship can make smaller than usual payments for several months under a hardship plan. What is the average hardship length?
mean(lend$hardhip_length,na.rm=T)
What is the average interest for loans taken out for debt consolidation with a FICO score (use the column for the FICO at origination for the high end of the range) greater than 730?
mean(lend$int_rate[lend$fico_range_high>730&lend$purpose=="debt_consolidation"])
Wine data frame: What are the average points for Sauvignon Blanc for the Sonoma region?
mean(points[region.2=="Sonoma"&variety=="Sauvignon Blanc"])
Wine data frame: Considering region 2, what are the average points for the Southern Oregon region? column name: region.2 specific region: "Southern Oregon"
mean(points[region.2=="Southern Oregon"]) *if data frame is not attached at wine$ in front of column names
Wine data frame: Considering region.2, what is the average price for the Sierra Foothills region?
mean(price[region.2=="Sierra Foothills"],na.rm=TRUE) *using summary(price) reveals there is missing data (NAs) in the price column, so that the na.rm=T argument must be added to the mean() function to tell R to ignore them. Many functions in R ignore them automatically, but not the mean() function.
continuous quantitative variable
measured set of values ex. distance of hummingbird from my patio
Lending club uses a FICO score when it is deciding to grant a loan to applicants. The FICO score is a three-digit number on a 300-850 range and is a credit score. This number is calculated through a complicated statistical model and aims to tell lenders how likely a consumer is to repay borrowed money based on their credit history. Our data reports the FICO score in ranges of 4 (except for the top tier that goes from 845 - 850). Using the column associated with the high end of the range for FICO scores at loan origination, calculate the median FICO score for these approved loans. Do not use the summary() function. Instead, choose from the min(), max(), median(), or mean() functions. Feel free to check your answer with the summary() function, though.
median(lend$fico_range_high)
What is the median FICO score (use the column for the FICO at origination for the high end of the range) for grade E loans?
median(lend$fico_range_high[lend$grade=="E"])
Which measure of center can be used for a categorical variable?
mode
How many variables are reported in this dataset?
ncol(lend) lend is the dataset
How many subjects' (people's) information are in this dataset?
nrow(cps) cps is data file
How many loan record are available in the data set?
nrow(lend) lend is the dataset
statistic (paramater)
number calculated from a sample (used to estimate the parameter)
Parameter (notes)
number used to describe.summarize population ex: mode, percentage of the amount of people
A number used to describe/summarize a characteristic of the population is called a
paramater
Parameter
percent of all households in this large city that are headed by a single woman.
nominal variable
qualitative (categorical) variables that have values that cannot be ordered - ex: a UofSC undergraduate student's major - nominal means name
salary is a ________ variable
quantitative
age can be quantitative or qualitative
quantitative - average age of our customers - 24, 37, 51, 28, 24 qualitative - - age group of books - child, teen, adult, senior
You are weighing subjects (pounds) for a weight loss program. The variable "weight of a person" is
quantitative and continuous
In weightlifting, weight can be added to the bar in 5 pound increments. You are recording the highest amount liften in the deadlift for members of a gym. The variable "weight lifted" is
quantitative and discrete
time series data
results from a variable measured at regular intervals over time - sequences of data - intervals: equally spaced points in time -- ex: hourly, daily, weekly, monthly, quarterly, annual..
The coefficient of variation is also known as
risk to return
observation units (variables with R)
rows of the data set (up to down)
The distribution of light beer sales is
skewed right (more concentrated on the left and goes off to the right)
Suppose a dataset has a calculated skewness of -1.372. This means that the distribution is
skewed to the left
How to find a column name in R?
str(dfname) or head(dfname) and look for the column you need
How many observations are in a specific column?
sum(!is.na(dfname$colname) or length(dfname$colname[!is.na(dfname$colname)]) #count the observations that are not NA
How many subjects are from the west?
sum(cps$we) we - column name
How many DAs are contained in the entire data set?
sum(is.na(cps))
How many of these applicants had income over $65,000 and were given a grade D loan?
sum(lend$annual_inc.65000&lend$grade=="D")
How many of these loans have as their current status that they are currently late by 16-30 days?
sum(lend$loan_status=="Late (16-30)")
Sample
the 200 households surveyed
In the R console, the > with the flashing cursor next to it is referred to as
the command prompt
The summary() function applied to a numerical variable will return
the five number summary and mean
The baseline kurtosis is based on comparing the tails of a distribution to
the normal distribution
Statistics
the science of data
Z-score
to compare 2 observations from distributions with different units
coefficient of variation
to compare the variation for 2 distributions with different units
cross-sectional data (under time series data)
when characteristic (variable) is measured on many subjects at the same time point (or time frame), the rule is called - taken at the same time -gives a "snapshot" of the data at the given point in time
ordinal variable
when data values can be ordered, we say that the variable is ordinal - ex: length of time employed (<5 years, between 5 and 10 years, >10 years)
When is it improper to use Z-scores for comparison?
when the distributions have different shapes
If you execute a line of code in R and get the + on the next line instead of >, what does this mean?
you didn't close a parenthesis or quote