Exam 1

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

What percentage of grade G loans are for debt consolidation?

(lend$grade=="G") / (lend$grade=="G", lend$purpose=="debt_consolidation")*100

What percentage of these loans have their current status as late by 31-120 days?

(lend$loan_status=="Late (31-120 days)") / (lend$loan_status=="Current") *100

Suppose we define a vector as follows: vec4 <- c(20,20,5,10) What does the following line of code below return in R? sum(vec4==10)

1

As a manager of a retail store location, your profits are reported (in $M) to Corporate. Corporate reports that, compared to profits of all locations, your profit has a Z-score of -1.5. Your profit is:

1.5 standard deviations below the average

Statistic

26.5% = percent of the 200 surveyed households that are headed by a single woman.

The kurtosis() function in the {moments} package in R computes the kurtosis, which uses

3 as a baseline instead of 0

What percentage of observations lie between Q3 and Q1?

50%

The summary() function applied to a categorical variable will return

A list of each category and how often it occurs

What does the following line of R code do? data <- read.csv("myData.csv", strings=T)

Calls in a dataset and stores it as data

Thinking about Chebyschev's and the Empirical (68-95-99.7%) rule, which is it OK to use for this distribution?

Chebychev's only

Suppose we define a vector as follows: vec4 <- c(20,20,5,10) What does the following line of code below return in R? vec4==10

FALSE FALSE FALSE FALSE

TRUE or FALSE: If you open a cdv file in Excel and see a column named "wait time", it will be named "wait time" in R.

FALSE: R does not like spaces or special characters. It will replace a space with a ".". Always look at your data in R using something like the str() function to see how to refer to columns.

TRUE or FALSE: R recognizes sys.time() and Sys.time() as the same function.

FALSE: R is case sensitive

TRUE or FALSE: R is a statistical computing language used mainly in education?

FALSE: R is used widely throughout industry and government all over the world. It is the real industrial strength deal.

What does the following code calculate in R? mean(dName$weight[dName$shipping=="high"])

Mean weight for high shipping cost items

Different kinds of variables

Quantitative (discrete & continuous) Qualitative

The kurtosis of Stock A's returns is 3.157 and for Stock B is 4.596. Which one has more kurtosis risk?

Stock B

The return to risk for Stock A is 0.12 and for Stock B is 0.27. Based on this measure, which one is safer?

Stock B

Consider this R code and output and explain what this means > mean(fileName$billing) [1] NA

There is at least one missing value in "billing"

Which is better for working with data - Python or R?

They are just as good as each other, with each having some advantages and disadvantages

When should you choose a box plot over a histogram for these data? (When would a box plot be better?)

When you want to compare light beer sales by container (bottle, can, keg, etc...)

If you get a + instead of the command prompt > after hitting enter/return in R, this means

You didn't close out parenthesis, quote, etc.

These side by side box plots display the salary distribution (many observations for salary) across 2 categories (broken out by gender). If we had data such that we had one observation for each gender, we should use

a bar chart

Variables

a characteristic of an individual or item in the population - its value varies from individual to individual or from one time period to the next - ex: hair color = light brown ; height = 50 in ; grade = 3 - - saying what the average or how they differ

What is a CRAN mirror?

a repository that stores all the R files for download onto individuals' computers

Identifier variable

a unique identifier assigned to each individual or item in a group - they... -- do not have units -- are a special kind of categorical variable -- are useful in combining data from different sources to avoid duplication -- are not variables to be analyzed -ex: SSN's student ID #'s, tracking #'s, transaction #'s

Population

all households in this city

How do we usually want a categorical variable to be stored in R?

as a factor

gender is a _______ variable

categorical, nominal

The standard deviation expressed as a percent of the mean is the

coefficient of variation

descriptive statistics

collecting, organizing, summarizing, and presenting the data

variables (with R)

columns of the dataset (left to right)

discrete quantitative variable

counted set of values ex. number of hummingbirds

Data on sales revenue, number of customers, and expenses for last month at each Starbucks (more than 20,000 locations as of 2012) would be

cross-sectional

When the <- symbol is used in R, you are

defining an object

Use the summary function on the region 2 column to examine the levels of this categorial variable. How does the Sonoma region of the Napa valley appear?

dfname: wine column name in R: region.2 summary(wine$region.2) - Napa-Sonoma

What are the dimensions of the dataframe?

dim(dfname) ^ gives both column and row nrow(dfname) ^ amount of rows ncol(dfname) ^ amount of columns

inferential statistics

drain conclusions about a population based on data observed on a sample from that population

When hitting enter after the following line of code, you will simply get the cursor vec <- c(0,0,0,0) What did that line of code do?

everything looks OK, it stored the 4 zeros as "vec"

mean beer sales are ______ median beer sales

greater than

qualitative variable (categorical)

have categories as values - arise from the descriptive responses to questions like "What kind of advertising do you use?" - May only have two possible values (like "yes" or "no") - may be a number like a zip code or area code - - ex. do you invest in the stock market (yes/no) - what type of advertising do you use (internet, newspaper, radio) - rate your satisfaction with this product (very negative, negative, neutral, positive, very positive) - could be a number like a zip code/area code

Quantiative variables

have units of measure; have magnitude. the units indicate... - how each value has been measured - the corresponding scale of measurement - how much of something we have - how far apart two values are - have to included units of measurement

observational unit

is the unit upon which an observation is made (e.g., individual, neighborhood, school, country)

What happens to the standard deviation if an observation is added to the low end of the distribution?

it gets higher

What happens to the standard deviation if an unusually low (small) observation is added to a dataset?

it gets larger

What happens to the mean when an observation is added to the low end of the distribution?

it gets lower

What happens to the mean if an unusually low (small) observation is added to a dataset?

it gets smaller

What is the length of a column?

length(dfname$colname) - if it were for the price column on the wine data frame: length(wine$price)

Looking at box plot: female and male annual salaries: Which of the following is a true statement

males make more than females, in general

What is the average hourly wage for subjects in this dataset? Use the mean function.

mean(cps$hrwage) hrwage - column name

What is the average hourly wage for a male?

mean(cps$hrwage[!cps$female]

What is the average hourly wage for subjects are female and have more than 15 years potential work experience?

mean(cps$hrwage[cps$female&cps$pexp>15])

What is the average hourly wage for a female?

mean(cps$hrwage[cps$female],na.rm=T)

What is the average hourly wage for someone with at most 35 years potential work experience?

mean(cps$hrwage[cps$pexp<=35])

What is the average hourly wage for someone with more than 35 years potential work experience?

mean(cps$hrwage[cps$pexp>35])

What is the average of a column?

mean(dfname$colname) of points in the wine data frame mean(wine$points) of price in the wine data frame mean(wine$price,na.rm=T) - summary(price) reveals that there is missing data (DAs) in the price column, so the na.rm=T argument must be addted to the mean() function to tell R to ignore them. The mean function does not automatically ignore them.

A borrower having a hardship can make smaller than usual payments for several months under a hardship plan. What is the average hardship length?

mean(lend$hardhip_length,na.rm=T)

What is the average interest for loans taken out for debt consolidation with a FICO score (use the column for the FICO at origination for the high end of the range) greater than 730?

mean(lend$int_rate[lend$fico_range_high>730&lend$purpose=="debt_consolidation"])

Wine data frame: What are the average points for Sauvignon Blanc for the Sonoma region?

mean(points[region.2=="Sonoma"&variety=="Sauvignon Blanc"])

Wine data frame: Considering region 2, what are the average points for the Southern Oregon region? column name: region.2 specific region: "Southern Oregon"

mean(points[region.2=="Southern Oregon"]) *if data frame is not attached at wine$ in front of column names

Wine data frame: Considering region.2, what is the average price for the Sierra Foothills region?

mean(price[region.2=="Sierra Foothills"],na.rm=TRUE) *using summary(price) reveals there is missing data (NAs) in the price column, so that the na.rm=T argument must be added to the mean() function to tell R to ignore them. Many functions in R ignore them automatically, but not the mean() function.

continuous quantitative variable

measured set of values ex. distance of hummingbird from my patio

Lending club uses a FICO score when it is deciding to grant a loan to applicants. The FICO score is a three-digit number on a 300-850 range and is a credit score. This number is calculated through a complicated statistical model and aims to tell lenders how likely a consumer is to repay borrowed money based on their credit history. Our data reports the FICO score in ranges of 4 (except for the top tier that goes from 845 - 850). Using the column associated with the high end of the range for FICO scores at loan origination, calculate the median FICO score for these approved loans. Do not use the summary() function. Instead, choose from the min(), max(), median(), or mean() functions. Feel free to check your answer with the summary() function, though.

median(lend$fico_range_high)

What is the median FICO score (use the column for the FICO at origination for the high end of the range) for grade E loans?

median(lend$fico_range_high[lend$grade=="E"])

Which measure of center can be used for a categorical variable?

mode

How many variables are reported in this dataset?

ncol(lend) lend is the dataset

How many subjects' (people's) information are in this dataset?

nrow(cps) cps is data file

How many loan record are available in the data set?

nrow(lend) lend is the dataset

statistic (paramater)

number calculated from a sample (used to estimate the parameter)

Parameter (notes)

number used to describe.summarize population ex: mode, percentage of the amount of people

A number used to describe/summarize a characteristic of the population is called a

paramater

Parameter

percent of all households in this large city that are headed by a single woman.

nominal variable

qualitative (categorical) variables that have values that cannot be ordered - ex: a UofSC undergraduate student's major - nominal means name

salary is a ________ variable

quantitative

age can be quantitative or qualitative

quantitative - average age of our customers - 24, 37, 51, 28, 24 qualitative - - age group of books - child, teen, adult, senior

You are weighing subjects (pounds) for a weight loss program. The variable "weight of a person" is

quantitative and continuous

In weightlifting, weight can be added to the bar in 5 pound increments. You are recording the highest amount liften in the deadlift for members of a gym. The variable "weight lifted" is

quantitative and discrete

time series data

results from a variable measured at regular intervals over time - sequences of data - intervals: equally spaced points in time -- ex: hourly, daily, weekly, monthly, quarterly, annual..

The coefficient of variation is also known as

risk to return

observation units (variables with R)

rows of the data set (up to down)

The distribution of light beer sales is

skewed right (more concentrated on the left and goes off to the right)

Suppose a dataset has a calculated skewness of -1.372. This means that the distribution is

skewed to the left

How to find a column name in R?

str(dfname) or head(dfname) and look for the column you need

How many observations are in a specific column?

sum(!is.na(dfname$colname) or length(dfname$colname[!is.na(dfname$colname)]) #count the observations that are not NA

How many subjects are from the west?

sum(cps$we) we - column name

How many DAs are contained in the entire data set?

sum(is.na(cps))

How many of these applicants had income over $65,000 and were given a grade D loan?

sum(lend$annual_inc.65000&lend$grade=="D")

How many of these loans have as their current status that they are currently late by 16-30 days?

sum(lend$loan_status=="Late (16-30)")

Sample

the 200 households surveyed

In the R console, the > with the flashing cursor next to it is referred to as

the command prompt

The summary() function applied to a numerical variable will return

the five number summary and mean

The baseline kurtosis is based on comparing the tails of a distribution to

the normal distribution

Statistics

the science of data

Z-score

to compare 2 observations from distributions with different units

coefficient of variation

to compare the variation for 2 distributions with different units

cross-sectional data (under time series data)

when characteristic (variable) is measured on many subjects at the same time point (or time frame), the rule is called - taken at the same time -gives a "snapshot" of the data at the given point in time

ordinal variable

when data values can be ordered, we say that the variable is ordinal - ex: length of time employed (<5 years, between 5 and 10 years, >10 years)

When is it improper to use Z-scores for comparison?

when the distributions have different shapes

If you execute a line of code in R and get the + on the next line instead of >, what does this mean?

you didn't close a parenthesis or quote


संबंधित स्टडी सेट्स

Unit 2: Prenatal and Neonatal Part 1

View Set

MGMT 320 Chapter 5,6,9 Assessments

View Set

CON 305 Quiz #9, CON 305 Quiz #1

View Set

psych chapter 15Kyle is extremely manipulative and can look anyone in the eye and lie convincingly. His deceit often endangers the safety and well-being of those around him, but he is indifferent to any suffering they might experience as a result of his a

View Set

ATI RN Concept-Based Assessment Level 1 Online Practice A

View Set