Stats
what is the difference between a permutation and commutation
order - permutation you look at the order - commutation
on occasion coding does ranking such as
1= bachelor 2= master's 3= doctorate
Likert scale
a special case that is frequently used in survey research - usually its where a statement is made and the respondent is asked to indicate his or her agreement/disagreement ex. strongly agree, somewhat agree, neither agree nor disagree.... researchers believe the intervals between the numbers are the same ex. the distance from 1 to 2 is the same as the interval say from 3 to 4
column chart
a vertical display of data p 78
difference between a sample and population
census: an examination of all items in a defined population sample: looking only at some items selected from the population
mean absolute deviation (MAD)
an additional measure of dispersion that reveals the average distance from the center
probability density function (PDF)
an equation that shows the height of the curve f(x) at each possible value of X any continuous PDF must be nonnegative and the area under the entire PDF must be 1
subjective probability
based on informed opinion or judgement example: there is a 60% chance that that toronoto will bid for the 2024 Winter Olympics
trimmed mean
calculated like any other mean, except that the highest and lowest k percent of the observations in the sorted data array are removed
pie charts
can only money a general idea of the data because it is hard to assess areas precisely - the only correct way to use a pie chart is to portray data that sum to a total
if the outcome is a continuous measurement, the sample space....
cannot be listed but can be described by a rule
events that include more than one outcome from the sample space are known as
compound events
ordinary data
connote a ranking of data values ex: How often do you use microsoft access? 1. frequently 2. sometimes 3. rarely 4. never
data set
consists of all the values of all of the variables for all of the observations we have chosen to observe
target population
contains all of the individuals in which we are interested
empirical data
data collected through observations and experiments
left skewed
mean < median
bivariate data sets
two variables
random number
"pick" the name
what is normal?
- be measured on a continuous scale - possess a clear center - have only one peak (unimodal) - exhibit tapering tails - be symmetric about the mean (equal tails)
which of the following events are mutually exclusive? - being on time & being late for an appointment - passing a stats test & passing an english test - being of German descent & being of Mexican descent - rolling an odd # & even # on the same roll of a die
- being on time & being late for an appointment - rolling an odd number & an even # on the same roll of a die
the sum of the probability of all outcomes in the sample space is
1
place in order, from beginning to end, the steps to calculate the mean absolute deviation
1. calculate the arithmetic mean for the data set 2. find the absolute difference between each data set value and the mean 3. sum the absolute differences 4. divide by the sample (or the population) size
the highest possible probability, of the choices below for an event is: 1.1 1.0 2.0 0.99
1.0
pareto chart
A special type of column chart used in business - it displays categorical data, with categories displayed in descending order of frequency, so that the most common categories appear first - 80/20 rule holds true for many aspects of business and the majority of the chart is usually leaning one way or the other
mutually exclusive events
Events A and B are mutually exclusive (or disjoint) if their intersection is the null sell (o with line through it) which contains no elements
match the excel normal CDF to the explanation
NORM.S.DIST --> area to the left of a z score NORM.DIST --> area to the left of a given x value
match the excel normal function to the explanation
NORM.S.INV --> Z score for a given cumulative area NORM.INV --> X value for a given cumulative area
actuarial science
a career that involves estimating empirical probabilities - they help companies calculate payout rates on life insurance, pension plans, and health care plans
variable
a characteristic of the subject or individual, such as the employee's income or an invoice amount
bar chart
a horizontal display of data p. 78
systematic sampling
a method of random sampling to choose every nth item fro a sequence or list, starting from a randomly chosen entry among the first k items on the list ex. in book shows a bunch of x's and every 4th x is highlighted an attraction of systematic sampling is that it can be used with unlistable or infinite populations, such as production processes or political polling.
judgment sampling
a non-random sampling method that relies on the expertise of the sampler to choose items that are representative of the population. for example, to estimate the corporate spending on research and development in the medial equipment industry, we might ask an industry expert to select several "typical" firms
focus group
a panel of individuals chosen to be representative of a wider population
observation
a single member of a collection of items that we want to study, such as a person, firm, or region ex: an employee or an invoice mailed last month
What type of data are these? a. the manufacturer of your car b. your college major c. the number of college credits you are taking
a. categorical b. categorical c. discrete numerical
cluster samples
are taken from strata consisting or geographical regions. we divide a region (say a city) into subregions (say, blocks, subdivisions, or school districts) in one stage cluster sampling, our sample consists of all elements in each of k randomly chosen subregions (or clusters) in two stage cluster sampling, we first randomly select k subregions (clusters) and then chose a random sample of elements within each cluster cluster sampling is useful when: - population frame and status characteristics are not readily available - it is too expensive to obtain a sample or stratified sample - the cost of obtaining data increases sharply with distance - some loss of reliability is acceptable
cumulative distribution function (CDF)
denoted as f(x) and it shows the cumulative area to the left of a given value of X it is used for probabilities, while the PDF reveals the shape of the distribution
arithmetic scale
distances on the Y axis are proportional to the magnitude of the variable being displayed
statistics can help you handle
either too much or too little information
logarithmic scale
equal distances represent equal ratios (for this reason, a log scale is sometimes called a ratio scale). when data vary over a wide range, say, by more than an order of magnitude, we might prefer a log scale for the vertical axis, to reveal more detail for small data values. a log graph reveled whether the quantity is growing at an increasing percent (convex function) , constant percent (straight line), or declining percent (concave function) on a log scale, equal distances represent equal ratios log scale is useful for time series data that might grow rapidly ex. GDP, the national debt, or your future income
empirical probability
estimated from observed outcome frequency example: there is a 2% chance of twins in a randomly chosen birth
simple random sample
every item in the population of N items has the same chance of being chosen in the sample of n items a physical experiment to accomplish this would be to write each of the N data values on a poker chip, and then to draw n chips from a bowl after stirring it thoroughly
interval data
has meaningful intervals between scale points examples are the celsius or fahrenheit scales of temperature intervals between numbers represent *distances*
ratio data
have all of the properties of the other three data types, but in addition possess a *meaningful zero* that represents the absence of the quantity being measured we can record ratio measurements downward into original or nominal measurements (but not conversely) zero does not have o be observable in the data
time series data
if each observation in the sample represents a different equally spaced point in time (years, months, days), we have time series data its the periodicity is the time between observations
cross sectional data
if each observation represents a different individual unit (ex. a person, firm, geographic area) at the same point in time in cross sectional data we are interested in *variation among observations* or in *relationships*
non-random sampling
is less scientific than random sampling but is sometimes used for expediency
sampling error
it is uncontrollable random error that is inherent in any random sample. even when using a random sampling method, it is possible that the sample will contain unusual responses. this cannot be prevented and is generally undetectable. it is not an error on your part
random sampling
items are chosen by randomization or a chance procedure the idea of this is to produce a sample that is representative of the population
standard normal distribution
its mean is 0 and its standard deviation si 1, denoted N(0,1). the maximum height of f(z) is 0 (the mean) and its points of influencetion are at + or - 1 (the standard deviation) the shape of the distribution is unaffected by the z transformation
right skewed
mean > median
sampling without replacement
means that once an item has been selected to be included in the sample, it cannot be considered for the sample again
4 levels of measurement
nominal, ordinal, interval, and ratio
coverage error
occurs when some important segment of the target population is systematically missed. for example, a survey of notre dame university alumni will fail to represent non college graduates or those who attended public universities
nonresponse bias
occurs when those who respond have characteristics different from those who don't respond. for example, people with caller ID, answering machines, blocked, or unlisted numbers, or cell phones are likely to be missed in telephone surveys. since these are generally more affluent individuals, their socioeconomic class may be underrepresented in the poll
numerical data
or quantitative data arise from counting, measuring something or some kind of mathematical operation can be broken down into 2 types: 1. discrete: a variable with a countable number of distinct valued 2. continuous: a numerical value that can have any value within an interval (this would include things like physical measurements ex. distance, weight, speed, time or financial variables ex. sales, assets, inventory turns)
data
plural tense each column is a variable (m) and each row is an observation (n) n x m
stratified sampling
procedure where a random sample of the whole population could be taken, and then individual strata estimates could reduce cost per observation and narrow the error bounds
if an event is getting a letter grade of A in your stats class, what is the complement of receiving an A?
receiving any grade except an A
inferential statistics
refers to generalizing from a sample to a population, estimating unknown population parameters, drawing conclusions, and making decisions
measurement error
results when the survey questions do not accurately reveal the construct being accessed
selection bias
self-selected sample for example, a talk show host who invites viewers to take a web survey about their sex lives will attract plenty of respondents
binary variables
some categorial variables have two values which we call binary variables (usually uses 1 or 0) ex: 1= female 0=male or vice versa
many statisticians feel which two tables are better than a pie chart
table or bar chart (pie charts do appear daily in companies tho)
continuity correction
the 0.5 on normal approximation is called this
stacked column chart
the bar height is the sum of several subtotals pg. 79
sampling frame
the group from which we take the sample if the frame differed from the target population, then our estimates might not be accurate ex. names and addresses of all registered voters in Colorado Springs, Colorado
symmetric data
the mean and median are about the same
center
the middle or typical values of a distribution
mean
the most familiar statistical measure of center is the mean its the balancing point because it has the property that distances from the mean to the data points always sum to 0
sample space
the set of all possible outcomes
nominal measurement
the weakest level of measurement and the easiest to recognize. it identifies a category it is common to use OTHER as the last item on the list ex.: which cell phone service provider do you use? 1. At&t 2. sprint 3. tmobile 4. verizon 5. other
sampling with replacemetn
this means that the same random number could show up more than once
line chart
used to display a *time series data *, to spot trends, or to compare time periods. they can be used to display several variables at once. - usually has no vertical gridlines - numerical variable is shown on the Y axis - to avoid graph clutter, numerical labels are committed on a line chart usually
parameters and statistics
we use different symbols for each parameter and its corresponding statistic parameter: a measurement or characteristic of the population (ex. a mean or proportion). usually unknown since we can rarely observe the entire population. usually (but not always) represented by a greek letter (pi or upside down h) statistic: a numerical value calculated from a sample (ex. a mean or proportion). usually (but not always) represented by a Roman letter (ex. x with line on top or p)
interviewer error
when the interviewer's factual expressions, tone of voice, or appearance influences the responses
one characteristic of a well-defined probability density function of a continuous random variable X is that the are under the curve, f(x) over all values of x is
equal to one
using the multiplication rule, the joint probability of event A and event B is computed by multiplying the conditional probability of event A given event B by the probability of
event B
one condition of a well defined probability density function of a continuous random variable X is that f(x) is
greater than zero for all values of X
what is normal
must be: - continuous - possess a clear center - have only one peak - exhibit tapering tails - be symmetric
dichotomous (or binary) events
two mutually exclusive events, collectively exhaustive events; example: a car repair is either covered by the warranty (A) or is not covered by a warranty (A'). There can be more than two mutually exclusive collectively exhaustive events. For example, a Walmart customer can pay by credit card (A), debit card (B), cash (C), or check (D).
the _____________ of two events, A and B, contains all of the outcomes in either A or B, or both A and B
union
empirical or relative frequency approach
use this to assign probabilities by counting the *frequency* of observed outcomes defined on the experimental sample space example: to estimate the default rate on student loans: P(a student defaults)= fln= # of defaults/# of loans
combination formula
used to determine the # of different ways to arrange a group of x objects from a total of n objects and the order of the objects is irrelevant
an example of a random variable that closely follows the normal distribution is
weight of newborn babies
the normal distribution is the most extensively used distribution in statistical studies because
- many physical measurements have a bell-shaped distribution - economic and financial data often display bell-shaped distribution - it has important features used in sampling and estimation
standard deviations can be compared
- only for data sets with the same measurement units and similar magnitude - only for data sets with the same measurement units
which of the following statements about the variance of a continuous variable are true?
- the standard deviation is the square root of the variance - the variance is the weighted average of the squared deviations from the mean
consider rolling two dice. which of the following describe two events that are collectively exhaustive? - event 1: a value of 9 or more. event 2: a value of 7 or less - event 1: rolling an even #. event 2: rolling an odd number - event 1: a value of 7 or more. event 2: a value of 6 or less - event 1: a value of 6 or more. a value of 8 or less
all except the first one
categorial data
also called qualitative data, have variables that are described by words rather than numbers for ex: structural lumber can be classified by the lumber type (ex. fir, hemlock, pine), automobile styles can be classified by size (ex. full, midsize, compact, subcompact) and movies can be categorized using common movie classifications (ex. action and adventure, children and family, classics, comedy, documentary)
combination
an arrangement of r items chosen at random from n items where the order of the selected items is not important (ex, XYZ is the same as ZYX) a combination is denoted little n C little r
permuation
an arrangment *in a particular order* of r randomly sampled items from a group of n items and is denoted little n P little r in other words, how many ways can the r items be arranged from n items, treating each arrangement as different (ex., XYZ is different from ZYX)?
random experiment
an observational process whose results cannot be known in advance
event
any subset of outcomes in the sample space
which of the following statements from the empirical rule is correct?
approximately 95% of values fall within 2 standard deviations of the mean for data with bell shaped histogram
the CDP for a continuous random variable gives the cumulative ________ under the PDF to the left of x.
area
what is an example of a discrete random variable
binomial
union of two events
consists of all outcomes in the sample space S that are contained either in event A or in event B or in both The union of A and B is sometimes denoted A U B or "A or B" as illustrated in the Venn diagram. the symbol U may be read "or" since it means that either or both events occur.
it is meaningful to compute the probability that a continuous random variable is between 2 numbers, greater than or equal to a #, or less than or equal to a number. only thing that is not meaningful is
exactly equal to a number
true or false: a continuous random experiment can have a finite set of integer values
false
fundamental rule of counting
if event A can occur in n1 ways and event B can occur in n2 ways, then events A and B can occur in n1 x n2 ways in general, m events can occur n1 x n2 x..... x nm ways example: stock keeping labels: how many unique stock keeping unit (SKU) labels can a hardware store create by using two letters (ranging from AA to ZZ) followed by four numbers (0 through 9)?
special law of addition
in the case of mutually exclusive events, the addition law reduces to: P (A U B)= P(A) + P(B)
the ______________ of the two events A and B contains only those outcomes that are in both A and B
intersection
normal distribution
it is continuos (if you start to draw it you trace it out-theres no skips or missing pieces in the middle) it is symmetric it has a high point in the middle which is the highest point and that is the mean
What is not a characteristic of the midrange?
it is robust to outliers
classical probability
known from a priori by the nature of the experiment example: there is a 50% chance of heads on a coin flip
symmetrical
mean= median
standard deviation
measures the degree of variation in the data
multivariate data sets
more than two variables
Is an imperfect analysis survey better than no survey even if only 80 people participate out of 1000?
no, causation is not shown
coding
on occasion if the categorial variable might be represented using numbers which is called coding ex. 1= cash 2= check 3= credit/debit card 4= gift card
dependent events
when P(A) differs from P(A I B) dependent events may be causally related, but statistical dependence does not prove cause and effect. It only means that knowing that event B has occurred will affect the probability that event A will occur
independent events
when knowing that event B has occurred does not affect the probability that event A will occur in other words, event A is independent of event B if the conditional probability P(A I B) is the same as the unconditional probability P(A); that is if the probability of event A is the same whether event B occurs or not. For example, if text messaging among high school students is independent of gender, this means that knowing whether a student is male or female does not change the probability that the student uses text messaging
event A= {1, 2, 3, 4} and event B= {2, 3, 6, 7}. A U B=
{1, 2, 3, 4, 6, 7}
If the revenue over a four year period was $2000, $2000, $3000, and $5000, what si the geometric mean revenue? round answer to a whole number
$2783
If Fund A has a coefficient of variation of 1.1 and Fund B has a coefficient of variation of 0.9, which Fund has the greater relative dispersion?
A
probability of an event
a number that measures the relative likelihood that the event will occur the probability of event A, denoted P(A), must lie within the interval from 0 to 1: 0< or equal to P(A) < or equal to 1 P(A)=0 means the event cannot occur while P(A)=1 means the event is certain to occur in a discrete sample space, the probabilities of all simple events must sum to 1, since it is certain that one of them will occur P(S)= P(E1) + P(E2) + ...... P(En)= 1
simple or elementary events are....
a single outcome
random experiment
a trial or process that produces several possible outmodes that cannot be known in advance
true or false: under appropriate circumstances, many discrete random variables can be described by the normal distribution
true
univariate data sets
one variable
classical approach- what is a priori?
priori: the process of assigning probabilities before the event is observed or the experiment is conducted - priori probabilities are *based on logic*, not experience - when flipping a coin or rolling a pair of dice, we do not actually have to perform an experiment because the nature of the process allows us to envision the entire sample space - instead of performing the experiment, we can use deduction to determine the probability of an event - this is the classical approach to probability
descriptive statistics
refers to the collection, organization, presentation, and summary of data (either using charts and graphs or using a numerical summary)
subjective approach of probability
reflects someone's informed judgement about the likelihood of an event - used when there is no repeatable random experiment - for example, what is the probability that a new truck product program will show a return on investment of at least 10 percent? - what is the probability that the price of Ford's stock will rise within the next 30 days?
law of large numbers
says that as the number of trials increases, any empirical probability approaches its theoretical limit
standard normal distribution
shape: symmetric, mesokurtic, and bell-shaped domain: -infinity < z < + infinity mean: 0 standard deviation: 1
law of large numbers
states that as the # of trials increases, an empirical probability will approach the theoretical probability
general law of multiplication
states that the probability of the intersection of two events A and B is P(A upside down U B)= P(A I B)P(B)
general law of addition
states that the probability of the union of two events A and B is: P(A U B)= P(A) + P(B)- P(A upside down U B)
critical thinking
stats is an essential part of critical thinking because it allows us to test an idea against empirical evidence
where should tables go
tables should be embedded in the narrative (not on a separate page) near the paragraph in which they are cited
geometric mean
the appropriate measure to use when evaluating growth rates
the probability that a continuous random variable takes on a particular value is zero because why?
the area under a curve AT a certain point is zero
when comparing two data sets with different units of measurement, what is the relative measure of dispersion?
the coefficient of variation
complement
the complement is the A' on the outside of the circle (consists of everything in the sample space S except event A) the probability of a complement A is found by subtracting the probability of A from 1: P(A')= 1-P(A)
Event A is independent of event B if
the conditional probability P(A I B) is the same as the marginal probability P(A)
the function used to find the area under the f(X) of a continuous random variable X up to any value x is called
the cumulative distribution function or CDF
intersection of two events
the event consisting of all outcomes in the sample space S that are contained in both event A and event B the intersection of A and B is denoted A upside down U B or "A and B" as illustrated in the venn diagram. the probability of A upside down U B is called the *joint probability* and is denoted P(A upside down U B)
which of the following are examples of conditional probabilities? - if Neil has already purchased groceries, then the probability of Colleen purchasing groceries - the probability of Angel going to the movie, given that Derrick is going to the movie - the probability of Amir purchasing a video game or the probability of Natasha purchasing a video game - the probability of Marilyn going to the football game and Tom going to the football game
the first two
which of the following statements is true? - two data sets could have different means but the same standard deviation - two data sets could have the same mean but different standard deviations - if two data sets have the same mean they must have the same standard deviation - if two data sets have different means they must have different standard deviations
the first two
mode
the measure of center that identifies the most frequently occurring value in the data set
median
the measure of center where half of the data set lie above this measure and half the data set lie below the measure
post hoe fallacy
the mistaken conclusion that if A proceeds B then A is the cause of B
n factorial
the number of ways that n items can be arranged in a particular order n factorial is the product of all integers from 1 to n
conditional probability
the probability of an event given that another event has already occured
conditional probability
the probability of event A given that event B has occurred denoted P (A I B) the vertical line "I" is read as "given"
the normal distribution is asymptotic in the sense that
the tails gets closer and closer to the horizontal axis, but never touch it
the addition rule is used to calculate
the union of two events
events are collectively exhaustive if
their union is the entire sample space S
factorials
they are useful for counting the possible arrangements of any n items there are n ways for counting the possible arrangements of any n items there are n ways to choose the first, n-1 ways to choose the second, and so on a home appliance service truck must make 3 stops (A, B, C). in how many ways could the three stops be arranged? 3!= 3 x 2 x 1= 6