Exam 1 - Econ Stats
empirical rule
for symmetrical, bell-shaped frequency distribution, 68% of observations will lie within +/- one standard of mean; about 95% of observations will lie within +/- 3 standard deviations of mean
ratio
named, natural order, equal interval btwn variables, has "true zero" - zero means absence --> ratio van be calculated (ex: height)
experiment
observation of some activity or act of taking some measurement
independent
occurrence of one event does not affect occurrence of another event; all three equivalent statements hold: P(A|B) = P(A) P(B|A) = P(B) P(A and B) = P(A)P(B)
combination
order is *not important*
outcome
particular result of experiment
measures of location
pinpoint center of distribution of data. only describes center of data, does not tell us anything about *how well* data is concentrated around the center (that is measure of dispersion), we need to consider both (there are 5: arithmetic mean, weighted mean, geometric mean, median, mode)
sample
portion, or part, of population of interest. used to obtain reliable estimates of population parameters
histogram
presents numerical data. no gaps btwn bars, represents frequency distribution of continuous variables. based on quantitative data. useful for large data sets
P(A)
probability of A
P(A and B)
probability of A and B
P(A|B)
probability of A given B has happened
P(A or B)
probability of A or B
P(~A)
probability of not A
discrete
quantitative variable that can take finite number of values. assume on certain values, there are "gaps" between values (number of children)
contnuous
quantitative variable that can take infinite number of values within particular range (ex: otuside temp, air pressure in tire)
correlation coefficient
r, used to measure direction and strength of linear relationship between two variables (ranges from -1 to 1)
coefficient of variation (CV)
ratio of *population* standard deviation σ to population mean μ
coefficient of variation (cv)
ratio of *sample* standard deviation s to population mean x̅
qualitative variable
recorded as nonnumeric characteristic (ex: eye color)
general rule of multiplication
refer to events that are *not* independent P(A and B) = P(A)*P(B|A)
special rule of multiplication
refers to events that are independent P(A and B) = P(A)*P(B)
rule of addition
refers to probability that any of two or more events can occur (special rule, general rule, complement rule)
statistics
science of collecting, organizing, presenting, analyzing, interpreting data to assist in making more effect decisions (two types: descriptive, inferential)
coefficient of skewness
shape of data (four shapes: symmetric, positively skewed, negatively skewed, bimodal)
box plot
shows general shape of variable's distribution (based on 5 descriptive statistics: max and min, first and third quartiles, and median)
frequency polygon
shows shape of distribution. consists of line segments formed by intersections of class *midpoints* and class frequencies
sample standard deviation
square root of sample variance
bivariate
studying relationship btwn two variables
sample variance
s², given sample mean (x̅) and sample size (n) (n-1) is to not underestimate population variance
stem-and-leaf plot
technique used to display info in condensed form while providing more info than frequency distribution (get identity of each value, can see distribution) two parts: stem (leading digit, vertical axis); leaves (trailing digit, horizontal axis)
complement rule
to determine probability of event happening by subtracting probability of event not happening from 1 P(A) = 1 - P(~A)
Chebyshev's theorem and empirical rule
two empirical results that allow us to characterize data dispersion around mean
hypergeometric distribution
used for problems with fixed n, probability for each trial changes (because of no replacement), without replacement used when samples are small compared to population. binomial is easier and gives good approximation if you have large population (ex: 30 people apply for two jobs. what is probability both positions are filled by women)
binomial distribution
used for problems with fixed number of trials, known p (prob of success is constant from trial to trial), with replacement used when you know *exact* probability of event happening; you want to find probability of that even happening k times out of n (ex: number of defects in box of 1,000 factory produced widgets)
continuous probability distribution
any value in interval (3 types: uniform, normal, exponential)
rule of multiplication
applied when two or more events occur simultaneously (special, general)
population and sample mean are examples of...
arithmetic mean
permutation
arrangement in which order of objects selected from specific pool of objects *is important*
geometric mean
finding avg of percentages, ratios, indexes, or growth rates over time rate of increase: avg percentage change over period nth root of (value at end of period/value at start of period)
Chebyshev's theorem
for any set of observations (sample or population, proportion of values that lie within k standard deviations of mean is at least 1 - 1/k^2 (k is any value greater than 1)
interquartile range
IQR = Q3 - Q1
arithmetic mean
1. data must be measured at interval/ratio level 2. influenced by extremely high and low values 3. all values included when computing 4. there is only one 5. sum of deviations of each value from mean is 0
subjective probability
based on whatever subjective info is available, relies on individual knowledge and assessment
harmonic mean
calculate avg value when value involves rates (value/unit) or ratios (index) (ex: speed in km/hr or price-earnings ratio)
quantitative variable
can be recorded numerically (ex: number of children in fam, outside temp, balance in checkings) (two types: discrete & continuous)
measures of dispersion
capture variation or spread in data. two distributions can have same average but different spreads (range, variance, coefficient of variation, chebyshev's theorem & empirical rule)
relative class frequencies
captures relationship between class frequency and total number of observations (fraction)
variable
characteristic of statistical unit being observe that may assume more than one of a set of values to which a numerical measure or a category from a classification can be assigned (2 types: qualitative & quantitative)
discrete probability distribution
characterized by all values x and associated probabilities (3 types: binomial, hypergeometric, Poisson) 1. sum of all probabilities is 1 2. probability of particular outcome is [0,1] 3. outcomes are mutually exclusive
normal distribution
characterized by mean (mu) and variance; useful for determining probabilities for any normally distributed random variable; find z value for particular value x of random variable based on mean and standard deviation of distribution
pie chart
chart that shows proportion/percentage that each class represents total number of frequencies. shows qualitative info
3 approaches to computing probabilities
classical, empirical, subjective
levels of measurement
classify data according to levels. level determines type of statistical analysis we can perform on data. 4 levels: nominal, ordinal, interval, ratio
event
collection of one or more outcomes of an experiment
weighted mean
compute arithmetic mean when we have several observations of same value
contingency table
cross-tabulation that simultaneously summarizes two variables of interest (enables classification of data according to 2 identifiable characteristics)
mean
describes central value of data (5 types: population, sample, geometric, weighted, harmonic)
measures of position
describes spread of data by determining position of values that divide observations into equal parts (quantiles)
range
difference btwn max and min values in data set (only considers max & min --> leaves out info)
dot plot
displays dot for each observation along horizontal number line indicating possible values of data; shows shape of distribution, value about which data tend to cluster, and largest and smallest; helpful for smaller data sets (when we organize data into classes w histogram, we lost exact value of observs) *if identical observs or observs are too close to be shown, dots are stacked on top of each other
population
entire set of individuals or objects of interest / measurements obtained from all individuals or objects of interest
bar chart
graph that shows qualitative classes on horizontal axis and class frequencies on vertical axis. class frequencies are proportional to heights of bars. most common graphic form to present qualitative variable. presents categorical data
scatter diagram
graphical technique used to show relationship between two variables measured with interval or ratio scales. one variable on vertical axis and other on horizontal (bivariate)
frequency table
grouping of qualitative data into mutually exclusive and collectively exhaustive classes showing number of observations in each class
frequency distribution
grouping of quantitative data into mutually exclusive and collectively exhaustive classes showing number of observations in each class (decide on number of classes, determine class interval, set individual class limits, tally vehicle profits into classes and determine number of observations in each class)
collectively exhaustive
if at least one of events must occur when experiment is conduct; sum of all probabilities of collectively exhaustive is equal to 1
mutually exclusive
if one event happens, the other cannot
multiplication rule
if there are m ways one event can happen and n ways another event can happen, then there are mn ways that two events can happen
conditional probability
likelihood that event will happen, given that another event has already happened
joint probability
likelihood that two or more events will happen at same time
negatively skewed
mean < median and mode
positively skewed
mean > median and mode
parameter
measurable characteristic of population (we rely on sample data to learn about population parameter)
statistic
measurable characteristic of sample (sample mean = best estimate of population mean)
variance
measures mean amount by which values in population, or sample, vary from mean (two types: population & sample
symmetric
median = mode = mean
descriptive statistics
methods or organizing, summarizing, presenting data in informative way (data collection, data presentation, summarizing data - surveys, graphs, tables)
inferential statistics
methods used to estimate a property (mean, proportion, etc) of a population on basis of sample (estimation and hypothesis testing). limited set of data
median
midpoint of values after they have been ordered from min to max values (2 midpoints--> find mean of two numbers) 1. unique for each data set 2. not affected by extremely large or small values (measure of location when such values do occur) 3. can be computed on ordinal, interval, and ratio level
uniform distribution
models events that are equally likely to occur within given range/interval; characterized by min value a, max value b, equal probability of 1/(b-a) of any value in that range to occur; rectangular in shape and symmetric (described by min value a and max b)
exponential distribution
models time btwn occurrences of event in sequence; actions occur independently at constant rate per unit of time/length; nonnegative, positively skewed, declines steadily to right, asymptotic
nominal
named (ex: eye color)
ordinal
named, natural order (ex: level of satisfaction)
interval
named, natural order, equal interval btwn variables (ex: temperature)
Poisson distribution
used for unknown n (it is random variable) and potentially infinite, unknown p for each trial (but known average p), with replacement used when you known the *mean* probability of an event and want to find probability of n events happening (ex: number of innocent people convicted of a crime)
special rule of addition
used when events are mutually exclusive P(A or B) = P(A) + P(B)
general rule of addition
used when events are not mutually exclusive P(A or B) = P(A) + P(B) - P(A and B)
three counting rules
useful in determining number of outcomes in experiment (multiplication rule, permutation, combination)
probability
value btwn 0 and 1 inclusive that represents likelihood a particular event will happen
mode
value of observation that appears most frequently 1. not always unique mode for each data set (can have multiple) 2. not affected by extremely large or small values (measure of location when such values do occur) 3. can be computed for nominal, ordinal, interval, ratio levels
outlier
value that is more than 1.5x IQR smaller than Q1 or larger than Q3
empirical probability
when number of times event happens is divided by number of observations
classical probability
when there are n equally likely outcomes to an experiment
sample mean
x̅, sum of all values of x in sample divided by number of values in sample n
population mean
μ, sum of all values of x in population divided by number of values in population N
population standard deviation
σ, square root of population variance larger standard deviation = more variance
population variance
σ², arithmetic mean of squared deviations from mean (μ) when population size is N