Business Analytics (Evans text) Quiz 2
correlation coefficient formula
(Cov of A and B) / [( STD of A) x (STD of B) ] Aka Pearson Product Moment correlation coefficient
Triangular Distribution
3 outcomes - most likely, optimistic, and pessimistic scenarios
ogive
A chart that displays the cumulative relative frequency
event
A collection of one or more outcomes from a sample space
Exponential Distribution
A continuous distribution that models the time time between randomly occurring events p 203 EXAMPLE - number of hits on a news subject..... grows fast at first then levels out In probability theory and statistics, the exponential distribution is the probability distribution of the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate.
contingency table aka cross-tabulation
A data matrix that displays the frequency of some combination of possible responses to multiple variables; cross tabulation results
cumulative distribution function
A function giving the probability that a random variable is less than or equal to a specified value.
Unimodal
A histogram with one peak (mode). A bell curve is unimodal but a skewed curve could be too.
Data profile (fractile)
A measure of dividing data into sets
coefficient of variation
A measure of relative variability... raltive to the average .....instead of just using Std Dev, which you can't compare as easily across populations. The reciprocal is "return to risk" computed by (Std dev/mean) x 100
Coefficient of skewness
A measure of the degree of asymmetry of observations around the mean p 144 calculation If the number is Positive, the distn is positively skewed (long tail to the right) If the number is negative, the distn is negatively skewed (long tail to the left) If it's between -0.5 and 0.5, the skew is low and there is relative symmetry
covariance
A measure of the linear relationship between two variables, x and y, DOES depends on the units of measurement Correlation is easier to use because it doesn't depend on unit of msmt., and is just -1 to 1
Correlation
A measure of the linear relationship between two variables, x and y, which (unlike covariance) does not depend on the units of measurement The value of a correlation coefficient ranges between -1 and 1. You could have a very strong relationship and get a very low Corr. coeff. because the relationship is not LINEAR
interval estimate
A method that provides a range for a population characteristic based on a sample
estimation
A method used to assess the value of an unknown population parameter such as a population mean, population proportion, or population variance using sample data.
continuous metric
A metric that is based on a continuous scale of measurement. There are no jumps in the data.
Judgment sampling
A nonprobability method of sampling whereby elements are selected for the sample based on the judgment of the person doing the study.
standard normal distribution
A normal distribution with a mean of 0 and a standard deviation of 1. (z scores)
probability density function
A probability distribution is a list of outcomes and their associated probabilities. ... A function that represents a discrete probability distribution is called a probability mass function. A function that represents a continuous probability distribution is called a probability density function it's like a histogram on steroids
frequency distribution
A table that shows the number of observations in each of several non-overlapping groups. p 147 ...When data are summarized in a frequency distribution we can use the frequencies to compute the mean and variance
cumulative relative frequency distribution
A tabular summary of cumulative relative frequencies
Outlier
A value much greater or much less than the others in a data set
Discrete Uniform Distribution
A variation of the uniform distribution for which the random variable is restricted to integer values between a and b (also integers) A good example of a discrete uniform distribution would be the possible outcomes of rolling a 6-sided die. The possible values would be 1, 2, 3, 4, 5, or 6. In this case, each of the six numbers has an equal chance of appearing.
empirical probability distribution
An approximation of the probability distribution of the associated random variable. ... the ratio of the number of outcomes in which a specified event occurs to the total number of trials, not in a theoretical sample space but in an actual experiment.
Sample
a subset of the population in ALL statistical analysis we study the characteristics of the SAMPLE so that we can state something useful about the POPULATION (which is usually too large to study). parameters of a sample don't use greek letters like mu sigma and pi for average, std/ dev. and ; instead we use x-bar,
relative frequency distribution
a table that presents the relative frequency of each category
metric
a unit of measurement that provides a way to objectively quantify performance
"kth" percentile
a value at or below which "kth" percent of the observations lie
proportion
fraction of data having a certain characteristic p = x/n Proportions are always between zero and one Key descriptive statistics for categorical data
Coefficient of Kurtosis (CK)
measures the degree of kurtosis of a population (kurtosis is "peakness" or flatness of a histogram) If the CK is less than three, the distribution is mostly flat with a wide degree of dispersion...If the CK is greater than three the distribution is peaked and has less dispersion. Note that in Excel they subtract three...Therefore if the value is less than zero, It's relatively flat and greater than zero means it is peaked pg 145 formula
descriptive statistics
measures used to describe and summarize data using tabular, visual, and quantitative techniques
estimators
measures used to estimate population parameters
Ratio Data
data that are continuous and have a natural zero eg Length, Width, Height Ratio data can be both continuous and discrete. ?
degrees of freedom
n-1. number of scores that can vary in the calculation of a statistic Video --- https://www.youtube.com/watch?v=VIlVWeUQ0vs FOR categorical values, if k = categories, df= k-1 and n-k for the df for our errors The general formula for degrees of freedom in ANY field of math is (d.f. = # of variables - # of constraints).
measure
numerical value associated with a metric
Discrete vs. Continuous distributions
one is a histogram, the other is a curve How many countries have you been to? (answers are finite) vs. how much do you weigh? (answers are infinite) VIDEO https://www.youtube.com/watch?v=bPFNxD3Yg6U
Dispersion
the degree of variation in the data, spread
sampling error
the difference between the results of random samples taken at the same time there will always be samping error LARGER sample sizes have less sampling error
relative frequency
the fraction or percent of the time that an event occurs
degrees of freedom (df)
the number of independent pieces of information remaining after estimating one or more parameters.... if there are a lot of df, there are many possible lines n-1 video: https://www.youtube.com/watch?v=Cm0vFoGVMB8
Marginal Probability
the probability of a single event without consideration of any other event
Joint Probability
the probability of the intersection of two events
Conditional Probability
the probability that one event happens given that another event is already known to have happened
confidence interval
the range of values within which a population parameter is estimated to lie along with a probability that the interval correctly estimates the true (unknown) population parameter NOT Correct - "there is a 90 % probability that the true population mean is within the interval" CORRECT - "there is a 90 % probability that any given confidence interval from a random sample will contain the true population mean
Population
the set of all elements of interest in a particular study; parameters of a population use greek letters like mu sigma for average, and std/ dev. , unlike samples where instead we use x-bar, and s
sample space
the set of all possible outcomes of an experiment
standard deviation
the square root of the variance; the Units of measure are the same as the Unit of data.So it is easier to interpret than variance is
cumulative relative frequency
the sum of previous relative frequencies up through, and including, the category of interest eg 20% of customers account for 80% of total sales
rules for normal distribution
1. skew is zero 2. mean=median=mode 3. x has no bounds, tails go to infinity 4. All data will fall within +/-3SD of the mean. Actual % may be higher or lower Under this rule, 68% of the data falls within one standard deviation, 95% percent within two standard deviations, and 99.7% within three standard deviations from the mean. The Empirical Rule is an approximation that applies only to data sets with a bell-shaped relative frequency histogram. ... Chebyshev's Theorem is a fact that applies to all possible data sets. It describes the minimum proportion of the measurements that lie must within one, two, or more standard deviations of the mean.
what affects the WIDTH (or range) of the confidence interval
1. the size of the sample 2. the amount of variation in the population from which we drew our sample 3. The LEVEL of confidence, ie 90%? 95% Bigger samples and less variation mean our confidence INTERVAL can be narrow
Ordinal data example
1=tallest, 2=next tallest, 3=third tallest data that can be ranked according to some relationship
Types of Discrete Probability Distributions (histogram)
Bernoulli, bimodal, Poisson
Chi-square test
Chi-square test for independance. testing if two categorical variables are independent; the same number of girls and boys prefer brand 1 over brand 2 H0 - they are independent H1 - they are dependent Specifically, it tests for the equality of two frequencies or proportions.
Interval Data
Data that are ordinal but have constant differences between observations and have arbitrary zero points. Examples of ordinal variables include: socio economic status ("low income","middle income","high income"), education level ("high school","BS","MS","PhD"), income level ("less than 50K", "50K-100K", "over 100K"), satisfaction rating ("extremely dislike", "dislike", "neutral", "like", "extremely like")
categorical data
Data that consists of names, labels, or other nonnumerical values Hierarchical - categorical data cannot be converted into ratio data (p. 118)
mutually exclusive
Events with no outcomes in common.
Chebyshev's formula
For any set of data, the proportion of values that are within "k" standard deviations of the mean is at least 1 - (1/k^2) so he says 75% of data are +/- 2 std dev from the mean and 89% of data are +/- 3 std dev from the mean The Empirical Rule is an approximation that applies only to data sets with a bell-shaped relative frequency histogram. ... Chebyshev's Theorem is a fact that applies to all possible data sets. It describes the minimum proportion of the measurements that lie must within one, two, or more standard deviations of the mean.
cluster sampling example
Instead of drawing random samples, Cluster sampling is a method of probability sampling that is often used to study large populations, particularly those that are widely geographically dispersed. Researchers usually use pre-existing units such as schools or cities as their clusters.
uniform distribution
Like the probability of rolling a 1, 2, 3, 4, 5, or 6, the uniform distribution has the same probability for measuring each value. EQUIPROBABLE
Types of Continuous Probability Distributions (curve)
Normal, Student's T distn, Chi squared (no neg values; starts at zero), Logistic
ORDINAL data vs INTERVAL data
Ordinal data are most concerned about the order and ranking while interval data are concerned about the differences of value within two consecutive values. ... Ordinal data place an emphasis on the position on a scale while interval data are on the value differences of two values in a scale.
Is the Poisson distribution a discrete function
The Poisson distribution is a discrete function, meaning that the variable can only take specific values. It is used to model the number of occurrences in some unit of measure,like the number of customers arriving between noon and 1, or machine failures per month
Variance
The average of the squared differences from the mean. A common measure of dispersion. The variance for a population sums up all the squared differences between the each observation and the mean, and then divides by n. The variance for a sample sums up all the squared differences between the each observation and the mean, and then divides by (n-1). So it is always larger
complement
The complement of an event E, denoted E′, is the set of outcomes in the sample space that are not in E. For example, suppose we are interested in the probability that a horse will lose a race. If event W is the horse winning the race, then the complement of event W is the horse losing the race
sample correlation coefficient
The correlation coefficient is (Cov of A and B) / [( STD of A) x (STD of B) ] The sample correlation coefficient, r, estimates the population correlation coefficient, ρ.
multiplicative law of probability
The multiplication law of probabilities states that if event A happening is independent of event B, then the probabilities of A and B happening together is simply pA×pB.
sample correlation coefficient formula
The only difference vs the POPULATION correlation coefficient is that you use the STD of the samples! (Cov of A and B) / [(STD of ASample) x ( STD of BSample) ]
return to risk
The reciprocal of the coefficient of variation, it equals: Mean/Std DeV while the coefficient of variation equals: StdDev/Mean
statisics
The science of uncertainty and the technology of extracting information from data; impt element of business given large growth of data
Std dev and variance
The standard deviation is the square root of the variance.
expected value
The weighted average of all of the possible outcomes of a probability distribution, where the weights are the P's. Mean, average
cluster sampling
a sampling technique in which clusters of participants that represent the population are used
Binomial Distribution (Conditions)
VIDEO https://www.youtube.com/watch?v=b9a27XN_6tg ...a binomial dist'n models "n" independent repetitions of a Bernoulli experiment, each with a "p" probability of success 1. Binary- Trials can be classified as success/failure BUT, unlike Bernoulli, we have many iterations. So a coin toss is Bernoulli, a series of them is a binomial distribution 2. Independent? Trials must be independent. 3. Number? The number of trials (n) must be fixed in advance 4. Success? The probability of success (p) must be the same for each trial.
discrete random variable
Variable where the number of outcomes can be counted and each outcome has a measurable and positive probability The number of eggs that a hen lays in a given day (it can't be 2.3, which would be a CONTINUOUS random variable)
Union
a composition of all outcomes that belong to either of TWO events
intersection
a composition with all outcomes belonging to both events
tree diagram
a diagram used to show the total number of possible outcomes in a probability experiment
t distribution
a family of bell-shaped curves based on degrees of freedom, similar to the standard normal distribution with the exception that the variance is greater than 1; used when you are testing small samples and when the population standard deviation is unknown have more/less probability in the tails (fat tails) and more/less in the center than does the standard normal the bigger your sample and more degrees of freedom, the closet the t curve goes to normal (p232)
Histogram
a graphical representation of a frequency distribution in columns can help make better decisions than just using average; see p 136 on repair times; % of them are under ____ weeks is better than using avg
combination
a grouping of items in which order does not matter A combination is a selection of all or part of a set of objects, without regard to the order in which objects are selected. For example, suppose we have a set of three letters: A, B, and C. we might ask how many ways we can select 2 letters from that set. for a combination, selecting A and then B is the same as selecting B and then A is the for a permutations, order matters and they are different electing A and then B is the see p 177 for formulas
z-score
a measure of how many standard deviations you are away from the norm (average or mean) (x - avg)/ std dev = z aka Standardized Value
Skewness
a measure of the degree to which a distribution is asymmetrical
discrete metric
a metric derived from counting something
random variable
a numerical description of the outcome of an experiment
metric vs measure
a numerical way to objectively quantify performance
statistical thinking
a philosophy of learning and action based on the following fundamental principles: all work occurs in a system of interconnected processes, variation exists in all processes, and understanding and reducing variation are keys to success
goodness of fit
a procedure that attempts to draw a conclusion about the nature of a distribution chi-square goodness of fit determines whether sample data are representative of some prob dist'n if the chi sq statistic is <= the critical value, then the data can be reasonably assume to come from a normal distribution having the same sample mean and std dev. O/W, the normal dist is not appropriate to model the data..... Normality is a requirement for the chi square test that a variance equals a specified value but there are many tests that are called chi-square because their asymptotic null distribution is chi-square such as the chi-square test for independence (WHICH YOU USE ON NOMINAL DATA) in contingency tables and the chi square goodness of fit test.
experiment
a process that results in an outcome
continuous random variable
a random variable that may assume any numerical value in an interval or collection of intervals A continuous variable is a variable whose value is obtained by measuring. Examples: height of students in class. weight of students in class. time it takes to get to school. distance traveled between classes.
midrange
average of the lowest and highest values in a data set (Max + Min)/2
probability distribution
list of possible outcomes with associated probabilities
chi-square statistic
goodness of fit? p 208 USED for data where you have no negative values.....The chi-square statistic is used to compare two categorical variables to see if they are related. Calculating the statistic involves looking the figure up in the chi-square table. The chi-square table is similar to other distribution tables; You need a couple of pieces of information to look up the statistic. In the case of chi-square, you'll need to know degrees of freedom and probability (both of which are usually supplied in the question).
confidence interval video
https://www.youtube.com/watch?v=tFWsuO9f74o
Process Capability Index
index that measures the potential for a process to generate defective outputs relative to either upper or lower specifications Cp = (Upper specification minus the lower specification) / by total variation
IQR
interquartile range, Q3-Q1 where Q3= 75th percentile (the median of the 'top' half) and Q1= 25th percentile (the median of the 'bottom' half), gives the spread of the central (middle) 50% of the data set aka "Midspread"
ratio data
is continuous and has a natural zero, like dollars and time - ratio data can be converted into interval data, ordinal data or categorical data
Bernouilli distribution
is the simplest case of a binomial distribution; only 2 possible outcomes Ex: Favorable (40%)view of Prex vs Unfavorable (60%) to do calculations, assign 1 and zero to these categories and then do the calculations https://www.khanacademy.org/math/statistics-probability/random-variables-stats-library/binomial-mean-standard-dev-formulas/v/mean-and-variance-of-bernoulli-distribution-example random variable, 2 outcome, equal P'
Central Limit Theorem
see pg 228 and if the sample size is large enough, the sampling distn of the mean will be normally distd even fi the population is not, AND the mean of the sampling dist'n will eqaul the mean of the population if the population is normally distributed, then the sampling dist'n of the mean will be normal for any sample size VIDEO https://www.youtube.com/watch?v=_YOr_yYPytM Lets say you have a population of 1 million and you keep taking samples of 100. Each time you take a sample, you calculate the average of that sample. If all of those sample averages form a normal dist'n, you also know that the population has a normal distribution x-bar is the mean of a sample - we don't know the mean of the population n is the size and s is the std dev of the sample
statistic
summary measure of data
Measurement
the act of obtaining data associated with a metric
