Business Analytics

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Time Series

Data about an attribute for a given subject (e.g. person, organization, or country) in temporal order, measured at regular time intervals (e.g. minutes, months, or years). When setting up visualizations like a scatter plot, be sure to setup the independent variable on the x-axis. If it is a time-series, time is the independent variable.

Cross-Sectional Data

Data that provide a measure of an attribute across multiple different subjects (e.g. people, organizations, or countries) at a given moment in time or during a given time period. When setting up visualizations like a scatter plot, be sure to setup the independent variable on the x-axis. Cross sectional charts are useful for comparing two variables at a given time.

Calculate Range Associated with Non-Cumulative Probability, E.g., Middle 99%

NORM.INV(0.005,63.5,2.5)=57.1 and NORM.INV(0.995,63.5,2.5)=69.9. The normal curve is symmetrical, so we know that the middle 99% of the distribution comprises 49.5% on either side of the mean and excludes 0.5% on each of the tails. Thus we can find the value corresponding to the left side of the range using the NORM.INV function evaluated at 0.5% and the right side using the NORM.INV function evaluated at 99.5%.

Standard Normal Curve

Normal distribution whose mean is equal to zero (μ=0)(μ=0), and whose standard deviation is equal to one (σ=1)(σ=1).

Outliers

Numbers that are much greater or much less than the other numbers in the set. Technically, a data point is considered an outlier if it is more than a specified distance below the lower quartile or above the upper quartile of a data set. Let's start with a couple of definitions. The lower quartile, Q1, is the 25th percentile—by definition, 25% of all observations fall below Q1. The upper quartile, Q3, is the 75th percentile—75% of all observations fall below Q3. The interquartile range (IQR) is the difference between the upper and lower quartiles, that is, IQR=Q3-Q1. We then multiply the IQR by 1.5 to find the appropriate range, computing 1.5(IQR)=1.5(Q3-Q1). A data point is an outlier if it is less than Q1-1.5(IQR) or greater than Q3+1.5(IQR).

Type II Error

Often called a false negative (we incorrectly fail to reject the null hypothesis when it is actually not true). Note that since we have no information on the probabilities of different sample means if the null hypothesis is false, we cannot calculate the likelihood of a type II error.

Type I Error

Often called a false positive (we incorrectly reject the null hypothesis when it is actually true) There is a 5% chance of making this error. However, you can reduce this error by increasing the confidence level; the downside of this is increasing the possibility of a false negative.

Two-Sided Hypothesis Test

We perform this test when we do not have strong convictions about the direction of a change. Therefore we test for a change in either direction H0:μafter=μbefore Ha:μafterXμbefore

One-Sided Hypothesis Test

We perform this test when we have strong convictions about the direction of a change—that is, we know that the change is either an increase or a decrease. H0:μafter≤μbefore Ha:μafter>μbefore

Two-Population Hypothesis Test (A/B Test)

We perform this test when we want to compare the means of two populations—for example, when we want to conduct an experiment and test for a difference between a control and treatment group.

Single Population Hypothesis Test

We perform this test when we want to determine whether a population's mean is significantly different from its historical average.

Margin of Error

the range of percentage points in which the sample accurately reflects the population =CONFIDENCE.NORM(alpha, standard_dev, size) -alpha, the significance level, equals one minus the confidence level (for example, a 95% confidence interval would correspond to the significance level 0.05). -standard_dev is the standard deviation of the population distribution. We will typically use the sample standard deviation, s, which is our best estimate of our population's standard deviation. size is the sample size, n.

Standard Deviation of the Distribution of Sample Means

σ/√n, the population standard deviation divided by the square root of the sample size. Large samples will create a "tighter" distribution of sample means than smaller samples, visualized as a narrower bell-curve with a smaller SD.

Rules of Thumb for Normal Distribution: P(μ−σ≤x≤μ+σ)≈68%

About 68% of the probability is contained in the range reaching one standard deviation away from the mean on either side.

Upper & Lower Bound

Calculate by adding or subtracting historical mean to CONFIDENCE.NORM(a,s,n)

P-Value

Can be interpreted as the probability, assuming the null hypothesis is true, of obtaining an outcome that is equal to or more extreme than the result obtained from a data sample. The lower the p-value, the greater the strength of statistical evidence against the null hypothesis. When the -- of a sample mean is less than the significance level, we reject the null hypothesis. =T.TEST(array1, array2, tails, type) -array1 is a set of numerical values or cell references. We will place our sample data in this range. -array2 is a set of numerical values or cell references. If we have only one set of data, so use the historical mean as the second data set. To do this, we create a column with each entry equal to the historical mean. -tails is the number of tails for the distribution. It can be either 1 or 2. If the alternative hypothesis is that the mean has changed and therefore can be either lower or higher than the historical mean, we will be use a two-tailed, or two-sided hypothesis test. -type can be 1, 2, or 3. Type 1 is a paired test and is used when the same group from a single population is tested twice to provide paired "before and after" data for each member of the group. Type 2 is an unpaired test in which the samples are assumed to have equal variances. Type 3 is an unpaired test in which the samples are assumed to have unequal variances. There are ways to test whether variances are equal, but when in doubt, use type 3. One-sided versus two-sided are similar except use 1 tail versus 2 and don't forget to adjust the significance level accordingly.

Alternative Hypothesis (H^a)

(The opposite of the null hypothesis) is the theory or claim we are trying to substantiate. If our data allow us to nullify the null hypothesis, we substantiate the alternative hypothesis.

Excel function to calculate a value greater than or equal to the mean.

=1-NORM.DIST(x,0,1,TRUE).

Conditional Mean

=AVERAGEIF(range, criteria, [average_range]) -range contains the one or more cells to which we want to apply the criteria or condition. -criteria is the condition that is to be applied to the range. -[average_range] is the range of cells containing the data we wish to average. The average of one variable given that another variable takes on a specific value: the mean of a specific subset of data.

Percentile

=PERCENTILE.INC(array, k) -array is the range of data for which we want to calculate a given percentile. -k is the percentile value. For example, if we want to know the 95th percentile, k would be 0.95. The value of a variable for which a certain percentage of the data set falls below. For example, if 87% of students taking the GMAT exam earn scores below 670, the 87th percentile for the GMAT exam is 670 points.

Confidence Intervals for Small Sample Sizes

=CONFIDENCE.T(alpha, standard_dev, size) =alpha, the significance level, equals one minus the confidence level (for example, a 95% confidence interval would correspond to the significance level 0.05). -standard_dev is the standard deviation of the population distribution. We will typically use the sample standard deviation, s, which is our best estimate of our population's standard deviation. -size is the sample size, n. Like CONFIDENCE.NORM, CONFIDENCE.T returns the margin of error, which we can add and subtract from the sample mean. Thus the confidence interval is: x¯ ± t(s/√n =x¯ ± CONFIDENCE.T(alpha, standard_dev, size) Confidence Intervals are usually constructed for sample sizes of 30 or more for which we know the Central Limit Theorem will hold; otherwise, we must use a different approach. We can still create a confidence intervals for samples of less than 30, but only if we know the sample is normally distributed. We would then use CONFIDENCE.T to calculate the width of the confidence interval for small samples. CONFIDENCE.T works just like CONFIDENCE.NORM, except that it uses a t-distribution rather than a normal distribution.

Correlation Coefficient

=CORREL(array 1, array 2) -array 1 is a set of numerical variables or cell references containing data for one variable of interest. -array 2 is a set of numerical variables or cell references containing data for the other variable of interest. -Note that the number of observations in array 1 must be equal to the number in array 2. A measure of the strength of a linear relationship between two variables. The correlation coefficient can range from -1 to +1. A correlation coefficient of -1 indicates a perfect negative linear relationship between two variables, whereas a correlation coefficient of +1 indicates a perfect positive linear relationship. A correlation coefficient of 0 indicates that no linear relationship exists between two variables, though it is possible that a non-linear relationship exists between the two variables. Correlation does not imply causation. However it can shows relationships between two variables that are significant to our understanding of the data set, so long as we are mindful not to overlook hidden variables.

Dummy Variable

=IF(logical_test,[value_if_true],[value _if_false]) E.g., =IF(A2="Yes",1,0). A variable that takes on one of two values: 0 or 1. Dummy variables are used to transform categorical variables into quantitative variables. A categorical variable with only two categories (e.g. "heads" or "tails") can be transformed into a quantitative variable using a single dummy variable that takes on the value 1 when a data point falls into one category (e.g. "heads") and 0 when a data point falls into the other category (e.g. "tails"). For categorical variables with more than two categories, multiple dummy variables are required. Specifically, the number of dummy variables must be the total number of categories minus one.

Cumulative Probability Hint: cumulative probabilities are conceptually related to the percentiles of a distribution. For example, the value associated with a cumulative probability of 90% is the 90th percentile of the distribution.

=NORM.DIST(x, mean, standard_dev, cumulative) -x is the value at which you want to evaluate the distribution function. -mean is the mean of the distribution. standard_dev is the standard deviation of the distribution. -cumulative is an argument that specifies the type of probability we wish to calculate. We insert "TRUE" to indicate that we wish to find the cumulative probability, that is, the probability of being less than or equal to the x-value. (Inserting the value "FALSE" provides the height of the normal distribution at the value x, which we will not cover in this course.) The probability of all values less than or equal to a particular value. For a standard normal curve, we know the mean is 0 and the standard deviation is 1, so we could find a cumulative probability using =NORM.DIST(x,0,1,TRUE). Alternatively, we use Excel's NORM.S.DIST function =NORM.S.DIST(z, cumulative). The "S" in this function indicates it applies to a standard normal curve. -z is the value (the z-value) at which we want to evaluate the standard normal distribution function. -cumulative is an argument that specifies the type of probability we wish to calculate. We will insert "TRUE".

Given Percentile

=NORM.INV(probability, mean, standard_dev) -probability is the cumulative probability for which we want to know the corresponding x-value on a normal distribution. -mean is the mean of the distribution. -standard_dev is the standard deviation of the distribution. The value of a cumulative probability. To calculate the corresponding value on a normal curve, we use NORM.INV

Z-Value

=STANDARDIZE(x, mean, standard_dev) -x is the value to be standardized. -mean is the mean of the distribution. -standard_dev is the standard deviation of the distribution. After standardizing, we can insert the resulting z-value into the NORM.S.DIST function to find the cumulative probability of that z-value. The distance in standard deviations from the data point to the mean. Negative z-values correspond to data points less than the mean; positive z-values correspond to data points greater than the mean. The z-values for any normal distribution can be calculated using the formula, z = (x−μ)/σ

Standard Deviation

=STDEV.S(number 1, [number 2], ...) A measure of the spread of a data set's values around its mean value. The standard deviation is the square root of the variance. We can also find the standard deviation using the Excel function =SQRT(number) to take the square root of the variance. For example, =SQRT(16)=4.

Variance

=VAR.S(number 1, [number 2], ...) A measure of the spread of a data set's values around its mean value. If the true population mean is known, the variance is equal to the sum of the squares of the differences between each point of the data set and the population mean, divided by the total number of data points. If the mean is estimated from a sample, the variance is equal to the sum of the squares of the differences between each point of the data set and the sample mean, divided by the total number of data points in the sample minus one.

Scatter Plot Graph

A graph of plotted points that show the relationship between two sets of data that do not depend on each other where the line is not connected. (Ex. height versus weight): helps us visualize the relationship between two variables.

Histogram

A graph of vertical bars representing the frequency distribution of a set of data: helps us visualize a single variable's distribution.

Null Hypothesis (H^0)

A statement about a topic of interest. It is typically based on historical information or conventional wisdom. We always start a hypothesis test by assuming that the null hypothesis is true and then test to see if we can nullify it—that's why it's called the "null" hypothesis. The null hypothesis is the opposite of the hypothesis we are trying to prove (the alternative hypothesis).

Central Limit Theorem

A theorem stating that if we take sufficiently large randomly-selected samples from a population, the means of these samples will be normally distributed regardless of the shape of the underlying population. (Technically, the underlying population must have a finite variance.) This means we can disregard the distribution of of the population and focus solely on the sample, which we know falls somewhere in a normal distribution centered at the true population mean.

Hidden Variable

A variable that is correlated with two different variables that are not directly related to each other. The two variables may appear to be unrelated, but are mathematically correlated because each of them is correlated with a third, the hidden variable that drives the observed correlation. A hidden variable makes its presence known through its relationship with each of the two variables that are being observed.

Rules of Thumb for Normal Distribution: P(μ−2σ≤x≤μ+2σ)≈95%

About 95% of the probability is contained in the range reaching two standard deviations (1.96 to be exact) away from the mean on either side: We'll use two standard deviations when discussing the normal distribution conceptually, but we will always use 1.96 for actual calculations in Excel.

Rules of Thumb for Normal Distribution: P(μ−3σ≤x≤μ+3σ)≈99.7%

About 99.7% of the probability is contained in the range reaching three standard deviations away from the mean on either side.

=SUM(Range 1)/Count

An alternative formula for calculating the Mean (average) of a data set.

A/B Test

An experiment that compares the value of a specified dependent variable (such as the likelihood that a web site visitor purchases an item) across two different groups (usually a control group and a treatment group). The members of each group must be randomly selected to ensure that the only difference between the groups is the "manipulated" independent variable (for example, the size of the font on two otherwise-identical web sites) To perform a two-sample test in Excel, we use the same T.TEST function we used earlier. The only difference is that we use the actual data from the second sample for our second column of data (as opposed to the historical mean).

Confidence Interval for a Population Mean

CONFIDENCE.NORM returns the margin of error, z(s/√n), where z is the z-value associated with the specified level of confidence. The lower and upper bounds of the confidence interval are equal to the sample mean, plus or minus that margin of error. Thus the confidence interval is: x¯± z(s/√n) = x¯ ± CONFIDENCE.NORM(alpha, standard_dev, size) A range constructed around a sample mean that estimates the true population mean. The confidence level of a confidence interval indicates how confident we are that the range contains the true population mean. For example, we are 95% confident that a 95% confidence interval contains the true population mean. The confidence level is equal to 1 - significance level. To increase the width of the confidence interval: increase the confidence level, or decrease the sample size

A Sample Size of 1000

Generally, the sample mean and standard deviation become closer to the population mean and standard as the sample size increases; however, this sample size is often a satisfactory representation of a population numbering in the millions, as long as the sample is randomly selected and representative of the entire population. We might expect that for a larger population, a larger sample size is needed to achieve a given level of accuracy, but this is not necessarily true--we may need a larger sample size when we are trying to detect something very rare. For example, if we are trying to estimate the incidence of a rare disease, we may need a larger sample simply to ensure that some people afflicted with the disease are included in the sample.

Random Sample

RAND assigns a random identification (ID) number between 0 and 1 to each data point -The Excel formula requires that we simply type the formula with closed parentheses. -We can use the RAND function to generate random numbers between any two specified values. For example, if we wanted to generate random numbers between 0 and 10 we would multiply the function by 10 and enter =RAND()*10. If we wanted numbers between 5 and 15, we would enter =5+RAND()*10. A sample (a subset of items) chosen from a population (the full set of items) such that every item in the population has an equal likelihood of being chosen. All statistical inferences about a population based on sample data should be based on random samples.

Survey

Researchers ask questions and record self-reported responses from a random sample of a population.

Experiment

Researchers divide a sample into two or more groups. One group is a "control group," which has not been manipulated. In the "treatment group (or groups)," they manipulate a variable and then compare the treatment group(s) responses to the responses of the control group.

Observational Study

Researchers observe and collect data about a sample (e.g., people or items) as they occur naturally, without intervention, and analyze the data to investigate possible relationships.

Population Proportion

The number of data points in a population with a certain characteristic divided by the total number of data points in the population. We generally cannot measure the full population proportion directly; we estimate it from the sample proportion. Population proportions are often expressed as fractions or percentages. Before we can calculate the confidence interval for the proportion of "yes" or "no" answers, we need to calculate p¯, the percent of the total number of responses that were "yes" responses. p¯ is our best estimate of our variable of interest, p, the true percentage of "yes" responses in the underlying population. Because every respondent must answer "yes" or "no", we know that the percentage of "no" responses equals 1−p¯. The easiest way to do this is to assign a "dummy variable" to each response—a variable that can take on only the values 0 and 1. The following guidelines are typically used when estimating proportions to ensure that a sample is large enough to provide a good estimate. The sample size n must be large enough to satisfy both conditions: n∗p¯ ≥ 5 n(1−p¯) ≥ 5

Confidence Level

The percentage of all possible samples that can be expected to include the true population parameter. For example, for a 95% confidence level, the intervals should be constructed so that, on average, for 95 out of 100 samples, the confidence interval will contain the true population mean. Note that this does not mean that for any given sample, there is a 95% chance that the population mean is in the interval; each confidence interval either contains the true mean or it does not. A 95% confidence level should be interpreted as saying that if we took 100 samples from a population and created a 95% confidence interval for each sample, on average 95 of the 100 confidence intervals (that is, 95%) would contain the true population mean.

Distribution of Sample Means

The probability distribution of the means of all randomly-selected samples of the same size that could be taken from a population: more closely approximates a normal curve as we increase the number of samples and/or the sample size. The mean of any single sample lies on the normally distributed Distribution of Sample Means, so we can use the normal curve's special properties to draw conclusions from a single sample mean. The mean of the Distribution of Sample Means equals the mean of the population distribution. The standard deviation of the Distribution of Sample Means equals the standard deviation of the population distribution divided by the square root of the sample size. Thus, increasing the sample size decreases the width of the Distribution of Sample Means.

Coefficient of Variation (CV)

The ratio of the standard deviation to the mean. (SD/Mean) = CV

Bias

The tendency of a measurement process to over- or under-estimate the value of a population parameter. Although a sample statistic will almost always differ from the population parameter, for an unbiased sample, the difference will be random. In contrast, for a biased sample, the statistic will differ in a systematic way (e.g., tend to be too high). Some common reasons for bias include non-random sampling methods and non-neutral question phrasing. Avoid biased results by phrasing questions neutrally; ensuring that the sampling method is appropriate for the demographic of the target population; and pursuing high response rates. It is often better to have a smaller sample with a high response rate than a larger sample with a low response rate. If a sample is sufficiently large and representative of the population, the sample statistics, x¯ and s, should be reasonably good estimates of the population parameters, μ and σ, respectively.

Significance Level

The threshold for deciding whether to reject the null hypothesis. The most commonly used - - is .05 (corresponding to a confidence level of 95%), which means we would reject the null hypothesis when the p-value < .05. The - - is represented by the Greek letter α (alpha) and is equal to 1-confidence level.

Determining Sample Size

To find the sample size necessary to ensure a specified margin of error is less than or equal to a given distance, M, we just rearrange the equation and solve for the sample size, n. if Margin of Error = z*σ/√n ≤ M . . . then . . . n ≥ (z*σ/M)^2 Recall that we usually don't know σ, the true standard deviation of the population. Moreover, when determining the appropriate sample size, we typically would not have even taken a sample yet, so we don't have a sample standard deviation. In a case like this, we could take a preliminary sample and use that sample's standard deviation, s, as an estimate of σ. Thus, to ensure that the margin of error is less than M, the sample size must satisfy: n ≥ (z*s/M)^2 Also, it is necessary to round up because we are solving for the smallest value that will satisfy the inequality.

Spread of a Data Distribution

a measure of the amount of variability, or how "spread out" a set of data is . . . The range, variance, and standard deviation measure the spread of the data: -Standard deviation is equal to the square root of the variance. - The coefficient of variation measures the size of the standard deviation relative to the size of the mean

Central Measures of Tendency

mean, median, mode To determine the "central tendency" of a data set—an indication of where the "center" of the data set lies—we usually start by calculating the mean, the most common measurement of central tendency. The mean is the number people refer to when they talk about the "average" of a set of numbers. However, the mean is not always a reliable numerical measurement when considering a large data set. The mean is strongly effected by outliers which skew the data. Considering measurements like mode and median help balance this data, as well as visualizing the data in a histogram or scatter plot.


Kaugnay na mga set ng pag-aaral

Kato Thick & Kato-Katz Technique

View Set

Shock NCLEX Questions-complex care

View Set

Organizational Identity and Identification

View Set

A Beka: American Literature Appendix Quiz N (English 11)

View Set

MEDICARE SUPPLEMENT POLICIES (MEDIGAP)

View Set

Junior explorer 5 - unit 5b - speaking - question, answer (pytania z was / were)

View Set