Business Analytics

Ace your homework & exams now with Quizwiz!

Drawbacks of lagged variables (2)

1) Each lagged variables decreases our sample size by one observation 2) If the lagged variable does not increase the model's explanatory power, the addition of the variable decreases Adjusted R^2

Why do we use regression analysis? (2 parts)

1) Studying the magnitude and structure of the relationship between two variables. 2) Forecasting a variable based on its relationship with another variable.

Given the general regression equation, ŷ =a+bx, the following describes ŷ ?

1) The expected value of y 2)The dependent variable 3) The value we are trying to predict

Before conducting a hypothesis test: (2 parts)

1. Determine whether to analyze a change in a single population or compare two populations. a) We perform a single-population hypothesis test when we want to determine whether a population's mean is significantly different from its historical average. b)We perform a two-population hypothesis test when we want to compare the means of two populations—for example, when we want to conduct an experiment and test for a difference between a control and treatment group. 2. Determine whether to perform a one-sided or two-sided hypothesis test. a)We perform two-sided tests when we do not have strong convictions about the direction of a change. Therefore we test for a change in either direction b)We perform a one-sided test when we have strong convictions about the direction of a change—that is, we know that the change is either an increase or a decrease. For a one-sided test at a given confidence level, we must double the p-value we would have used for a two-sided test at that confidence level.

If is often useful to infer attributes of a large population from a smaller sample. To make sound inferences: (list three reasons)

1. Make sure the sample is sufficiently large and representative of the population. 2. Avoid biased results by phrasing questions neutrally, ensuring that the sampling method is appropriate for the demographic of the target population, and pursuing high response rates 3. If a sample is sufficiently large and representative of the population, the sample statistics, x̄ and s, should be reasonably good estimates of the population parameters μ and σ, respectively

To conduct a hypothesis test we: 4 steps

1. State the null and alternative hypotheses 2. Choose the level of significance for the test. 3. Gather data about a sample or samples. 4. To determine whether the sample is highly unlikely under the assumption that the null hypothesis is true, construct the range of likely sample means or calculate the p-value. If the sample mean falls in the range of likely sample means, or if its p-value is greater than the stated significance level, we do not have sufficient evidence to reject the null hypothesis. If the sample mean falls in the rejection region, or if it has a p-value lower than the stated significance level, we have sufficient evidence to reject the null hypothesis. We can never accept the null hypothesis.

Suppose we wanted to calculate a 90% range of likely sample means for the movie theater example. Select the function that would correctly calculate this range.

6.7±CONFIDENCE.NORM(0.10,2.8,196) The range of likely sample means is centered at the historical population mean, in this case 6.7. Since this is a 90% range of likely sample means, alpha equals 0.10.

Suppose we wanted to calculate a 90% range of likely sample means for the movie theater example but our sample size had been only 15. (Assume the same historical population mean, sample mean, and sample standard deviation.) Select the function that would correctly calculate this range.

6.7±CONFIDENCE.T(0.10,2.8,15) The range of likely sample means is centered at the historical population mean, in this case 6.7. We must use CONFIDENCE.T since the sample size is less than 30.

Confidence Interval excel function

=CONFIDENCE.NORM(alpha, standard_dev, size) - alpha, the significance level, equals one minus the confidence level (for example, a 95% confidence interval would correspond to the significance level 0.05). - standard_dev is the standard deviation of the population distribution. We will typically use the sample standard deviation, s, which is our best estimate of our population's standard deviation. - size is the sample size, n. x¯±z*s/√n=x¯±CONFIDENCE.NORM(alpha, standard_dev, size)

To calculate the width of the confidence interval for small samples:

=CONFIDENCE.T(alpha, standard_dev, size) - alpha, the significance level, equals one minus the confidence level (for example, a 95% confidence interval would correspond to the significance level 0.05). - standard_dev is the standard deviation of the population distribution. We will typically use the sample standard deviation, s, which is our best estimate of our population's standard deviation. - size is the sample size, n CONFIDENCE.T returns the margin of error, which we can add and subtract from the sample mean.

Dummy Variable excel function

=IF(logical_test,[value_if_true],[value _if_false]) To make this function assign the value 1 if the referenced cell is "Yes," and 0 if the referenced cell is not (in this case, if the cell referenced is "No"), we would enter the IF function for every observation. In this example, the following formula refers to the first observation in cell A2. =IF(A2="Yes",1,0) The number of dummy variables we include must always be one fewer than the number of options in a category!

Cumulative Probability excel function

=NORM.DIST(x, mean, standard_dev, cumulative) x is the value at which you want to evaluate the distribution function. mean is the mean of the distribution. standard_dev is the standard deviation of the distribution. cumulative is an argument that specifies the type of probability we wish to calculate. We insert "TRUE" to indicate that we wish to find the cumulative probability, that is, the probability of being less than or equal to the x-value. (Inserting the value "FALSE" provides the height of the normal distribution at the value x, which we will not cover in this course.)

Standard Normal Curve excel function

=NORM.DIST(x,0,1,TRUE) =NORM.S.DIST(z, cumulative) z is the value (the z-value) at which we want to evaluate the standard normal distribution function. cumulative is an argument that specifies the type of probability we wish to calculate. We will insert "TRUE".

starting with a cumulative probability and wishing to find the corresponding value on a normal curve excel function

=NORM.INV(probability, mean, standard_dev) probability is the cumulative probability for which we want to know the corresponding x-value on a normal distribution. mean is the mean of the distribution. standard_dev is the standard deviation of the distribution.

To return the p-value associated with a given test excel function

=T.TEST(array1, array2, tails, type) - array1 is a set of numerical values or cell references. - array2 is a set of numerical values or cell references. If we only have one set of data, for the second data set we create a column for which every entry is the historical mean. - tails is the number of tails for the distribution. It should be set to 1 to perform a one-sided test; to 2 to perform a two-sided test. - type can be 1, 2, or 3. - Type 1 is a paired test and is used when the same group is tested twice to provide paired "before and after" data for each member of the group. - Type 2 is an unpaired test in which the samples are assumed to have equal variances. - Type 3 is an unpaired test in which the samples are assumed to have unequal variances. Unless we have a good reason to believe two samples have equal variances, we typically use type 3 when conducting an unpaired test.

range of likely sample means

A confidence interval around the population mean assumed under the null hypothesis. This width of the range is determined by the sample standard deviation and the desired confidence level. When a sample mean falls outside the range of likely sample means, we reject the null hypothesis at the stated confidence level. = historical mean +/- Confidence.Norm (Alpha, Standard_Dev, Size)

Hidden variables

A hidden variable is a variable that is correlated with each of two variables (such as ice cream and snow shovel sales) that are not fundamentally related to each other. That is, there is no reason to think that a change in one variable will lead to a change in the other; in fact, the correlation between the two variables may seem surprising until the hidden variable is considered. Although there is no direct relationship between these two variables, they are mathematically correlated because each is correlated individually with a third "hidden" variable. Therefore, for a variable to act as a hidden variable, there must be three variables, all of which are mathematically correlated (either directly or indirectly).

two-sided hypothesis test

A hypothesis test that tests for any difference in a parameter (e.g., if the mean of one group is different -- either greater than or less than -- the mean of another group). In a two-sided test, the null hypothesis is that the parameter is the same (e.g., that the means of two groups are the same), whereas the alternative hypothesis is that the parameter is different (e.g., the means of two groups are different). The rejection region for a two-sided hypothesis test is divided into two parts in the tails of the distribution.

Note that the p-value and R2 provide different information...

A linear relationship can be significant (have a low p-value) but not explain a large percentage of the variation (not have a high R2)

variance

A measure of the spread of a data set's values around its mean value. If the true population mean is known, the variance is equal to the sum of the squares of the differences between each point of the data set and the population mean, divided by the total number of data points. If the mean is estimated from a sample, the variance is equal to the sum of the squares of the differences between each point of the data set and the sample mean, divided by the total number of data points in the sample minus one. The variance is the square of the standard deviation. The variance is measured in squared units (e.g., if the data set contains data denominated in dollars, the variance will be in squared dollars). Population Variance σ2= sum of squared differences/n (number of data points)= average of the square diffferences sample variance s2= sum of squared differences/n-1 (number of data points) =VAR.S(number 1, [number 2], ...)

standard deviation

A measure of the spread of a data set's values around its mean value. The standard deviation is the square root of the variance. The standard deviation is measured in the same units (such as dollars or hours) as the observations in the data. Population Standard Deviation σ= square root of the population variance Sample Standard Deviation s= square root of the sample variance =STDEV.S(number 1, [number 2], ...) We can also find the standard deviation using the Excel function =SQRT(number) to take the square root of the variance. For example, =SQRT(16)=4.

correlation coefficient

A measure of the strength of a linear relationship between two variables. The correlation coefficient can range from -1 to +1. A correlation coefficient of -1 indicates a perfect negative linear relationship between two variables, whereas a correlation coefficient of +1 indicates a perfect positive linear relationship. A correlation coefficient of 0 indicates that no linear relationship exists between two variables, though it is possible that a non-linear relationship exists between the two variables. =CORREL(array 1, array 2)

null hypothesis

A null hypothesis is a statement about a topic of interest, typically based on historical information or conventional wisdom. We start a hypothesis test by assuming that the null hypothesis is true and then test to see if we can nullify it, which is why it's called the "null" hypothesis. The null hypothesis is the opposite of the hypothesis we are trying to substantiate (the alternative hypothesis).

Prediction Interval

A prediction interval is a range of values constructed around a point forecast. The center of the prediction interval is the point forecast, that is, the expected value of y for a specified value of x. The range of the interval extends above and below the point forecast. The width of the interval is based on the standard error of the regression and the desired level of confidence in the prediction, and the location of the x-value of interest in relation to the historical values of the independent variable. As the confidence level increases, the width of the prediction interval increases. As we move to the edge of, and beyond, the range of historical data, the width of the prediction interval increases.

single-population hypothesis test

A test in which a single population is sampled to test whether a parameter's value is different from a specific value (often a historical average).

two-population hypothesis test

A test in which samples from two different populations are compared to see if the parameter of interest is different between the two populations.

Lagged Variables

A type of independent variable often used in a regression analysis. When data are collected as a time series, a regression analysis is often performed by analyzing values of the dependent with independent variables from the same time period. However, if researchers hypothesize that there is a relationship between the dependent variable and values of an independent variable from a previous time period, may include a "lagged variable", that is, and independent variable based on data from a previous time period. Lagged variables are used to capture the ongoing effects of a given variable. The lag period is based on managerial insight and data availability.

Outlier

A value that falls far from the rest of the data (we should always carefully investigate an outlier before deciding where to leave it as is, change its value to the correct value, or remove it).

z-value

A z-value of a point x is the distance x lies from the mean, measured in standard deviations z=(x-μ)/σ

Adjusted R-Squared

Because R2 never decreases when independent variables are added to a regression, it is important to multiply it by an adjustment factor when assessing and comparing the fit of a multiple regression model. This adjustment factor compensates for the increase in R2 that results solely from increasing the number of independent variables. A measure of the explanatory power of a regression analysis. Adjusted R-squared is equal to R-squared multiplied by an adjustment factor that decreases slightly as each independent variable is added to a regression model. Unlike R-squared, which can never decrease when a new independent variable is added to a regression model, Adjusted R-squared drops when an independent variable is added that does not improve the model's true explanatory power. Adjusted R2 should always be used when comparing the explanatory power of regression models that have different numbers of independent variables. The adjusted R-squared is not a measure from which we can estimate the variability explained by the model - an adjusted R-squared can even be negative when we have included too many independent variables that don't add enough explanatory power! Adjusted R2 is provided in the regression output.

In addition to analyzing R2, we must test whether the relationship between the dependent and independent variable is:

significant and whether the linear model is a good fit for the data. We do this by analyzing the p-value (or confidence interval) associated with the independent variable and the regression's residual plot.

The interquartile range (IQR) OPTIONAL

the difference between the upper and lower quartiles, that is, IQR=Q3-Q1. We then multiply the IQR by 1.5 to find the appropriate range, computing 1.5(IQR)=1.5(Q3-Q1). A data point is an outlier if it is less than Q1-1.5(IQR) or greater than Q3+1.5(IQR).

A confidence interval associated with an independent variable's coefficient indicates:

the likely range for that coefficient. If the 95% confidence interval does not contain zero, we can be 95% confident that there is a significant linear relationship between the variables.

median

the middle value of the data set: half of the data set's values lie below the median, and half lie above the median To find the median, first arrange the values in order of magnitude. If the total number of data points is odd, the median is the value that lies in the middle. If the total number is even, the median is the average of the two middle values. =MEDIAN(number 1, [number 2], ...)

We should also analyze in multiple regression:

the p-values of the independent variables to determine whether there is a significant relationship between the variables in the model. If the p-value of each of the independent variables is less than .05, we can conclude that there is sufficient evidence to say that we are 95% confident that there is a significant linear relationship between the independent and dependent variables.

mode

the value that occurs most frequently in the data set. A data set may have multiple modes. =MODE.SNGL(number 1, [number 2], ...)

Large Sample

typically defined as 30 or more data points

mediating variables

variables which are affected by one variable, and then affect another variable in turn. e.g. being worried about grades may cause a student to study harder, and thus get better grades, but we wouldn't consider studying to be a hidden variable linking worry and getting better grades. Those two variables ARE fundamentally related, in that the worry is leading to the better grades. If students are more worried, they may study harder and get even better grades.

To find the sample size necessary to ensure a specified margin of error is less than or equal to a given distance, M, we...

we rearrange the equation and solve for the sample size, n. z*σ/√n≤M --> n≥(z*σ/M)^2 Recall that we usually don't know σ, the true standard deviation of the population. Moreover, when determining the appropriate sample size, we typically would not have even taken a sample yet, so we don't have a sample standard deviation. In a case like this, we could take a preliminary sample and use that sample's standard deviation, s, as an estimate of σ. Thus, to ensure that the margin of error is less than M, the sample size must satisfy: n≥(z*s/M)^2

To estimate the population mean with a small sample...

we use a t-distribution instead of a "z-distribution", that is, a normal distribution. A t-distribution looks similar to a normal distribution but is not as tall in the center and has thicker tails. These differences reflect that fact that a t-distribution is more likely than a normal distribution to have values farther from the mean. Therefore, the normal distribution's "rules of thumb" do not apply. The shape of a t-distribution depends on the sample size; as the sample size grows towards 30, the t-distribution becomes very similar to a normal distribution. Smaller samples have greater uncertainty, which means wider confidence intervals.

Confidence Interval Equation

x̄±z*s/√n Increasing the confidence interval and decreasing the sample size INCREASES width of the confidence interval as n decreases, s/√n increases, that is, the width of the confidence interval increases

mean

x̅ The mean is equal to the sum of all of the data points in a set, divided by nn, the number of data points. =AVERAGE(number 1, [number 2], ...) Or =SUM(A1:A5)/COUNT(A1:A5). The COUNT function counts the number of cells that contain numerical values so in this case =SUM(A1:A5)/COUNT(A1:A5) is equivalent to =SUM(A1:A5)/5.

The true relationship between two variables is described by the equation:

y=α+βx+ε, where ε is the error term (ε=y−ŷ). The idealized equation that describes the true regression line is ŷ =α+βx.

The structure of the single variable linear regression line is:

ŷ =a+bx - ŷ is the expected value of y, the dependent variable, for a given value of x. - x is the independent variable, the variable we are using to help us predict or better understand the dependent variable. - a is the y-intercept, the point at which the regression line intersects the vertical axis. This is the value of ŷ when the independent variable, x, is set equal to 0. - b is the slope, the average change in the dependent variable yy as the independent variable xx increases by one.

The structure of the single variable linear regression line is:

ŷ =a+bx - ŷ is the expected value of yy, the dependent variable, for a given value of xx. - x is the independent variable, the variable we are using to help us predict or better understand the dependent variable. - a is the y-intercept, the point at which the regression line intersects the vertical axis. This is the value of ŷ y^ when the independent variable, xx, is set equal to 0. - b is the slope, the average change in the dependent variable yy as the independent variable xx increases by one.

Cross-Sectional

Cross-sectional data contain data that measure an attribute across multiple different subjects (e.g. people, organizations, countries) at a given moment in time or during a given time period. The average oil consumption of ten countries in 2012 is an example of cross-sectional data. Managers use cross-sectional data to compare metrics across multiple groups.

Residual Plots for Multiple Regression

For multiple regression models, because it is difficult to view the data in a simple scatter plot, residual plots are an indispensable tool for detecting whether the linear model is a good fit. - There is a residual plot for each independent variable included in the regression model. - We can graph a residual plot for each independent variable to help detect patterns such as heteroskedasticity and nonlinearity. - As with single variable regression models, if the underlying multiple relationship is linear, each of the residuals follows a normal distribution with a mean of zero and fixed variance.

To create a regression model in excel:

From the Data menu, select Data Analysis, then select Regression. The Input Y Range is F1:F59 and the Input X Range is C1:E59. You must check the Labels box to ensure that the regression output table is appropriately labeled. You must also check the Residuals and Residual Plots boxes so that you are able to analyze the residuals.

rejection region

In a hypothesis test, the rejection region is the region outside the range of likely sample means. If the sample statistic falls in the rejection region, there is sufficient evidence to reject the null hypothesis at the confidence level used to create the range of likely sample means.

. Even though our question has only two possible responses, we still have to address an inherent uncertainty: how often will each response occur?

In such cases, we usually convey the survey results by reporting p¯, the percent of the total number of responses that were "yes" responses. p¯ is our best estimate of our variable of interest, p, the true percentage of "yes" responses in the underlying population. Because every respondent must answer "yes" or "no", we know that the percentage of "no" responses equals 1−p¯

type II error

Incorrectly failing to reject a false null hypothesis (e.g., concluding that there is no difference between two groups, when there is, in fact, a difference); also called a false negative. If you are performing a hypothesis test based on a 90% confidence level, it is not possible to know your chances of making a type II error without more information.

type 1 error/false positive

Incorrectly rejecting a true null hypothesis (e.g., concluding that there is a difference between two groups when there is not one); also called a false positive.

What if our sample size is small? Is there a way to estimate the population mean even if we have only a handful of data points? Can we still create a confidence interval?

It depends: if we don't know anything about the underlying population, we cannot create a confidence interval with fewer than 30 data points because the properties of the Central Limit Theorem may not hold. However, if the underlying population is roughly normally distributed, we can use a confidence interval to estimate the population mean as long as we modify our approach slightly. We can gain insight into whether a data set is approximately normally distributed by looking at the shape of a histogram of that data. There are formal tests of normality that are beyond the scope of this course.

The lower quartile

Q1, 25% of all observations fall below Q1 The 25th percentile is the smallest value that is greater than or equal to 25% of the data points.

The upper quartile

Q3, 75% of all observations fall below Q3

R^2

R2 measures the percent of total variation in the dependent variable, y, that is explained by the regression line. R2=Variation explained by the regression line/Total variation R2=Regression Sum of Squares/Total Sum of Squares 0<_R2<_1 For a single variable linear regression, R2 is equal to the square of the correlation coefficient

When is sample size particularly important?

Sample size is particularly important when dealing with very small (or very large) proportions. Suppose we are sampling to find the prevalence of Amyotrophic Lateral Sclerosis (ALS), a disease commonly known as Lou Gehrig's disease. In the United States, an estimated six to eight people per 100,000 have ALS. That is, the likelihood that a person in the U.S. has ALS is between 0.00006 and 0.00008, or between 0.006% and 0.008%. Would our sample be useful if we surveyed 100 people? No. Since the proportion we are estimating is very small, we need to have a large enough sample to make sure that it includes at least SOME people with the disease.

Single Variable Linear Regression

Single Variable Linear Regression analysis is used to identify the best fit line between two variables. This analysis builds on two previous concepts we have used to study relationships between two variables: 1) Scatter plots, which are useful for visualizing a relationship between two variables. 2) The correlation coefficient, a value between -1 and 1 that measures the strength and direction (positive or negative) of the linear relationship between two variables. A coefficient in a single variable linear regression characterizes the gross relationship between the dependent variable and the independent variables.

Central Limit Theorem

The Central Limit Theorem states that if we take enough sufficiently large samples from any population, the means of those samples will be normally distributed, regardless of the shape of the underlying population.

Residual Sum of Squares

The amount of variation that is not explained by the regression line. The residual sum of squares is equal to the sum of the squared residuals, that is, the sum of the squared differences between the observed values of the dependent variable and the predicted values of the dependent variable. To calculate the residual sum of squares, subtract the regression sum of squares from the total sum of squares.

range

The distance between the smallest and greatest values in a data set. Range is the simplest measure of the variability of a data set. Maximum value-Minimum value

Distribution of Sample Means

The distribution of the sample means of a population more closely closely approximates a normal curve as we increase the number of samples and/or the sample size The mean of any single sample lies on the normally distributed Distribution of Sample Means, so we can use the normal curve's special properties to draw conclusions from a single sample mean The mean of the Distribution of Sample Means equals the mean of the population distribution The standard deviation of the Distribution of Sample Means equals the standard deviation of the population distribution divided by the square root of the sample size. Thus, increasing the sample size decreases the width of the Distribution of Sample Means

Confidence level trade-off

The higher the confidence level (and therefore the lower the significance level), the lower the chance of rejecting the null hypothesis when it is true (type I error or false positive). But the higher the confidence level, the higher the chance of not rejecting it when it is false (type II error or false negative).

conditional mean

The mean of a specific subset of data...we are imposing a condition on our data set and we only want to find the mean of the values that meet that condition. =AVERAGEIF(range, criteria, [average_range]) - range contains the one or more cells to which we want to apply the criteria or condition. - criteria is the condition that is to be applied to the range. - [average_range] is the range of cells containing the data we wish to average.

Suppose we want to know the range of values associated with the "middle" 99% of a normal distribution?

The normal curve is symmetrical, so we know that the middle 99% of the distribution comprises 49.5% on either side of the mean and excludes 0.5% on each of the tails. Thus we can find the value corresponding to the left side of the range using the NORM.INV function evaluated at 0.5% and the right side using the NORM.INV function evaluated at 99.5%. In this case, the values associated with the middle 99% are NORM.INV(0.005,63.5,2.5)=57.1 and NORM.INV(0.995,63.5,2.5)=69.9.

Normal Distribution

The normal distribution is a symmetric, bell-shaped continuous distribution, with a peak at the mean. A normal distribution is completely determined by two parameters, its mean and standard deviation. Approximately 68% of a normal distributions outcomes fall within one standard deviation of the mean and approximately 95% of its outcomes fall within two standard deviations of the mean. The mean, median and mode of a normal distribution are equal. Using the properties of the normal distribution, we can calculate a probability associated with any range of values. Rules of thumb: 1. About 68% of the probability is contained in the range reaching one standard deviation away from the mean o either side, that is, P(μ−σ≤x≤μ+σ)≈68% 2. About 95% of the probability is contained in the range reaching two standard deviations (1.96 to be exact!) away from the mean on either side, that is, P(μ−2σ≤x≤μ+2σ)≈95% 3. About 99.7% of the probability is contained in the range reaching three standard deviations away from the mean on either side, that is, P(μ−3σ≤x≤μ+3σ)≈99.7%

p-value

The p-value is the likelihood of obtaining a sample as extreme as the one we've obtained, if the null hypothesis is true. The p-value of a one-sided hypothesis test is half the p-value of a two-sided hypothesis test. A p-value can never be exactly 0. However, p-values can be very close to 0. What this means is that if the null hypothesis were true, you would be extremely unlikely to see data as extreme as the sample you are analyzing. However, there is still a chance (albeit a small one), that the null hypothesis is true, and you just drew a very unusual sample. Very small p-values are often seen and analyzed in high-throughput screening (HTS) processes that automatically test hundreds of thousands of active compounds to identify those with particular properties. (Note that if your software package reports p-values only to four digits, any p-value less than 0.00005 will appear in the output as 0.0000.) To test a two-sided hypothesis at the 90% confidence level, you would use a p-value of 1.0 - 0.9 = 0.1, or 10%. As we've seen in the graphs in the course, that 10% is split into two regions in each tail of the normal distribution, each with 5% probability. These are the rejection regions - if our sample falls into either region we reject the null hypothesis.

cumulative probability

The probability of all values less than or equal to a particular value is called a cumulative probability

Gross Relationship

The relationship between a single independent variable and a dependent variable. The gross relationship is affected by any variables that are related to the independent and/or dependent variable but are not included in the model.

Residual plots

The residual plot is a scatter plot with residuals on the y-axis and the independent variable on the x-axis. The plot graphically represents the residual (the difference between the observed value and predicted value of the dependent variable) for each observation. Examining residual plots can provide significant insight into the relationships among variables and the validity of the assumptions underlying regression models. - Each observation in a data set has a residual equal to the historically observed value minus the regression's predicted value, that is e=y-^y - Linear regression models assume that the regression's residuals follow a normal distribution with a mean of zero and fixed variance

Standard Normal Curve

The standard normal curve is a normal distribution whose mean is equal to zero (μ=0), and whose standard deviation is equal to one (σ=1).

margin of error of the confidence interval

The term z*σ/√n, estimated by z*s/√n, is often referred to as the confidence interval's "margin of error."

margin of error of the confidence interval

The term z*σ√n, estimated by z*s/√n, is often referred to as the confidence interval's "margin of error."

percentile

The value of a variable for which a certain percentage of the data set falls below. For example, if 87% of students taking the GMAT exam earn scores below 670, the 87th percentile for the GMAT exam is 670 points. =PERCENTILE.INC(array, k) - array is the range of data for which we want to calculate a given percentile. - k is the percentile value. For example, if we want to know the 95th percentile, k would be 0.95.

Total Sum of Squares

The variance of the dependent variable, that is, the sum of squared differences between the observed values of the dependent variable and the mean of the dependent variable The total sum of squares is equal to the regression sum of squares plus the residual sum of squares.

Time Series

Time series data contain data about a given subject in temporal order, measured at regular time intervals (e.g. minutes, months, or years). U.S. oil consumption from 2002 through 2012 is an example of a time series. Managers collect and analyze time series to identify trends and predict future outcomes.

How do we perform regression analyses using qualitative, or categorical, variables?

To do so, we must convert data to dummy (0,1) variables. After that, we can proceed as we would with any other regression analysis. A dummy variable is equal to 1 when the variable of interest fits a certain criterion. For example, a dummy variable for "Female" would equal 1 for all female observations and 0 for male observations. Qualitative variables that can be sorted or grouped into categories must be transformed into dummy variables. These include: direction (North, South, East, and West) and country telephone codes. Even though country telephone codes are numerical values, the values do not have the usual numerical properties. For example, the country code for Russia is 7, for Egypt is 20, and for South Africa is 27. Although 7 + 20 = 27, it does not make sense to say that Russia + Egypt = South Africa. The country codes only make sense as categories, and thus require the use of dummy variables to include the codes in a regression analysis. A variable that can be counted or measured and is naturally represented as a number does not need to be represented as a dummy variable. These include: temperature (in degrees Celsius) and volume (in cubic meters).

Suppose we want to calculate the value associated with the upper tail of a distribution, that is, the probability of an outcome greater than a specified value.

We first need to calculate the value associated with the corresponding cumulative probability, which is one minus the probability of the upper tail. For example, the height that 1% of women are taller than is the same as the height that 99% of women are shorter than. Thus we calculate the value associated with the top 1% by entering the function =NORM.INV(0.99,63.5,2.5)≈69.

When estimating true population proportion

We must ensure that the sample size is large enough by checking that both of the follow conditions are true: n∗p¯≥5 n(1−p¯)≥5 n= sample size p-bar= mean If either of these guidelines is not satisfied, we must collect a larger sample.

Histogram

We use histograms to help us visualize a single variable's distribution. A histogram's x-axis represents bins corresponding to ranges of data; its y-axis indicates the frequency of observations falling into each bin. For example, bin 2 contains all values greater than 1 and less than or equal to 2. the value 2 is represented by the gray vertical grid line between bins 2 and 3 (not where the bin label 2 appears). The best bin size depends on what we are trying to learn from the data. Using larger bins can simplify a histogram, but may make it difficult to see trends in the data. Very small bins can have such low frequencies that make it difficult to discern patterns.

It is important to evaluate several metrics in order to determine whether a single variable linear regression model is:

a good fit for a data set, rather than looking at individual metrics in isolation.

Scatter Plot

a graph of two data sets that can reveal relationships between the two variables although there may be a relationship between two variables, we cannot conclude that one variable "causes" the other-- "correlation does not imply causation."

"central tendency" of a data set

an indication of where the "center" of the data set lies

How do we calculate p¯?

assign "dummy variable" to each response—a variable that can take on only the values 0 and 1. (Dummy variables are also called indicator variables or binary variables

Coefficient of Variation

compares variation in two data sets; the coefficient of variation is the ratio of the standard deviation to the mean...the relative variation standard deviation/mean

Coefficient of Variation

compares variation in two data sets; the coefficient of variation is the ratio of the standard deviation to the mean...the relative variation standard deviation/mean INCLUDE EXPLANATION

We use regression analysis to:

forecast the dependent variable, y, within the historically observed range of the independent variable, x - We determine a point forecast by entering the desired value of x into the regression equation. - We must be extremely cautious about using regression to forecast for values outside of the historically observed range of the independent variable (x-values).

The p-value of an independent variable:

is the result of the hypothesis test that tests whether there is a significant linear relationship between the dependent and independent variable; that is, it tests whether the slope of the regression line is zero, H0:β=0 and Ha:β≠0. - If the coefficient's p-value is less than 0.05, we reject the null hypothesis and conclude that we have sufficient evidence to be 95% confident that there is a significant linear relationship between the dependent and independent variables. - Note that the p-value and R2 provide different information. A linear relationship can be significant (have a low p-value) but not explain a large percentage of the variation (not have a high R2.) We reach a specified level of confidence when our p-value is less than 1-confidence level.

In addition to analyzing Adjusted R2, we must test whether the relationship between the independent and dependent variables is:

linear and significant. We do this by analyzing the regression's residual plots and the p-values associated with each independent variable's coefficient.

The sample size gives us a confidence interval that extends a distance...

m=z*s/√n the distance m is the confidence interval's margin of error

descriptive/summary statistics

provide a quick overview of a data set without showing every data point From the Data menu, select Data Analysis, then select Descriptive Statistics. Make sure to check labels in first row box! make sure to check summary statistics so output box is generated

confidence interval for a population mean

A range constructed around a sample mean that estimates the true population mean. The confidence level of a confidence interval indicates how confident we are that the range contains the true population mean. For example, we are 95% confident that a 95% confidence interval contains the true population mean. The confidence level is equal to 1 - significance level. A confidence interval depends on the sample's mean, standard deviation, and sample size. As we'll see, a confidence interval also depends on how "confident" we would like to be that the range contains the true mean of the population. The 95% confidence interval estimates the population mean and does not tell us about the distribution of the population. We construct a confidence interval around the sample mean to draw conclusions about the population mean, not the sample mean. Indeed, we know the sample mean is 50, so there is a 100% chance that the range includes the sample mean. There is not a 95% chance that the population mean lies between 42 and 58. If multiple 95% confidence intervals were calculated to estimate the population mean, on average, 95% of these confidence intervals would contain the true mean. The confidence interval's level of confidence does not tell us the chance, probability, or likelihood that an individual confidence interval contains the true population mean. The 95% confidence interval is a range around the sample mean. We can say that we are 95% confident that the true population mean is within this range, based on the methods we used to calculate the range. If we were to construct similar intervals for 100 samples drawn from this population, on average 95 of the intervals will contain the true population mean.

Multiple Regression

A regression analysis with two or more independent variables. We use multiple regression to investigate the relationship between a dependent variable and multiple independent variables. The structure of the multiple regression equation is ŷ =a+b1x1+b2x2+...+bkxk The true relationship between multiple variables is described by y=α+β1x1+β2x2+...+βkxk+εy=α+β1x1+β2x2+...+βkxk+ε, where εε is the error term. The idealized equation that describes the true regression model is ŷ =α+β1x1+β2x2+...+βkxky^=α+β1x1+β2x2+...+βkxk. Coefficients in multiple regression characterize relationships are net with respect to the independent variables included in the model but gross with respect to all omitted independent variables. For multiple regression we rely less on scatter plots and more on numerical values and residual plots because visualizing three or more variables can be difficult. Forecasting with a multiple regression equation is very similar to forecasting with a single variable linear model. However, instead of entering only one value for a single independent variable, we input a value for each of the independent variables.

Net Relationship

A relationship between an independent variable and a dependent variable that controls for other independent variables in a multiple regression. Because we can never include every variable that is related to the independent and dependent variables, we generally consider the relationship between the independent & dependent variables to be net with regard to the other independent variables in the model, and gross with regard to variables that are not included.

survey

A research method in which a sample of respondents provide self-reported data in response to questions or prompts. To ensure that survey data are representative of a population, respondents must be randomly selected (i.e., each member of the population should have the same likelihood of being selected to complete the survey), and the survey questions must be free of bias. Note that simply asking a randomly selected set of people to complete a survey is insufficient; the respondents themselves must be representative of the population.

observational study

A research method in which researchers collect data without manipulating the observations or the variables under investigation. An observational study provides insight into relationships, but does not establish causality.

experiment

A research method used to test whether changing the values of one or more independent variables changes the value of the dependent variable. An experiment involves randomly dividing observations into two or more groups; each group is treated the same except for the independent variable(s), which is (are) systematically varied across the groups. The dependent variable is then measured, and the differences across the groups are tested for statistical significance.

Multicollinearity

A situation that occurs when two independent variables are so highly correlated that it is difficult for the regression model to separate the relationship between each variable and the dependent variable. Multicollinearity can obscure the results of a regression analysis. If adding a new independent variable decreases the significance of another independent variable in the model that was previously significant, multicollinearity may well be the culprit. Another symptom of multicollinearity is when the R-square of a regression is high but none of the independent variables are significant. Multicollinearity occurs when there is a strong linear relationship among two or more of the independent variables. Indications of multicollinearity include seeing an independent variable's p-value increase when one or more other independent variables are added to a regression model. We may be able to reduce multicollinearity by either increasing the sample size or removing one (or more) of the collinear variables.

alternative hypothesis

An alternative hypothesis is the theory or claim we are trying to substantiate, and is stated as the opposite of a null hypothesis. When our data allow us to nullify the null hypothesis, we substantiate the alternative hypothesis.

A/B test

An experiment that compares the value of a specified dependent variable (such as the likelihood that a web site visitor purchases an item) across two different groups (usually a control group and a treatment group). The members of each group must be randomly selected to ensure that the only difference between the groups is the "manipulated" independent variable (for example, the size of the font on two otherwise-identical web sites). An A/B test is a hypothesis test that tests whether the means of the dependent variable are the same across the two groups. (An A/B test can also be used to test whether another parameter, such a standard deviation, is the same across two groups.)


Related study sets

AP Biology Unit 1 Ecology Practice Test

View Set

Periodic Table Of The Elements 1-40

View Set

Physics 1 Calculus Based Chapters 2-4 Practice Tests Answers

View Set

E2 Economic rights ownership and property rights economic challenges

View Set

Business Finance Exam 4 concepts

View Set