Business Analytics

Ace your homework & exams now with Quizwiz!

Outliers

- Leave, change, or remove - Technically, a data point is considered an outlier if it is more than a specified distance below the lower quartile or above the upper quartile of a data set. - The lower quartile, Q1, is the 25th percentile—by definition, 25% of all observations fall below Q1. The upper quartile, Q3, is the 75th percentile—75% of all observations fall below Q3. The interquartile range (IQR) is the difference between the upper and lower quartiles, that is, IQR=Q3-Q1. We then multiply the IQR by 1.5 to find the appropriate range, computing 1.5(IQR)=1.5(Q3-Q1). A data point is an outlier if it is less than Q1-1.5(IQR) or greater than Q3+1.5(IQR).

To calculate the values for the mean, median, and mode in Excel, we can use the following formulas:

- =AVERAGE(number 1, [number 2], ...) - =MEDIAN(number 1, [number 2], ...) - =MODE.SNGL(number 1, [number 2], ...)

Hidden Variables

- A hidden variable is a variable that is correlated with each of two variables (such as ice cream and snow shovel sales) that are not fundamentally related to each other. - That is, there is no reason to think that a change in one variable will lead to a change in the other; in fact, the correlation between the two variables may seem surprising until the hidden variable is considered. - Although there is no direct relationship between these two variables, they are mathematically correlated because each is correlated individually with a third "hidden" variable. Therefore, for a variable to act as a hidden variable, there must be three variables, all of which are mathematically correlated (either directly or indirectly).

Business Analytics Summary

- Before gathering and analyzing data, we should always clearly articulate the question we wish to answer. - Organizing data into logical categories can make it easier to perform analyses within a category or across multiple categories. - Graphs are very useful for examining data sets, as they often reveal patterns and trends and help us detect outliers. - One useful graph is a histogram. o A histogram's x-axis represents bins corresponding to ranges of data; its y-axis indicates the frequency of observations falling into each bin. o The best bin size depends on what we are trying to learn from the data. o Using larger bins can simplify a histogram, but may make it difficult to see trends in the data. o Very small bins can have such low frequencies that make it difficult to discern patterns. - An outlier is a value that falls far from the rest of the data. o We should always carefully investigate outliers before deciding whether to leave them as is, change their values to the correct values, or remove them. We never change their values or remove outliers without carefully investigating the situation. - Skewness measures the degree of a graph's asymmetry.

Histograms

- Displays the frequency, or number of data points (observation) that fall within specified bins - Allow us to quickly discern trends or patterns in a data set and are easy to construct using programs such as excel - On the horizontal axis, we display a series of single values, each of which represents a bin, or range of possible values. On the vertical axis, we display the frequency of the observations in each bin. With a small data set, we can count and assign data points to bins as we just did, but with large data sets, this approach would be extremely tedious. - By convention, Excel includes in the range the number represented by the bin label. So bin 1 includes all countries with oil consumption less than or equal to 1 million barrels per day (x≤1); bin 2 includes all countries with oil consumption greater than 1 but less than or equal to 2 million barrels per day (1<x≤2); and bin 19 includes all countries with oil consumption greater than 18 but less than or equal to 19 million barrels per day (18<x≤19). - In order to create histograms in Excel, you may need to download the Analysis ToolPak, which is an add-in program. Please consult the Microsoft website for more information on how to install the Analysis ToolPak for your own version of Excel.

Percentile

- In addition to calculating measures of central tendency, sometimes we may want to know the value beneath which a certain percentage of the data lie. For example, we may want to find the 25th percentile—also known as the first quartile. The 25th percentile is the smallest value that is greater than or equal to 25% of the data points. - Percentiles are often used to categorize test scores. For example, someone who scored in the 95th percentile of a test scored equal to or higher than 95% of all people who took that test. We can also say that person scored in the top 5%. - Remember that the mean's location depends upon the distribution of the data set. Recall how the location of the mean differs for a symmetrical distribution and a skewed distribution. - Remember that the mode's location depends upon the distribution of the data set. Therefore, there is no way to determine the percentile of the mode without more information about the data set. =PERCENTILE.INC(array, k) - array is the range of data for which we want to calculate a given percentile. - k is the percentile value. For example, if we want to know the 95th percentile, k would be 0.95.

Hidden variables are not the same as "mediating variables," which are variables which are affected by one variable, and then affect another variable in turn. For example, being worried about grades

- may cause a student to study harder, and thus get better grades, but we wouldn't consider studying to be a hidden variable linking worry and getting better grades. Those two variables ARE fundamentally related, in that the worry is leading to the better grades. If students are more worried, they may study harder and get even better grades. - may cause a student to stress eat and gain weight , but we wouldn't consider eating to be a hidden variable linking worry and weight gain. Those two variables ARE fundamentally related, in that the worry is leading to the weight gain. If students are more worried, they may gain even more weight. - In this situation, we'd see a correlation between weight gain and grades, driven by the hidden variable, worry. Students couldn't just eat more food and expect their grade to improve, nor could they make a point of doing poorly in their courses just to lose weight. These two variables are not fundamentally related.

Correlation

1) Range: Correlation coefficients include all values, and only values, from -1 to 1. Both 1 and -1 must be mentioned for this criterion to be answered correctly. 2) Magnitude: Correlations are stronger for coefficients that are closer to -1 or 1; this criterion could also be phrased correctly as correlations are stronger as the coefficient value moves farther from 0. 3) Directionality: A positive correlation coefficient indicates a positive relationship, meaning that as one variable increases, the other variable increases. A negative correlation coefficient indicates a negative relationship, meaning that as one variable increases, the other variable decreases. Both positive and negative correlations must be mentioned for this criterion to be answered correctly. 4) Non-linearity: Correlation coefficients measure only linear relationships; they may not provide insight into other types of relationships. Two variables with a correlation close to 0 or equal to 0 have little or no linear relationship; they may have no relationship at all or may have another type of relationship. A non-linear relationship may be visible in a scatterplot. If a response contains the information included in any of these three sentences, this criterion can be considered correctly answered. =CORREL(array 1, array 2) • array 1 is a set of numerical variables or cell references containing data for one variable of interest. • array 2 is a set of numerical variables or cell references containing data for the other variable of interest. • Note that the number of observations in array 1 must be equal to the number in array 2.

Two sets of data A&B has the same standard deviation then:

1. The range of A might not equal the range of B Data sets with equal standard deviations do not necessarily have the same range. The standard deviation is based upon each data point's distance from the mean and thus provides little information about the range, which is based only on the minimum and maximum values in the data set. Note that another option is also correct. 2. The variance of A must equal the variance of B The variance is equal to the square of the standard deviation. If the standard deviation of A and B are equal, then the variances must also be equal. Note that another option is also correct. =VAR.S(number 1, [number 2], ...) =STDEV.S(number 1, [number 2], ...) • number 1 is the first number, cell reference, or range of cells for which to calculate the specified value. • [number 2],... represents additional numbers, cell references, or ranges of cells. The square brackets indicate that the argument is optional. • Note that the "S" in VAR.S and STDEV.S indicates that we are working with a sample. We will learn more about the differences between samples and populations in the next module. • We can also find the standard deviation using the Excel function =SQRT(number) to take the square root of the variance. For example, =SQRT(16)=4.

Constructing a Range of Likely Sample Means

95% = Usually range of likely sample means

To calculate a conditional mean in Excel, we can use the following formula:

=AVERAGEIF(range, criteria, [average_range]) - range contains the one or more cells to which we want to apply the criteria or condition. - criteria is the condition that is to be applied to the range. - [average_range] is the range of cells containing the data we wish to average.

Another quick way to forecast is to use Excel's FORECAST function:

=FORECAST(x, known_y's, known_x's) • x is the data point for which you want to predict a value. • known_y's is the dependent array or range of data. • known_x's is the independent array or range of data. In order to use this function we must have the original data. This approach also gives us a point forecast, but does not provide other helpful values that Excel's regression tool produces.

Excel's NORM.DIST function.

=NORM.DIST(x, mean, standard_dev, cumulative) • x is the value at which you want to evaluate the distribution function. • mean is the mean of the distribution. • standard_dev is the standard deviation of the distribution. • cumulative is an argument that specifies the type of probability we wish to calculate. We insert "TRUE" to indicate that we wish to find the cumulative probability, that is, the probability of being less than or equal to the x-value. (Inserting the value "FALSE" provides the height of the normal distribution at the value x, which we will not cover in this course.) For a standard normal curve, we know the mean is 0 and the standard deviation is 1, so we could find a cumulative probability using =NORM.DIST(x,0,1,TRUE). Alternatively, we use Excel's NORM.S.DIST function =NORM.S.DIST(z, cumulative). The "S" in this function indicates it applies to a standard normal curve. • z is the value (the z-value) at which we want to evaluate the standard normal distribution function. • cumulative is an argument that specifies the type of probability we wish to calculate. We will insert "TRUE". Another way to find the cumulative probability for a normal distribution is to use the functions STANDARDIZE and NORM.S.DIST. The STANDARDIZE function "standardizes" the data by returning the z-value, using the familiar equation, . Note that the term "standardize" reinforces our interpretation of the z-value as the number of standard deviations from the mean. =STANDARDIZE(x, mean, standard_dev) • x is the value to be standardized. • mean is the mean of the distribution. • standard_dev is the standard deviation of the distribution. After standardizing, we can insert the resulting z-value into the NORM.S.DIST function to find the cumulative probability of that z-value. Related: Alternative Excel Functions NORM.DIST replaces the function: =NORMDIST(x, mean, standard_dev, cumulative) NORM.S.DIST replaces the function: =NORMSDIST(z) Throughout the course we will provide alternative functions that existed prior to Excel 2010 that can still be used in Excel 2010.

To perform a two-sample test in Excel, we use the same T.TEST function we used earlier. The only difference is that we use the actual data from the second sample for our second column of data.

=T.TEST(array1, array2, tails, type) • array1 is a set of numerical values or cell references. This will be one sample. • array2 is a set of numerical values or cell references. This will be the other sample. • tails is the number of tails for the distribution. It should be set to 1 to perform a one-sided test; to 2 to perform a two-sided test. • type can be 1, 2, or 3. Type 1 is a paired test and is used when the same group is tested twice to provide paired "before and after" data for each member of the group. Type 2 is an unpaired test in which the samples are assumed to have equal variances. Type 3 is an unpaired test in which the samples are assumed to have unequal variances. Since we have no reason to believe that the variances of our two samples are the same, we use type 3. There are ways to test whether variances are equal, but when in doubt, use type 3. • T.TEST replaces the function: • =TTEST(array1, array2, tails, type) • Throughout the course we will provide alternative functions that existed prior to Excel 2010 that can still be used in Excel 2010.

Graphing two variables on a scatter plot can reveal relationships between two variables, or data sets.

Although there may be a relationship between two variables, we cannot conclude that one variable "causes" the other. This point is best summarized in the admonition, "correlation does not imply causation." Be alert to the possibility of hidden variables, which may be responsible for patterns we see when graphing or examining relationships between two data sets.

Adjusted R-squared

As always, once we perform a regression analysis we must assess the model's fit and explanatory power. As we mentioned earlier, since it is difficult to visualize the relationship between the dependent variable and multiple independent variables, we must rely more heavily on numerical values and residual plots to determine our regression model's fit. In single variable regression, to measure the predictive power of a single independent variable we used R*2: the percentage of the variation in the dependent variable explained by the independent variable. For multiple regression models, we will rely on the Adjusted R*2. To adjust for the effect that adding independent variables has on R2, we modify R2 by an adjustment factor. The adjustment factor looks quite complicated, but notice that it is largely determined by the terms n-1n-1 and n-k-1n-k-1. The term n-1n-1 is the Total df which can be found in the regression output table. The term n-k-1n-k-1 is the difference between the number of observations, nn, and the number of independent variables in the regression model, kk, minus one. Recall that n-k-1n-k-1 is the Residual df and can also be found in the regression output table. For the multiple regression model, Selling Price vs. House Size and Distance from Boston, n-1=30-1=29n-1=30-1=29 and n-k-1=30-2-1=27n-k-1=30-2-1=27. Adjusted R2=1−[(1−R2)(n−1n−k−1)]=1−[(1−R2)(Total dfResidual df)]Adjusted R2=1−[(1−R2)(n−1n−k−1)]=1−[(1−R2)(Total dfResidual df)] • Adjusted R*2 is provided in the regression output. • It is particularly important to look at Adjusted R*2, rather than R*2, when comparing regression models with different numbers of independent variables. • In addition to analyzing Adjusted R*2, we must test whether the relationship between the dependent variable and independents variables is linear and significant. We do this by analyzing the regression's residual plots and the p-values associated with each independent variable's coefficient.

One-sided Testing

CONFIDENCE.NORM finds the margin of error for a two-sided hypothesis test but we are interested in the upper bound of a one-sided test. To find the upper bound for the one-sided test we must first determine what two-sided test would have a 5% rejection region on the right side. Since the distribution of sample means is symmetric, a two-sided test with a 10% significance level would have a 5% rejection region on the left side of the normal distribution and a 5% rejection region on the right side. Thus, the upper bound for a two-sided test with alpha=0.1 will be the same as the upper bound on a one-sided test with alpha=0.05. The margin of error is CONFIDENCE.NORM(0.1,C3,C4)=0.33. The upper bound of the 95% range of likely sample means for this one-sided hypothesis test is the population mean plus the margin of error, which is approximately 6.7+0.33=7.03. Performing a one-sided hypothesis test using Excel is very similar to performing a two-sided test. To calculate the p-value for the sample mean we use the same function we learned about earlier. The only difference in setting up a one-sided test versus a two-sided test is the number we assign to the tails argument: 1 for a one-sided test and 2 for a two-sided test. =T.TEST(array1, array2, tails, type) • array1 is a set of numerical values or cell references. We will place our sample data in this range. • array2 is a set of numerical values or cell references. We have only one set of data, so we will use the historical mean, 6.7, as the second data set. To do this, we create a column with each entry equal to 6.7. • tails is the number of tails for the distribution. It can be either 1 or 2. Now that we are performing a one-sided test, we will enter a 1 instead of a 2. • type can be 1, 2, or 3. Type 1 is a paired test and is used when the same group is tested twice to provide paired "before and after" data for each member of the group. Type 2 is an unpaired test in which the samples are assumed to have equal variances. Type 3 is an unpaired test in which the samples are assumed to have unequal variances. The variances of the two columns are clearly different in our case, so we use type 3. There are ways to test whether variances are equal, but when in doubt, use type 3. T.TEST replaces the function: =TTEST(array1, array2, tails, type) Throughout the course we will provide alternative functions that existed prior to Excel 2010 that can still be used in Excel 2010.

Excel function CONFIDENCE.T to calculate the width of the confidence interval for small samples.

CONFIDENCE.T works just like CONFIDENCE.NORM, except that it uses a t-distribution rather than a normal distribution. =CONFIDENCE.T(alpha, standard_dev, size) • alpha, the significance level, equals one minus the confidence level (for example, a 95% confidence interval would correspond to the significance level 0.05). • standard_dev is the standard deviation of the population distribution. We will typically use the sample standard deviation, s s , which is our best estimate of our population's standard deviation. • size is the sample size, n n . Like CONFIDENCE.NORM, CONFIDENCE.T returns the margin of error, which we can add and subtract from the sample mean. Thus the confidence interval is: Another way to find the confidence interval is to use the descriptive statistics tool we learned about earlier in the course. We just check the Confidence Level for Mean box and enter the desired level of confidence. The resulting output table calculates the margin of error which we can then add to and subtract from the sample mean to find the confidence interval range. The descriptive statistics tool uses CONFIDENCE.T to calculate the margin of error, thus, we typically don't use the descriptive statistics tool for large samples because the t-distribution is more conservative than the normal distribution. However, for very large samples, the tool generates a confidence interval that's very close to the results of CONFIDENCE.NORM. The function T.INV.2T can find the t-value for a desired level of confidence. =T.INV.2T(probability, degrees_freedom) • probability is the significance level, that is, 1-confidence level, so for a 95% confidence interval, the significance level=0.05. • degrees_freedom is the number of degrees of freedom, which in this case is simply the sample size minus one, or n-1. For example, for the BMI example where the confidence level was 95% and n=15, the t-value would be T.INV.2T(0.05,14)=2.14. T.INV.2T replaces the function: =TINV(probability, degrees_freedom)

Forecasting with a multiple regression equation

Forecasting with a multiple regression equation is very similar to forecasting with a single variable linear model. However, instead of entering only one value for a single independent variable, we input a value for each of the independent variables.

Visualizing the Relationship

From the Insert menu, select Scatter, then select Scatter With Only Markers. Next, enter the appropriate Input Y Range and Input X Range, making sure to include the cells containing labels. Check the Labels in first row box, as this ensures that your scatter plot will be appropriately labeled. Remember that creating scatter plots in this course is somewhat different from the process in Excel.

Avoiding Bias

General social survey happens every year to see the trend of social attitudes - They frame the same questions differently to get more accurate answers - Welfare & assistance to the poor (basically same thing but received very different poll results 64% vs. 19%) Surveyors wish to get as high a response rate as possible. Low response rates can introduce bias if the non-respondent's answers would have differed from those who responded—that is, if the non-respondents and the respondents represent different segments of the population. If we do not represent a segment of the population, then our sample is not representative of the population. If resources are limited, it is often better to take a small sample and relentlessly pursue a high response rate than to take a larger sample and settle for a low response rate. If we have a low response rate, we should contact non-respondents and try to either increase the response rate or demonstrate that the non-respondents' answers do not differ from the respondents' answers. It is often useful to infer facts about a large population from a smaller sample. To make sound inferences: • Make sure the sample is representative of the population by choosing members randomly to ensure that each member of the population is equally likely to be included in the sample.

The Central Limit Theorem

If we take enough samples, the average of the samples will be closer enough to the actual population mean

Multicollinearity

In addition to the measurements we have already discussed, we must also review the multiple regression output table for signs of multicollinearity. Multicollinearity occurs when two independent variables are so highly correlated that it is difficult for the regression model to separate the effect each variable has on the dependent variable. Recall that house size and distance from Boston were significant in each of the single variable linear regression models and in the multiple regression model of selling price versus house size and distance from Boston.

Samples vs. Populations

In some cases, selecting a random sample is quite straightforward. If we have a list of all members of a population in a database, we can use a computer to assign a random number to each member and draw a sample from the list. This process makes sure that each member—that is, each element of the population—has an equal likelihood of being selected, which ensures that the sample is representative of the population.

The Normal Function: NORM.INV

In the previous module we learned to use the Excel function PERCENTILE.INC for a discrete set of numbers. We cannot use the PERCENTILE.INC function for a normal curve, since the normal distribution's values are not restricted to a discrete set of numbers. For a normal distribution, we can use Excel's NORM.INV function to calculate a given percentile. The "INV" indicates that this function calculates the inverse of the cumulative probability. =NORM.INV(probability, mean, standard_dev) • probability is the cumulative probability for which we want to know the corresponding x-value on a normal distribution. • mean is the mean of the distribution. • standard_dev is the standard deviation of the distribution. • NORM.INV replaces the function: • =NORMINV(probability, mean, standard_dev) • Throughout the course we will provide alternative functions that existed prior to Excel 2010 that can still be used in Excel 2010. For a standard normal curve, we could translate a cumulative probability into to a value using NORM.INV(probability,0,1). Alternatively, we could use the Excel function NORM.S.INV. The "S" in this function indicates that it applies to a standard normal curve. Since we are working with the standard normal curve, we can interpret the resulting value as a z-value. =NORM.S.INV(probability) • probability is the cumulative probability for which we want to know the corresponding x-value on a standard normal distribution. For example, if we wanted to know the z-value for the 95th percentile on a standard normal curve, we would enter =NORM.S.INV(0.95)=1.645. Equivalently, we could enter =NORM.INV(0.95,0,1)=1.645. NORM.S.INV replaces the function: =NORMSINV(probability) Throughout the course we will provide alternative functions that existed prior to Excel 2010 that can still be used in Excel 2010.

Spreadsheet: Assigning Dummy Variables

It is easy to transform categorical variables into dummy variables. Below is a segment of data that Caesars Palace uses for predicting the number of daily check-ins and subsequently making staffing decisions. This segment of data includes the date and day of week for each observation. Day of week is an important variable because guest arrival patterns differ from one day of the week to another, particularly on weekdays versus weekends. However, since day of week is a categorical variable, before performing any type of regression analysis, Ruben and his team had to transform day of week into dummy variables. Let's first create a dummy variable called Monday. Using the function IF, we'll assign a 1 to the days that are Mondays and a 0 to all other days. =IF(logical_test, [value_if_true], [value_if_false]) To develop its regression model, Caesars managers assigned dummy variables for each day so that they could analyze the relationship between day of week and the number of daily check-ins. It is important to note that we should always use one fewer dummy variables than the number of options in a category. In this case, there is one dummy variable for each day except Saturday, which serves as the base case. For each day of the week, the dummy variable is set to one on the appropriate day and to zero on all other days. There is no variable for Saturday because when all of the dummy variables equal zero, we know the day must be Saturday. If we tried to add a dummy variable for Saturday, the regression would return an error term. Any one category can be used as the base case and hence excluded from the model. For example, Caesars could have used Friday as the base case and included a dummy variable for Saturday. However, this would alter the coefficients in the regression model. The choice of which variable is the base case influences the interpretation of the dummy variable coefficients. Why don't we have exactly as many dummy variables as categories? Suppose we want to designate day of week by two categories—weekdays and weekends—and we created two dummy variables called "Weekday" and "Weekend". On weekdays, Weekday=1 and Weekend=0, and on weekend days, Weekday=0 and Weekend=1. Having these two dummy variables would be completely redundant: Since there are only two categories, if Weekday=0, we know that, by default, it must be a weekend. Technically, including both dummy variables would be problematic because the variables would be perfectly correlated. When Weekday equals zero, Weekend must equal one. Likewise, whenever Weekday equals one, Weekend must equal zero. Perfectly correlated variables are perfectly collinear: If we include both variables the regression software would return an error term. Sales=−631,085+533,024(Red)+50.5(Advertising)Sales=−631,085+533,024(Red)+50.5(Advertising) We always interpret the coefficient of a dummy variable as being the expected change in the dependent variable when that dummy variable equals one compared to the base case. In this case, controlling for advertising, we expect sales for red sneakers to be $533,024 more than blue sneakers. It is helpful to view this model graphically. Consider two parallel regression lines: one for red sneakers and one for blue sneakers. The vertical distance between the lines is the average increase in sales the manager can expect when red sneakers are sold versus blue sneakers, controlling for advertising. The slope of the two lines, 50.5, is the same: It tells us the average increase in sales, controlling for sneaker color, as we increase advertising by $1.

Distribution of sample vs. distribution

Larger samples will result more narrow sample mean If we take enough then the mean will form a normal distribution centered at the true sample mean Real world: take one sample and squeeze all the info

Quantifying Predictive Power

Measuring the vertical distance (because we want to see how well it predicts the value of dependent variable) between each data point and line

Net & Gross Relationship

Net: With respect to all variables included in the model Gross: with respect to all omitted variables Gross Relationship: A single variable linear regression model determines the gross effect of an independent variable on a dependent variable. For example, the gross effect of house size on selling price is the average change in selling price when house size increases by one square foot. Since no other independent variables are included in the model, the coefficient for house size may pick up the effect of other factors related to selling price. Net Relationship: A multiple regression model determines the net effect of an independent variable on a dependent variable. The net effect controls for all other factors (independent variables) included in the regression model. For example, in a regression model including both distance and house size as independent variables, the coefficient for house size controls for distance. That is, the regression determines the average change in selling price if a house's size increases by one square foot but its distance from Boston does not change. Coefficients in multiple regression are net with respect to variables included in the model and gross with respect to variables that are omitted from the model.

Prediction Intervals Questions Examples

Q: The best point forecast for the selling price of a 2,500 square foot house is the expected selling price of a 2,500 square foot home, approximately 13,490 + 255.36(2,500) = $652,000. Given that the standard error of the regression is about $151,000, which of the following would give the BEST estimate for the prediction interval for a 2,500 square foot home with approximately 95% confidence? A: $652,000 ± 2($151,000). A prediction interval is centered at a point forecast, in this case $652,000. The standard error of the regression is multiplied by 2 since we wish to estimate the prediction interval at the 95% confidence level. Note that we are using 2 to approximate the z-value for a 95% prediction interval. The actual z-value corresponding to 95% (for sufficiently large samples) is 1.96. Note that using our method to estimate a prediction interval, the width of our estimated prediction interval is the same for a 2,500 square foot home as it is for a 2,000 square foot home, despite the fact that 2,500 square feet is further from the average house size (about 1,750 square feet). The width does not change because our estimated prediction interval method is based on the standard error of the regression, $151,000, which does not vary by house size. The calculation of the actual prediction interval takes into account both the standard error of the regression and the distance between x¯ and the specific value of x on which we are basing our prediction interval. You are not responsible for knowing the formula to compute the width of an actual prediction interval, though interested students can check the Drill Down above. You are responsible for understanding that the further the value of the independent variable is from its mean, the wider an actual prediction interval will be. Q: How would the width of the actual prediction interval (at a 95% confidence level) for a 3,000 square foot home differ from the width of the actual prediction interval (at a 95% confidence level) for a 2,000 square foot home, given that the average home size is approximately 1,750 square feet? A: The width of the actual prediction interval for a 3,000 square foot home would be larger than the width of the prediction interval for a 2,000 square foot home. Because 3,000 square feet is further from the mean house size (1,750 square feet) than 2,000 square feet, the actual prediction interval at 3,000 square feet will be wider. The width of the actual prediction interval is based on both the standard error of the regression and the distance from the mean; the actual prediction interval gets wider as the value of the independent variable moves further from the mean of the independent variable. The image below compares prediction intervals created using both of the methods we have discussed. The red dashed lines show the actual prediction intervals for different house sizes. The blue dashed lines represent our method of estimating the prediction interval using the standard error of the regression. Note that the actual prediction intervals widen as house size moves further from the mean whereas the estimate prediction intervals, do not. They are parallel to the regression line.

R squared

R2 (pronounced R-squared) measures how closely a regression line fits a data set. It is a standardized measure of the regression line's explanatory power. R2 is defined as the percentage of total variation in the dependent variable, yy, that is explained by the regression line. R2=Variation explained by the regression lineTotal variation=Regression Sum of SquaresTotal Sum of Squares For example, the R2 of the regression equation for the housing data is the percentage of variation in selling price that is explained by the variation in house size. In single variable linear regression, i.e. a regression model that has only one independent variable, R2 is the square of the correlation coefficient between the independent and dependent variables. R2=(Correlation Coefficient)*2 Equivalently, the correlation coefficient is the positive or negative square root of R2. The sign is determined by whether there is a positive or negative relationship between the two variables. Correlation Coefficient(R)=± √R*2 We can quickly find the R2 for a single variable linear regression by creating a scatter plot of the two variables. When we select Trendline and check the Display Equation box to display the equation of the best fit line, we can also check the Display R-squared Value box to display the R2 value on the scatter plot.

Using P-values

Significance Level = 1- Confidence Level The significance level (α) of a hypothesis test has multiple interpretations. Although they are essentially equivalent, it is helpful to understand each one. We will see the different uses of these interpretations as we continue through this module. • The significance level defines the rejection region by specifying the threshold for deciding whether or not to reject null hypothesis. When the p-value of a sample mean is less than the significance level, we reject the null hypothesis. • The significance level is the area of the rejection region, meaning the area under the distribution of sample means over the rejection region. • The significance level is the probability of rejecting the null hypothesis when the null hypothesis is actually true. The significance level also defines the confidence level: Significance Level=1-Confidence Level Significance Level = 1 - Confidence Level The confidence level tells us how confident we can be that the range of likely sample means contains the true population mean. We should always specify the significance level (and thus the confidence level) before performing a hypothesis test. If we use the most commonly used significance level of 0.05, we draw our conclusions on whether the sample's p-value is less than or greater than 0.05. If the p-value is less than 0.05, we reject the null hypothesis. If the p-value is greater than or equal to 0.05, we fail to reject the null hypothesis. It is always important to use your managerial judgment when making decisions, especially when the p-value is very close to the significance level.

Estimating the Population Proportion

Sometimes our sample does not consist of numerical values. For example, if we pose a "yes" or "no" question, the data will consist only of yes and no answers. Even though our question has only two possible responses, we still have to address an inherent uncertainty: how often will each response occur? In such cases, we usually convey the survey results by reporting The easiest way to do this is to assign a "dummy variable" to each response—a variable that can take on only the values 0 and 1. (Dummy variables are also called indicator variables or binary variables; we will learn more about them in the regression modules.) In this case, we'll assign 1 to every "yes" response and 0 to every "no" response. (Typically we assign 1 to our value of interest, in this case the proportion of "yes" responses.) To do so, we use Excel's IF function. =IF(logical_test,[value_if_true],[value _if_false]) To make this function assign the value 1 if the referenced cell is "Yes," and 0 if the referenced cell is not (in this case, if the cell referenced is "No"), we would enter the IF function for every observation. In this example, the following formula refers to the first observation in cell A2. =IF(A2="Yes",1,0) Be sure to use the correct formula syntax when creating dummy variables. For example, if you place quotation marks around the numbers in your formula, that is, if you write =IF(A2="Yes","1","0") instead of writing =IF(A2="Yes",1,0), the misconfigured formula will produce a 1 or 0 result that looks like a number, but does not perform like a number. The quotation marks in the formula tell the spreadsheet to treat the 1 or 0 as text, rather than as a numeric value. If you then try to perform any calculations on these text-formatted values, you will get an error.

The Best Fit Line

Spreadsheet: Adding the Best Fit Line to a Scatter Plot As we discussed earlier in the course, a scatter plot is a powerful visual tool that helps us build intuition about the relationship between two variables. Excel can help us deepen and formalize our intuition by finding the line that best fits the relationship. Let's walk through how to do this. Step 1 Create a scatploter t with "House Size (Sqft)" on the horizontal axis and "Selling Price ($)" on the vertical axis. Include the labels when inputting your ranges so that the scatter plot is appropriately labeled. Step 2 Select Chart Tools from the Insert menu. Then select Layout, then select Trendline. Check the Display Equation box to display the equation of the best fit line. The Input Y Range is C1:C31 and the Input X Range is B1:B31. You must check the Labels in first row box since we included B1 and C1 to ensure that the scatter plot's axes are appropriately labeled. You must select Trendline and check the Display Equation box to display the best fit line and equation on the scatter plot. The best fit line is y=255.36x+13,490.45y=255.36x+13,490.45, or in terms of our variables: SellingPrice=13,490.45+255.36(HouseSize)SellingPrice=13,490.45+255.36(HouseSize).

General relationship between accuracy and sample size

The graphic below suggests the general relationship between accuracy and sample size. Later in this module, we will learn how to calculate the minimum required sample size to ensure a specified level of accuracy. Although we don't necessarily have to increase the sample size for larger populations, we may need a larger sample size when we are trying to detect something very rare. For example, if we are trying to estimate the incidence of a rare disease, we may need a larger sample simply to ensure that some people afflicted with the disease are included in the sample. When you increase the sample size. The sample mean and standard deviation generally become closer to the population mean and standard deviation

Spreadsheet: Creating the Multiple Regression Output Table

Step 1 From the Data menu, select Data Analysis, then select Regression. Step 2 Enter the appropriate Input Y Range and Input X Range: • The Input Y Range is the dependent variable, in this case selling price. The data are in column D with its label, D1:D31. • The Input X Range should include both independent variables, in this case house size and distance from Boston. To ensure that the independent variables are labeled correctly in the output table, enter the data with its labels in column B and column C, B1:C31. • Note that to run a regression in Excel the independent variables must be in contiguous columns. • Since we included the cells containing the variables' labels when inputting the ranges, check the Labels box. Step 3 Scroll down and make sure to check the Residuals and Residual Plots boxes to ensure we see the relevant residual information. You will not be able to submit if you do not include the residual plots. The multiple regression output table, like the single variable output table, is divided into three main parts: the Regression Statistics table, the ANOVA table, and the Regression Coefficients table. Although this course does not cover some of the ANOVA (Analysis of Variation) measures, we've included the definitions for completeness. The Residual Output table appears only if we select residuals when inputting data in the regression dialog box. It is important to be able to identify the most useful measures and interpret them correctly. Click on the name of any statistic in the output table below to view its definition. Make sure to familiarize yourself with the locations of the items we've discussed in this module: the coefficients of the regression equation, the Adjusted R2, the p-value of the independent variables, the 95% confidence interval of the independent variables, and the residuals.

Regression Sum of Squares :

The Regression Sum of Squares is the amount of variation that is explained by the regression line. This graph shows the sum of the squared differences between the predicted and observed values. What is the vertical distance between a predicted and observed value called? (Total Sum of Squares)

Residual Sum of Squares

The Residual Sum of Squares is the amount of variation that is left unexplained by the regression line, that is, the sum of the squared differences between the predicted and observed values. That is exactly what this graph shows.

Total Sum of Squares:

The Total Sum of Squares is the variance of y, that is, the total variation in y. This graph shows the sum of the squared differences between the predicted and observed values. What is the vertical distance between a predicted and observed value called? R2 (pronounced R-squared) measures how closely a regression line fits a data set. It is a standardized measure of the regression line's explanatory power. R2 is defined as the percentage of total variation in the dependent variable, yy, that is explained by the regression line.

Lagged Variables

The real estate data we examined earlier—selling price, house size, lot size, and distance from Boston—all result from transactions during the summer of 2013. These data capture a moment in the Boston-area housing market, and thus should be classified as cross-sectional data. For sneaker sales, we have data on sales and advertising levels for ten consecutive years, so we should classify these data as time series data. Sometimes, the value of the dependent variable in a given period is affected by the value of an independent variable in an earlier period. Advertising provides a good example, because its effects often persist. For example, last year's advertising may still influence this year's sneaker sales. We can incorporate the delayed effect of an independent variable on the dependent variable using a lagged variable. To study the lagged effect of advertising on sneaker sales, we add the lagged variable "Previous Year's Advertising" so that our regression model now includes the current year's and the previous year's advertising levels. Adding a lagged variable is costly in two ways: • Each lagged variable creates an incomplete line of data. If we have a single lagged variable, our first observation will be incomplete. If we have two lagged variables, our first two observations will be incomplete, and so on. The loss of each data point decreases our sample size by one, which reduces the precision of our estimates of the regression coefficients. • In addition, if the lagged variable, or variables, do not increase the model's explanatory power, the addition of the variable decreases Adjusted R2, just as the addition of any variable to a regression model can. We include lagged variables only if we believe the benefits of doing so outweigh the loss of one or more observations and the "penalty" imposed by the adjustment to R2. Despite those costs, lagged variables can be very useful. Because they pertain to previous time periods, they are usually available ahead of time. They are often good leading indicators, which help us predict future values of a dependent variable.

The Shopping Cart A/B Test

The table below summarizes the results of the two hypothesis tests for the new shopping cart design. The sample size is large (1.5 million observations), so we provide the results of the hypothesis tests rather than performing the analysis ourselves. These results are based on the following null and alternative hypotheses: Ordered Product Sales (OPS) • H0: Average ordered product sales are equal for customers who see the old shopping cart and for those who see the new shopping cart (μcontrol OPS=μtreatment OPS) • Ha: Average ordered product sales are different for customers who see the old shopping cart and for those who see the new shopping cart (μcontrol OPS≠μtreatment OPS) Total Units Ordered (Units) • H0: Average total units ordered are equal for customers who see the old shopping cart and for those who see the new shopping cart (μcontrol units=μtreatment units) • Ha: Average total units ordered are different for customers who see the old shopping cart and for those who see the new shopping cart (μcontrol units≠μtreatment units)

The Arrow A/B Test

The table below summarizes the results of the two hypothesis tests for the new, modern arrow design. These results are based on the following null and alternative hypotheses: Ordered Product Sales (OPS) · H0: Average ordered product sales are equal for customers who see the old arrow design and for those who see the new arrow design (μ control OPS=μ treatment OPS) · Ha: Average ordered product sales are different for customers who see the old arrow design and for those who see the new arrow design (μcontrol OPS ≠ μ treatment OPS) Total Units Ordered (Units) · H0: Average total units ordered are equal for customers who see the old arrow design and for those who see the new arrow design (μ control units =μ treatment units) · Ha: Average total units ordered are different for customers who see the old arrow design and for those who see the new arrow design (μ control units ≠ μ treatment units)

We can quantify the strength of a linear relationship between two variables by calculating the correlation coefficient.

The value of the correlation coefficient ranges between -1 and +1. A correlation coefficient near zero indicates a weak or nonexistent linear relationship. - A correlation coefficient near zero does not mean there is no relationship between the two variables; it indicates only that any relationship that does exist is not linear. - Outliers can unduly influence the calculation of the correlation coefficient, making the correlation coefficient much higher or lower than what it would be without the outlying points.

Excel's RAND function

We will do this using Excel's RAND function. RAND assigns a random identification (ID) number between 0 and 1 to each data point—in this case, to each phone number. • The Excel formula requires that we simply type the formula with closed parentheses. • We can use the RAND function to generate random numbers between any two specified values. For example, if we wanted to generate random numbers between 0 and 10 we would multiply the function by 10 and enter =RAND()*10. If we wanted numbers between 5 and 15, we would enter =5+RAND()*10.

Time Series vs. Cross-Sectional Data

Time Series: Time series data contain data about a given subject in temporal order, measured at regular time intervals (e.g. minutes, months, or years). U.S. oil consumption from 2002 through 2012 is an example of a time series. Managers collect and analyze time series to identify trends and predict future outcomes. Cross-Sectional: Cross-sectional data contain data that measure an attribute across multiple different subjects (e.g. people, organizations, countries) at a given moment in time or during a given time period. The average oil consumption of ten countries in 2012 is an example of cross-sectional data. Managers use cross-sectional data to compare metrics across multiple groups.

For multiple regression models, residual plots are an indispensable tool for detecting whether the linear model is a good fit.

We rely heavily on residual plots for multiple regression models because the full relationship among multiple independent variables and the dependent variable is difficult or impossible to represent in a single scatter plot. (When there are two independent variables, the regression model describes the best fit plane through the data and the residuals can be seen on the three-dimensional scatter plot as well as on their individual residual plots.) The residuals are the difference between the historically observed values and the values predicted by the regression model. There is a residual plot for each independent variable included in the regression model. We can graph a residual plot for each independent variable to help detect patterns such as heteroskedasticity and nonlinearity. As with single variable regression models, if the underlying multiple relationship is linear, the residuals follow a normal distribution with a mean of zero and fixed variance. • We should also analyze the p-values of the independent variables to determine whether there is a significant relationship between the variables in the model. If the p-value of each of the independent variables is less than 0.05, we conclude that there is sufficient evidence to say that we are 95% confident that there is a significant linear relationship between the dependent and independent variables.

Amazon's Use of Hypothesis Testing

To ensure that important managerial decisions are as well-informed as possible, it is critical to put our claims and theories to the test before making those decisions. For example, we may wish to determine whether a pilot marketing campaign increases revenues before rolling it out on a large scale. Or we may wish to gauge customer interest before opening a new restaurant in town. Hypothesis testing allows us to rigorously test such claims and theories. • In a survey, researchers ask questions and record self-reported responses from a random sample of a population. • In an observational study, researchers observe and collect data about a sample (e.g., people or items) as they occur naturally, without intervention, and analyze the data to investigate possible relationships. • In an experiment, researchers divide a sample into two or more groups. One group is a "control group," which has not been manipulated. In the "treatment group (or groups)," they manipulate a variable and then compare the treatment group(s) responses to the responses of the control group. As data have become more readily available, it has become easier for companies to employ such methods. In some cases, an organization may use a combination of methods to examine an issue. For example, a company may start with an observational study to learn more about what it wants to test before designing an experiment. In this module we'll focus on using hypothesis tests to analyze sample data to test a claim about a population. Let's return to Amazon to see how it uses experiments and hypothesis tests. The specific method of hypothesis testing that Amazon uses is called A/B testing, one of the most commonly used tools for web design optimization.

Type I and Type II Errors

To summarize, we can go wrong in two ways. A type I error is often called a false positive (we incorrectly reject the null hypothesis when it is actually true), and a type II error is often called a false negative (we incorrectly fail to reject the null hypothesis when it is actually not true). Note that since we have no information on the probabilities of different sample means if the null hypothesis is false, we cannot calculate the likelihood of a type II error.

use Excel's T.TEST function

To use Excel's T.TEST function for a hypothesis test with one sample, we must create a second column of data that will act as a second sample. We will walk through how to use the T.TEST function but please understand that the most important thing to take away from this discussion is the interpretation of the p-value. =T.TEST(array1, array2, tails, type) • array1 is a set of numerical values or cell references. We will place our sample data in this range. • array2 is a set of numerical values or cell references. We have only one set of data, so we will use the historical mean, 6.7, as the second data set. To do this, we create a column with each entry equal to 6.7. • tails is the number of tails for the distribution. It can be either 1 or 2. We will learn more about what this means later in the module. Since our alternative hypothesis is that the mean has changed and therefore can be either lower or higher than the historical mean, we will be using a two-tailed, or two-sided hypothesis test. • type can be 1, 2, or 3. Type 1 is a paired test and is used when the same group from a single population is tested twice to provide paired "before and after" data for each member of the group. Type 2 is an unpaired test in which the samples are assumed to have equal variances. Type 3 is an unpaired test in which the samples are assumed to have unequal variances. The variances of the two columns are clearly different in our case, so we use type 3. There are ways to test whether variances are equal, but when in doubt, use type 3. Excel's Data Analysis Tool Pack provides another tool, called the t-Test: Two-Sample Assuming Unequal Variances tool, for finding the p-value when we assume unequal variances. This tool provides information in addition to the p-value. The inputs for the tool are similar to the T.TEST function. We enter the ranges of our samples into Variable 1 Range and Variable 2 Range. The Hypothesized Mean Difference is 0, because our null hypothesis assumes no change, and alpha is the significance level. The default significance level is 0.05. T.TEST replaces the function: =TTEST(array1, array2, tails, type) Throughout the course we will provide alternative functions that existed prior to Excel 2010 that can still be used in Excel 2010.

Estimating the Population Mean

We also need to have a measure of the accuracy of our estimate—in this case, how close our estimate is to the true population mean. Based on the sample mean and a specified level of confidence, we can create a confidence interval. A confidence interval depends on the sample's mean, standard deviation, and sample size. As we'll see, a confidence interval also depends on how "confident" we would like to be that the range contains the true mean of the population. For example, if our sample mean is 50, we might calculate a 95% confidence interval from 42 to 58. We can be 95% confident that our interval contains the true population mean. A 95% confidence level should be interpreted as saying that if we took 100 samples from a population and created a 95% confidence interval for each sample, on average 95 of the 100 confidence intervals (that is, 95%) would contain the true population mean. Often people misinterpret a confidence interval's meaning, thinking that for a 95% confidence interval, there is "a 95% probability (or 95% chance) that the confidence interval contains the true population mean." This is false. For any given sample and a corresponding confidence interval, the population mean either is or is not in the confidence interval. Because we don't know the true population mean, we cannot say whether a particular sample's confidence interval is one that contains the population mean. However, we do know that, on average, 95 out of 100 such confidence intervals do contain the true mean, which is why we say we're 95% confident that our interval does. What if we wish to construct an interval based on a level of confidence other than 95%? Higher levels of confidence require wider confidence intervals that have a higher likelihood of containing the true population mean; lower levels of confidence require narrower confidence intervals. For example, a 68% confidence interval would be approximately half the width of a 95% confidence interval; its width is only one times the standard deviation of the distribution of sample means.

Comparing Two Populations

We have conducted single-population hypothesis tests, collecting a sample from one population and testing to see if its average was significantly different from the historical average. In many cases, hypothesis tests are performed by comparing samples from two different populations. For example, if our movie theater manager had two theaters, he could conduct an experiment—showing new releases in one theater and classic films in the other. To do this he would collect two samples during the same time period, one from each theater, and use a two-population hypothesis test to determine whether the average customer satisfaction rates in the two populations were different.

Large Samples

We know that if we have a large enough sample size (typically defined as greater than 30 data points), we can construct a confidence interval using properties of the normal distribution such as z-values, normal distribution rules of thumb, etc. For large samples, Excel consolidates the multiple steps for building a confidence interval into a single function, CONFIDENCE.NORM. =CONFIDENCE.NORM(alpha, standard_dev, size) • alpha, the significance level, equals one minus the confidence level (for example, a 95% confidence interval would correspond to the significance level 0.05). • standard_dev is the standard deviation of the population distribution. We will typically use the sample standard deviation, s s , which is our best estimate of our population's standard deviation. • size is the sample size, n n .

Small Samples

if we don't know anything about the underlying population, we cannot create a confidence interval with fewer than 30 data points because the properties of the Central Limit Theorem may not hold. Smaller samples have greater uncertainty, which means wider confidence intervals. However, if the underlying population is roughly normally distributed, we can use a confidence interval to estimate the population mean as long as we modify our approach slightly. We can gain insight into whether a data set is approximately normally distributed by looking at the shape of a histogram of that data. There are formal tests of normality that are beyond the scope of this course. To estimate the population mean with a small sample, we use a t-distribution instead of a "z-distribution", that is, a normal distribution. A t-distribution looks similar to a normal distribution but is not as tall in the center and has thicker tails. These differences reflect that fact that a t-distribution is more likely than a normal distribution to have values farther from the mean. Therefore, the normal distribution's "rules of thumb" do not apply. The shape of a t-distribution depends on the sample size; as the sample size grows towards 30, the t-distribution becomes very similar to a normal distribution.

the population standard deviation divided by the square root of the sample size, is the standard deviation of the distribution of sample means. Large samples will create a "tighter" distribution of sample means than smaller samples.

o A z-value of a point x x is the distance x x lies from the mean, measured in standard deviations, z=x−µ σ z = x − µ σ . - Using the properties of the normal distribution, we can calculate a probability associated with any range of values. o The Excel function NORM.DIST calculates the cumulative probability associated with a specified value. o The Excel function NORM.INV calculates the value associated with a specified cumulative probability. This is equivalent to calculating the percentile. - The Central Limit Theorem states that if we take enough sufficiently large samples from any population, the means of those samples will be normally distributed, regardless of the shape of the underlying population. o The Distribution of Sample Means more closely approximates a normal curve as we increase the number of samples and/or the sample size. o The mean of any single sample lies on the normally distributed Distribution of Sample Means, so we can use the normal curve's special properties to draw conclusions from a single sample mean. o The mean of the Distribution of Sample Means equals the mean of the population distribution. o The standard deviation of the Distribution of Sample Means equals the standard deviation of the population distribution divided by the square root of the sample size. Thus, increasing the sample size decreases the width of the Distribution of Sample Means. - Excel Summary - =NORM.DIST(x, mean, standard_dev, cumulative) - =NORM.S.DIST(z, cumulative) - =NORM.INV(probability, mean, standard_dev)

Hypothesis test

· Null Hypothesis (H0): The null hypothesis is a statement about a topic of interest. It is typically based on historical information or conventional wisdom. We always start a hypothesis test by assuming that the null hypothesis is true and then test to see if we can nullify it—that's why it's called the "null" hypothesis. The null hypothesis is the opposite of the hypothesis we are trying to prove (the alternative hypothesis). · Alternative Hypothesis (Ha): The alternative hypothesis (the opposite of the null hypothesis) is the theory or claim we are trying to substantiate. If our data allow us to nullify the null hypothesis, we substantiate the alternative hypothesis. In hypothesis test there will be only two results: · Reject the null hypothesis · Or fail to reject null hypothesis (Insufficient evidence) e.g. US court: Innocent till proven guilty Null Hypothesis (Ho): Accused is innocent Alternative Hypothesis (Ha): Accused is not innocent (guilty)

Multiple Regression (two or more independent variables)

ŷ =a+b1x1+b2x2+...+bkxky^=a+b1x1+b2x2+...+bkxk SellingPrice=194,986.59+244.54(HouseSize)-10,840.04(DistancefromBoston)SellingPrice=194,986.59+244.54(HouseSize)-10,840.04(DistancefromBoston) Like single variable linear regression, the true multiple regression best fit line is written as ŷ =α+β1x1+β2x2+...+βkxky^=α+β1x1+β2x2+...+βkxk and the regression line with the error term as y=α+β1x1+β2x2+...+βkxk+εy=α+β1x1+β2x2+...+βkxk+ε. The εε represents the error (or residual) term which is the difference between the actual value of yy and the expected value of yy, which we refer to as ŷ y^. As in single variable regressions, the error term in a multiple linear regression model is assumed to be normally distributed with mean 0 and fixed variance.

Single Variable Linear Regression (one independent variable)

ŷ =a+bxy^=a+bx SellingPrice=13,490.45+255.36(HouseSize) SellingPrice=686,773.86-15,162.92(DistancefromBoston)

The structure of the single variable linear regression line is ŷ =a+bxy^=a+bx.

ŷ y^ is the expected value of yy, the dependent variable, for a given value of xx. xx is the independent variable, the variable we are using to help us predict or better understand the dependent variable. aa is the y-intercept, the point at which the regression line intersects the vertical axis. This is the value of ŷ y^ when the independent variable, xx, is set equal to 0. bb is the slope, the average change in the dependent variable yy as the independent variable xx increases by one. The true relationship between two variables is described by the equation y=α+βx+εy=α+βx+ε, where εε is the error term (ε=y−ŷ ε=y−y^). The idealized equation that describes the true regression line is ŷ =α+βxy^=α+βx.

Excel Summary on basic analytic

• =AVERAGE(number 1, [number 2], ...) • =MEDIAN(number 1, [number 2], ...) • =MODE.SNGL(number 1, [number 2], ...) • =AVERAGEIF(range, criteria, [average_range]) • =PERCENTILE.INC(array, k) • =VAR.S(number 1, [number 2], ...) • =STDEV.S(number 1, [number 2], ...) • =SQRT(number) • =COUNT(value 1, [value 2], ...) • =SUM(number 1, [number 2], ...) • =MIN(number 1, [number 2], ...) • =MAX(number 1, [number 2], ...) • Creating the Descriptive Statistics output table

Excel Summary Recall the Excel functions and analyses covered in this course and make sure to familiarize yourself with all of the necessary steps, syntax, and arguments. We have provided some additional information for the more complex functions listed below. As usual, the arguments shown in square brackets are optional. The functions whose names include "S" use the standard normal distribution.

• =RAND() • =NORM.DIST(x, mean, standard_dev, cumulative) When cumulative is set to "TRUE", NORM.DIST finds the cumulative probability, that is, the probability of being less than or equal to the specified value x, for a normal distribution with the specified mean and standard deviation. (Inserting the value "FALSE" provides the height of the normal distribution at the value x, which is not covered in this course.) • =NORM.S.DIST(z, cumulative) When cumulative is set to "TRUE", NORM.S.DIST finds the cumulative probability, that is, the probability of being less than or equal to the specified value z for a standard normal distribution. • =NORM.INV(probability, mean, standard_dev) Returns the corresponding x-value on a normal distribution for the specified mean, standard deviation, and cumulative probability. • =CONFIDENCE.NORM(alpha, standard_dev, size) Returns the margin of error using a normal distribution for a specified alpha, standard_dev, and size. Alpha is the significance level, which equals one minus the confidence level (for example, a 95% confidence interval would correspond to the significance level 0.05). • =CONFIDENCE.T(alpha, standard_dev, size) Returns the margin of error using a t-distribution for a specified alpha, standard_dev, and size. • =IF(logical_test,[value_if_true],[value_if_false]) Returns value_if_true if the specified condition is met, and returns value_if_false if the condition is not met.

Choose the right sample size:

• Sample size does not necessarily depend on population size. The right sample size depends on desired accuracy and in some cases, the likelihood of the phenomenon we wish to observe or measure. • Avoid biased results by phrasing questions neutrally; ensuring that the sampling method is appropriate for the demographic of the target population; and pursuing high response rates. It is often better to have a smaller sample with a high response rate than a larger sample with a low response rate. • If a sample is sufficiently large and representative of the population, the sample statistics, x ¯ x ¯ and s s , should be reasonably good estimates of the population parameters, μ μ and σ σ , respectively.

To summarize a data set numerically, we often use descriptive statistics, also known as summary statistics.

• Three values describe the center, or central tendency, of the data set: the mean, the median, and the mode. • The mean is equal to sum of all data points in the set divided by the number of data points: everything/number • The median is the middle value of the data set. The median is the 50th percentile of the data set. • The mode is the value that occurs most frequently in the data set. A data set may have multiple modes. • We can also calculate the conditional mean. • A conditional mean is the mean of a subset of the data. We apply a condition and calculate the mean for values that meet that condition. • A percentile may be another value of interest. • For example, 60% of the observations are less than or equal to the 60th percentile. • The range, variance, and standard deviation measure the spread of the data. • The standard deviation is equal to the square root of the variance. • We typically work with the standard deviation rather than the variance because the standard deviation is in the same units as the data itself. • To compare variation in different data sets, we calculate and compare the coefficients of variation for each set rather than simply comparing standard deviations. The coefficient of variation measures the size of the standard deviation relative to the size of the mean (that is, coefficient of variation=standard deviation mean coefficient of variation = standard deviation mean ). • We can calculate many descriptive statistics using the Descriptive Statistics Tool in Excel. If we are interested in only one of the measures, we can use the designated formula for that measure.


Related study sets

Macroeconomics, Exam 3 (STUDY GUIDE)

View Set