HBX Business Analytics
How to create summary statistics
Excel has a descriptive statistics tool that provides a number of summary statistics, including those we've already learned, for a set of data. Let's walk through how to create it. Step 1 From the Data menu, select Data Analysis, then select Descriptive Statistics. Step 2 Enter the appropriate Input Range: The Input Range is the oil consumption data in column A with its label, A1:A11. Make sure to include A1, the cell containing the label, when inputting your range and check the Label in first row box, as this ensures that your output table will be appropriately labeled. Step 3 Enter the appropriate Output Range, in this case enter D1. This cell is the top left hand cell in which the output table will appear. Step 4 Be sure to select Summary Statistics so that the output table is generated.
Mean
Mean (average): The mean is equal to the sum of all of the data points in a set, divided by n, the number of data points. We represent the mean by x¯ when we calculate the mean of a sample and by μ when we calculate the mean of a population. In the following definitions, n is the number of data points in a sample and N is the number of data points in a population. x¯=∑i=1nxin=x1+x2+...+xnn and μ=∑i=1NxiN=x1+x2+...+xNN For the data set {0.5, 0.5, 1.5, 3.0, 4.0}, x¯=(0.5+0.5+1.5+3.0+4.0)5=9.55=1.9
Median
Median: The middle value of a data set. The same number of data points fall above and below the median. To find the median, first arrange the values in order of magnitude. If the total number of data points is odd, the median is the value that lies in the middle. If the total number is even, the median is the average of the two middle values. The median of the data set {0.5, 0.5, 1.5, 3.0, 4.0} is 1.5. The median of the data set {0.5, 0.5, 1.5, 3.0} is the average of 0.5 and 1.5, which is 1.0.
Mode
Mode: The value that occurs most frequently in a data set. The mode of the data set {0.5, 0.5, 1.5, 3.0, 4.0} is 0.5.
Range
One of the simplest measures of variability, or spread, is the range: Range=Maximum value-Minimum value
Percentile
Percentiles are often used to categorize test scores. For example, someone who scored in the 95th percentile of a test scored equal to or higher than 95% of all people who took that test. We can also say that person scored in the top 5%.
Descriptive Statistic
We call such summary measures "descriptive statistics"—they provide a quick overview of a data set without showing every data point. We encounter these kinds of descriptive statistics every day: we talk about a baseball player's performance by referring to his batting average; we measure stock market performance using the Dow Jones Industrial Average or the Nikkei Index; we summarize our academic performance using our grade point average.
Scatter Plots
We use histograms to help us visualize a single variable's distribution. To visualize the relationship between two variables, we typically use a scatter plot. One variable is plotted on the horizontal axis (x-axis), and the other is plotted on the vertical axis (y-axis)
z-value
the distance that x lies from the mean, measured in standard deviations.
Time series
time series—a data set in which one of the variables is time
Central Tendency
"central tendency" of a data set—an indication of where the "center" of the data set lies.
Rules of thumb for the normal distribution
About 68% of the probability is contained in the range reaching one standard deviation away from the mean on either side: P(μ−σ≤x≤μ+σ)≈68% About 95% of the probability is contained in the range reaching two standard deviations (1.96 to be exact) away from the mean on either side: P(μ−2σ≤x≤μ+2σ)≈95% We'll use two standard deviations when discussing the normal distribution conceptually, but we will always use 1.96 for actual calculations in Excel. About 99.7% of the probability is contained in the range reaching three standard deviations away from the mean on either side: P(μ−3σ≤x≤μ+3σ)≈99.7%
Confidence interval
An estimate of the range in which the true population mean likely lies.
Histogram
Histogram, also called a bar chart. A histogram's x-axis represents bins corresponding to ranges of data; its y-axis indicates the frequency of observations falling into each bin. The best bin size depends on what we are trying to learn from the data. Using larger bins can simplify a histogram, but may make it difficult to see trends in the data. Very small bins can have such low frequencies that make it difficult to discern patterns.
multimodal
If a data set has more than one value with the highest frequency, that data set has more than one mode. A distribution is called bimodal if it has two clearly defined peaks (two points with very high frequency). The two peaks may have equal frequency and hence be true modes, or one peak may be a mode and the other peak may simply have a very high (but not the highest) frequency. Distributions with multiple peaks are called multimodal.
Central Limit Theorem
If we take enough sufficiently large samples from any population, the means of those samples will be normally distributed regardless of the shape of the underlying population.
Kurtosis
Kurtosis: a measure of the height and sharpness of a distribution. A flat distribution has a very low kurtosis; a very peaked distribution has high kurtosis.
Skewness
Skewness measures the degree of a graph's asymmetry. If the right tail is longer, we say the distribution is skewed to the right or "right-tailed." Likewise, if the left tail is longer, we say the distribution is skewed to the left or "left-tailed."
Outlier
Technically, a data point is considered an outlier if it is more than a specified distance below the lower quartile or above the upper quartile of a data set. Let's start with a couple of definitions. The lower quartile, Q1, is the 25th percentile—by definition, 25% of all observations fall below Q1. The upper quartile, Q3, is the 75th percentile—75% of all observations fall below Q3. The interquartile range (IQR) is the difference between the upper and lower quartiles, that is, IQR=Q3-Q1. We then multiply the IQR by 1.5 to find the appropriate range, computing 1.5(IQR)=1.5(Q3-Q1). A data point is an outlier if it is less than Q1-1.5(IQR) or greater than Q3+1.5(IQR).
Conditional Mean
The mean of a specific subset of data is known as the conditional mean, because we are imposing a condition on our data set and we only want to find the mean of the values that meet that condition. To calculate a conditional mean in Excel, we can use the following formula: =AVERAGEIF(range, criteria, [average_range]) range contains the one or more cells to which we want to apply the criteria or condition. criteria is the condition that is to be applied to the range. [average_range] is the range of cells containing the data we wish to average.
Standard Deviation
The standard deviation describes how much the values in a single data set vary.The range, variance, and standard deviation measure the spread of the data. The standard deviation is equal to the square root of the variance.
Time Series vs. Cross-Sectional Data:
Time Series: Time series data contain data about a given subject in temporal order, measured at regular time intervals (e.g. minutes, months, or years). U.S. oil consumption from 2002 through 2012 is an example of a time series. Managers collect and analyze time series to identify trends and predict future outcomes. Cross-Sectional: Cross-sectional data contain data that measure an attribute across multiple different subjects (e.g. people, organizations, countries) at a given moment in time or during a given time period. The average oil consumption of ten countries in 2012 is an example of cross-sectional data. Managers use cross-sectional data to compare metrics across multiple groups.
Mean Median Mode Excel Functions
To calculate the values for the mean, median, and mode in Excel, we can use the following formulas: =AVERAGE(number 1, [number 2], ...) =MEDIAN(number 1, [number 2], ...) =MODE.SNGL(number 1, [number 2], ...) number 1 is the first number, cell reference, or range of cells for which to calculate the specified value. [number 2],... represents additional numbers, cell references, or ranges of cells. The square brackets indicate that the argument is optional. For example, if we had the data set {0.5, 0.5, 1.5, 3.0, 4.0}, we could find the mean by entering =AVERAGE(0.5, 0.5, 1.5, 3.0, 4.0) into a cell. This would calculate the average, 1.9. More often when using Excel, we won't input data directly into the formula, but will instead input a range of cells. For example if cell A1=0.5, A2=0.5, A3=1.5, A4=3.0, and A5=4.0, we could calculate the mean of the data set by entering =AVERAGE(A1:A5), which would return 1.9. Because the mean is the sum of all data points, divided by the number of data points, we could also enter =SUM(A1:A5)/COUNT(A1:A5). The COUNT function counts the number of cells that contain numerical values so in this case =SUM(A1:A5)/COUNT(A1:A5) is equivalent to =SUM(A1:A5)/5.
Variance
To calculate the variance or standard deviation of a sample in Excel, we can use the following functions: =VAR.S(number 1, [number 2], ...) =STDEV.S(number 1, [number 2], ...) number 1 is the first number, cell reference, or range of cells for which to calculate the specified value. [number 2],... represents additional numbers, cell references, or ranges of cells. The square brackets indicate that the argument is optional. Note that the "S" in VAR.S and STDEV.S indicates that we are working with a sample. We will learn more about the differences between samples and populations in the next module. We can also find the standard deviation using the Excel function =SQRT(number) to take the square root of the variance. For example, =SQRT(16)=4.
Coefficient Variation
To compare variation in two data sets, we calculate a value called the coefficient of variation (CV). The coefficient of variation is the ratio of the standard deviation to the mean. The equation for the coefficient of variation is: Coefficient of Variation=Standard DeviationMean CVStock A=4.5025.70=0.180 CVStock B=4.50100.00=0.045
Correlation coefficient
We will use the correlation coefficient to measure the strength of the linear relationship between two variables. The correlation coefficient measures the extent to which the data points on the scatter plot create a line, on a scale from -1 to +1. To find the correlation coefficient in Excel, we use the following function: =CORREL(array 1, array 2) array 1 is a set of numerical variables or cell references containing data for one variable of interest. array 2 is a set of numerical variables or cell references containing data for the other variable of interest. Note that the number of observations in array 1 must be equal to the number in array 2. Correlation indicates a linear relationship, but it does not indicate causality.
Normal Distribution
a unique symmetrical shape whose centre and width are determined by its mean and standard deviation respectively.
Bias
avoid biased results by - phrasing questions neutrally -ensuring that the sampling method is appropriate for the demographic of the target population -pursuing a high response rate