Business Analytics

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Now calculate the 50th percentile for the oil consumption data on your own. Also, calculate the median using the MEDIAN function. Note how the 50th percentile relates to the median.

PERCENTILE.INC(A2:A11,0.50)=3.05 and MEDIAN(A2:A11)=3.05. 50% of the countries in our data set consume less than 3.05 million barrels of oil per day. 50% of the countries in our data set consume more than this amount. Note that by definition, the median is the 50th percentile.

correlation coefficient.

We can quantify the strength of a linear relationship between two variables by calculating the correlation coefficient. The value of the correlation coefficient ranges between -1 and +1. • A correlation coefficient near zero indicates a weak or nonexistent linear relationship. A correlation coefficient near zero does not mean there is no relationship between the two variables; it indicates only that any relationship that does exist is not linear.

multiple regression

We use multiple regression to investigate the relationship between a dependent variable and multiple independent variables.

single variable linear regression

We use single variable linear regression to investigate the relationship between a dependent variable and one independent variable. • A coefficient in a single variable linear regression characterizes the gross relationship between the independent variable and the dependent variable.

We use regression analysis for two primary purposes:

• Studying the magnitude and structure of a relationship between two variables. • Forecasting a variable based on its relationship with another variable.

The alternative hypothesis (Ha)

• The alternative hypothesis (Ha) is the theory or claim we are trying to substantiate.

The width of the confidence interval depends on

• The width of the confidence interval depends on the level of confidence, our best estimate of the population standard deviation, and the sample size. We can only control the level of confidence and the sample size. For large samples (n≥30), the lower and upper bounds are calculated using the following equation:

=CONFIDENCE.NORM

→ =CONFIDENCE.NORM(alpha, standard_dev, size) • Returns the margin of error using a normal distribution for a specified alpha, standard_dev, and size. Alpha is the significance level, which equals one minus the confidence level (for example, a 95% confidence interval would correspond to the significance level 0.05).

Calculating the range of likely sample means using

Calculating the range of likely sample means using CONFIDENCE.NORM or CONFIDENCE.T

Using the Data Analysis tool to:

Create bins and histograms Create the Descriptive Statistics output table

regression models

Creating regression models using lagged variables Creating regression models using dummy variables • =IF(logical_test,[value_if_true],[value_if_false]) → Returns value_if_true if the specified condition is met, and returns value_if_false if the condition is not met. Forecasting with regression models in Excel → Creating a regression output table using the Data Analysis tool

calculate 75th percentile.

Step 1 In cell E2, enter the function =PERCENTILE.INC(A2:A11,0.75).

The function CONFIDENCE.NORM

The function CONFIDENCE.NORM calculates the margin of error, which we add and subtract from the sample mean to find the confidence interval.

Outliners

离群值,异常值

=COUNT

=COUNT(value 1, [value 2], ...)

=IF

=IF(logical_test,[value_if_true],[value_if_false]) • Returns value_if_true if the specified condition is met, and returns value_if_false if the condition is not met.

=MAX

=MAX(number 1, [number 2], ...)

=MEDIAN

=MEDIAN(number 1, [number 2], ...)

=MIN

=MIN(number 1, [number 2], ...)

=MODE.SNGL

=MODE.SNGL(number 1, [number 2], ...)

=NORM.DIST

=NORM.DIST(x, mean, standard_dev, cumulative) • When cumulative is set to "TRUE", NORM.DIST finds the cumulative probability, that is, the probability of being less than or equal to the specified value x, for a normal distribution with the specified mean and standard deviation. (Inserting the value "FALSE" provides the height of the normal distribution at the value x, which is not covered in this course.)

=NORM.INV

=NORM.INV(probability, mean, standard_dev) • Returns the corresponding x-value on a normal distribution for the specified mean, standard deviation, and cumulative probability.

=NORM.S.DIST

=NORM.S.DIST(z, cumulative) • When cumulative is set to "TRUE", NORM.S.DIST finds the cumulative probability, that is, the probability of being less than or equal to the specified value z for a standard normal distribution.

=PERCENTILE.INC

=PERCENTILE.INC(array, k) • Returns the k-th percentile of value in the specified array. For example, if we want to know the 95th percentile for an array of data, k would be 0.95.

=SQRT

=SQRT(number)

=STDEV.S

=STDEV.S(number 1, [number 2], ...)

=SUM

=SUM(number 1, [number 2], ...)

=VAR.S

=VAR.S(number 1, [number 2], ...)

conditional mean.

A conditional mean is the mean of a subset of the data that includes all values satisfying a certain condition.

confidence interval

A confidence interval associated with an independent variable's coefficient indicates the likely range for that coefficient. → If the 95% confidence interval does not contain zero, we can be 95% confident that there is a significant linear relationship between the variables.

histogram.

A histogram's x-axis represents bins corresponding to ranges of data; its y-axis indicates the frequency of observations falling into each bin.

percentile

A percentile may be another value of interest. For example, 60% of the observations are less than or equal to the 60th percentile. The median is by definition the 50th percentile of a data set.

z-value

A z-value of a point x is the distance x lies from the mean, measured in standard deviations,

scatter plot

Adding the best fit line to a scatter plot using the Insert menu

mean, median,mode

In general, if there is an outlier, or if the distribution is skewed-- that is, has a tail that extends out to one side-- the extreme values will pull the mean towards them. People tend to rely heavily on the mean to characterize a data set, but it's important to realize that the mean is affected by outliers and therefore may not be the best value to represent the distribution. In this case, with a skewed distribution, we may draw false conclusions about the underlying distribution if we use only the mean to represent the data. In this case, we may consider the median to capture the central tendency more accurately because it is not pulled upwards by the two largest consumers. The mode, like the median, is not affected by outliers.

If you were interested in knowing the average oil consumption of the top oil-consuming countries, how would you handle the outliers?

Keep them in the data set is correct Because we want to know average oil consumption of the top oil-consuming countries, we would keep the outliers in the data set. We know that the United States and China are large oil-consumers, so we have no reason to suspect that those countries' data are not valid entries. Removing them would significantly alter our analysis.

Which histogram do you think best displays the 2012 revenue for the top 100 U.S. companies? Why?

Option B: it displays 10 buckets which eases the review and provides enough detailed information at the same time. Option A may be to general while the others are too detailed and do not accentuate on the trends.

R2

R2 measures the percent of total variation in the dependent variable, y, that is explained by the regression line. • 0≤R2≤1 • For a single variable linear regression, R2 is equal to the square of the correlation coefficient. In addition to analyzing R2, we must test whether the relationship between the dependent and independent variable is significant and whether the linear model is a good fit for the data. We do this by analyzing the p-value (or confidence interval) associated with the independent variable and the regression's residual plot. The p-value of the independent variable is the result of the hypothesis test that tests whether there is a significant linear relationship; that is, it tests whether the slope of the regression line is zero, H0: β=0 and Ha: β≠0. If the coefficient's p-value is less than 0.05, we reject the null hypothesis and conclude that we have sufficient evidence to be 95% confident that there is a significant linear relationship between the dependent and independent variables. Note that the p-value and R2 provide different information. A linear relationship can be significant (have a low p-value) but not explain a large percentage of the variation (not have a high R2.)

One of the simplest measures of variability, or spread, is the range:

Range=Maximum value-Minimum value

Skewness

Skewness measures the degree of a graph's asymmetry.

coefficient of variation.

Standard deviation / mean

Calculating Conditional Means

Step 1 In cell F2, enter the function =AVERAGEIF(C2:C11,E2,A2:A11). This function says, if the country's continent is North America, then include that country's oil consumption in the calculation of the average. C2:C11 contains the data to which we want to apply the condition. E2 is the condition or criterion by which we choose the data points to include in the calculation of the conditional mean. In the function we have specified, we have just entered a link to cell E2, which contains "North America". Alternatively, we could type "North America" into the function as follows: =AVERAGEIF(C2:C11,"North America",A2:A11). A2:A11 contains the numeric data from which the conditional mean will be calculated. Only the values that meet the criterion—in this case, data corresponding to countries in North America—will be used in the calculation. Thus, the mean oil consumption will be calculated for Canada and the United States. The average oil consumption of North American countries is AVERAGEIF(C2:C11,"North America",A2:A11), or equivalently, AVERAGEIF(C2:C11,E2,A2:A11)=10.45 million barrels per day.

Creating a Histogram

Step 3 From the Data menu, select Data Analysis, then select Histogram. Step 4 Enter the appropriate Input Range and Bin Range: The Input Range is the original oil consumption data in column A with its label, A1:A11. The Bin Range is the set of values from 1-19 that you created in column D with its label, D1:D20. Make sure to include the cells containing labels when inputting your ranges and check the Labels in first row box, as this ensures that your histogram will be appropriately labeled.

Central Limit Theorem

The Central Limit Theorem states that if we take enough sufficiently large samples from any population, the means of those samples will be normally distributed, regardless of the shape of the underlying population. The distribution of those sample's means, called the Distribution of Sample Means, more closely approximates a normal curve as we increase the number of samples and/or the sample size. • The mean of any single sample lies on the normally distributed Distribution of Sample Means, so we can use the normal curve's special properties to draw conclusions from a single sample mean. • The mean of the Distribution of Sample Means equals the mean of the population distribution. • The standard deviation of the Distribution of Sample Means equals the standard deviation of the population distribution divided by the square root of the sample size. Thus, increasing the sample size decreases the width of the Distribution of Sample Means.

What percentile does the mean represent?

The answer cannot be determined without further information is correct Remember that the mean's location depends upon the distribution of the data set. Recall how the location of the mean differs for a symmetrical distribution and a skewed distribution. Therefore, there is no way to determine the percentile of the mean without more information about the data set.

What percentile does the mode represent?

The answer cannot be determined without further information is correct Remember that the mode's location depends upon the distribution of the data set. Therefore, there is no way to determine the percentile of the mode without more information about the data set.

The structure of the multiple regression equation is

The structure of the multiple regression equation is 𝐲=a+b1x1+b2x2+...+bkxk. • The true relationship between multiple variables is described by y=α+β1x1+β2x2+...+βkxk+ε, where 𝜀 is the error term. The idealized equation that describes the true regression model is y=α+β1x1+β2x2+...+βkxk. • Coefficients in multiple regression characterize relationships that are net with respect to the independent variables included in the model but gross with respect to all omitted independent variables.

What percentile does the median represent?

50% is correct Remember that half of a distribution's data points are less than or equal to the median. Therefore, the median is equal to the 50th percentile, because 50% of the data points are equal to or below this value.

=AVERAGE

=AVERAGE(number 1, [number 2], ...)

=AVERAGEIF

=AVERAGEIF(range, criteria, [average_range]) • Returns the conditional mean, or average of the cells in a specified range that meet the given criteria. • range contains the one or more cells to which we wish to apply the criteria or condition. • criteria is the condition that is to be applied to the range. • [average_range] is the range of cells containing the data we wish to average.

=CONFIDENCE.T

=CONFIDENCE.T(alpha, standard_dev, size) • Returns the margin of error using a t-distribution for a specified alpha, standard_dev, and size.

=CORREL

=CORREL(array 1, array 2)

bins

Before we plot the data, we first need to determine those height ranges. We call these ranges "bins" and will use them to categorize players' heights.

Creating regression models with dummy variables

Creating regression models with dummy variables • =IF(logical_test,[value_if_true],[value_if_false]) → Returns value_if_true if the specified condition is met, and returns value_if_false if the condition is not met. • To perform a regression analysis with an independent dummy variable, follow the same steps as when using quantitative variables.

cross-sectional data

Data that provide a measure of an attribute across multiple different subjects (e.g. people, organizations, or countries) at a given moment in time or during a given time period.

Forecasting with regression models in Excel

Forecasting with regression models in Excel • =SUMPRODUCT(array1, [array2], [array3],...) is a convenient function for calculating point forecasts. → Creating a regression output table using the Data Analysis tool

normal distribution

The normal distribution has a unique symmetrical shape whose center and width are determined by its mean and standard deviation respectively.

The null hypothesis (H0)

The null hypothesis (H0) is a statement about a topic of interest about the population. It is typically based on historical information or conventional wisdom. We always start a hypothesis test by assuming that the null hypothesis is true and then test to see if we can nullify it using evidence from a sample. The null hypothesis is the opposite of the hypothesis we are trying to prove (the alternative hypothesis).

confidence interval

The sample mean is only a point estimate. Using the properties of the normal distribution and the Central Limit Theorem, we can construct a range around the sample mean, called a confidence interval, to estimate the range in which the true population mean likely lies.

dummy variable

We can also perform regression analyses using qualitative, or categorical, variables. To do so, we must convert data to dummy (0, 1) variables. After that, we can proceed as we would with any other regression analysis. • A dummy variable is equal to 1 when the variable of interest fits a certain criterion. For example, a dummy variable for "Female" would equal 1 for all female observations and 0 for male observations.

The structure of the single variable linear regression line is 𝐲=a+bx.

The structure of the single variable linear regression line is 𝐲=a+bx. • 𝐲 is the expected value of y, the dependent variable, for a given value of x. • x is the independent variable, the variable we are using to help us predict or better understand the dependent variable. • a is the y-intercept, the point at which the regression line intersects the vertical axis. This is the value of y when the independent variable, x, is set equal to 0. • b is the slope, the average change in the dependent variable y as the independent variable x increases by one. • The true relationship between two variables is described by the equation y=α+βx+𝜀, where 𝜀 is the error term y-y. The idealized equation that describes the true regression line is y=α+βx.

Trade-offs

Trade-offs: The higher the confidence level (and therefore the lower the significance level), the lower the chance of rejecting the null hypothesis when it is true (type I error or false positive). But the higher the confidence level, the higher the chance of not rejecting it when it is false (type II error or false negative).

when we see an outlier:

Typically, we take one of three approaches when we see an outlier: leave it as is, change it to a corrected value, or, very rarely, remove it from the data set. Excluding or changing data is not something we do often. Such changes to a data set should only be done on a case-by-case basis after careful investigation of the situation. We should always carefully investigate outliers before deciding whether to leave them as is, change their values to the correct values, or remove them. We never change their values or remove outliers without carefully investigating the situation.

dummy (0, 1) variables.

We can also calculate confidence intervals for proportions. To do so, we must convert data to dummy (0, 1) variables. After that, we can proceed as we would with any other confidence interval. → When estimating the true population proportion, we should ensure that the sample size is large enough by checking that both of the following conditions are true: n*p ≥ 5, and n(1−p)≥ 5. If either of these guidelines is not satisfied, we must collect a larger sample.

Before conducting a hypothesis test:

• Determine whether to analyze a change in a single population or compare two populations. • Determine whether to perform a one-sided or two-sided hypothesis test.

To make sample sound inferences

• Make sure the sample is sufficiently large and is representative of the population. • Avoid biased results by → phrasing questions neutrally; → ensuring that the sampling method is appropriate for the demographic of the target population; and → pursuing high response rates. • If a sample is sufficiently large and representative of the population, the sample statistics, x and s, should be reasonably good estimates of the population parameters, μ and σ, respectively.

To conduct a hypothesis test, we must follow these steps:

• State the null and alternative hypotheses. • Choose the level of significance for the test. • Gather data about a sample or samples. • To determine whether the sample is highly unlikely under the assumption that the null hypothesis is true, construct the range of likely sample means or calculate the p-value.

lagged variables

• We can also include lagged variables in multiple regression models. Lagged values are used to capture the ongoing effects of a given variable. → The lag period is based on managerial insight and data availability. → Including lagged variables has some drawbacks: • Each lagged variable decreases our sample size by one observation. • If the lagged variable does not increase the model's explanatory power, the addition of the variable decreases Adjusted R2.

=T.TEST

→ =T.TEST(array1, array2, tails, type) • Returns the p-value associated with a given t-test. • array1 is a set of numerical values or cell references. • array2 is a set of numerical values or cell references. If we only have one set of data, for the second data set we create a column for which every entry is the historical mean. • tails is the number of tails for the distribution. It should be set to 1 to perform a one-sided test; to 2 to perform a two-sided test. • typecanbe1,2,or3. → Type 1 is a paired test and is used when the same group is tested twice to provide paired "before and after" data for each member of the group. → Type 2 is an unpaired test in which the samples are assumed to have equal variances. → Type 3 is an unpaired test in which the samples are assumed to have unequal variances. Unless we have a good reason to believe two samples have equal variances, we typically use type 3 when conducting an unpaired test.

Several rules of thumb are helpful for estimating probabilities for a normal distribution.

→ About 68% of the probability is contained in the range reaching one standard deviation away from the mean on either side, that is, P(μ-σ≤ 𝑥 ≤μ+σ)≈ 68%. → About 95% of the probability is contained in the range reaching two standard deviations (1.96 to be exact) away from the mean on either side, that is, P(μ-2σ≤ 𝑥 ≤μ+2σ)≈ 95%. → About 99.7% of the probability is contained in the range reaching three standard deviations away from the mean on either side, that is, P(μ-3σ≤ 𝑥 ≤μ+3σ)≈ 99.7%.

For small samples (n<30), the lower and upper bounds are calculated using the following equation

→ For small samples, we use a t-distribution, which is shorter and wider than a normal distribution. The t- distribution provides a wider range, a more conservative estimate of where the true population mean lies. → The function CONFIDENCE.T calculates the margin of error, which we add and subtract from the sample mean to find the confidence interval.

Residual plots

→ Residual plots can provide important insights into whether a linear model is a good fit. • Each observation in a data set has a residual equal to the historically observed value minus the regression's predicted value, that is, 𝜀=y-𝐲. • Linear regression models assume that the regression's residuals follow a normal distribution with a mean of zero and fixed variance.

p-value.

→ The p-value is the likelihood of obtaining a sample as extreme as the one we've obtained, if the null hypothesis is true. → The p-value of a one-sided hypothesis test is half the p-value of a two-sided hypothesis test. → If the sample mean falls in the range of likely sample means, or if its p-value is greater than the stated significance level, we do not have sufficient evidence to reject the null hypothesis. → If the sample mean falls in the rejection region, or if it has a p-value lower than the stated significance level, we have sufficient evidence to reject the null hypothesis. We can never accept the null hypothesis.

point forecast

→ We determine a point forecast by entering the desired value of x into the regression equation. • We must be extremely cautious about using regression to forecast for values outside of the historically observed range of the independent variable (x-values). • Instead of predicting a single point, we can construct a prediction interval, an interval around the point forecast that is likely to contain, for example, the actual selling price of a house of a given size. → The width of a prediction interval varies based on the standard deviation of the regression (the standard error of the regression), the desired level of confidence, and the location of the x-value of interest in relation to the historical values of the independent variable.


Set pelajaran terkait

Cardiovascular, Hematologic, and Lymphatic Systems Level 1

View Set

Sociology Exam 2: Chapter 11- The Economy and Work

View Set

QM 6655 Midterm (Chapters 1, 2, 3, 4, 5, 10, 11, 16)

View Set

Coursepoint Module 15 Quiz: Taylor's Clinical Nursing Skills

View Set

Principles of Management and Leadership

View Set

Los Números de 10 en 10 (del 10 al 200)

View Set