business analytics

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

95% CI corresponds to? 68%?

1.95 StDev, 1 StDev

What happens to the sample mean and standard deviation as you take new samples of equal size?

The sample mean and standard deviation vary but remain fairly close to the population mean and standard deviation

in a single variable linear regression, R^2 = ?

(Correlation Coefficient)^2

why are lagged variables costly? 2

1. Each lagged variable creates an incomplete line of data. If we have a single lagged variable, our first observation will be incomplete. If we have two lagged variables, our first two observations will be incomplete, and so on. The loss of each data point decreases our sample size by one, which reduces the precision of our estimates of the regression coefficients. 2. In addition, if the lagged variable, or variables, do not increase the model's explanatory power, the addition of the variable decreases Adjusted R2, just as the addition of any variable to a regression model can.

How to test whether the linear relationship between two variables are significant? In other words, that the regression line's slope isn't 0?

1. check whether the CI for the line's slope contains 0 2. look at the p-value to determine whether we can reject the null hypothesis that the slope is 0.

how to calculate a p-value in excel to compare sample to historical mean?

1. create a second column that is just the historical average copied down 2. =T.TEST(array1, array2, tails, type) array1 is a set of numerical values or cell references. We will place our sample data in this range. array2 is a set of numerical values or cell references. We have only one set of data, so we will use the historical mean, 6.7, as the second data set. To do this, we create a column with each entry equal to 6.7. tails is the number of tails for the distribution. It can be either 1 or 2. We will learn more about what this means later in the module. Since our alternative hypothesis is that the mean has changed and therefore can be either lower or higher than the historical mean, we will be using a two-tailed, or two-sided hypothesis test. type can be 1, 2, or 3. Type 1 is a paired test and is used when the same group from a single population is tested twice to provide paired "before and after" data for each member of the group. Type 2 is an unpaired test in which the samples are assumed to have equal variances. Type 3 is an unpaired test in which the samples are assumed to have unequal variances. The variances of the two columns are clearly different in our case, so we use type 3. There are ways to test whether variances are equal, but when in doubt, use type 3.

how to conduct linear regression using CORe excel?

1. data > data analysis > regression 2. try to include labels. y is dependent variable and x is independent variable. 3. make sure to check residuals and residual plots

Correlation Coefficient excel function

=CORREL(array 1, array 2) array 1 is a set of numerical variables or cell references containing data for one variable of interest. array 2 is a set of numerical variables or cell references containing data for the other variable of interest. Note that the number of observations in array 1 must be equal to the number in array 2.

excel formulas for mean, median, and mode?

=AVERAGE(number 1, [number 2], ...) =MEDIAN(number 1, [number 2], ...) =MODE.SNGL(number 1, [number 2], ...)

averageif?

=AVERAGEIF(range, criteria, [average_range]) range contains the one or more cells to which we want to apply the criteria or condition. criteria is the condition that is to be applied to the range. [average_range] is the range of cells containing the data we wish to average.

how to calculate the margin of error using excel?

=CONFIDENCE.NORM(alpha, standard_dev, size) alpha, the significance level, equals one minus the confidence level (for example, a 95% confidence interval would correspond to the significance level 0.05). standard_dev is the standard deviation of the population distribution. We will typically use the sample standard deviation, ss, which is our best estimate of our population's standard deviation. size is the sample size, nn. CONFIDENCE.NORM returns the margin of error, zsn√zsn, where zz is the z-value associated with the specified level of confidence. The lower and upper bounds of the confidence interval are equal to the sample mean, plus or minus that margin of error.

t-dist CI width excel formula?

=CONFIDENCE.T(alpha, standard_dev, size) alpha, the significance level, equals one minus the confidence level (for example, a 95% confidence interval would correspond to the significance level 0.05). standard_dev is the standard deviation of the population distribution. We will typically use the sample standard deviation, ss, which is our best estimate of our population's standard deviation. size is the sample size, nn.

another way to forecast future ys?

=FORECAST(x, known_y's, known_x's) x is the data point for which you want to predict a value. known_y's is the dependent array or range of data. known_x's is the independent array or range of data.

To find a cumulative probability, the probability of being less than a specified value on a normal curve, we use?

=NORM.DIST(x, mean, standard_dev, cumulative) x is the value at which you want to evaluate the distribution function. mean is the mean of the distribution. standard_dev is the standard deviation of the distribution. cumulative is an argument that specifies the type of probability we wish to calculate. We insert "TRUE" to indicate that we wish to find the cumulative probability, that is, the probability of being less than or equal to the x-value. (Inserting the value "FALSE" provides the height of the normal distribution at the value x, which we will not cover in this course.)

how to start with a cumulative probability and find the corresponding value on a normal curve?

=NORM.INV(probability, mean, standard_dev) probability is the cumulative probability for which we want to know the corresponding x-value on a normal distribution. mean is the mean of the distribution. standard_dev is the standard deviation of the distribution.

excel percentile?

=PERCENTILE.INC(array, k) array is the range of data for which we want to calculate a given percentile. k is the percentile value. For example, if we want to know the 95th percentile, k would be 0.95.

how to find the z-value in excel?

=STANDARDIZE(x, mean, standard_dev) x is the value to be standardized. mean is the mean of the distribution. standard_dev is the standard deviation of the distribution.

how to calculate the t-value with n-2 dof used in calculating a prediction interval using excel?

=T.INV.2T(probability, degrees_freedom) probability is 1-confidence level, so for a 95% prediction interval, we would enter 0.05. degrees_freedom is the number of degrees of freedom, in this case, n-2, where n is the sample size.

given CI, excel function to find t-value?

=T.INV.2T(probability, degrees_freedom) probability is the significance level, that is, 1-confidence level, so for a 95% confidence interval, the significance level=0.05. degrees_freedom is the number of degrees of freedom, which in this case is simply the sample size minus one, or n-1.

variance and standard deviation excel formulas?

=VAR.S(number 1, [number 2], ...) =STDEV.S(number 1, [number 2], ...) number 1 is the first number, cell reference, or range of cells for which to calculate the specified value. [number 2],... represents additional numbers, cell references, or ranges of cells. The square brackets indicate that the argument is optional. Note that the "S" in VAR.S and STDEV.S indicates that we are working with a sample. We will learn more about the differences between samples and populations in the next module.

What is a hidden variable?

A hidden variable is a variable that is correlated with each of two variables (such as ice cream and snow shovel sales) that are not fundamentally related to each other. Although there is no direct relationship between these two variables, they are mathematically correlated because each is correlated individually with a third "hidden" variable. Therefore, for a variable to act as a hidden variable, there must be three variables, all of which are mathematically correlated (either directly or indirectly).

what is a histogram?

A histogram displays the frequency, or number, of data points (often called observations) that fall within specified bins.

by convention, how does excel create histograms?

By convention, Excel includes in the range the number represented by the bin label. So bin 1 includes all countries with oil consumption less than or equal to 1 million barrels per day (x≤1); bin 2 includes all countries with oil consumption greater than 1 but less than or equal to 2 million barrels per day (1<x≤2); and bin 19 includes all countries with oil consumption greater than 18 but less than or equal to 19 million barrels per day (18<x≤19).

Kurtosis?

Kurtosis: a measure of the flatness or sharpness of a distribution. A flat distribution has low kurtosis; a very sharp distribution has high kurtosis.

Suppose we want to sample from two populations—the first population comprises 5,000 observations and the second population comprises 5 million. If we take a sample of size 1,000 from the first population, how many times larger does the sample need to be from the second population to ensure the same level of accuracy?

No larger. We might expect that for a larger population, a larger sample size is needed to achieve a given level of accuracy, but this is not necessarily true. A sample of 1,000 is often a satisfactory representation of a population numbering in the millions, as long as the sample is randomly selected and representative of the entire population.

Is this an example of a hidden variable? A hidden variable, such as GDP, may explain variation in oil consumption across various countries, and provide more clarity than looking solely at the number of barrels of oil consumed.

No. GDP is likely correlated with oil consumption. To determine whether there is a hidden variable, first identify two variables that are not fundamentally related to each other, and then identify a third "hidden" variable that is correlated with each. In this example, what would the two variables be? One would be oil, but there is no second variable proposed that is fundamentally unrelated to oil.

how much of the variation in y can be explained by x?

R^2

how to display the best fit line in CORe excel?

Select Chart Tools from the Insert menu. Then select Layout, then select Trendline. Check the Display Equation box to display the equation of the best fit line.

Skewness

Skewness: a measure of the degree of asymmetry of a distribution.

technical definition of an outlier?

The interquartile range (IQR) is the difference between the upper and lower quartiles, that is, IQR=Q3-Q1. We then multiply the IQR by 1.5 to find the appropriate range, computing 1.5(IQR)=1.5(Q3-Q1). A data point is an outlier if it is less than Q1-1.5(IQR) or greater than Q3+1.5(IQR).

What is the center value of the distribution of the sample means?

The population mean (μ) According to the Central Limit Theorem, if we take enough large samples, the mean of the set of sample means equals the population mean.

How does the distribution of sample means differ from the distribution of the population?

The two distributions have different standard deviations. Larger sample sizes will result in a more narrow sample mean distribution.

Time series vs cross-sectional?

Time Series: Time series data contain data about a given subject in temporal order, measured at regular time intervals (e.g. minutes, months, or years). U.S. oil consumption from 2002 through 2012 is an example of a time series. Managers collect and analyze time series to identify trends and predict future outcomes. Cross-Sectional: Cross-sectional data contain data that measure an attribute across multiple different subjects (e.g. people, organizations, countries) at a given moment in time or during a given time period. The average oil consumption of ten countries in 2012 is an example of cross-sectional data. Managers use cross-sectional data to compare metrics across multiple groups.

how to find multiple modes in excel?

To find them, instead of entering the MODE.MULT formula in a single cell, highlight at least three vertically-contiguous cells and then input =MODE.MULT(number 1, [number 2], ...) into the formula bar. Then, instead of using ENTER to find the result, use CTRL+SHIFT+ENTER to enter the array.

We always start a hypothesis test by assuming? and then test to see?

We always start a hypothesis test by assuming that the null hypothesis is true and then test to see if we can nullify it

two possible outcomes of the hypothesis test?

We either reject the null hypothesis, or we fail to reject it, because we have insufficient evidence.

Correlation coefficient?

We will use the correlation coefficient to measure the strength of the linear relationship between two variables. The correlation coefficient measures the extent to which the data points on the scatter plot create a line, on a scale from -1 to +1. Even when the correlation coefficient is 0, a relationship between two variables might exist—just not a linear one. The relationship may appear more like a curve, for example, as shown below. Never rely solely on the correlation coefficient. Always consult a visual representation of the data such as a scatter plot to see patterns and gain other insights into the situation the data describe.

What is a time series?

a data set in which one of the variables is time.

sum of squared errors?

also called residual sum of squares, is the sum of each of the residual errors squared. also known as the variation left unexplained by the regression model

how to create dummy variables for regression?

always assign 1 fewer dummy variable than the number of options available

why is R^2 not a good measure of predictive power of regression? rather?

because R^2 can never decrease when an independent variable is added to the model. rather, we should use R^2 adjusted to see if adding an independent variable explains more than not.

how to figure out how much additional value the regression model gives us?

compare accuracy of regression line with that of the mean price line. we do this by calculating the regression sum of squares = subtracting residual sum of squares from total sum of squares.

95% of all samples have CI that?

contain the true population mean

how to do multiple regression in CORe excel?

do the same thing for linear but rather highlight all independent variables

how to tell p-value from a regression line and the bunch of points around it?

even if the graph's points are scattered and looks like it has a low R^2, it can also have a low p-value

type 1 error? type 2 error?

false positive. rejecting null when null is true. so, we select a sample and it's actually one of the alpha % of samples that fall outside the rejection range when null is true. false negative. we fail to reject null when null is false. so we select a sample and it just happens to fall in the CI of null.

how to create a lagged variable in excel?

fill out the other column, with each cell being equal to the one above on the left column if that column is the previous year

what do the intercepts in a multiple regression model tell us?

for each increase in x, y increases by alpha assuming that everything else stays the same.

what is a p-value and how is it useful?

if a specific sample's p-value is less than alpha, we reject null because the likelihood of obtaining that sample give true null is less than alpha.

What is the central limit theorem?

if we take many large enough random samples from the population and plot the means of the samples, those sample means will become normally distributed.

dummy variables are also called?

indicator variables

residual plot? what does it tell us? residual distribution should follow?

it's a plot of the residuals, or residual sum of squares, on y with original x values on x. if there is no pattern in the residual plot, then there is a linear relationship bet. indep an dep. a normal dist. with mean 0 and fixed variance.

what is the net effect of x on y?

it's the intercept of a multi linear reg. this is because we are controlling for other variables.

what is the gross effect of x on y?

it's the intercept of a single linear reg. this is because we are not taking other variables into consideration.

what is R^2?

measures how closely a regression line fits a data set. R2 is defined as the percentage of total variation in the dependent variable, y, that is explained by the regression line.

is "DO YOU THINK WOMEN SHOULD BE DRAFTED INTO THE MILITARY?" biased?

no, because although it is controversial, it doesn't take a stance within the question.

Is the number on the back of a player's jersey a quantitative variable?

no, its qualitative, so it uses a dummy variable

if the population mean is 6.7, leading to a CI of 6.3 to 7.1 at an 0.05 alpha, what do we do if we draw a sample that has a mean of 7.3?

since 7.3 falls in the rejection region at 0.05 alpha, we reject the null hypothesis or accept the alt. hypothesis. This is because there is only a 5% chance that this sample is that lucky sample and is not representative of the whole population.

if a distribution is symmetric and has only one peak then?

the mean, median, mode are equal at that peak

heteroskedasticity?

the residuals become larger (pos and neg) as we move along the x-axis. thus, this violates assumption that residuals' variances are fixed.

regression sum of squares?

the variation in price that's explained by the regression model

residual error?

the vertical distance between a data point and the regression line

total variation in price data?

total sum of squares = the sum of squared errors calculated for the mean price line

how to Calculate the upper bound of the 95% range of likely sample means for this one-sided hypothesis test using the CONFIDENCE.NORM function?

use confidence.norm at 10% instead of 5% because confidence.norm assumes it's two tailed. Then, add that CI to the real mean to get the upper bound.

What are mediating variables?

variables which are affected by one variable, and then affect another variable in turn. For example, being worried about grades: 1. may cause a student to study harder, and thus get better grades, but we wouldn't consider studying to be a hidden variable linking worry and getting better grades. Those two variables ARE fundamentally related, in that the worry is leading to the better grades. If students are more worried, they may study harder and get even better grades. 2. may cause a student to stress eat and gain weight , but we wouldn't consider eating to be a hidden variable linking worry and weight gain. Those two variables ARE fundamentally related, in that the worry is leading to the weight gain. If students are more worried, they may gain even more weight.

We never accept the null hypothesis—

we simply do not reject it, that is, we "fail to reject" it. If we reject the null hypothesis, we essentially accept the alternative hypothesis.

estimated prediction interval vs actual prediction interval?

when estimating a prediction interval, the interval is the same regardless of x because the estimated prediction intervalis based on the standard error of regression, which doesn't vary by x.

when is a sample size considered small? what to use?

when the sample is smaller than 30. use t-dist in this case

what happens in multi linear reg when two of x's are negatively correlated?

when we switch from net to gross, one of the intercepts will increase and the other will decrease.

forecast uncertainty? prediction interval? SEoR?

when we use the regression line to forecase, as we go outside of historical x value data, our forecast uncertainty increases. the specific uncertainty around a point forecast is the prediction interval. each prediction interval around a point forecast is normally distributed. the standard error of regression is a reasonable but conservative estimate of the forecast standard deviation.

we can conclude that there is a 95% chance that a large sample's mean is?

within 2 standard dev of the population mean

if the p-value for a one-sided test is x, what is the p-value for a two-sided test?

x/2


Ensembles d'études connexes

GIVE ME LIBERTY! By Eric Foner Chapter 24

View Set

Unit 2 "Writing and Testing Code", Unit 3 too

View Set

APUSH chapter 13 test multiple choice

View Set

Powerpoint 2019/365 - Concept Review 1

View Set

NUR326 Mental Health Medications

View Set

Chapter 11, The Health Care Delivery System

View Set

Chapter 14: Environmental Liability Insurance

View Set

Abeka Vocabulary Spelling Poetry V Quiz 1A

View Set

Introduction to the Quadratic Formula

View Set