BADM 210 Final

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

what are the steps of the two-tailed test

(1) Gather or calculate all sample statistics (2) Determine the confidence level and level of significance (3) Calculate a t-statistic for the sample (4) Determine p-values using Excel (5) Decide whether to reject the null (6) Explain the statistical conclusion

what are the steps of calculating the one-tailed test

(1) Gather or calculate all sample statistics (2) Determine the confidence level and level of significance (3) Calculate a t-statistic for the sample (4) Determine p-values using Excel P-value: ​the probability, assuming that Ho is true, of obtaining a random sample of size n that results in a test statistic at least as extreme as the one observed in the current sample --Smaller p-values indicate more evidence against Ho as they suggest that it is increasing; y more unlikely that the sample could occur if Ho is true (5) Decide whether to reject the null --Reject if P-value is less than the alpha reject (6) Explain the statistical conclusion

coefficient of determination

A measure of the goodness of fit of the estimated regression equation -can be interpreted as the proportion of the variability in the dependent variable y that is explained by the estimated regression equation -r^2 is the amount of variation in y that is explained by the x's -must be between 0 and 1 -Ex. R^2 = 0.8838 indicates that the regression model explains approximately 88.4% of the variability in travel time for the driving assignments in the sample

what does it mean to overfit a regression model

-condition where a statistical model begins to describe the random error in the data rather than the relationship between variables - can produce misleading R-squared values, regression coefficients, and p-values -Fitting a model too closely to sample data, resulting in a model that does not accurately reflect the population

determine upper tail p-value

1 - lower p value or -T.DIST(test statistic, degrees of freedom, cumulative)

level of significance formula and relation to the confidence level

1 - the alpha level (ex. If significance level is 0.05 then confidence level is 0.95) -If the P value is less than your significant an e level, the hypothesis test is statistically significant and you reject the null

meaning of lift ratio

> 1 = indicates usefulness; more significant - not random

interpret a regression model with a categorial variable

Provides a model for two cases where y-intercept will vary and find the difference between these two intercepts (ignored the sign) to see the difference in the two scenarios

Categorical Predictors

Qualitative variable -ex. Marital status

sample statistic

Characteristic of sample data (sample mean, sample SD, sample proportion) where the value is used to estimate the value of the corresponding population parameter

how to test the hypothesis of the hypothesis test of the regression slope

Determine whether the p-value associated with t is LESS THEN α. If so, then reject the null

Type II error

Fail to reject H0 even though Ha is true -usually related to not enough statistical power to allow us conclude that we should reject H0 Usually related to not enough statistical power to allow -Much more uncertainty associated with making a type II error than a type I error and so it's better to use the statement "do not reject the Ho" instead of "accept Ho"

relationship between standard deviation and standard error

SD measures the amount of variability from the mean while standard error measures how far the sample mean of the data is likely to be from the true population mean

t-test

Statistical test based on the Student's ​t​ probability distribution that can be used to test the hypothesis that a regression parameter Bj is zero -if this hypothesis is rejected, we conclude that there is a regression relationship between the nth independent variable and the dependent variable -formula is Tbj on test sheet

how to create random sample in excel

Step 1: Assign a random number to each element of the population by utilizing =RAND() function (assigns random # between 0-1) - Select the cell range with random numbers → copy in clipboard group → paste values - Data → Sort → check my data has headers → sort by random numbers Step 2: Select the n elements corresponding to the n smallest random numbers

quadratic regression model equation

if b2 is neg then concave down and if b2 is pos then concave up

coverage error

if the research objective and the population from which the sample are drawn from do not align -we do not choose a representative sample on an important factor -ex: ask for ethnicities of U.S. population and only white people and asians respond

association rules

if-then statement describing relationship between item sets; likelihood of items being bought together -ex: Information that customers who buy beer also buy diapers is e.g. encoded as: beer ) Diapers [support = 2%, confidence = 75%]

omitted variable

important factor left out of the regression model

measurement error

incorrect measurement of the population characteristic at interest -the values taken are not true measures of the actual values -ex: asking for starting salary of recent graduates, and they overstate their salary to sound more impressive

what does the column F mean on the regression output

its a measure of how well the line fits the data

what is the shape of the t-distribution relative to the standard normal distribution

its shorter and fatter with heavier handles than the normal distribution -The t-distribution, like the normal distribution, is bell-shaped and symmetric, but it has heavier tails, which means it tends to produce values that fall far from its mean -Normal distribution has a mean of 0 and standard deviation of 1

effect of a larger sample size on the height/width of normal distribution

larger sample size = higher height and lower width

impact of large samples on confidence intervals

larger samples mean low standard errors -this then means that the confidence interval is smaller and more tight -this then means that we are more likely to reject the null hypothesis -This means we should consider the practical significance of the test -Ex. Does the $2 difference between $52,002 and $52,000 really matter, even though we may find a statistical difference?

Type 1 Error relationship to the level of significance

level of significance can be the probability of making type 1 error when the null hypothesis is true as an equality -If a higher cost is associated with making a Type I error, then a smaller level of significance is preferred

best fit

line is a better fit of the relationship in the data when the regression relationship outweighs the error

regression output

make sense of these -R^2 ​ -Observations -Degrees of freedom (regression) = q (number of factors/independent variables) -Degrees of freedom (residual) = n - q - 1 -Degrees of freedom (total) = n - 1 -SSR, SSE, SST -Coefficients (intercepts and slopes) -Standard errors -T-stat -P-value -Lower % and upper %

term frequency

measure understand how important a given word is in document -TD-IDF -ex: used "compet" as it can be token for competition, competes, competitive

how do you find the number of dummy variables

n - 1

residual degrees of freedom

n - 1 - q -Use in =T.INV excel formula

degrees of freedom for residual of sample

number of independent variables (with coefficients b​1,​ b​2,​ etc.) in regression model

degrees of freedom

number of scores that can vary in the calculation of a statistic

support count

number of times that a collection of items occurs together in a transaction data set -counts how many time things are purchased together -rule of thumb is to work with those rules that have at least 20% support

target population

population for statistical inference

sampling distribution

probability distribution of all possible values of a sample statistic; when you repeat your survey or poll for all possible samples of the population -this will help us to make probability statements about how close sample is to population -it is the distribution of sample means if we took multiple samples -will vary in tighter range than the sample

confidence

probability that the consequent of the association rule occurs given that the antecedent occurs

least squares method

procedure for using sample data to find the estimated regression equation -the best fit regression line will minimize the sum of the square errors

hiercharical clustering

process of agglomerating observations into a series of nested groups based on a measure of similarity -Typically for a data set less than 500 and is highly influenced by outsiders

k-means clustering

process of organizing observations into one of k groups based on a measure of similarity (typically from Euclidean distance) -Typically for a data set greater than 500, more accurate, no good for binary/ordinal data

rejection rule

reject Ho if the p-value is less athanor equal to the level of significance

how do the sample means from a sampling distribution differ from those of the sample

sample means of a sampling distribution will vary in a tighter range than the sample

random sample

sample size (n) from finite population (N), where each sample size has the same probability of being selected

point estimator

sample statistic that provides the point estimate of the population parameter

regression model

the equation that describes how y is related to x and an error term B0 and 1 = population parameters of y intercept and slope that relate y and x E = variability in y that can't be explained; we assume this to be 0 bc we don't know what it is -use for population

calculating the residual errors

the error (ei) for an observation is the observed value minus the predicted value (from the line) ei = y, - y^

confidence level

the estimated probability that a population parameter lies within a given confidence interval

one-tailed test of significance

the hypothesis is directional, and extreme statistical values that occur in a single tail of the curve are of interest

degrees of freedom for a regression

the number of observations that there are (q)

confidence interval

the range of values within which a population parameter is estimated to lie

Euclidian distance

the straight-line distance, or shortest possible path, between two points -this equation is given to you but know how to use the equation to find distance between two observations -smaller number = closer together

point estimate (sample)

the value of a point estimator used in a particular instance as an estimate of a population parameter -ex: sample mean, sample standard deviation, sample variance, sample, etc.

different types of residual plots that indicate probable bias

they indicate bias if the points have no correlation and are not close to the mean

Multicollinearity

two or more predictor variables in a multiple regression model are highly correlated -meaning that one can be linearly predicted from the others with a substantial degree of accuracy

why do we sample

use sample statistic to make inference about population parameter -it is important to have a close correspondence between sample population and target population

Sum of Squares Error (SSE)

value that measures the variation in a dependent variable that is explained by variables other than an independent variable

interval estimation

we know that we cannot estimate the population mean exactly, so there is always a margin of error -formula: point estimate +- margin of error

sample estimate

we use sample estimates to infer information about population parameters (which are usually unknown) -provides our best guess as to what is the value of the population parameter, but it is not 100% accurate

estimating regression line

we use this equation to estimate the regression line y = point estimator of E ( y | x ); predicted value of y given any value of x b0 and b1 = samples -use for sample

non response error

when segments of the population are more or less likely to respond to the survey mechanism -people who do not respond are very different than our population of interest -ex: US census is sent out and some people don't respond because of fear of immigration authorities

Type 1 Error

when we incorrectly reject H0

independent variable

x -the predictor

dependent variable

y -the response

how to solve for the confidence

​Support of [antecedent and consequent] / support of antecedents

simple random sample

​each possible sample of size n has the same probability of being selected

holdout method

​method of cross-validation in which sample data are randomly divided into mutually exclusive and collectively exhaustive sets, then one set is used to build the candidate models and the other set is used to compare model performances and ultimately select a model

t-distribution and sampling distribution

We can calculate mean and standard deviation for repeated samples -ex: For a confidence level of 90%, we expect that our confidence interval contains the population mean for 90% of the samples.

Be able to perform hypothesis tests just as with point estimates except two modifications: Modify the process to calculate z (you do not need to memorize the formula) and use NORM.S.DIST (z is from the normal distribution) to determine p values.

Z-score definition: ​number of standard deviations below or above the population mean a data point is; range from -3 standard deviations to +3 standard deviations -Z-score formula: ​z​ = (x - μ) / σ - =NORM.S.DIST (​z-score, TRUE)

sampling distribution definition

a probability distribution consisting of all possible values of a sample statistic -distributiion of sample means if we took multiple samples

residual plot

a scatterplot of the regression residuals against the explanatory variable

Sum of Squares Regression (SSR)

a value that measures the amount of variation in a dependent variable that is explained by an independent variable -deviation of the line from a line with no relationship (B1 =0) -The sum of squares regression comes from the difference between the predicted value on the line and a line with no slope (at the average y or y-bar) -measures how much of a relationship exists between x and y

how does the area under the normal distribution curve reflect a probability

The whole are under the curve is always equal to 1 (or 100%)

confidence interval of regression slope

bj is the estimate of the slope (relationship between x and y) Sbj is the estimated standard error of bj tα/2 is the number of standard errors required to meet a certain level of confidence -formula will be given to you but know how to use

How do we implement a categorical variable into a regression model

by using a dummy variable -ex: if the data can be sorted into four categories (East, West North, South), we need three binary (0/1) variables in the model (East, West, North). If East = 0, West = 0, and North = 0, then we know we have the case of South so no need for the South variable

meaning of r^2

closer to 0 = -- SSR = 0 -- SSE high -- no relationship between x and y closer to 1 = -- SSR = SST -- SSE low -- best fit

alternative hypothesis

concluded to be true if the null hypothesis is rejected -opposite of what is stated in the null -denoted by Ha -often what the test is attempting to prove -never has an equal sign in it

how to solve for lift ratio

confidence / (support of consequent / total number of transactions) -equation will be given to you

t in confidence intervals

confidence intervals are always 2 sides, so you will divide by 2 every time

confidence coefficient

confidence level expressed as a decimal value -ex: .95 is the confidence coefficient for a 95% confidence level

clustering

grouping together a set of observations based on similar attributes; creates similar groups -the max number of clusters is however many points there are -used to identify outliers -clusters should min distance between observations within a cluster and max distance between clusters

null hypothesis

hypothesis tentatively assumed to be true in the hypothesis testing procedure and is denoted by H​o -if rejected, then generally means the relationship is significant -can never prove is right, can only prove it to be wrong -always has an equal in it

two-tailed test

hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in either tail of its sampling distribution

one-tailed test

hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in one tail of its sampling distribution

Understand what a quadratic relationship between x and y looks like in a chart or in the regression equation

*​if b2 > 0 then convex (bowl-shaped) and if b2 < 0 then concave (mound-shaped)

formulate null and alternative hypothesis from conjectures

-Ask questions like what is the purpose of collecting the sample? What conclusions are we hoping to make? -In research based scenarios, it is generally easier to formulate the alternative hypothesis first -In validation scenarios, it is generally easier to formulate null and then the alternative hypothesis -ex. Company thinks their sodas contain 65 ounces or more of soda but researcher thinks it is less so takes a random sample; Ho would be the mean is greater than or equal to 65

what are the reasons that we decide to take a sample rather than take a census

-Potential difficulties with a census: E​xpensive, time consuming, misleading if population is changing quickly, may be unnecessary or impractical -Sample ​allows us to gather data from a subset of the population that is as similar as possible to the entire population so that what we learn from the sample data accurately reflects what we want to understand about the entire population

identify under what scenarios model overfitting may occur

-The analyst adds an independent variable that helps explain the variation in the sample, but does not have a relationship with the dependent variable in the population -The analyst chooses a complex non-linear relationship for the model without a good theoretical reason

three conditions that may create bias in a regression

-Variance in y that is not constant across all x values (don't want a high variance) -Omitted Variable -Multicollinearity These conditions may indicate bias in the coefficients of the regression analysis.

what to think about when determining the level of significance for Type 1 errors

-if cost of making Type 1 error is high, smaller values of level of significance are preferred -if cost of making Type 1 error is not high, larger values of level of significance are preferred

two conditions that makes something a sample

1. elements come from same population 2. elements are selected individually/independently to prevent selection bias

determine two tailed p-value

2 * smaller value between the upper and lower tail values

t-distribution

A family of probability distributions that can be used to develop an interval estimate of a population mean whenever the population standard deviation s is unknown and is estimated by the sample standard deviation s -accounts for error and additional uncertainty in our estimates -less degrees of freedom indicates that it will get shorter and fatter -used instead of normal distribution when you have small samples

dummy variable

A variable for which all cases falling into a specific category assume the value of 1, and all cases not falling into that category assume a value of 0.

y = b0 + b1(miles) + b2(deliveries) + b3(rush hour) y=-0.33 +0.0672(Miles) + 0.674(Deliveries) + 0.998(RushHour) what is the interpretation of b3

An assignment route with a congested rush hour highway segment is expected to take 0.998 additional hours, with all else being equal.

central limit theorem

As sample sizes get large, the sampling distribution will approach a normal distribution, regardless of the shape of the original population -helpful for identifying the shape when selecting random samples that don't have a normal distribution -true for any shape of population

meaning of t-test

As the magnitude of t increases (as t deviates from 0 in either direction), we are more likely to reject the hypothesis that the regression parameter bj is 0 -so conclude that a relationship exists between the dependent variable y and the independent variable xj

cross-validation

Assessment of the performance of a model on data other than the data that were used to generate the model -We holdout a portion of the sample to "test" the fit of the line and then we repeat by using a different portion of the sample as the holdout -If k = 10, that means we use 90% of the sample to create the model, then holdout 10% to test. Then, we repeat the holdout process 10 times, each with a different 10% if the sample

when is a one tailed test used and when its a two tailed test used

If hypotheses contain greater than symbols it is one-tailed and if equal to or not equal to then two-tailed

μ0

It is the hypothesized value of the population mean -constant threshold to be tested -claimed value of the population mean given in H0

impact of large samples on errors

Large samples mean very low standard errors -can be found from standard error formula

Identify scenarios that might require a non-linear model for the regression equation

Look at scatter chart to determine if there is more of a curvilinear relationship

what is the impact of the degrees of freedom on the t-distribution

Low df (low sample size) means the t- distribution is wider and shorter in height.

effect that a larger/smaller standard deviation has on the width and height of a normal distribution

Lower the SD, the higher the curve and the smaller the width

multiple regression

Multiple regression simply means that there are multiple predictors in the regression model

how can sampling distribution be impacted if there is a finite population

Sampling distribution can be impacted if there is a finite population unless the population involved is large relative to the sample size -then the difference between the values of the standard deviation for the finite and infinite populations becomes negligible -ex: n/N < 0.05 to be considered negligible -won't need to calculate on the test

standard deviation

Shows how much variation or "dispersion" there is from the "average" (mean/EV) -low SD indicates that the data points tend to be very close to the mean

determine lower tail p-value using excel

T.DIST(test statistic, degrees of freedom, cumulative) -cumulative = true

t-statistic of confidence interval on excel

T.INV (alpha/2, DF, true)

An analyst wants to understand the impact of class standing (freshman, sophomore, junior, or senior are the four possible categories) on the GPA of students in the Gies College of Business. The analyst creates a regression model for the prediction: (GPA) ̂=b_0+ b_(1 ) (Freshman)+b_(2 ) (Sophomore)+b_3(Junior)+b_4 (Senior) What is wrong about this regression model?

The analyst should leave one dummy variable out of the model

level of significance

The probability that the interval estimation procedure will generate an interval that does not contain the value of the parameter being -also error of making Type 1 error when the null hypothesis is true as an equality

residual plots

The residuals e are the difference between the observed and the predicted line and are plotted along the x axis -Used to diagnose problems of bias in our regression model

what does the significance F column mean on the regression output

This is a p-value, the probability that the model is no better using no predictor variables

Calculate the lower and upper bound of the Confidence Interval for a given sample mean, standard deviation, and sample size, and confidence level

Use + for upper bound and - for lower bound​ *formula provided -The part of the formula that you are adding or subtracting is the confidence interval itself and these are always two-sided (upper and lower) so you will see the text use alpha divided by 2 - =-T.INV (probability, residual degrees of freedom) for t stat alpha/2 ---Remember negative!! ---Probability is significance level / 2 ---If using T.INV.2T then use the full alpha

draw conclusions on a population based on estimates from a sample

Use sample estimates to infer information about population parameters (which are usually unknown)

when is the sample considered unbiased

When the expected value of the mean of the sampling distribution is equal to the mean of the population

normal distribution

function that represents the distribution of variables as a symmetrical bell-shaped graph -continuous random variables frequently show this type of information

standard error

accuracy of the predictions

p-value

assuming H0 is true, the probability of obtaining a random sample that results in test statistic at least as extreme as the one observed in the current sample -measures strength of evidence provided by the sample against the null hypothesis -smaller p = more evidence, as it suggests that it is increasingly more unlikely that the sample could occur if Ho is true

t-statistic of regression slope

b = estimate of slope in relation to x and y S = estimated standard error -formula will be given, but know what it means

sampling error

deviation of the sample from the population that results from randomness -sample size reduces sampling error, but it does not eliminate it -ex: sometimes when we flip a coin, it comes up heads more often than tails

non sampling error

deviations from sample error from the population that occurs for reasons other than random sampling -Arises because of deficiency and inappropriate analysis of data; can be random or non-random

statistical inference

drawing conclusions about a population based on estimates from a sample (value of one or more parameters) -sample provides estimates for the population

lift ratio

evaluates how effective association rules are at identifying transactions of the consequent vs randomly selected transaction

parameter (population)

factor that defines characteristic of population -ex: population mean, population standard deviation, population variance, census, etc.

no relationship

slope is 0

negative linear relationship

slope is negative

positive linear relationship

slope is positive

why to standardize variables in Euclidean distance

so that each variable receives equal weight in the clustering -otherwise the cluster might be dominated by a variable with large sized -ex: annual salary compared to years of experience -analyst may choose to weight one variable higher than another based on the important to the grouping

Sum of Squares Total (SST)

sum of square total comes from the difference between the observed values and a line with no slope (at the average y or y-bar) SST = SSR + SSE Total = regression + error


Ensembles d'études connexes

Human Development Final - Debbie's class

View Set

1 - Chapter 01 - Marketing: Creating and Capturing Customer Value

View Set

NCLEX Questions - Fundamentals - Health and Physical Assessment (304)

View Set

math 2 - 2nd semester exam review

View Set

CoursePoint - Chapter 61: Management of Patients with Neurologic Dysfunction

View Set