BADM 210 Final
what are the steps of the two-tailed test
(1) Gather or calculate all sample statistics (2) Determine the confidence level and level of significance (3) Calculate a t-statistic for the sample (4) Determine p-values using Excel (5) Decide whether to reject the null (6) Explain the statistical conclusion
what are the steps of calculating the one-tailed test
(1) Gather or calculate all sample statistics (2) Determine the confidence level and level of significance (3) Calculate a t-statistic for the sample (4) Determine p-values using Excel P-value: the probability, assuming that Ho is true, of obtaining a random sample of size n that results in a test statistic at least as extreme as the one observed in the current sample --Smaller p-values indicate more evidence against Ho as they suggest that it is increasing; y more unlikely that the sample could occur if Ho is true (5) Decide whether to reject the null --Reject if P-value is less than the alpha reject (6) Explain the statistical conclusion
coefficient of determination
A measure of the goodness of fit of the estimated regression equation -can be interpreted as the proportion of the variability in the dependent variable y that is explained by the estimated regression equation -r^2 is the amount of variation in y that is explained by the x's -must be between 0 and 1 -Ex. R^2 = 0.8838 indicates that the regression model explains approximately 88.4% of the variability in travel time for the driving assignments in the sample
what does it mean to overfit a regression model
-condition where a statistical model begins to describe the random error in the data rather than the relationship between variables - can produce misleading R-squared values, regression coefficients, and p-values -Fitting a model too closely to sample data, resulting in a model that does not accurately reflect the population
determine upper tail p-value
1 - lower p value or -T.DIST(test statistic, degrees of freedom, cumulative)
level of significance formula and relation to the confidence level
1 - the alpha level (ex. If significance level is 0.05 then confidence level is 0.95) -If the P value is less than your significant an e level, the hypothesis test is statistically significant and you reject the null
meaning of lift ratio
> 1 = indicates usefulness; more significant - not random
interpret a regression model with a categorial variable
Provides a model for two cases where y-intercept will vary and find the difference between these two intercepts (ignored the sign) to see the difference in the two scenarios
Categorical Predictors
Qualitative variable -ex. Marital status
sample statistic
Characteristic of sample data (sample mean, sample SD, sample proportion) where the value is used to estimate the value of the corresponding population parameter
how to test the hypothesis of the hypothesis test of the regression slope
Determine whether the p-value associated with t is LESS THEN α. If so, then reject the null
Type II error
Fail to reject H0 even though Ha is true -usually related to not enough statistical power to allow us conclude that we should reject H0 Usually related to not enough statistical power to allow -Much more uncertainty associated with making a type II error than a type I error and so it's better to use the statement "do not reject the Ho" instead of "accept Ho"
relationship between standard deviation and standard error
SD measures the amount of variability from the mean while standard error measures how far the sample mean of the data is likely to be from the true population mean
t-test
Statistical test based on the Student's t probability distribution that can be used to test the hypothesis that a regression parameter Bj is zero -if this hypothesis is rejected, we conclude that there is a regression relationship between the nth independent variable and the dependent variable -formula is Tbj on test sheet
how to create random sample in excel
Step 1: Assign a random number to each element of the population by utilizing =RAND() function (assigns random # between 0-1) - Select the cell range with random numbers → copy in clipboard group → paste values - Data → Sort → check my data has headers → sort by random numbers Step 2: Select the n elements corresponding to the n smallest random numbers
quadratic regression model equation
if b2 is neg then concave down and if b2 is pos then concave up
coverage error
if the research objective and the population from which the sample are drawn from do not align -we do not choose a representative sample on an important factor -ex: ask for ethnicities of U.S. population and only white people and asians respond
association rules
if-then statement describing relationship between item sets; likelihood of items being bought together -ex: Information that customers who buy beer also buy diapers is e.g. encoded as: beer ) Diapers [support = 2%, confidence = 75%]
omitted variable
important factor left out of the regression model
measurement error
incorrect measurement of the population characteristic at interest -the values taken are not true measures of the actual values -ex: asking for starting salary of recent graduates, and they overstate their salary to sound more impressive
what does the column F mean on the regression output
its a measure of how well the line fits the data
what is the shape of the t-distribution relative to the standard normal distribution
its shorter and fatter with heavier handles than the normal distribution -The t-distribution, like the normal distribution, is bell-shaped and symmetric, but it has heavier tails, which means it tends to produce values that fall far from its mean -Normal distribution has a mean of 0 and standard deviation of 1
effect of a larger sample size on the height/width of normal distribution
larger sample size = higher height and lower width
impact of large samples on confidence intervals
larger samples mean low standard errors -this then means that the confidence interval is smaller and more tight -this then means that we are more likely to reject the null hypothesis -This means we should consider the practical significance of the test -Ex. Does the $2 difference between $52,002 and $52,000 really matter, even though we may find a statistical difference?
Type 1 Error relationship to the level of significance
level of significance can be the probability of making type 1 error when the null hypothesis is true as an equality -If a higher cost is associated with making a Type I error, then a smaller level of significance is preferred
best fit
line is a better fit of the relationship in the data when the regression relationship outweighs the error
regression output
make sense of these -R^2 -Observations -Degrees of freedom (regression) = q (number of factors/independent variables) -Degrees of freedom (residual) = n - q - 1 -Degrees of freedom (total) = n - 1 -SSR, SSE, SST -Coefficients (intercepts and slopes) -Standard errors -T-stat -P-value -Lower % and upper %
term frequency
measure understand how important a given word is in document -TD-IDF -ex: used "compet" as it can be token for competition, competes, competitive
how do you find the number of dummy variables
n - 1
residual degrees of freedom
n - 1 - q -Use in =T.INV excel formula
degrees of freedom for residual of sample
number of independent variables (with coefficients b1, b2, etc.) in regression model
degrees of freedom
number of scores that can vary in the calculation of a statistic
support count
number of times that a collection of items occurs together in a transaction data set -counts how many time things are purchased together -rule of thumb is to work with those rules that have at least 20% support
target population
population for statistical inference
sampling distribution
probability distribution of all possible values of a sample statistic; when you repeat your survey or poll for all possible samples of the population -this will help us to make probability statements about how close sample is to population -it is the distribution of sample means if we took multiple samples -will vary in tighter range than the sample
confidence
probability that the consequent of the association rule occurs given that the antecedent occurs
least squares method
procedure for using sample data to find the estimated regression equation -the best fit regression line will minimize the sum of the square errors
hiercharical clustering
process of agglomerating observations into a series of nested groups based on a measure of similarity -Typically for a data set less than 500 and is highly influenced by outsiders
k-means clustering
process of organizing observations into one of k groups based on a measure of similarity (typically from Euclidean distance) -Typically for a data set greater than 500, more accurate, no good for binary/ordinal data
rejection rule
reject Ho if the p-value is less athanor equal to the level of significance
how do the sample means from a sampling distribution differ from those of the sample
sample means of a sampling distribution will vary in a tighter range than the sample
random sample
sample size (n) from finite population (N), where each sample size has the same probability of being selected
point estimator
sample statistic that provides the point estimate of the population parameter
regression model
the equation that describes how y is related to x and an error term B0 and 1 = population parameters of y intercept and slope that relate y and x E = variability in y that can't be explained; we assume this to be 0 bc we don't know what it is -use for population
calculating the residual errors
the error (ei) for an observation is the observed value minus the predicted value (from the line) ei = y, - y^
confidence level
the estimated probability that a population parameter lies within a given confidence interval
one-tailed test of significance
the hypothesis is directional, and extreme statistical values that occur in a single tail of the curve are of interest
degrees of freedom for a regression
the number of observations that there are (q)
confidence interval
the range of values within which a population parameter is estimated to lie
Euclidian distance
the straight-line distance, or shortest possible path, between two points -this equation is given to you but know how to use the equation to find distance between two observations -smaller number = closer together
point estimate (sample)
the value of a point estimator used in a particular instance as an estimate of a population parameter -ex: sample mean, sample standard deviation, sample variance, sample, etc.
different types of residual plots that indicate probable bias
they indicate bias if the points have no correlation and are not close to the mean
Multicollinearity
two or more predictor variables in a multiple regression model are highly correlated -meaning that one can be linearly predicted from the others with a substantial degree of accuracy
why do we sample
use sample statistic to make inference about population parameter -it is important to have a close correspondence between sample population and target population
Sum of Squares Error (SSE)
value that measures the variation in a dependent variable that is explained by variables other than an independent variable
interval estimation
we know that we cannot estimate the population mean exactly, so there is always a margin of error -formula: point estimate +- margin of error
sample estimate
we use sample estimates to infer information about population parameters (which are usually unknown) -provides our best guess as to what is the value of the population parameter, but it is not 100% accurate
estimating regression line
we use this equation to estimate the regression line y = point estimator of E ( y | x ); predicted value of y given any value of x b0 and b1 = samples -use for sample
non response error
when segments of the population are more or less likely to respond to the survey mechanism -people who do not respond are very different than our population of interest -ex: US census is sent out and some people don't respond because of fear of immigration authorities
Type 1 Error
when we incorrectly reject H0
independent variable
x -the predictor
dependent variable
y -the response
how to solve for the confidence
Support of [antecedent and consequent] / support of antecedents
simple random sample
each possible sample of size n has the same probability of being selected
holdout method
method of cross-validation in which sample data are randomly divided into mutually exclusive and collectively exhaustive sets, then one set is used to build the candidate models and the other set is used to compare model performances and ultimately select a model
t-distribution and sampling distribution
We can calculate mean and standard deviation for repeated samples -ex: For a confidence level of 90%, we expect that our confidence interval contains the population mean for 90% of the samples.
Be able to perform hypothesis tests just as with point estimates except two modifications: Modify the process to calculate z (you do not need to memorize the formula) and use NORM.S.DIST (z is from the normal distribution) to determine p values.
Z-score definition: number of standard deviations below or above the population mean a data point is; range from -3 standard deviations to +3 standard deviations -Z-score formula: z = (x - μ) / σ - =NORM.S.DIST (z-score, TRUE)
sampling distribution definition
a probability distribution consisting of all possible values of a sample statistic -distributiion of sample means if we took multiple samples
residual plot
a scatterplot of the regression residuals against the explanatory variable
Sum of Squares Regression (SSR)
a value that measures the amount of variation in a dependent variable that is explained by an independent variable -deviation of the line from a line with no relationship (B1 =0) -The sum of squares regression comes from the difference between the predicted value on the line and a line with no slope (at the average y or y-bar) -measures how much of a relationship exists between x and y
how does the area under the normal distribution curve reflect a probability
The whole are under the curve is always equal to 1 (or 100%)
confidence interval of regression slope
bj is the estimate of the slope (relationship between x and y) Sbj is the estimated standard error of bj tα/2 is the number of standard errors required to meet a certain level of confidence -formula will be given to you but know how to use
How do we implement a categorical variable into a regression model
by using a dummy variable -ex: if the data can be sorted into four categories (East, West North, South), we need three binary (0/1) variables in the model (East, West, North). If East = 0, West = 0, and North = 0, then we know we have the case of South so no need for the South variable
meaning of r^2
closer to 0 = -- SSR = 0 -- SSE high -- no relationship between x and y closer to 1 = -- SSR = SST -- SSE low -- best fit
alternative hypothesis
concluded to be true if the null hypothesis is rejected -opposite of what is stated in the null -denoted by Ha -often what the test is attempting to prove -never has an equal sign in it
how to solve for lift ratio
confidence / (support of consequent / total number of transactions) -equation will be given to you
t in confidence intervals
confidence intervals are always 2 sides, so you will divide by 2 every time
confidence coefficient
confidence level expressed as a decimal value -ex: .95 is the confidence coefficient for a 95% confidence level
clustering
grouping together a set of observations based on similar attributes; creates similar groups -the max number of clusters is however many points there are -used to identify outliers -clusters should min distance between observations within a cluster and max distance between clusters
null hypothesis
hypothesis tentatively assumed to be true in the hypothesis testing procedure and is denoted by Ho -if rejected, then generally means the relationship is significant -can never prove is right, can only prove it to be wrong -always has an equal in it
two-tailed test
hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in either tail of its sampling distribution
one-tailed test
hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in one tail of its sampling distribution
Understand what a quadratic relationship between x and y looks like in a chart or in the regression equation
*if b2 > 0 then convex (bowl-shaped) and if b2 < 0 then concave (mound-shaped)
formulate null and alternative hypothesis from conjectures
-Ask questions like what is the purpose of collecting the sample? What conclusions are we hoping to make? -In research based scenarios, it is generally easier to formulate the alternative hypothesis first -In validation scenarios, it is generally easier to formulate null and then the alternative hypothesis -ex. Company thinks their sodas contain 65 ounces or more of soda but researcher thinks it is less so takes a random sample; Ho would be the mean is greater than or equal to 65
what are the reasons that we decide to take a sample rather than take a census
-Potential difficulties with a census: Expensive, time consuming, misleading if population is changing quickly, may be unnecessary or impractical -Sample allows us to gather data from a subset of the population that is as similar as possible to the entire population so that what we learn from the sample data accurately reflects what we want to understand about the entire population
identify under what scenarios model overfitting may occur
-The analyst adds an independent variable that helps explain the variation in the sample, but does not have a relationship with the dependent variable in the population -The analyst chooses a complex non-linear relationship for the model without a good theoretical reason
three conditions that may create bias in a regression
-Variance in y that is not constant across all x values (don't want a high variance) -Omitted Variable -Multicollinearity These conditions may indicate bias in the coefficients of the regression analysis.
what to think about when determining the level of significance for Type 1 errors
-if cost of making Type 1 error is high, smaller values of level of significance are preferred -if cost of making Type 1 error is not high, larger values of level of significance are preferred
two conditions that makes something a sample
1. elements come from same population 2. elements are selected individually/independently to prevent selection bias
determine two tailed p-value
2 * smaller value between the upper and lower tail values
t-distribution
A family of probability distributions that can be used to develop an interval estimate of a population mean whenever the population standard deviation s is unknown and is estimated by the sample standard deviation s -accounts for error and additional uncertainty in our estimates -less degrees of freedom indicates that it will get shorter and fatter -used instead of normal distribution when you have small samples
dummy variable
A variable for which all cases falling into a specific category assume the value of 1, and all cases not falling into that category assume a value of 0.
y = b0 + b1(miles) + b2(deliveries) + b3(rush hour) y=-0.33 +0.0672(Miles) + 0.674(Deliveries) + 0.998(RushHour) what is the interpretation of b3
An assignment route with a congested rush hour highway segment is expected to take 0.998 additional hours, with all else being equal.
central limit theorem
As sample sizes get large, the sampling distribution will approach a normal distribution, regardless of the shape of the original population -helpful for identifying the shape when selecting random samples that don't have a normal distribution -true for any shape of population
meaning of t-test
As the magnitude of t increases (as t deviates from 0 in either direction), we are more likely to reject the hypothesis that the regression parameter bj is 0 -so conclude that a relationship exists between the dependent variable y and the independent variable xj
cross-validation
Assessment of the performance of a model on data other than the data that were used to generate the model -We holdout a portion of the sample to "test" the fit of the line and then we repeat by using a different portion of the sample as the holdout -If k = 10, that means we use 90% of the sample to create the model, then holdout 10% to test. Then, we repeat the holdout process 10 times, each with a different 10% if the sample
when is a one tailed test used and when its a two tailed test used
If hypotheses contain greater than symbols it is one-tailed and if equal to or not equal to then two-tailed
μ0
It is the hypothesized value of the population mean -constant threshold to be tested -claimed value of the population mean given in H0
impact of large samples on errors
Large samples mean very low standard errors -can be found from standard error formula
Identify scenarios that might require a non-linear model for the regression equation
Look at scatter chart to determine if there is more of a curvilinear relationship
what is the impact of the degrees of freedom on the t-distribution
Low df (low sample size) means the t- distribution is wider and shorter in height.
effect that a larger/smaller standard deviation has on the width and height of a normal distribution
Lower the SD, the higher the curve and the smaller the width
multiple regression
Multiple regression simply means that there are multiple predictors in the regression model
how can sampling distribution be impacted if there is a finite population
Sampling distribution can be impacted if there is a finite population unless the population involved is large relative to the sample size -then the difference between the values of the standard deviation for the finite and infinite populations becomes negligible -ex: n/N < 0.05 to be considered negligible -won't need to calculate on the test
standard deviation
Shows how much variation or "dispersion" there is from the "average" (mean/EV) -low SD indicates that the data points tend to be very close to the mean
determine lower tail p-value using excel
T.DIST(test statistic, degrees of freedom, cumulative) -cumulative = true
t-statistic of confidence interval on excel
T.INV (alpha/2, DF, true)
An analyst wants to understand the impact of class standing (freshman, sophomore, junior, or senior are the four possible categories) on the GPA of students in the Gies College of Business. The analyst creates a regression model for the prediction: (GPA) ̂=b_0+ b_(1 ) (Freshman)+b_(2 ) (Sophomore)+b_3(Junior)+b_4 (Senior) What is wrong about this regression model?
The analyst should leave one dummy variable out of the model
level of significance
The probability that the interval estimation procedure will generate an interval that does not contain the value of the parameter being -also error of making Type 1 error when the null hypothesis is true as an equality
residual plots
The residuals e are the difference between the observed and the predicted line and are plotted along the x axis -Used to diagnose problems of bias in our regression model
what does the significance F column mean on the regression output
This is a p-value, the probability that the model is no better using no predictor variables
Calculate the lower and upper bound of the Confidence Interval for a given sample mean, standard deviation, and sample size, and confidence level
Use + for upper bound and - for lower bound *formula provided -The part of the formula that you are adding or subtracting is the confidence interval itself and these are always two-sided (upper and lower) so you will see the text use alpha divided by 2 - =-T.INV (probability, residual degrees of freedom) for t stat alpha/2 ---Remember negative!! ---Probability is significance level / 2 ---If using T.INV.2T then use the full alpha
draw conclusions on a population based on estimates from a sample
Use sample estimates to infer information about population parameters (which are usually unknown)
when is the sample considered unbiased
When the expected value of the mean of the sampling distribution is equal to the mean of the population
normal distribution
function that represents the distribution of variables as a symmetrical bell-shaped graph -continuous random variables frequently show this type of information
standard error
accuracy of the predictions
p-value
assuming H0 is true, the probability of obtaining a random sample that results in test statistic at least as extreme as the one observed in the current sample -measures strength of evidence provided by the sample against the null hypothesis -smaller p = more evidence, as it suggests that it is increasingly more unlikely that the sample could occur if Ho is true
t-statistic of regression slope
b = estimate of slope in relation to x and y S = estimated standard error -formula will be given, but know what it means
sampling error
deviation of the sample from the population that results from randomness -sample size reduces sampling error, but it does not eliminate it -ex: sometimes when we flip a coin, it comes up heads more often than tails
non sampling error
deviations from sample error from the population that occurs for reasons other than random sampling -Arises because of deficiency and inappropriate analysis of data; can be random or non-random
statistical inference
drawing conclusions about a population based on estimates from a sample (value of one or more parameters) -sample provides estimates for the population
lift ratio
evaluates how effective association rules are at identifying transactions of the consequent vs randomly selected transaction
parameter (population)
factor that defines characteristic of population -ex: population mean, population standard deviation, population variance, census, etc.
no relationship
slope is 0
negative linear relationship
slope is negative
positive linear relationship
slope is positive
why to standardize variables in Euclidean distance
so that each variable receives equal weight in the clustering -otherwise the cluster might be dominated by a variable with large sized -ex: annual salary compared to years of experience -analyst may choose to weight one variable higher than another based on the important to the grouping
Sum of Squares Total (SST)
sum of square total comes from the difference between the observed values and a line with no slope (at the average y or y-bar) SST = SSR + SSE Total = regression + error