BA - Core

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

normal distribution

-a probability distribution whose center is and width are determined by its mean and standard deviation -using the properties of normal dist., can calculate probability associated with any range of values -total area under curve =1 - on x axis is the variable we are studying

normal distribution rules of thumb

-68% of the probability of a normal distribution is contained in the range from 1 SD below the mean to 1SD above the mean -95% within 2 SD above and below the mean -98% within 3 SD above and below mean (also stated as: 98% of the probability is contained in the range reaching three standard deviations away from the mean on either side)

percentile

=percentile.inc(range of data, percentile # like .95) -the median by definition is the 50th percentile -if 60% of the observations are less than or equal to the 60th percentile

cumulative probability

=the probability of obtaining any value less than or equal to that number, which is the area to the left of that number

prediction interval

A prediction interval is a range of values constructed around a point forecast. The center of the prediction interval is the point forecast, that is, the expected value of y for a specified value of x. The range of the interval extends above and below the point forecast. The width of the interval is based on the standard error of the regression and the desired level of confidence in the prediction. -the standard error of the regression is found in regression output table (need to pick level of confidence) then we can say we are 95% confident that the actual selling price will fall within the prediction interval

single-population hypothesis test

collecting from 1 population and testing to see if its average was significantly different from the historical average

P bar

equivalent to the mean of all responses represented by the binary variable 0 and 1 -the percent of the total success "yes"

type 2 error

false negative - incorrectly fail to reject the null hypothesis when it is actually not true

type 1 error

false positive - incorrectly reject the null hypothesis when it is actually true the probability of a type 1 error is equal to significance level of 5% which is 1- confidence level to decrease the chance of a type 1 error we can increase the confidence level ->99%, but but we decrease type 1 error, we increase chance of type 2 error

adjusted r^2

for multiple regressions, we rely on an adjusted R^2 we can't just check if R^@ increased when running multiple regression compared to linear because actually R^2 can never decrease when an independent variable is added to a model we can always increase R^2 by adding more independent variables, but don't want to overfit so to compensate, we modify R^2 by an adjustment factor which reduces R^2 slightly for each independent variable we add to the model if the new variable explains enough additional variation to increase the adjusted r^2, we consider the new variable to add predictive or explanatory power

low and high r^2

high r^2: large portion of the variation in y is explained by the regression line. lower r^2: A smaller portion of the variation in y is explained by the regression line and if the p value is less than .05 then we can be 95% confident that the true slope is not 0, that there is a significant linear relationship between the independent and dependent variables

1 sided hypothesis test

if movie manager has reason to believe that showing old classics has increased customer satisfaction rating, use one-sided hypothesis test if the alternative hypothesis is that the average satisfaction rating has increased, then the null hypothesis is that the rating is the same or lower null hypothesis: mean < or equal to 6.7 alternative hypothesis mean > 6.7 If the two-sided p-value of a given sample mean is 0.0040, what is the one-sided p-value for that sample mean? 0.0020 The one-sided p-value is half of the two-sided p-value. Since the two-sided p-value is 0.0040, the one-sided p-value is 0.0040/2=0.0020.

p-values for multiple regression

if the p-value is less than .05 for each of the independent variables, we can be 95% confident that the true coefficient of each of the independent variables is not 0. In other words, we can be confident that there is a significant linear relationship between the dependent variable and the independent variable

residual plot

scatter plot of residuals residuals on y axis independent variable on x axis (house size) should be spread randomly if you do see a pattern in a residual plot, then it's possible that other factors may be influencing the dependent variable and the linear model may not be the best fit for the data -look for cuve, or for them to get larger on x axis (heteroskedasticity)

multiple linear regression

seeks to identify linear relationship among 3 or more variables example - selling price vs house size and distance from boston note that if r^2 = 15% of the variation in selling price is explained by a home's distance from boston (low) and r^2 = 74% of the variation in selling price is explained by house size - the two variables together DOES NOT explain the sum (89%) y = a + bx + b2X2 + b3X3 essentially separates the effects that house size and distance each have on price

margin of error

the term z*(SD/(sqrt(n)) is ofted referred to as the intervals margin of error z= (x-mean)/SD

alternative hypothesis

the theory or claim we are trying to support Ha ≠ 6.7

regression line vs mean price line

to find out how much additional value the regression model gives us, we'll compare the accuracy of the regression line with that of the mean price line calculate the sum of squared errors for each of the 2 lines and see how much smaller the error is around the regression line (called the residual sum of squares). compare to sum of squared errors for the mean line (total sum of squares) the difference is called the regression sum of squares - measuring the variation in price that's explained by the regression model -if the residual sum of squares is a large fraction of the total sum of squares, we know the regression line helps us predict more accurately than price alone

t-test

use a t-test for all calculations of p-value =T.TEST(array1, array2, tails, type) almost always type 3 Although there are multiple ways to calculate a p-value in Excel, we will use a t-test, the most common method used for hypothesis tests. The t-test uses a t-distribution, which provides a more conservative estimate of the p-value when the sample size is small.

single variable linear regression

used for a) studying magnitude and structure of a relationship between two variables b) forecasting a variable based on its relationship with another variable

descriptive statistics (also known as summary stats)

used to summarize a data set numerically -3 values describe the center, or central tendency, of a data set -mean = to sum of all data points in the set divided by number of data points =average(A1:A5) (conditional mean is the mean of a subset of the data that includes all values satisfying a certain condition =averageif(cells that contain the labels, the criteria like "NA", and range of data) -median - the middle value of the data set: half of the data set's values lie below the median, and half above =median(A1:A5) -mode - # that occurs most frequently, can have multiple modes =mode.SNGL(A1:A5) =mode.mult(A1:A5) to find how many modes = count(mode.mult(A1:A5)

mediating variable

variables which are affected by one variable and then affect another example - worrying and getting better grades because worrying causes student to studying harder (mediating variable is studying harder)

p-value

we reject the null hypothesis if the sample's p-value is less than 5% the threshold at which we reject the null hypothesis is called the significance level and is equal to 1 - the confidence level example 1 - .95 = .05 in general, the smaller the p-value, the stronger the evidence is against the null hypothesis When the p-value of a sample mean is less than the significance level, we reject the null hypothesis. How would you interpret the p-value of 0.0026? If the null hypothesis is true, the likelihood of obtaining a sample with a mean at least as extreme as 7.3 is 0.26%

histogram

we use histograms to help us visualize a variables distribution a histograms x-axis represents bins corresponding to ranges of data a histograms y axis indicates the frequency of observations falling into each bin -bins = a range of possible numbers that run up to a number and include that number (e.g. greater than 66 and less than or equal to 68) -data analysis -> histogram - input range are the actual numbers -bin range = the number you decided to label each bin

scatter plot

we use scatterplots to visualize the relationship between 2 variables can reveal relationships between two variables although there may be a relationship between two variables, we cannot conclude that one variable causes the other - correlation does not imply causation -usually the dependent or responsible variable is on the y (vertical axis) while the independent variable or predictor variable is plotted on the x (horizontal axis)

residual plots for multiple regression

you will see a residual plot for each of the independent variables each residual plot should reveal a random distribution

Spread of Data

- 3 values measure the spread of the data -range -variance - measures how far each point is from the mean, measured in square units =var.s(A1:A5) -standard deviation is equal to the square root of the variance -a small SD means data points are close to mean, a large SD means data points are from from mean and broader spread =stdev.s(A1:A5) s = that you are working with a sample If variance = 16, find the SD by: =sqrt(16) = 4

challenges of using lagged variable

- note that you will lose a data point which decreases our sample size which reduces precision of our estimates of the regression coefficients -also if the lagged variables do not increase the mode's explanatory power (the addition of the variable decreases adjusted R^2) then adding a lagged variable can be costly

sample sizes larger than 30

-for large samples (greater than 30), the lower and upper bounds are calculated using the equation sample mean +/- z(sample SD/sqrt(sample size) -the function CONFIDENCE.NORM calculates the margin of error, which we add and subtract from the sample mean to find the confidence interval

sample sizes smaller than 30

-for small samples (smaller than 30), the lower and upper bounds are calculated using the equation sample mean +/- t(sample SD/sqrt(sample size) -for small samples, we use a t-distribution, which is shorter and wider than a normal distribution -the t distribution provides a wider range, a more conservative estimate of where the true population mean lies the function CONFIDENCE.T calculates the margin of error, which we add and subtract from the sample mean to find the confidence interval

What would increase the width of the confidence interval

-increasing the confidence interval -decreasing the sample size (will result in a less accurate prediction and there, a winder confidence interval)

Multicollinearity

-multicollinearity - when 2 independent variables are so highly corelated that it is difficult for regression model to separate the effect each variable has on the dependent variable -if a variable that was significant (p value less than .05) becomes insignificant when we add it to a regression model, we can usually attribute it to a relationship between 2 or more of the independent variables -example - house size and lot size (high correlation) OK for making predictions - can ignore multicollinearity NOT OK for understanding the net effects - best way to reduce multicollinearity is to increase sample size OR remove one of the collinear variables

population vs sample

-the numerical properties of a population are called parameters -the numerical properties of a sample are called statistics (a statistic is an estimate of a true value of a parameter)

2 ways to test whether the slope of the best fit line = 0

1) check whether the confidence interval for the line's slope contains 0 check by looking at the lower and upper 95% - example: if it's 196 and 314 then we are 95% confident that the true slope of the regression line is between 196 and 314, and since it does not contain 0, we can be 95% confident that there is a significant linear relationship between the variables, no chance of the slope being 0 2) check whether the p-value is greater than or equal to .05 check p value of the independent variable (house size) - need to know that the true regression line is not 0 so we can be confident there is a significant linear relationship Ho = true slope of regression line = 0 Ha = the true slope of the regression line ≠ 0 if the p value is .0000 we reject the null hypothesis that the slope is 0 and can be confident that there is a significant relationship between the 2 variables

interpreting pvalue

The p-value of 0.0026 indicates that if the population mean were actually still 6.7, there would be a very small possibility, just 0.26%, of obtaining a sample with a mean at least as extreme as 7.3. Equivalently, since 7.3-6.7=0.6, this p-value tells us that if the null hypothesis is true, the probability of obtaining a sample with a mean less than 6.7-0.6=6.1 or greater than 6.7+0.6=7.3 is 0.26%. since it is less than .05, we reject the null hypothesis and conclude that customer satisfaction has changed

significance level

The significance level defines the rejection region by specifying the threshold for deciding whether or not to reject null hypothesis. When the p-value of a sample mean is less than the significance level, we reject the null hypothesis. If we use the most commonly used significance level of 0.05, we draw our conclusions on whether the sample's p-value is less than or greater than 0.05. If the p-value is less than 0.05, we reject the null hypothesis. If the p-value is greater than or equal to 0.05,

z-value

The z-value of a data point is the distance in standard deviations from the data point to the mean. Negative z-values correspond to data points less than the mean; positive z-values correspond to data points greater than the mean. z= (x-mean)/SD If a particular standardized test has an average score of 500 and a standard deviation of 100, what z-value corresponds to a score of 350? =(350-500)/100 = -1.5

dummy variables for multiple regression

We always interpret the coefficient of a dummy variable as being the expected change in the dependent variable when that dummy variable equals one compared to the base case. example Sales=−631,085+533,024(Red)+50.5(Advertising) In this case, controlling for advertising, we expect sales for red sneakers to be $533,024 more than blue sneakers.

interpreting confidence interval

a 95% confidence level should be interpreted as: if we took 100 samples from a population and created a 95% confidence interval for each sample, on average 95 of the 100 confidence intervals would contain the true population mean -or on average, the 95 out of 100 such confidence intervals do contain the true mean, which is why we say we're 95% confident our interval does DOES NOT MEAN that there is a 95% chance that the confidence interval contains the true population mean "The lower bound is approximately 22.04 and the upper bound is 29.90, so we can be 95% confident that the true mean BMI of all U.S. citizens is between 22.04 kg/m2 and 29.90 kg/m2."

time series

a data set in which one of the variables is time time is usually on the horizontal axis

bimodal

a distribution is bimodal if it has 2 clearly defined peaks

skewedness

a distribution that is skewed to the right looks "stacked" on the left and has a tail on the right side having one or more outliers at one end of a dataset will skew a distribution that is otherwise not skewed

null hypothesis

a statement about a topic of interest about the population.it is typically based on historical information (average rating at the movie theater goer satisfaction).Assume that the null hypothesis is true and then test to see if we can nullify it using evidence from a sample. the null hypothesis is the opposite of the hypothesis we are trying to prove (the alternative hypothesis) Ho = 6.7

outlier

a value that falls far from the rest of the data we should always carefully investigate an outlier before deciding whether to leave it as is, change its value to the correct value, or remove it

outcomes of hypothesis testing

a) reject the null hypothesis (accept the alternative hypothesis) and accept the alternative hypothesis b) fail to reject the null hypothesis because we have insufficient evidence (cannot prove that the null hypothesis is true)

hidden variables

be aware of hidden variables, which may be responsible for patterns we see when graphing or examining relationships between two data sets

two population hypothesis test

collect 2 samples during the same time period and test to determine whether the average customer satisfaction rates in the 2 populations were different (a/b tests)

margin of error

either =confidence.norm(alpha, SD, size) where alpha is 1-95 to find the upper and lower bounds =sample mean +/- =confidence interval (alpha, SD, size) where alpha is 95 OR -sample mean +/- z * (s/sqrt(n)) (change to confidence.t for smaller sample size)

dummy variables

either 0 or 1 how it relates to hypothesis testing: Ho=selling price in neighborhoods where SAT is @ or above 1700 = selling price in neighborhoods with SAT below 1700 Ha=selling price in neighborhoods where SAT is @ or above 1700 ≠ the selling price in neighborhoods where SAT is below 1700

Central Limit Theorem

if we take enough sufficiently large samples from any population, the means of those samples will be normally distributed, regardless of the shape of the underlying population - the distribution of sample means more closely approximates a normal curve as we increase the number of samples or sample size -we can use the normal curve's special properties to draw conclusions from a single sample mean -the mean of the distribution of sample means equals the mean of the population distribution -the SD of sample means = the SD of the population distribution divided by the square root of the sample size --> thus, increasing the sample size decreases the width of the of the distribution of sample means

regression output: x and y

input y range is the dependent input x is for the independent

sample

it is often very useful to infer attributes of a large population from a smaller sample. to make sound inferences: -make sure the sample is sufficiently large and representative of population -avoid bias (phrase questions neutrally, ensure sampling method is appropriate, pursue high response rates) if a sample is sufficiently large and representative of population, the sample statistics x (mean) and s (standard deviation) should be reasonable good estimates of the population parameters, u and σ -as sample grows, the sample mean and SD approach the populations SD and mean

difference between 68% confidence interval and 95%

lower levels of confidence require narrow confidence intervals for example, a 68% confidence interval would be approximately half the width of a 95% confidence interval

cross sectional data

measure an attribute across different subjects (people, countries) -provides snapshot of data cross multiple groups at a given point in time example - oil consumed by 10 countries in 2012

r^2

measures how closely a regression like fits a data set - explanatory power R^2 = 0 - the regression line explains none of the variation in y R^2 = 1 - the regression line explains all the variation in 1 note - correlation measure linear relationship between 2 variables while R^2 measures explained variation explains how much variation is explained by the regression line, but does not reveal exactly how the variables are related

t-distribution

not as tall and has thicker tails - these differences reflect that fact that a t-distribution is more likely than a normal distribution to have values farther from the mean. use for samples smaller than 30

multiple linear regression - interpreting equation

selling price = 194,800 + 244(house size) - 10,840(distance from boston) 244(house size) = interpret coefficient as expected change in selling price if distance says the same 10,840(distance from boston) - for every additional mile a house is from Boston, on average, price decreases by 10840 assuming that the size of the house stays the same ("net effect of distance on price" or "effect of distance on price controlling for house size") similarly - because we are controlling for distance, the coefficient for house size assumes that all houses are the same distance from boston the coefficient is NET with respect to all variables included in the model but GROSS with respect to all omitted variables

point forecast

once we have found the regression equation for a given data set, can use that equation to obtain a point forecast example: forecast the price of a 1500 sq foot house selling price = 13,500 + 255(house size) =13500+255(1500) there is greater uncertainty as we forecast outside of the historical range of the data and should feel less comfortable with those predictions

p value vs r squared

r^2 measures how much variation is explained by the regression line but it DOES NOT reveal exactly how the variables are related...so look at 2 other metrics 1) p value of the independent variable (house size) 2) the residual plot

lagged variable

sometimes the value of the dependent variable in a given period is affected by the value of an independent variable in an earlier period. example: advertising in an earlier period will still influence this years sneaker sales incorporate delayed effect using a lagged variable to d this, we add the previous year's advertising (the lagged variable) sot hat the model now includes the current year and previous year's advertising independent variable (ads) - x range dependent variable (sales) - y range

correlation coefficient

tells us the strength of a linear relationship between 2 variables -the value of the correlation coefficient ranges between -1 and +1 -a correlation coefficient near zero indicates a weak or nonexistent linear relationship -a correlation coefficient near 0 does not mean there is no relationship between the two variables, but only that any relationship that does exist is not linear upward sloping = +1 downars lope = -1 =correl(B2:B11, C2:C11) correlation indicates a linear relationship but not causality: ex - taller people tend to weight more, but gaining weight won't make you taller

2 sided hypothesis test

test for change either up or down, when you don't have conviction about how satisfaction will change

residual error

the difference between the observed value and the line's prediction for the dependent variable to quantify the total size of the errors, we can't just sum each of the vertical distances (positive and negative distances would cancel each other out) so we calculate sum of squared errors or residual sum of squares a regression lines is formally defined as the line that minimizes the sum of squared errors

regression line

the linear relationship that best fits the data, but it won't pass through every point the regression line is the line that minimizes the dispersion of points around that line and we measure the accuracy by measuring that dispersion -we attribute the difference between the actual data points and the values predicted by the regression line either to a) relationships between selling price and variables other than house size or b) chance alone y = a + bx

confidence interval

the sample mean is only a point estimate using the properties of the normal distribution and the central limit theorem, we can construct a range around the sample mean, called a confidence interval -the width of the confidence interval depends on the level of confidence, our best estimate of the population SD, and the sample size. we can only control the level of confidence and sample size - a confidence interval must cover a distance of =z (SD/sqt (sample size)) z is the z-value associate with our specified level of confidence (also called the margin of error)


Set pelajaran terkait

Chapter 38 Oxygenation and Perfusion

View Set

25 Quiz 7 - The Nursing Process - Planning

View Set

Adult Nursing: Integumentary System

View Set

ASTR 112 - Review Questions Exam 2

View Set

Food Animal Production: Poultry: Mareks

View Set

Chapter 2: Research in Psychology (55-108)

View Set