Stats Final Exam - Theory Side

¡Supera tus tareas y exámenes ahora con Quizwiz!

Evaluating variance for regression analysis

- Can evaluate assumption of equal variance from a plot of the residuals with Xi

Coefficient of Determination for regression analysis

- The coefficient of determination is portion of total variation in dependent variable that is explained by variation in independent variable - Is between 0 and 1. o If it 0, means model is not good at all o If it 1, it perfect model, and explains linear relationship between X and Y very well o If above 0.6, 0.65 then model good fit o If 0.3 or 0.4, model not good fit - Coefficient of determination is also called r-squared and is denoted as r2 - NOTE: find this on Excel

The principle of parsimony

- The principle of parsimony is the belief that you should select the simplest model that gets the job done adequately - Suppose two or more models provide a good fit for the data, select the simplest model - Among forecasting models, the least-squares linear and quadratic models and the first-order autoregressive model are regarded by most statisticians as the simplest. - The second- and pth-order autoregressive models, the least-squares exponential model and the Holt-Winters model are considered the most complex of the techniques presented.

Approaches to time forecasting

- There are qualitative forecasting methods. These used when historical data is unavailable. It is considered highly subjective and judgemental - There 2 quantitative forecasting methods o Using past data to predict future values o 1) Time series: use past data to predict future values: involves forecasting of future values of a variable based entirely on the past and present values of that variable, eg: quarterly GDP, annual recorded total sales revenue of company o 2) Casual: eg: multiple regression analysis: involve the determination of factors that relate to the variable you trying to forecast.

Autoregressive modelling steps

1) Choose a value for p 2) Form a series of p 'lagged predictor' variables Yi-1 , Yi-2 , ... ,Yi-p 3) Use Excel to run a multiple regression model using all predictor variables 4) Test significance of the pth variable: o If significant, this model is selected o If not, discard the pth variable and repeat steps 3 and 4 with evaluation of new highest order parameter whose predictor variable lags by p-1 years.

Simple linear regression model equation

Yi = B0 + B1Xi + Ei - Need gradient and y intercept to find the equation of a line - Slope of line, B1, represents expected change in Y per unit change in Z. Represents mean amount that Y changes (either positively or negatively) for one-unit change in X. - Y intercept, B0, represents mean value of Y when X equals 0 - Random error component is added to equation to include all other factors you haven't considered in model, ie: everything not included in x variable o Ei represents the random error in Y for each observation i that is not explained by B0 + b1Xi, aka it is the vertical distance Yi is above or below the line o Random/residual error: diff between observed value of Y and predicted value of Y, see on diagram below o The Ei in the equation is the mean of all residuals - This equation is for reality - Model is never going to be perfect in real life as going to be other factors affecting.

Simple linear regression

o Only one independent variable, X, and one dependent variable, Y o Relationship between X and Y described by a linear function o Changes in Y assumed to be caused by changes in X, ie: causal relationship

Interpolation vs extrapolation in regression model

- When using a regression model for prediction, only predict within the relevant range of data, ie: do not try to extrapolate beyond range of observed X's

One tail tests

- Where alternative hypothesis focus on particular direction - Hypothesis where entire rejection region contained in one tail of sampling distribution - Can be upper or lower tail

Test for when pop standard dev is known

- Z test of hypothesis for the mean - When pop standard dev known, use Z test if pop is normally distributed or sample size large enough for Central Limit Theorem

What is a hypothesis?

Statement/assumption about a population parameter

What is time series forecasting?

- About relationship between time and the variable of interest, eg: how inflation changing over time - Using past and present data for one variable to predict future values- About relationship between time and the variable of interest, eg: how inflation changing over time - Using past and present data for one variable to predict future values

Confidence Interval Estimation vs hypothesis testing

- Both major components of statistical inference - Based on same concepts - Used for different purposes o Confidence intervals used to estimate parameters o Hypothesis testing used for making decisions about specified values of pop parameters § Used when trying to prove that a parameter is less, more, or not equal to specified value § NOTE: proper interpretation of confidence interval can indicate whether parameter is less, more or not equal to specified value

Evaluating independence for regression analysis

- Can evaluate assumption of independence of errors by plotting residuals in order or sequence in which observed data were collected.

Evaluating normality for regression analysis

- Can evaluate assumption of normality in errors by tallying the residuals into a frequency distribution and displaying the results in a histogram. - Can evaluate by comparing actual vs theoretical values of residuals, or by constructing a normal probability plot of the residuals

Pitfalls in time series forecasting

- Critics of time-series forecasting argue that these techniques are overly naïve and mechanical; that is, a mathematical model based on the past should not be used to extrapolate trends mechanically into the future without considering personal judgments, business experiences or changing technologies, habits and needs. o Thus, in recent years econometricians have developed highly sophisticated computerised models of economic activity incorporating such factors for forecasting purposes. - Time-series models provide useful guides for projecting future trends (on long and short term bases) - Although not explicitly stating the causes of time-series change, many of the past changes reflect economic causes. - If used properly and in conjunction with other forecasting methods, as well as with business judgment and experience, time-series methods will continue to be excellent tools for forecasting. - They maintain the advantages of fewer technical assumptions and do not require knowledge of a future cause (independent variable) in order to predict change in the dependent variable.

Dependent variable in regression analysis

- Dependent variable (Y): variable we wish to predict or explain (response variable). o Assuming X is affecting Y o Called dependent because dependent on independent variable o That's why X on right hand side of equation, and Y on left hand side o Y is something we want to predict and explain using x values

What is the Level of significance?

- Designated by a - Usually 0.01, 0.05, 0.10 and related to confidence level, eg: 99% confidence has 1% level of significance - Defines the unlikely values of the sample statistic if null hypothesis is true - Defines critical value(s) and rejection region(s) of the test o Can have one or two rejection regions - Selected by researcher

Trend component of time series forecasting

- Long-run increase or decrease over time (overall upward or downward movement) - Trend can be upward or downward, or linear or non-linear

Is the model significant for multiple regression model?

- F Test for Overall Significance of the Model - R squared gives you general understanding about whether changes in Y explained by changes in X1 and X2, but need more. Want to be sure about quality of model, ie: that it is good representation of true model we trying to find o So, we test the overall model, and then if it is significant/good model, then do some other tests - Shows if there is linear relationship between all of X variables considered together and Y - Hypotheses: o For null, saying it is equal to 0 because this means there is no relationship between X1 and Y and X2 and Y, ie: model is really bad/not significant o For alternative, if at least one of X variables doesn't equal 0, then overall model is significant § We want the alternative to be true o Because there more than one independent variable, use following null and alternative hypotheses: H0 = B1 = B2=...Bk = 0 HI = at least one Bi does not equal 0

Choosing a forecasting model

- Following guidelines provided for determining adequacy of particular forecasting model. - These guidelines based on a judgment of how well the model fits the past data of a given time series, and assumes that future movements in the time series can be projected by a study of the past data: o Perform a residual analysis to make sure you don't see any obvious patterns (see below) o Measure magnitude of residual error through squared differences o Look for pattern or direction o Use simplest model, ie: principle of parsimony

Why is time forecasting important?

- Govs forecast unemployment, interest rates and expected revenues from income taxes for policy purposes - Marketing executives forecast demand, sales and consumer preferences for strategic planning - Uni administrators forecast enrolments to plan for facilities and for faculty recruitment - Retail stores forecast demand to control inventory levels, hire employees and provide training - Can predict future values from past and present data - Has many applications - Forecasting is needed to monitor the changes that occur over time.

The Null hypothesis

- H0 - Statement about the value of one or more pop parameters which we test and aim to disprove - It states the belief or assumption in the current situation (status quo) - Begin with assumption that the null is true - Can contain =, equal to or less than, equal to or more than signs - May or may not be rejected o Is rejected when there sufficient evidence from sample info that the null is false o NOTE: can never prove the null is correct, failure to reject it isn't proof it is true. This because the decision based on sample info, not pop info. So, there is only insufficient evidence to warrant its rejection o Expect sample statistic to be close to pop parameter if null is true, and if there large difference between statistic and hypothesised value of pop parameter, conclude the null is false - Always about a pop parameter, not a sample statistic - Is written in terms of the population

Alternative hypothesis

- H1 - Opposite of the null - Statement we aim to prove about one or more pop parameters - Can contain <, > or not equal to sign - May or may not be proven o If proven, reject the null hypothesis - Generally, the claim or hypothesis the researcher is trying to prove - Represents the conclusion reached by rejecting the null hypothesis

Hypothesis testing for two sample tests

- Here, looking at two samples, so there are two means - There are independent and related samples o Independent: sample selected from one pop has no effect on the sample selected from the other pop

Model selection

- In addition to visually inspecting scatter plots and comparing adjusted R2 values, you can calculate and examine first, second and percentage differences. - Use linear trend model if the first differences are approximately constant o Ie: diff between Y2 and Y1 approx equal to diff between Y3 and Y2 and so on - Use quadratic trend model if second differences are approx. constant - Use exponential trend model if the percentage differences are approx. constant

What is hypothesis testing?

- In hypothesis testing, you consider the sample statistic to see whether the evidence better supports the null hypothesis or the mutually exclusive alternative - Hypothesis testing: method of statistical inference used to make tests about the value of pop mean - Determining how likely it is that the null is true by considering the info from the sample - Calculating the probability of getting a given sample result if null hypothesis is true - Use normal distribution or t distribution to help determine whether null is true

What is multiple regression model?

- In real world, don't use simple regression because doesn't really make sense for real world as there lots of variables that affect Y variable - Examines the linear relationship between one dependent (Y) & two or more independent variables (XK) - Uses two or more independent variables to predict the value of a dependent variable - Can see if model is acceptable or not, which variable is significant or not - Assuming Y variable (dependent variable) is affected all X variables, and changes in X variables will cause changes in Y - Similar to simple regression model, just adding more x variables to have better model - Random error: includes all other variables haven't been able to include in model because didn't have the data, didn't come to your mind as an important factor, etc. - In multiple regression model with 2 independent variables, slope, B1, represents the change in mean of Y per unit change in X1, taking into account the effect of X2 - But, we take a sample, do the analysis based on that sample, and make predictions based on that o You use sample regression coefficients (b0, b1 and b2) as estimates of pop parameters (B0, B1 and B2) o So, the coefficients of the multiple regression model are estimated using sample data o We assume the average of errors is equal to 0

Independent variable in regression analysis

- Independent variable (X): variable used to explain the dependent variable (explanatory variable), variable used to make the prediction

Hypothesis tests for proportions

- Involves categorical variables with 2 possible outcomes: o Success (possesses a certain characteristic) o Failure (does not possess that characteristic - Fraction or proportion of the pop in the success category is denoted by π - Sample proportion in the success category is denoted by p - When both n pi and n(1 - π) are at least 5, pop proportion can be approximated by a normal distribution with mean and standard dev

What is ANOVA?

- It is hypothesis testing for finding out whether there are differences between pop means when have three or more pops - Aka analysis of variance - Investigator controls one or more factors of interest - Each factor contains two or more levels - Levels can be numerical or categorical - Different levels produce different groups - Think of each group as a sample from a different population o So, F test is right skewed distribution, and always upper tail test

Smoothing the annual time series - moving averages

- Moving Averages: A series of arithmetic means over time - Calculate moving averages to get overall impression of pattern of movement over time - Moving averages can be used for smoothing a time series. o Smooth fluctuations in data so that it shows the overall trend - Averages of consecutive time series values for a chosen period of length L - Result dependent upon choice of L (length of period for computing means) - Examples: For a 5-year moving average, L = 5, For a 7-year moving average, L = 7 etc.

Least-squares trend fitting

- Moving average is too simple and can be misleading, so use least-squares trend fitting - Estimate a trend line using regression analysis and based of trend line, can predict future values - Show first year as 0, second year as 1, etc. - Independent variable (X) is time - Dependent variable (Y) is whatever measuring - Put values into Excel and get the regression line, as did with regression analysis

Moving averages

- Moving averages (MA) for chosen period of length L consist of series of means calculated over time such that each mean calculated for sequence of L observed values - It highly subjective and dependent on L, the length of the period selected for constructing the averages. - Eg: 5 year moving average

Non-linear trend forecasting

- Non-linear regression model can be used when time series exhibits a non-linear trend - Quadratic form is one type of non-linear model - Exponential trend model is another type - Can compare adj. R2 to that of the linear model to see if there is an improvement. - Can try other functional forms to get best fit

What are the hypothesis' for ANOVA

- Null hypothesis is all means are equal, ie: no variation in means between groups - It assumes all the means are equal, and all the distributions are normal - Alternative hypothesis: at least one pop mean is different - H1: not all of the pop means are the same o Ie: some may be the same and others different, or all may be different

What is regression analysis/simple linear regression?

- Regression analysis: method for predicting values of a numerical variable based upon values of one of more other variables - Simple linear regressions: where single numerical independent variable, X, used to predict numerical dependent variable Y - Nature of relationship between 2 variables can take many forms, ranging from simple to extremely complicated mathematical functions. Simplest relationship consists of linear relationship. - Only use regression analysis for linear relationships

What is the decision rule for ANOVA testing?

- Reject H0 (null hypothesis) if F stat > F crit, otherwise do not reject H0

Coefficient of determination (r squared) for multiple regression model?

- Reports the proportion of total variation in Y explained by all X variables taken together o Don't worry about the formula, it always in Excel output, as seen below - If equals to 98% eg, then model very good and predicting all changes in Y, ie: changes in Y variable are explained very well by changes in X variables - If r squared is low, eg: one from question is 52%, means 52% of changes in Y are explained by changes in X variables. o We don't know about the other 48%, those variables not included in the model, eg: income of people, operating hours of the shop, customer service, etc. o So, this model not a good fit - Usually expect r squared to be around 65-70%, the higher the better

Residual Analysis

- Residual for observation i, ei, is diff between its observed and predicted value - Used to evaluate assumptions and determine whether regression model selected is appropriate model. - Graphically, residual appears on scatter diagram as vertical distance between observed value of Y and prediction line. - Graphical analysis of residuals: can plot residuals vs X

Residual analysis for time series forecasting

- Residuals are diff between the observed and the predicted values - When don't see any obvious patterns, mean none of the assumptions are violated and the model is good - If see patterns, means something is missing in your model, as shown above, ie: cyclical effects not accounted for

What is the rejection region?

- Sampling distribution of test statistic is divided into 2 regions: rejection (aka critical region) and non-rejection region - Rejection region: range of values of test statistic where null is rejected o If test statistic in this region, reject the null o Consists of values of test statistic that unlikely to occur if null is true, ie: more likely to occur if null is false - Non-rejection region: range of values of test statistic where null hypothesis cannot be rejected. o If test statistic in this region, don't reject null - To make decision about null, first determine critical value of test statistic. o Critical value divides the non-rejection from rejection region

Pitfalls of regression analysis

- Some pitfalls involved in using regression analysis are: o Lacking awareness of assumptions of least-squares regression. o Not knowing how to evaluate assumptions of least-squares regression. o Not knowing what alternatives to least-squares regression are if particular assumption is violated. o Using regression model without knowledge of subject matter. o Extrapolating outside relevant range. o Concluding that sig relationship identified in observational study is due to a cause-and-effect relationship

Related populations: paired difference test

- Tests means of 2 related pops, eg: paired or matched samples and repeated measures (before and after) - Related pops: performance of one affects the other - Assumptions: Both pops are normally distributed, or samples are large enough to say they are

What are independent samples? What are the two tests?

- There are two types of t tests/assumptions for when don't know pop stand dev o 1) Assume pop stand dev are equal: use pooled variance t test o 2) Assume pop stand dev not equal: use separate variances t test

Measures of variation in regression analysis

- This is seeing whether the model used is a good fit, does it explain the relationship between X and Y very well? How close is to to reality? How much of the variation in Y is explained by variation in X? - Total variation is made up of 2 parts, ie: regression sum of squares and error sum of squares - Smaller the gap between dot and line, the better, as means model is explaining the relationship between X and Y very well - So, use coefficient of determination to see how good the model is

Time series data and plot

- Time series data: numerical data obtained at regular time intervals o Time intervals can be annually, quarterly, daily, hourly, etc - As long as it based on regular time intervals and can be collected over period of time, we call it time series data - Time series plot: o Making plot of variable of interest over time o 2D plot of time series data o Vertical axis measures the variable of interest o Horizontal axis corresponds to the time periods, eg: 1979-2004

Evaluating linearity for regression analysis

- To evaluate linearity, plot the residuals on vertical axis against corresponding Xi values of independent variable on horizontal axis. - If linear model appropriate for data, there no apparent pattern in this plot. - If linear model not appropriate, there will be relationship between Xi values and residuals ei.

Partition of total variation for ANOVA testing

- To perform ANOVA test, subdivide total variation in values into two parts - that which is due to variation between groups and that which is due to variation within groups. - Between group variation: pop mean difference between one group to another, ie: sample mean for group one is diff to sample mean of group 2 and 3 - Within-group variation: inside each group, there is variation, diff between each value and the mean of its own group, eg: diff petrol stations and each offering different prices - Total variation is equal to between group variation plus within group variation - Ie: SST = SSB + SSW o Aka total variation (SST, sum of squares) = between group variation (SSB, sum of squares between groups) + within group variation (SSW, sum of squares within groups)

F test for overall significance for multiple regression model

- To test the overall model, use F test - F test is right skewed - Need the critical value of F to define the rejection region, and then the test stat. - If test stat falls into rejection region, you reject null and have proven alternative, and vice versa

What are the ANOVA assumptions?

- To use the one-way ANOVA F test, must make certain assumptions about the data - The 3 assumptions are: o 1) Randomness and independence o 2) Normality o 3) Homogeneity of variance - When only normality assumption is violated, Kruskal-Wallis rank test, a non-parametric procedure, is appropriate - When only homogeneity of variance assumption is violated, procedures similar to those used in the separate variance t test are available - When both normality and homogeneity of variance assumptions have been violated, need to use an appropriate data transformation that will normalise the data and reduce the differences in variances or use a more general non-parametric procedure

How can you tell what type of test (ie: one tail, lower tail, etc) it is?

- Two tail test: if alternative hypothesis has not equal sign - Upper tail test: if alternative hypothesis has greater than sign (<) - Lower tail test: if alternative hypothesis has < sign - Based on a, we define the shaded areas (aka rejection regions). o Then, take a sample and if sample statistic falls into shaded area, then reject the null as not possible. If falls into white area, cannot reject null

What are the assumptions of regression?

- Use acronym LINE o Linearity: underlying relationship between X and Y is linear. Always have to test if there is linear relationship o Independence of errors: error values are statistically independent, ie: no relationship between errors because if they related, means there important variable in model that you missed/haven't included, so have to find that variable. § This situation particularly important when data collected over a period of time. There, the errors for a specific time period often correlated with those of previous time period o Normality of error: error values (ε) are normally distributed for any given value of X), if not normally distributed, then shows it is a bad model § Regression analysis fairly robust against departures from normality assumption. § As long as the distribution of errors at each level of X not extremely diff from normal distribution, inferences about β0 and β1 are not seriously affected. o Equal variance (aka homoscedasticity): Probability distribution of the errors has constant variance § Requires that variance of errors (Ei) is constant for all values of X. § Variability of Y values will be same when X is a low value as when X is a high value. § This assumption important when making inferences about B0 and B1. § If there serious departures from this assumption, can use either data transformations or weighted least-squares methods

What is adjusted r squared?

- Use it when you compare different models with different variables - Is always less than r squared - It shows the proportion of variation in Y explained by all X variables adjusted for the number of X variables used. - Interpretations are still the same, ie: the higher the better - For model below, adjusted r square is 44% (see statement about what it means below on excel output, need to write the underlined bit)

Autoregressive modelling

- Used for forecasting - It takes advantage of autocorrelation - 1st order autocorrelation refers to association between consecutive values (AR1) - 2nd order autocorrelation refers to association between values 2 periods apart (AR2) - pth order autoregressive model:

Tukey-Kramer Procedure

- Used in ANOVA after rejecting the null to find out which pop means are different - Tells which population means are significantly different o It allows paired comparisons o It compares absolute sample mean differences with a critical range (that is provided on the next slide) o Ie: μ1 = μ2 but μ1 ≠ μ3 and μ2 ≠ μ3

Hypothesis testing pitfalls and ethical issues

- When planning to carry out a test of hypothesis based on a survey, research study or designed experiment, must ask several questions to ensure proper methodology is used, eg: o 1) What is goal of survey, study or experiment? How can you translate the goal into a null hypothesis and an alternative hypothesis? o 2) Is hypothesis test a two-tail test or a one-tail test? o 3) Can you select a random sample from the underlying population of interest? o 4) What kinds of data will you collect from the sample? Are the variables numerical or categorical? o 5) At what level of significance, or risk of committing a Type I error, should you conduct the hypothesis test? o 6) Is the intended sample size large enough to achieve the desired power of the test for the level of significance chosen? o 7) What statistical test procedure should you use and why? o 8) What conclusions and interpretations can you make from the results of the hypothesis test? - To do this, consult person with sig statistical training early in the process. - To avoid biases, adequate controls must be built in from the beginning - Ethical considerations arise when the hypothesis-testing process is manipulated. - To eliminate the possibility of potential biases in the results, you must use proper data-collection methods, ie: random sample from a pop or from experiment with randomisation process used - Ethical considerations require that any individual who is to be subjected to some 'treatment' in an experiment must be made aware of the research endeavour and any potential behavioural or physical side effects. - If prior info is available that leads you to test the null hypothesis against a specifically directed alternative, then a one-tail test is more powerful than a two-tail test. - If interested only in differences from null hypothesis, not in the direction of the difference, use two-tail test - In well designed study, select level of significance before data collection occurs. - Cannot alter level of significance after the fact to achieve a specific result - Data snooping is never permissible. o Unethical to perform hypothesis test on set of data, look at results, then decide on level of significance or decide between a one-tail and two-tail test. o Data cleansing is not data snooping. - Cannot arbitrarily change or discard extreme or unusual values in order to alter the results of the hypothesis tests. - Should document good and bad results

Classical multiplicative time series model components

- Whenever make time-series model, it has to consider 4 diff components of the data (see below) - Trend component - Seasonal component: ie: see increase in sales in summer every year and decrease in winter every year o When have monthly or quarterly data, additional component, ie seasonal component, is considered with trend and cyclical and irregular components. - Cyclical component: like seasonal, but over more than one year, eg: financial crisis every 7 years. Vary in length, usually between 2 to 10 years. o They differ in intensity or amplitude and often correlated with bus cycle. In some years, values are higher than would be predicted by trend line (i.e. they are at or near the peak of a cycle); in other years, values are lower than would be predicted by trend line (i.e. they are at or near the bottom or trough of a cycle). - Irregular component: random events, can't predict them o Any data that do not follow the trend curve modified by the cyclical component are considered part of the irregular or random component.

Use of Excel and other software issues with regression analysis

- Widespread availability of spreadsheet and statistical software has removed computational hurdle that prevented many users from applying regression analysis. o However, for many users this enhanced availability of software not been accompanied by understanding of how to use regression analysis properly. o User who not familiar with assumptions of regression or how to evaluate them cannot be expected to know what the alternatives to least-squares regression are if particular assumption is violated.

Simple Linear Regression Equation (Production Line)

- Y (hate) i = b0 + b1Xi - Simple linear regression equation provides estimate of population regression line o Never know exact relationship between X and Y for above, so take a sample and calculate estimated version of line, which is shown below - Equation doesn't have error because it is the estimated version and know we only have X and Y, assume for this model that average of all residuals will be 0 so don't need to provide it - NOTE: Find the values for this on Excel

Least squares method for regression analysis

- b0 and b1 are obtained by finding the values of b0 and b1 that minimise the sum of the squared differences between actual values (Y) and predicted values (Y) - NOTE: this calculated on Excel, so don't need to do it yourself - So, finding the diff between observed values and predicted values

Type 2 error

- fail to reject a false null hypothesis o Probability of Type 2 error is β o Depends on difference between hypothesized and actual values of pop parameter. o To reduce prob of making Type 2 error, increase sample size o If difference between hypothesised and actual values of pop parameter is large, β is small, because large differences are easier to find than small ones. o If difference between hypothesised and actual values of parameter is small, prob you commit Type 2 error is large

What test is it for hypothesis testing when pop standard dev is unknown

- t test hypothesis for the mean (pop standard dev unknown) - - Don't know the pop standard dev, so use the sample standard dev instead (s) - If assume pop is normally distributed, the sampling distribution of the mean will follow a t distribution with n - 1 degrees of freedom o If pop not normally distributed, can still use t test if sample size large enough for Central Limit Theorem

Testing the overall model with the p value approach for multiple regression modelling

1) Null hypothesis is B1 and B2 are both zero and alternative hypothesis is at least one beta is not zero, ie: model will be significant 2) Level of significance: 0.05 3) Method: F test 4) Find the P value: it is given in the Excel output, it is given in the significance F column, ie: 0.01201 5) Decision rule: if p value less than alpha, reject the null 6) Conclusion: p value (0.01) is less than alpha (0.05) so reject the null, which means overall model is significant

Testing the individual variables with p value approach

1) Null hypothesis is Bi is 0 and alternative is Bi is not zero, where i is 1 or 2 (or however many beta's) 2) Level of significance: 0.05 3) Method: t test as testing individual variables 4) P value: found in Excel output, for X1 it is 0.03979 and for X2 it is 0.01449 5) Decision rule: if p value is less than alpha, reject the null 6) Conclusion: for X1, 0.03979 is less than 0.05 so reject the null and means X1 is significant For X2, p value is less than 0.05 so reject the null and X2 is significant

Steps to solving the problems

o 1) State the null hypothesis, H0, and the alternative hypothesis, H1 o 2) Choose the level of significance, a, and the sample size, n o 3) Determine the appropriate method (ie: z test or t test, one tail test or two tail test) o 4) Determine the critical values that divide the rejection and non-rejection regions o 5) Collect data and compute the value of the test statistic (eg: sample mean, convert to Z score or t score) o 6) Make the statistical decision and state the managerial conclusion § If the test statistic falls into the non-rejection region, do not reject the null hypothesis, H0. § If the test statistic falls into the rejection region, reject the null hypothesis, H0 o 7) Express the managerial conclusion in the context of the real-world problem

Homogeneity of variance ANOVA assumption

o All populations sampled from have the same variance o States that the pop variances of the c groups are equal o If you have equal sample sizes in each group, inferences based on the F distribution are not seriously affected by unequal variances. o However, if you have unequal sample sizes, then unequal variances can have a serious effect on inferences developed from the ANOVA procedure. o Thus, when possible, you should have equal sample sizes in all groups. o The Levene test for homogeneity of variance is one method to test whether the variances of the c populations are equal.

P value approach to one sample hypothesis testing

o Another way of testing null hypothesis and is more straightforward and quicker o P value: prob of obtaining a test statistic more extreme (less than or equal to, or more than or equal to) than the observed sample value, given H0 is true o Decision rule: § If p < a, reject H0 § If p greater than or equal to a, do not reject H0 o Approach: § 1) State null and alternative hypothesis § 2) Level of significance § 3) Determine method: determine appropriate test stat and sampling distribution § 4) Calculate the value of test stat and p value § 5) Make statistical decision and state conclusion

What are two issues with correlation coefficient (why don't we use it)?

o Can have just one variable assumed to have relationship with other but regression analysis can have one or more than one, eg: can have 5, 6 or 10 factors affecting other factor Regression assumes there is a casual relationship, but with correlation coefficient, it is just a relationship, may or may not be causal relationship

Power of a test

o Complement of prob of Type 2 error o Prob that you reject the null hypothesis when in fact it is false and should be rejected

Confidence Coefficient

o Complement of the prob of Type 1 error (1 - a) o Prob that you will not reject the null when it is true and should not be rejected

How does Tukey-Kramer Procedure work?

o Have proven that not all of them are equal o We do not know the value, so find sample mean 1, sample mean 2 and sample mean 3 o Then, calculate the differences between the sample means § Eg: if want to see that M1 and M2 are equal. We find absolute value of sample mean 1 - sample mean 2. If the difference is greater than the critical range, then can conclude that M1 is different to M2 § Eg: if want to see that M1 and M3 are equal. Find the diff between sample mean 1 and sample mean 3, ie: subtract them. If that absolute value is greater than the critical range, can conclude that M1 is different to M3 § Same thing for M2 and M3

Cyclical component of time series forecasting

o Long-term wave-like patterns o Usually occur every 2-10 years o Often measured peak to peak or trough to trough

What is regression analysis used to do?

o Predict value of dependent variable (Y) based on value of at least one independent variable (X) o Explain the impact of changes in an independent variable on the dependent variable o Can find relationship between X and Y and see how effective X is on Y, ie: if change X, what is effect on Y. o Can then predict future variables

Type 1 Error

o Reject a null hypothesis when it is, in fact, true and shouldn't be rejected o Considered a serious type of error o Probability of Type 1 error is set by the researcher in advance, so because you specify it, the risk of committing a Type 1 error is directly under your control o Aka the level of significance o The choice of a particular risk level for making a Type 1 error depends on cost of making such an error

Randomness and independence ANOVA assumption

o Select random samples from the c groups (or randomly assign the levels) o Critically important o Assumptions necessary to avoid bias o The validity of any experiment depends on random sampling and/or the randomisation process. o To avoid biases in outcomes, need either to select random samples from c populations or randomly assign items or individuals to c levels of the factor. o Selecting a random sample, or randomly assigning the levels, will ensure that a value from one group is independent of any other value in the experiment. o Departures from this assumption can seriously affect inferences from analysis of variance.

Seasonal component of time series forecasting

o Short-term regular wave-like patterns o Observed within 1 year o Often monthly or quarterly

What are strategies to use to help avoid pitfalls of regression?

o Start with scatter plot to observe possible relationship between X and Y. o Check assumptions of regression before moving on to use results of model. o Plot residuals against independent variable to determine whether linear model is appropriate and to check equal variance assumption. o Use histogram, stem-and-leaf display, box-and-whisker plot or normal probability plot of the residuals to check the normality assumption. o If you collected the data over time, plot the residuals against time or use the Durbin- Watson test to check the independence assumption. o If there are violations of the assumptions, use alternative methods to least-squares regression or alternative least-squares models. o If there are no violations of the assumptions, carry out tests for the significance of the regression coefficients and develop confidence and prediction intervals. o Avoid making predictions and forecasts outside the relevant range of the independent variable. o Always note that the relationships identified in observational studies may or may not be due to a cause-and-effect relationship. Remember that, while causation implies correlation, correlation does not imply causation.

Normality ANOVA assumption

o The sample values for each group are from a normal population o States that sample values in each group are from a normally distribution pop o The one-way ANOVA F test is fairly robust against departures from normal distribution o As long as distributions are not extremely diff from a normal distribution, the level of significance of ANOVA F test is usually not greatly affected, particularly for large samples. o You can assess normality of each of c samples by constructing a normal probability plot or a box-and- whisker plot.

Irregular component of time series forecasting

o Unpredictable, random, 'residual' fluctuations o Due to random variations of nature, or accidents or unusual events o Usually short duration and non-repeating o Called 'Noise' in the time series

Stats Final Exam - Theory Side

Conjuntos de estudio relacionados

Sociology Chapter 8

Chess Final Exam

Poit 7.3.4 Career Portfolio

Pulmonary NPTE prep quiz

Block 4: CLI commands

Chapter 17 & 18 Exam

NBDE Part II

MS1 POST TEST

BIO 103 Chapter 24

Mastering Assignment 3 : lipids and membranes

midterm spanish

Ch. 10

PLC Test 1

Evolve: Maternity - Women's Health/Disorders

Allied Health Practice Test

Insurance Pre-licensing : Chapter 4 quiz

Quiz

Biology Week 1: Tragedy of the Commons

Unit 4: Cables

Quiz 16 — Revelation