Topic 4: Hypothesis testing

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Analysis of variance (F-test for the ratio of variances)

Section 1 and section 2 outline different techniques for estimating and testing the value of a single population mean and the difference between two population means. However, sometimes there is a need to estimate and compare several population means simultaneously. The analysis of variance is a general method of analysis that can be used to analyse data from designed experiments involving multiple independent variables. The objective in the analysis of variance is to isolate and assess sources of variation associated with the independent variables and to determine how they interact and affect the total variation of the dependent variable. Analysis of variance is discussed in more detail in subsequent topics.

One-tailed test

A one-tailed test is used when the researcher is examining the possibility of only one outcome. To detect if µ1 > µ2 (or µ1 - µ2>0), the rejection region is located in the upper end of the distribution. To detect if µ1 < µ2 (or µ1 - µ2<0), the rejection region would be located in the lower end of the distribution. The probability of a Type I error (α) in a one-tailed test is equal to the area under the normal curve located in the rejection region (i.e. shaded area of the distribution in Figure 1). Therefore if α = 0.05, the critical value of a one-tailed test would be 1.65 standard deviations from the mean of the sample (either above or below). Many fund managers claim that their fund has consistently achieved higher returns than the Australian All Ordinaries Index. A hypothesis test could be set up to test if the particular fund on average has significantly outperformed the All Ordinaries Index. This would test if the mean return to Fund A(µFund) is greater than the mean return of the All Ordinaries Index (µAllOrds). The one-tailed test in this case tests the null hypothesis that µFund = µAllOrds or µFund - µAllOrds = 0 and the research (alternative) hypothesis is µFund > µAllOrds or µFund - µ AllOrds > 0. If a normal distribution and a significance level of α = 0.05 are assumed, the mean return of the fund manager would have to be greater than 1.65 standard deviations from the mean return of the All Ordinaries Index to reject the null hypothesis (µFund - µAllOrds = 0). Note: When the critical vale is negative (i.e. when the research hypothesis is a 'less than' test), it is best to consider absolute values of both the critical value and the test statistic. When the test is a one-tailed test to the left of the mean (as in the above figure), the rejection zone is less than the critical value, but is greater in absolute size (distance from the mean) than the critical value. Thus, the decision rule for a hypothesis test is to reject the null hypothesis when the absolute value of the test statistic is greater than the absolute value of the critical value.

Chi-squared test for variance

Although the estimation of the mean is the most discussed statistic in this topic, the estimation of variance (and its implications) is often a more important statistic. A higher variance implies a larger estimation error. The variance provides information on the spread of the population being examined. The chi-squared test for variance is examined in more detail in subsequent topics.

Degrees of freedom

As discussed above, there is more than one t-distribution depending on the size of the sample used. The t-distribution varies according to the number of degrees of freedom (which is a single parameter in contrast to the two parameters of mean and variance that define the normal distribution). Typically, if a sample of size n is being examined, then the t-distribution is said to have (n - 1) degrees of freedom. The letter ν (pronounced nu) is used to indicate how many degrees of freedom a t-distribution has. So, in general, for a t-distribution of sample size n, the number of degrees of freedom is ν = n - 1. The loss of one degree of freedom was due to the estimation of the mean using the same sample data. This is the case because the sum of n deviations from the sample mean is always zero. That means that if (n - 1) deviations are freely observed from the sample, the remaining deviation can be determined with certainty. The concept of degrees of freedom will arise in a number of areas in this subject. Essentially, it refers to the number of variables that are allowed to vary freely. For example, if there are 10 observations, there will generally be 10 degrees of freedom. However, if the 10 observations are used to determine the sample mean, and a test is performed that involves an estimation of the variability around the sample mean, there is one less degree of freedom from the data, leaving (10 - 1 = 9) nine degrees of freedom.

Avoiding Type I and Type II errors

As noted above, the probabilities of making Type I and Type II errors are dependent on the sample size (n) and the size of the rejection region. Therefore, the probability of making a Type I or Type II error can be reduced by changing either of these factors: Sample size: The larger the sample size, the lower the probability of both Type I and Type II errors; that is, the larger the sample on which to base the definition, the less the chance of an error and thus the power of the test will also be increased. • Size of the rejection region: Since α is the probability of the test statistic falling within the rejection region when in fact the test statistic should fall in the acceptance region, an increase in the size of the rejection region increases the α; that is, any increase in the size of the rejection region increases the probability of the test statistic wrongly falling within the rejection region. While an increase in the size of the rejection region increases the probability of a Type I error (α), it reduces the chances of a Type II error (β) and increases the power (1 - β) of the test. Similarly, by reducing the size of the rejection region, the probability of a Type I error (α) decreases, while the probability of a Type II error (β) increases and the power of the test decreases.

Steps in a hypothesis test

Hypothesis testing is the statistical assessment of a statement or idea regarding a population. For instance, a statement could be as follows: 'The mean return of the Australian sharemarket is greater than zero.' Given the relevant returns data, hypothesis testing procedures can be employed to test the validity of this statement at a given significance level. A hypothesis is a statement about the value of a population parameter developed for the purpose of testing a theory or belief. Hypotheses are stated in terms of the population parameter to be tested, like the population mean. For example, a researcher may be interested in the daily mean on stock options. Hence, the hypothesis may be that the mean daily return on a portfolio of stock is positive. Hypothesis testing procedures, based on sample statistics and probability theory, are used to determine whether a hypothesis is a reasonable statement and should not be rejected or if it is an unreasonable statement and should be rejected. The process of hypothesis testing is a series of steps: • state the research hypothesis (or alternative hypothesis) (H1) • state the null hypothesis (H0) (the contradiction of the research hypothesis) • select the appropriate test statistic • state the decision rule regarding the hypothesis • collect the sample and calculate the sample statistics • make a decision regarding the hypothesis based on whether the result is or is not within the rejection region • make a decision based on the results of the test

Topic learning outcomes

On completing this topic, students should be able to: • use hypothesis testing to evaluate population means, proportions and variances • apply the principles of hypothesis testing in assessing the performance of investments • recognise and apply ethical standards in hypothesis testing

Rejection region

The entire set of values the test statistic may assume is broken into two parts — the acceptance region and the rejection region (refer to Figure 1). If the test statistic falls within the acceptance region, the null hypothesis is 'accepted', but if it falls within the rejection region, the null hypothesis is rejected. The term 'acceptance' is somewhat misleading, as the null hypothesis is not really accepted. It is more appropriate to use the phrase 'fail to reject the null hypothesis' than 'accept the null hypothesis'. That is, if the test statistic lies in the 'acceptance' region, it means that there is not enough evidence to reject the null hypothesis. It still may not be true - but the statistics aren't strong enough to comfortably prove its invalidity. As stated previously, the researcher is keen to reject a null hypothesis as this provides statistically important information in the acceptance of the research hypothesis. Failing to reject the null hypothesis provides no significant information to the researcher.

Levels of significance

The level of significance which is denoted by the Greek letter alpha, α is the probability of rejecting the null hypothesis when it is actually true. The level of significance is also referred to as the level of risk because it states the level of risk of incorrectly rejecting the null hypothesis. There is no universal level of significance that is applied to sampling. Two of the most common levels are the 0.05 level (often called 'the 5% level') and the 0.01 level ('the 1% level'). It can, however, be any level between 0 and 1 (usually smaller than 10%). For any given α-level of a statistical test, the researcher can calculate the smallest value of the test statistic that will lead to rejection of the null hypothesis. This value is called the critical value of the test statistic and is obtained from the distribution. For a two-sided test example, if the distribution of the test statistic is assumed to be a standard normal distribution, then the critical value can be read from the standard normal tables. If α = 0.05, the critical value of Z is 1.96. This is because when Z = 1.96, the probability of the rejection region or p-value is 2 × (0.5 − 0.475) or 0.05. Similarly, if α = 0.02, the critical value is close to 2.33, since the p-value when Z = 2.33 is 2 × (0.5 − 0.49) or 0.02.

p-value

The rejection zone in a hypothesis test is set by the level of significance. The smaller the level of significance, the lower the risk of a false rejection. Researchers may be tempted to increase this risk (α) so as to ensure a rejection of the null hypothesis. The p-value for a hypothesis test is the lowest level of significance at which the null hypothesis is rejected. This may be less than 5% (the most common value for α) or above. Often, researchers will simply refer to a test's p-value when determining statistical significance. If the test's p-value is less than 5%, the null hypothesis is rejected; if it is greater than 5%, the null hypothesis is not rejected. The p-value of a test statistic, like any probability, can be found by referring to the standard normal distribution table. For example, if the calculated test statistic for a normally distributed variable (formula given in following sub-sections) is 1.37 (i.e. parameter is 1.37 standard deviations away from the test level) than the probability of an observation being above this level is 1 - 0.9147 = 8.53% (see below figure). If the hypothesis test is one-sided, the p-value relating to a test statistic of 1.37 would be 8.53%. If the test is two-sided, the p-value would be 2 × 8.53% = 17.06%

Research hypothesis (or alternative hypothesis)

The research hypothesis is the hypothesis that the researcher is trying to test and what is concluded if there is sufficient evidence to reject the null hypothesis. The rejection of the null hypothesis gives support to the research hypothesis. For example, the research hypothesis (H1 below) is that a significant difference in returns exists between small and large capitalisation stocks. This difference between small and large capitalisation stocks can be one-sided or two-sided (see below section on one and two-tailed tests). A one-sided (or one-tailed) test would be to test either whether the average return of large capitalisation stocks is greater than the average return for small capitalisation stocks, or whether the average return of small capitalisation stocks is greater than that for large capitalisation stocks. For the first case, the research hypothesis would be: H1: µ1 > µ2 or H1: µ1 - µ2 > 0 For the second case, the research hypothesis would be: H1: µ2 > µ1 or H1: µ1 - µ2 < 0 A two-sided test would be to test that µ1 - µ2 ≠ 0. In a two-sided test, the key issue is whether there is a significant difference from the null hypothesis. In a one-sided test, the direction of the variation is also being tested. The issue of whether a test is one-sided or two-sided affects the regions of rejection, given the level of significance (a given level of risk for being wrong). For a one-sided test at a 5% level of significance using the normal distribution, the rejection region is located at only one end of the domain of underlying values. For a two-sided test at a 5% level of significance (discussed below), the rejection region is evenly split in each tail of the domain of underlying values.

Test statistic

The test statistic is used to determine whether the null hypothesis can be rejected or not. The test statistic is calculated from information contained within the sample. The set of possible values of the test statistic form a distribution. Remember that for large random samples (n > 30), this distribution of the sample mean is demonstrated to be normal (i.e. the central limit theorem (CLT)).

Test statistics based on the normal distribution

The test statistic is used to test whether a hypothesis is rejected or not. The test statistic is always calculated from data contained within the sample. The full set of possible values of the test statistics form a sampling distribution. Assuming that µ and σ are the population mean and population standard deviation respectively, if the random sample size (n) is of sufficient size (n > 30), then the means of the random sample will be normally distributed with mean µ and standard deviation n σ

Errors in hypothesis testing

There are two errors that can occur during a hypothesis test — Type I and Type II errors. The level of significance for a hypothesis test is measured by the probability of making a Type I error and the 'power of a hypothesis test' is related to the probability of making a Type II error. (Note: In order to determine the probability of Type I or Type II errors, the null hypothesis and the alternative hypothesis must be specified at a fixed value rather than at a range of values.) Type I error A Type I error is when the null hypothesis is rejected when it is true. For example, the null hypothesis that there is not a significant difference in returns for small and large capitalisation stocks may be incorrectly rejected (i.e. µ1 = µ2 is rejected when µ1 = µ2 is true). The significance level is the probability of making a Type I error and is designated by the Greek letterα. For instance, a significance level of 5% (α = 0.05) means there is a 5% chance of rejecting a true null hypothesis. When conducting hypothesis testing, a significance level must be specified in order to identify the critical values needed to evaluate the test statistic. The decision for a hypothesis test is to either reject the null hypothesis or fail to reject the null hypothesis. The decision rule for rejecting or failing to reject the null hypothesis is based on the distribution of the test statistic. For example, if the test statistic follows a normal distribution, the decision rule is based on critical values determined from the standard normal distribution (z-distribution)

Type II error and the power of the test

Type II error is where the null hypothesis is not rejected when it is false and the research (or alternative) hypothesis is true. For example, a null hypothesis states that there is not a significant difference in returns between small and large capitalisation stocks. This is not rejected when, in fact, they are significantly different (i.e. µ1 = µ2 is not rejected when µ1 = µ2 is false). The probability of making a Type II error is represented byβ. (Note: 1 - β is the probability of not making a Type II error.) While the significance of a test is the probability of rejecting the null hypothesis (α) when it is true, the power of a test is the probability of correctly rejecting the null hypothesis when it is false. The power of a test is actually one minus the probability of making a Type II error, or 1 - β (i.e. the probability of not making the Type II error). When more than one test statistic is used, the power of the test for the competing test statistics may be useful in deciding which test statistic to use. Ordinarily, the most powerful test among all possible tests is used. Sample size and the choice of significance level (Type I error probability) will together determine the probability of a Type II error. The relationship is not simple and calculating the probability of a Type II error in practice is quite difficult. Decreasing the significance level (probability of a Type I error) from 5% to 1%, for example, will increase the probability of failing to reject a false null hypothesis (Type II error) and therefore reduce the power of the test. Conversely, for a given sample size, the power of a test can only be increased with the consequence that the probability of rejecting a true null hypothesis (Type I error) increases. For a given significance level, decreasing the probability of a Type II error and increasing the power of the test can only be done by increasing the sample size.

Hypothesis testing ethics

Type of test — two-tail or one-tail If prior information is available that leads to test the null hypothesis against a specifically directed alternative, then a one-tail test is more powerful than a two-tail test. However, if the interest is only in differences from the null hypothesis, not in the direction of the difference, the two-tail test is the appropriate procedure to use. For example, if previous research and statistical testing have already established the difference in a particular direction, or if an established scientific theory states that it is possible for results to occur in one direction only, then a one-tail test is appropriate. It is never appropriate to change the direction of a test after the data is collected. 4.4 Choice of level of significance (α) In a well-designed study, the level of significance (α) is selected before data collection occurs. You cannot alter the level of significance, after the fact, to achieve a specific result. It is also good practice always to report the p-value, not just the conclusions of the hypothesis test. 4.5 Data snooping Data snooping is never permissible. It is unethical to perform a hypothesis test on a set of data, look at the results and then decide on the level of significance or decide between a one-tail and two-tail test. You must make these decisions before the data is collected if the conclusions are to have meaning. In situations in which you consult a statistician late in the process, with data already available, it is imperative that you establish the null and alternative hypotheses and choose the level of significance prior to carrying out the hypothesis test. In addition, you cannot arbitrarily change or discard extreme or unusual values in order to alter the results of the hypothesis tests. Cleansing and discarding of data Data cleansing is not data snooping. In the data preparation stage of editing, coding and transcribing, you have an opportunity to review the data for any value whose measurement appears extreme or unusual. After reviewing the unusual observations, you should construct a stem-and-leaf display and/or a boxplot in preparation for further data presentation and confirmatory analysis. This exploratory data analysis stage gives you another opportunity to cleanse the data set by flagging possible outliers to double-check against the original data. In addition, the exploratory data analysis enables you to examine the data graphically with respect to the assumptions underlying a particular hypothesis test procedure. The process of data cleansing raises a major ethical question. Should you ever remove a value from a study? The answer is a qualified 'yes'. If you can determine that a measurement is incomplete or grossly in error because of some equipment problem or unusual behavioural occurrence unrelated to the study, you can discard the value. Sometimes you have no choice — an individual may decide to quit a study they have been participating in before a final measurement can be made. In a well designed experiment or study, you should decide, in advance, on all the rules regarding the possible discarding of data.

Hypothesis test for the population mean (µ) when population standard deviation (σ) is known

Without the benefit of a computer, a statistical table of the standard normal distribution is needed for hypothesis testing. To test a hypothesis about µ and (µ1 - µ2), the test statistic used is the Z-score where: µ σ − = X Z The Z-score is the deviation of the sample mean ( X ) from the population mean (µ) of a normally distributed variable expressed in units of σ. A Z-score of 2 implies that X of the variable is 2 standard deviations from µ. Therefore, a two-tailed test with α = 5% would reject the null hypothesis (H0) when Z > 1.96 or Z < -1.96.

Two-tailed test

two-tailed test is used when the researcher is examining the possibility of two outcomes. The rejection region in a two-tailed test is therefore located in both tails of the sampling distribution (e.g. µ1 - µ2 ≠ 0). In practice, most hypothesis tests are constructed as two-tailed tests. For a two-tailed test, the probability of a Type I error (α) is equally divided between the two tails of the sampling distribution (see Figure 2). The general decision rule for a two-tailed test is: Reject H if test statistic upper critical > < value or test statistic lower critical value 0 : , This can also be written as: Do not reject H if lower critical value t < < est statistic upper critical value 0 : If α = 0.05, the critical values of a two-tailed test would be 1.96 standard deviations from the mean of the sample (above and below).


Ensembles d'études connexes

LIS3353 Week 10 - Anonymity and Privacy

View Set

Security+ Network Security 20% (part 2)

View Set