EPM102: Statistics with Computing

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

the multiplicative law

. The multiplicative law applies to independent events.

unpaired data

Unpaired data occur when individual observations in one sample are independent of individual observations in the other. To study the effect of a new drug in patients who had suffered a coronary heart event, patients were randomised into two groups: those on the new drug (A) and those on conventional treatment (B).

sample size, degree of freedom and t-distribution

the larger the sample size, and hence the larger the number of degrees of freedom, the closer a t-distribution is to the Normal distribution.

Qualitative Variables The simplest type of categorical variable is one which can take only two categories. Such a variable is known as binary (or dichotomous).

Binary variables are a common feature in epidemiological studies. Sex: male / female Smoking Status: smoker /non-smoker Vaccination status: vaccinated / non-vaccinated HIV Status: HIV positive/ HIV negative

paired data

Paired samples occur when the individual observations in the first sample are matched to individual observations in the second sample. For quantitative data this usually occurs when there are repeated measurements on the same person. ex o assess the effect of physical exercise on a group of students, students were asked to run on the spot for 2 minutes. The pulse rate was measured for each student before and after running. Each student had two measurements of pulse rate. In the analysis of paired data we calculate the difference between the first and second measurements. This results in one sample of differences.

Qualitative data in summary

Qualitative data arises when values of a variable can be categorised, so it is also known as categorical data. There are 3 types of qualitative data: BINARY data (Dichotomous) a variable can take one of two values UNODERED categorical data (Nominal) a variable which has several distinct categories ORDERED categorical data (Ordinal) a variable has several distinct categories that may be ordered

Appropriate charts If the data are categorical,

we use a bar chart or a pie chart.

For most variables that are approximately normally distributed:

~ 68% of the data lie within 1 standard deviation of the mean ~ 95% of the data lie within 2 standard deviations of the mean ~ 99% of the data lie within 3 standard deviations of the mean

The purposes of applying statistical methods are

•Obtaining a suitable sample •Summarising data numerically and graphically •Making inferences about a population based on information from a sample •Obtaining a simple model for a set of data

Data not normally distributed

•SD and mean cannot describe the distribution appropriately. •Be always aware when the SD is LARGER than the mean: it often means that the distribution is skewed! •However, most distributions can become closer to the Normal distribution by using a transformation.

Properties of the Mean, Median & Mode

•The mean is sensitive to outliers; the others are not. •The mode may be affected by small changes in the data; the others are not. •The mode and median may be found graphically. •All three measures of location are equal for a symmetric distribution; in a skewed distribution they differ (see below).

t-test statistic

For testing H0: μ =μ0 For testing against H1: μ ≠ μ0 the formula is the same but the P-value has to be read from tables of the t-distribution: t test statistic = (χ - µ0) / SE (χ)

Chi Square test

For testing the difference between two categorical variables The expected frequencies are what we would expect if the null hypothesis were true. To test the null hypothesis we compare the expected frequencies with the observed frequencies, using the formula below. The formula for calculating a χ2 statistic is <Figure>

Thus, if H0 is true, the probability of observing a sample mean 1. 8 cm lower or higher than 171.4cm is only 1%.

For this reason we are unlikely to obtain a sample mean of 169.6cm if the true mean height were 171.4cm. We conclude that there is evidence to suggest that the height of EPP students does differ to that of Danish students.

the notation for the mean and standard deviation in a population or a sample.

Greek letters for population parameters Roman letters for sample estimates

Large P-value

If the P-value is large we say that the chance of observing a value as extreme as the sampled value would be high if the Null Hypothesis were true. In this situation sampling variation alone can be the reason for the difference between the sample mean and the hypothesised mean, μ0. Conclusion: There is little evidence to suggest that x differs from μ0.

95%CI where t-distribution

estimated mean ± multiplier x standard error of estimate where the multiplier will be the value of t corresponding to a two-sided p = 0.05, and with the degrees of freedom equal to the sample size minus 1.

There are 3 main ways to summarise the variability of a set of data.

•give the range of all values •quote percentiles within the data •calculate a single numerical measure of the dispersion around the mean: the standard deviation

percentiles

A percentile (or centile) is the value below which a given percentage of the data has occurred. For example, the 5th percentile is the value in the data corresponding to the point at which 5% of the data have occurred.

Qualitative variables and categorical data

A qualitative variable characterises a certain quality of a subject. Qualitative data are also known as categorical data (since qualities can be grouped into categories). • Immune status • Car ownership • Place of residence • Disease status

Qualitative Variables However, some qualitative variables do have a natural ordering to their categories. These are called ordered categorical variables and their data are called ordinal data. Although they do not have a true measured value, a numerical value (or score) may be assigned to them.Example of Ordered Categorical Data

Consider the level of cigarette smoking within a population. We can categorise the individuals in the group according to the degree to which they smoke, and assign a score to each category. = non-smokers = ex-smokers = 1 - 5 cigarettes per day = 6 - 20 per day = 21 - 40 per day = >40 per day So although we are not using an actual numerical measurement of cigarette smoking in this case, we can still use a number to represent the level of exposure.

degree of freedom

Estimates of parameters can be based upon different amounts of information. The number of independent pieces of information that go into the estimate of a parameter is called the degrees of freedom. The shape of the t-distribution depends on its "degrees of freedom" (or d.f.). These are a measure of how small the sample size is. The degrees of freedom of a t-distribution are equal to the sample size minus 1.

Quantitative Variables The first type of numerical data, discrete data, is usually the result of a count, in which case the values will always be positive integers.

Examples of Discrete Data • Number of children • Number of visits to doctor • Number of cups of coffee consumed per day

Quantitative Variables The second type of numerical data is continuous data. These are usually obtained by some form of measurement, where the values are not restricted to an integer. However the actual measurements are restricted by the precision of the measuring instrument.

Examples of continuous data • Age • Weight • Height • Blood pressure Note that although weight is a continuous variable, if your scales are calibrated at intervals of 0.5kg, then your measurements are restricted to discrete values, e.g. 52.5kg, 53.0kg, 53.5kg etc

outcome variable and explanatory outcome

Explanatory outcomes is a factor that may influence the outcome. Such a variable partly explains the variability of the outcome. They are also called independent or predictor variables.

Small P-value

If the P-value is small we say that the chance of observing a value as extreme as the sampled value would be low if the Null Hypothesis were true. So, in this situation, sampling variation alone is unlikely to be the reason for the difference between the sample mean x and the hypothesised mean, μ0. Conclusion: There is sufficient evidence to suggest that x differs from μ0.

Qualitative Variables Some qualitative variables can take more than two values. There may be several different categories which are distinct from each other.

If these categories may be listed in any order, then the variable is called unordered categorical. Such variables are also known as nominal variables. Examples of Unordered Categorical Variables • Ethnic Group • Blood Group • Nationality

Central Limit Theorem

It is the most important of all the three properties of sampling distribution and it is also the most surprising. It says that, when the sample size is large, the distribution of the sample estimates is always NORMAL.

Key points of bar charts

Key Points about Bar Charts •used to display qualitative data •one bar represents one category, and the height of the bar equals its frequency (or relative frequency) •each bar is the same width and equally spaced More Key Points about Bar Charts •bars should have a space between them to stress that they represent categorical data •the position of each category is arbitrary if the variable is unordered •it is important that the vertical axis of a bar chart starts at zero, to avoid distortion of true differences between frequencies

Quantitative Variables A quantitative variable represents a counted or measured quantity.

Quantitative data are also called numerical data, and may either be discrete or continuous. Examples of Quantitative Variables • Weight • Height • Number of children • Annual income

Again Large P-valune and Small P-value

LARGE P-VALUES Sampling variation is a likely explanation There is little/no evidence against the null hypothesis SMALL P-VALUE Sampling variation is an unlikely explanation There is strong evidence against the null hypothesis "..the smaller the P-value, the stronger the evidence against the null hypothesis" (Kirkwood & Sterne P74)

mutually exclusive event

Mutually exclusive events are events that cannot happen at the same time. For example, when we throw a die, only one of the six faces will land on top. If two events are not mutually exclusive, we can say " independent".

Quantitative Data in summary

Quantitative data can be a count or a measured value. It is also called numerical data. There are two types of quantitative data. DISCRETE data results from a count. Such data values can only be positive integers. CONTINUOUS data is a measured value. The data values are real numbers.

binomial distribution

Remember that the binomial distribution describes the probability of a characteristic that can take one of only two values, in this case each baby is a boy or not i.e. a boy or girl. "The binomial distribution is for a probability distribution for discrete data that can take any one of two values (binary data)."

Two sided/ Two tailed p-value

The "2-sided" or "2-tailed" p-value is the probability of getting Z > 2.5 or Z < -2.5 - that is, the area in both tails. This is calculated as follows: P = 2 x 0.006 = 0.012 This p-value tells us the probability of observing a difference equal to, or more extreme than, the one we have observed, if the Null Hypothesis were true. This probability is shown in the graph below. Note that the shaded areas include all values greater than 2.5 and less than -2.5.

variables and data

Suppose we wanted to study a group of epidemiology students. We might ask about their: Age Sex Place of residence Weight Height Each of these characteristics varies from student to student. They are what we call variables, and the values we collect from the students are called data.

normal distribution

The Normal distribution can be used to represent the distribution of values that would be observed if we could examine everybody in a population. In this sense, it shows the distribution of values that we would observe "in the long run", in a large population (see SC03). That is why the y-axis in the plot of a Normal distribution is called "probability".

calculate z-value

The Null Hypothesis is that the mean height of all EPP students is the same as the mean height of all Danish students. H0 : µ = 171.4cm We need to use: Z = (x- µ0) / SE (x) where x is for the sample mean of EPP students = 169.6cm. and µ0 is for the population mean for all Danish students under H0= 171.4cm. The standard error of the sample mean of the heights of EPP students was estimated as 0.72cm. So, the actual difference between sample mean and hypothesised mean is 1.8cm, while the standardised difference is: z = (169.6 - 171.4) / 0.72 = -1.8/0.72 z = -2.5

The P-value is ...

The P-value is the area under the curve corresponding to values outside the range (-z, z). That is, the area in the tails of the distribution gives the probability of observing the more extreme values.

z-value

The general form of the test statistic compares the observed sample mean with the mean value expected if the null hypothesis was true. It also takes account of the variability in the population and the sample size using the standard error. The value of the test statistic is called a Z-VALUE as the statistic follows a normal distribution, and is equal to: (observed mean - hypothesised mean) / (standard error (estimated mean)) z = (χ − μ0) / SE (χ)

mean

The mean is the conventional average. It is the sum of the observations divided by the number of observations.

Median for Frequency Distributions

The median for a frequency distribution is simply the value at which the cumulative relative frequency is 50%.

median

The median is the value that divides the distribution into two equal parts so that there are the same number of observations above and below it. When there is an even number of data values, there is no single middle value. In this case the median is defined as the mean of the central pair of values.

mode

The mode of a distribution is simply the value that occurs most frequently.

standard deviation

The most common way of quantifying the variability of a distribution is to calculate its standard deviation. This method uses all the observations, by accounting for all deviations from the mean. By deviations we mean the differences between each observation and the mean.

t-distribution

The most important of these is the property that says that all sampling distributions are Normal if the sample size is large enough (see session SC06 ). The distribution of sample means is approximately Normal for large samples. For small samples, however, the Normal approximation is poor. In this case, the t-distribution is used in the calculation of confidence intervals and in significance tests. The Normal distribution should ONLY really be used when the variance or standard deviation in the population is known, which is rarely the case.

range

The simplest way to describe the spread of a set of observations is to quote the range, stating the lowest and highest values and hence the difference in between. The problem with this is that it reports the extreme values (which may be the most peculiar), while the actual distribution of all the values in between will not be summarised in any way.

Standard Error

The standard deviation of a sampling distribution takes a special name, standard error, often indicated by the letters SE. SE (χ) = σ /√n •the standard deviation represents the variability in the individual data •the standard error represents the variability in the sample estimates.

Key Points about Histograms

The x-axis must be continuous, and there are no spaces between the bars. The y-axis always begins at zero - this is important because relative comparisons are being made. The area of each bar represents the frequency in each group The width of each bar is the size of the interval for each group

Qualitative Variables Sometimes the categories of an unordered categorical variable may be suggestively placed in an order, despite there being no true order to the categories.

There is no natural order for the categories of marital status, but they are usually ordered like this: • Married • Single • Divorced • Separated • Widowed

P-value

Using two-tailed tables of the Standard Normal distribution (this may be reviewed in SC05) we can obtain a probability value for the observed z-value. This value tells us the probability of obtaining the observed or a more extreme sample mean if the Null Hypothesis were true. This is known as the P-value. The graph below shows again the distribution of sample means that would occur if the Null Hypothesis were true. Try to identify which part of the distribution corresponds to the P-value.

Summarize data distribution describe two ways.

We can summarise this distribution in two ways. Mean height: 169.3 cm Standard deviation: 9.2 cm Median height: 168 cm Interquartile range: 161 cm to 176 cm Range: 149cm to 194 cm For a skewed distribution, the mean is distorted by extreme values, so a median is more appropriate.

How possible the z can be taken, if the null hypothesis is true?

We now need to find out what z = -2.5 corresponds to in terms of probability. We can refer to, for example, the one-sided table of the Standard Normal distribution in "Medical Statistics" by Kirkwood & Sterne (p470-471) to find the P-value for z = -2.5. The table does not include negative values of Z but for Z = 2.5 we find that p=0.00621. (note that this represents the proportion of the area of the standard normal distribution that is above Z) As the standard normal distribution is symmetrical about zero the area below Z = -2.5 is equal to the area above Z = 2.5. So the value of p corresponding to Z = -2.5 is also 0.00621. P=0.00621 (or 0.006) is called the "1-sided" p-value. However 1-sided p-values are not usually used, because we need to include the probability that the difference might (by chance) have been in the opposite direction. We therefore use a "2-sided" p-value

Appropriate charts If the data are quantitative,

a histogram or frequency polygon is more appropriate.

Use of the 2 sample t-test in a small sample is only valid if we can assume that;

•the data are symmetrically distributed and not too different from a normal distribution, •that the population standard deviations are approximately equal •that the data are independent, that is not paired or related in any way.


Kaugnay na mga set ng pag-aaral

Chapter 3 : Risk, Economic, & Environmental Concerns

View Set

Microbiology Mutations and The Ames Test

View Set

LP1: Alterations in Fluid & Electrolytes

View Set

Chp 12: Motivating Employees: Achieving Superior Performance in the Workplace

View Set

MIS Key Concept Questions Exam 1

View Set

Ch 13: The Nursing Role in Promoting Nutritional Health During Pregnancy

View Set

Chapter 7 Production Costs Assignment

View Set

HI 217 Quiz 1-- "Columbus was a Cannibal"

View Set