Clinical research
Nominal categorical data:
are data points that either represents words ('yes' or 'no') or concepts (like gender or heart disease) which have no mathematical value have no natural order to the values or words - i.e. 'nominal' - for example: gender, or your profession
Quartiles
divide the group of values into four equal quarters, in the same way that median divides the dataset into two equal parts has a zeroth value, which represents the minimum value is a dataset, and a fourth quartile which represents the maximum value has a first quartile, which represents the value in the dataset, which will divide that set into a quarter of the values being smaller than the first quartile value and three-quarters being larger than that value has a third quartile, which represents the value in the dataset, which will divide that set into three-quarters of the values being smaller than the third quartile value and one quarter being larger than that value has a second quartile value, which divides the dataset into two equal sets and is nothing other than the median The zeroth value is the same as the minimum value and the fourth quartile value is the same as the maximum value.
discrete variables
has a finite set of values cannot be subdivided (rolling of the dice is an example, you can only roll a 6, not a 6.5!) a good example are binomial values, where only two values are present, for example, a patient develops a complications, or they do not
Continuous data:
has infinite possibilities of subdivisions (for example, 1.1, 1.11. 1.111 etc.) an example I used was the measure of blood pressure, and the possibility of taking ever more detailed readings depending on the sensitivity of the equipment that is being used is mostly seen in a practical manner, i.e. although we can keep on halving the number of red blood cells per litre of blood and eventually end up with a single (discrete) cell, the absolutely large numbers we are dealing with make red blood cell count a continuous data value
Null hypothesis
he null hypothesis predicts that there will be no difference between the variables of investigation.
Stratified random
individuals are chosen because of some common, mutually exclusive trait. The most common such trait is gender, but may also be socio-economic class, age, and many others.
Median
is a calculated value that falls right in the middle of all the other values. That means that half of the values are higher than and half are lower than this value, irrespective of how high or low they are (what their actual values are) are used when there are values that might skew your data, i.e. a few of the values are much different from the majority of the values
Inferential statistics
is all about comparing different sample subject to each other. Most commonly we deal with numerical data point values, for which we can calculate measures of central tendency and dispersion.
Mode
is the data value that appears most frequently is used to describe ((categorical)) values returns the value that occurs most commonly in a dataset, which means that some datasets might have more than one mode, leading to the terms bimodal (for two modes) and multimodal (for more than two modes) Note: It would be incorrect to use mean and median values when it comes to categorical data. Mode is most appropriate for categorical data types, and the only measure of central tendency that can be applied to nominal categorical data types. In the case of ordinal categorical data such as in our pain score example above, or with a Likert scale, a mean score of 5.5625 would be meaningless. Even if we rounded this off to 5.6, it would be difficult to explain what .6 of a pain unit is. If you consider it carefully, even median suffers from the same shortcoming.
significance level
is the probability of rejecting the null hypothesis given that it is true (a type I error). If we have a significant level of say 0.05, if we find a value of larger than 0.05, we do not reject the null hypothesis. You in fact accept the null hypothesis, you can't prove it. You just do not reject the null hypothesis, so your statement of neutrality stands. There is no difference in the use of drug A or placebo. There is no difference in the means between patients in group A and patients in group B. If you find a p-value that's less than your level of significance say this is 0.05. Then you reject that null hypothesis, and you accept the alternative hypothesis. Yes, drug 'A' is statistically significantly better than placebo. Yes there is a statistically significant difference between the mean and for patients in group A and group B. Now, really what happens there if we do reject the null hypothesis and accept the alternative hypothesis It's just that we found a result which has a low likelihood of having occurring. Remember our results that we get or the results from the study is just one of many others and it would be one that would occur very infrequently as opposed to ones that occur much more frequently, that's what we're after.
Percentile
looks at your data in finer detail and instead of simply cutting your values into quarters, you can calculate a value for any percentage of your data points turns the first quartile into the 25th percentile, the median (or second quartile) into a 50th percentile and a third quartile into a 75th percentile (and all of these are just different expression of the same thing) also includes a percentile rank that gives a percentage of values that fall below any value in your set that you decide on, i.e. a value of 99 might have a percentile rank of 13 meaning that 13% of the values in the set are less than 99 and 87% are larger than 99
standered normal distribution
mean: 0 st: 1
Mean
or average refers to the simple mathematical concept of adding up all the data point values for a variable in a dataset and dividing that sum by the number of values in the set i s a meaningful way to represent a set of numbers that do not have outliers (values that are way different from the large majority of numbers)
At a significance level of 0.05: P-value: <0.05
reject the null hypothesis.
Cluster random
the groups of individuals that are included are somehow (), i.e. all in the same space, location, or allowed time-frame. There are many forms of cluster random sampling. We often have to deal with the fact that a master list simply does not exist or is too costly or difficult to obtain. Clustering groups of individuals greatly simplifies the selection process.
quantile
the quantile is the percentage of values that are less than that specific value.
Systematic random
the selection process iterates over every decided number of individuals, i.e. every 10th or 100th individual on a master list.
Standard error
the standard deviation of a sampling distribution A statistic used to make an inference about how well the sample mean matches up to the true population mean.
At a significance level of 0.05: P-value: >0.05
we do not reject the null hypothesis.
Population
-A group of individuals that share at least one characteristic in common -On a macro level, this might refer to all of humanity -At the level of a clinical research, this might refer to every individual with a certain disease, or risk factor, which might still be an enormous number of individuals -It is quite possible to have quite small population, i.e. in the case of very rare condition -The findings of a study infer its results to a larger population; we make use of the findings to manage the population to which those study findings infer
Inferential statistics
-The investigation of specified elements which allow us to makes inferences about a larger population (i.e., beyond the sample size) -Here we compare groups of subjects or individuals -It is normally not possible to include each subject or individual in a population in a study, therefore we use statistics and infer that the results we get, apply to the larger population
What is the p-value?
-The p-value explains a probability of an event occurring -It is based on the calculation of a geometrical area -The mathematics behind a p-value draws a curve and simply calculates the area under a certain part of that curve -takes a value between 0 and 1.
Descriptive statistics
-The use of statistical tools to summarize and describe a set of data values -Human beings usually find it difficult to create meaning from long lists of numbers or words -Summarizing the numbers or counting the occurrences of words and expressing that summary with single values makes much more sense to us -In descriptive statistics, no attempt is made to compare any data sets or groups
Non-parametric tests: when to use
-not notmally distributed. -ordinal categorical. or numerical
What are the measures of dispersions and why are they used?
1. Range 2. quartiles 3. percentiles. 4. interquartile range
Tests for comparing categorical data
1. The chi-squared goodness-of-fit test 2. Fisher's exact test 3. Paired t-test. nominal
ANOVA Types
1. The most common type of ANOVA test is one-way ANOVA. Here we simply use a single factor (variable) and compare more than two groups to each other.
How would you know that this pop parameter comes from a normal distribution = whether parametric tests should be used?
1. make a graph of the data values for each group & see if they make anormal dis 2. q-q plot: if the dots lie on the line = it's normally disributed
The chi-squared goodness-of-fit test: Type of data? Number of groups?
1. we will predict a distribution of values, 2. go out and measure some actual data and 3. see how well our prediction faired. The chi-square value is calculated from the differences between the observed and expected values and from this can calculate a probability of having found this difference (a p-value). If it is less than our chosen value of significance we can reject the null hypothesis and accept the test hypothesis. If not, we cannot reject the null hypothesis.
Mann-Whitney-U test: Type of data? Number of groups?
=T-test Two groups numberical
Hypothesis
A proposed, scientifically testable explanation for an observed phenomenon.
question
A researcher calculates the mean admission white cell count for 30 patients with mild acute appendicitis (infection of the appendix) and for 30 with severe appendicitis and compares these means, finding a p-value of 0.03. What conclusion can be drawn from this? There was a 97% chance of not finding this value This study has conclusively proven that there is an absolute difference in admission white cell count between all patients with mild and severe acute appendicitis. The difference found in this study would occur in 3% of cases Incorrect Response When dealing with continuous variables (and in the case of the Central Limit Theorem the occurrences of the means do represent continuous data types) we cannot calculate the probability of a single outcome, only of a range of values. The difference in means between these two groups of 30 patients found in this study represents one of the less likely possible differences that would be found if this study was to be repeated many, many times.
Sample
A sample is a selection of members within the population (I'll discuss different ways of selecting a sample a bit later in this course) Research is conducted using that sample set of members and any results can be inferred to the population from which the sample was taken This use of statistical analysis makes clinical research possible as it is usually near impossible to include the complete population
Parameter معيار
A statistical value that is calculated from (all the values) in (a whole) population, is termed a parameter If we knew the age of (every) individual on earth and calculated the mean or average age, that age would be a parameter
Statistic (vs. parameter: sample, not the whole population)
A statistical value that is calculated from all the values in a sample, is termed a statistic The mean or average age of all the participants in a study would be a statistic
ANOVA: Type of data? Number of groups?
ANOVA is the acronym for analysis of variance. As opposed to t-test, ANOVA can compare a point estimate for a numerical variable between more than two groups.
Paired comparisons: Sign ranks
After the signing, you do the rank, but you do the rank of the absolute values. And in the very last step we're just going to multiply. Multiply sign by rank.
Linear regression: does correlation = causation?
Any correlation between variables does not necessarily mean causation. Just because two variables are correlated does not mean the change in one is caused by a change in the other. There might be a third factor influencing both. Proof of a causal relationship requires much more than linear regression.
Linear regression: strength of the correlation
As you would have noticed, some of the data point pairs are quite a distance away from the linear regression line. With statistical analysis we can calculate how strongly the pairs of values are correlated and express that strength as a correlation coefficient, r. This correlation coefficient ranges from -1 to +1, with negative one being absolute negative correlation. This means that all the dots would fall on the line and there is perfect movement in the one variable as the other moves. With a positive one correlation we have the opposite. In most real-life situations, the correlation coefficient will fall somewhere in between. There is also the zero value, which means, no correlation as all.
Types of data
Categorical (including nominal and ordinal data), refers to categories or things, not mathematical values Numerical (further defined as being either interval or ratio data) refers to data which is about measurement and counting I will also cover another numerical classification type: discrete and continuous variables. Discrete values as the name implies exist as little islands which are not connected (no land between them). Think of the roll of a die. With a normal six-sided die you cannot role a three-and-a-half. Continuous numerical values on the other hand have (for practical purposes) many values in-between other values. They are infinitely divisible (within reasonable limits).
Confidence intervals intro
Confidence intervals are very often quoted in the medical literature, yet it is often mentioned that it is a poorly understood statistical concept amongst healthcare personnel. In this course we are concentrating on inferential statistics. We realize that most published literature is based on the concept of taking a sample of individuals from a population and analyzing some data point values we gather from them. These results are then used to infer some understanding about the population. We do this because we simply cannot investigate the whole population. This method, though, is fraught with the danger of introducing bias. The sample that is selected for analysis might not properly represent the population. If a sample statistic such as a mean is calculated, it will differ from the population mean (if we were able to test the whole population). If a sample was taken with proper care, though, this sample mean should be fairly close to the population mean. This is what confidence intervals are all about. We construct a range of values (lower and upper limit, or lower and upper maximum), which is symmetrically formed around the sample mean (in this example) and we infer that the population mean should fall between those values. We can never be entirely sure about this, though, so we decide on how confident we want to be in the bounds that we set, allowing us to calculate these bounds. Confidence intervals can be constructed around more than just means and in reality, they have a slightly more complex meaning than what I have laid out here. All will be revealed in this lesson, allowing you to fully understand what is meant by all those confidence intervals that you come across in the literature.
Type II error
Failing to reject a null hypothesis when it is in fact false.
variables and data points
I refer to a data point as a single example value for a variable, i.e. a patient might have a systolic blood pressure (the variable) or 120 mm Hg (the data point)
Ordinal categorical data
If catgeorical data have some natural order or a logical ranking to the data points, it is termed ordinal categorical data, i.e. they can be placed in some increasing or decreasing order. I gave the example of a pain score from 1 (one) to 10 (ten). Even though these are numbers, no mathematical operation can be performed on these digits. They are ordered in magnitude from 1 to 10. But there is no standardized measurement of these rankings and therefore no indication that the interval between the specific scores is of the same value. Other common examples include survey questions: where a participant can rate their agreement with a statement on a scale, say 1 (one), indicating that they don't agree at all, to 5 (five), indicating that they fully agree. Likert style answers such as totally disagree, disagree, neither agree nor disagree, agree and totally agree can also be converted to numbers, i.e. 1 (one) to 5 (five). Although they can be ranked, they still have no inherent numerical value and as such remain ordinal categorical data values.
Parametric vs non-parametric tests
In the literal meaning of the terms, a parametric statistical test is one that makes assumptions about the parameters (defining properties) of the population distribution(s) from which one's data are drawn, while a non-parametric test is one that makes no such assumptions.
Interval numerical data
Interval With interval data, the difference between each value is the same, which means the definition as 'I' used above holds. The difference between 1 and 2 degrees Celsius is the same as the difference between 3 and 4 degrees Celsius (there is a 1 degree difference). However, temperatures expressed in degrees Celsius (or Fahrenheit) do not have a 'true zero' because 0 (zero) degrees Celsius is not a true zero. This means that with numerical interval data (like temperature) we can order the data and we can add and subtract, but we cannot divide and multiply the data (we can't do ratios without a 'true zero'). 10 degrees plus 10 degrees is 20 degrees, but 20 degrees is not twice as hot as 10 degrees Celsius. Ratio type numerical data requires a true zero.
Why is it imp to know and understand the different types of distribution?
It is important to understand the various types of distributions because, as mentioned, distribution types have an influence on the choice of statistical analysis that should be performed on them. It would be quite incorrect to do the famous t-test on data values for a sample that do not come from variable with a normally distribution in the population from which the sample was taken.
non-para
Key concepts Common parametric tests for the analysis of numerical data type values include the various t-tests, analysis of variance, and linear regression The most important assumption that must be met for their appropriate use is that the sample data point must be shown to come from a population in which the parameter is normally distributed The term parametric stems from the word parameter, which should give you a clue as to the underlying population parameter Testing whether the sample data point are from a population in which the parameter is normally distributed can be done by checking for skewness in the sample data or by the use of quantile (distribution) plots, amongst others When this assumption is not met, it is not appropriate to use parametric tests The inappropriate use of parametric analyses may lead to false conclusions Nonparametric tests are slightly less sensitive at picking up differences between groups Nonparametric tests can be used for numerical data types as well as ordinal categorical data types When data points are not from a normal distribution the mean (on which parametric tests are based) are not good point estimates. In these cases it is better to consider the median Comparing medians makes use of signs, ranks, sign ranks and rank sums When using signs all of the sample data points are grouped together and each value is assigned a label of either zero or (plus) one based on whether they are at or lower than a suspected median (zero) or higher than that median (one) The closer the suspected value is to the true median, the closer the sum of the signs should be to one half of the size of the sample A distribution can be created from a rank where all the sample values are placed in ascending order and ranked from one, with ties resolved by giving them an average rank value The sign value is multiplied by the (absolute value) rank number to give the sign rank value In the rank sums method specific tables can be used to compare values in groups and making lists of which values 'beat' which values with the specific outcome one of many possible outcomes The Mann-Whitney-U (or Mann-Whitney-Wilcoxon or Wilcoxon-Mann-Whitney or Wilcoxon rank-sum) test uses the rank-sum distribution The Kruskal-Wallis test is the nonparametric equivalent to the one-way analysis of variance test If a Kruskal-Wallis finds a significant p-value then the individual groups can be compared using the Mann-Whitney-U test The Wilcoxon sign-rank test is analogous to the parametric paired-sample t-test Spearman's rank correlation is analogous to linear regression The alternative Kendall's rank test can be used for more accuracy when the Spearman's rank correlation rejects the null hypothesis
Kurtosis
Kurtosis refers to the spread of your data values. A platykurtic curve is flatter and broader than normal as a result of having few scores around the mean. Large sections under the curve are forced into the tail, thereby (falsely) increasing the probability of finding a value quite far from the mean. A mesokurtic curve takes the middle ground with a medium curve from average distributions. In a leptokurtic curve is more peaked, where many values are centred around the mean. Remember, in this section we are discussing the distribution of the actual data point values in a study, but the terms used here can also refer to the curve that is eventually constructed when we calculate a p-value. As we will see later, these are quite different things (the curve of actual data point values and the p-value curve calculated from the data point values). This is a very important distinction. Central limit theorem We saw in the previous section on combinations that when you compare the difference in averages between two groups, your answer is but one of many that exist. The Central Limit Theorem states that if we were to plot all the possible differences, the resulting graph would form a smooth, symmetrical curve. Therefore, we can do statistical analysis and look for the area under the curve to calculate our p-values. The mathematics behind the calculation of the p-value constructs an estimation of all the possible outcomes (or differences as in our example). Let's look at a visual representation of the data. In the first graph below we asked a computer program to give us 10,000 random values between 30 and 40. As you can see, there is no pattern. Let's suggest that these 10,000 values represent a population and we need to randomly select 30 individuals from the population to represent our study sample. So, let's instruct the computer to take 30 random samples from these 10,000 values and calculate the average for those 30. Now, let's repeat this process 1000 times. We are in essence repeating our medical study 1000 times! The result of the occurrence of all the averages is shown in the graph below. The Central Limit Theorem predicts, a lovely smooth, symmetric distribution. Just ready and waiting for some statistical analysis. Every time a medical study is conducted, the data point values (and their measures of central tendency and dispersion) are just one example of countless others. Some will occur more commonly than others and it is the Central Limit Theorem that allows us to calculate how likely it was to find a result as extreme as the one found in any particular study.
The chi-squared- when to use?
Large values >5
Non-parametric tests: types
Non-parametric student T-test 1. Mann-Whitney-U test: uses rank sums for 2 groups. 2. Kruskal-Wallis test.=1 way ANOVA comparing the medians of >2 groups. 3. The Wilcoxon sign-rank test is analogous to the parametric paired-sample t-test 4. Spearman's rank correlation is analogous to linear regression 5. The alternative Kendall's rank test can be used for more accuracy when the Spearman's rank correlation rejects the null hypothesis
Interval estimation: Confidence intervals
Now that we understand what confidence levels are, we are ready to define the proper interpretation of confidence intervals. It might be natural to suggest that given a 95% confidence level, that we are 95% confident that the true population parameter lies between the intervals given by that confidence level. Revisiting our last example we might have a sample statistic of 55 years for the mean of our sample and with a 95% confidence level construct intervals of 51 to 59 years. This would commonly be written as a mean age of 55 years (95% CI, 51-59). It would be incorrect though to suggest that there is a 95% chance of the population mean age being between 51 and 59 years. The true interpretation of confidence intervals Consider that both the sample statistics (mean age of the participants in a study) and the population parameter (mean age of the population from which the sample was taken) exist in reality. They are both absolutes. Given this, the population parameter either does or does not fall inside of the confidence interval. It is all or nothing. The true meaning of the confidence level of say 95% is that if the study is repeated 100 times (each with its own random sample set of patients drawn from the population, each with its own mean and 95% confidence intervals), 95 of these studies will correctly have the population parameter correctly within the intervals and 5 would not. There is no way to know which one you have for any given study.
Central limit theorem
Now that you are aware of the fact that the p-value represents the area under a very beautiful and symmetric curve (for continuous data type variables at least) something may start to concern you. If it hasn't, let's spell it out. Is the probability curve always so symmetric? Surely, when you look at the occurrence of data point values for variables in a research project (experiment), they are not symmetrically arranged. In this lesson, we get the answer to this question. We will learn that this specific difference between the means of the two groups is but one of many, many, many (really many) differences that are possible. We will also see that some differences occur much more commonly than others. The answer lies in a mathematical theorem called the Central Limit Theorem (CLT). As usual, don't be alarmed, we won't go near the math. A few simple visual graphs will explain it quite painlessly.
Ordering values: Ranking
Now, if we don't have a normal distribution to work from. If our sample data does not come from a population parameter that is normally distributed, we've got to create a distribution somehow. We can't use the normal distribution. We've got to come up with one. And the way that we do it, is we rank the values. Ranking from lowest to highest, in ascending order.
Comparing paired observations: Signs: example
Now, imagine that we've given the sample and we have values there. You see them there. Two 10s, 12, 12, 13, 14, 14, 16, 19, 21, 21, 23, and 24. So, imagine that's our sample. Now, our sample size there, you can see is 13. You can work out the median. That's 14. But we can ask ourselves a question now. Is the true population, median, out there, is it 17, now to do that, 1:05 we assign these values, either 0 or positive 1 to these. Now, remember we said for nonparamtric tests they either have to be ordinal categorical, so you've got to order them, or numerical. Now, numerical data we can always, I mean, those are numbers. We can order them in ascending order, zero, one, two, three, four, five There's some order to those values. So, we do that and in our sample it was already ordered. And what we would do is if we asked the question 17 in this instance, every value that is 17 or less we assign, that's with sign, the value 0 and for every value that is more than 17 we'll assign a positive 1. And all we do we just add all of those up. Now, imagine 17 was the median. What you would expect is equal on both sides. So, they would have added up to six. Remember it's the There were 13 values. So, we would have 6 on either side to make 12 and the 13th would be in the middle. So, we would have expected if 17 really was the median we would have expected a value of 6 because 6 would have been more and 6 would have been less. The 6 less we would have given a 0 to the, 6 0's to and the 6 more we would have given +1's to and then it would have added 6 but we got to 5 here. So, we can sort of start saying that 17 probably isn't the true median, as far at least as this sample is concerned. So, that is assigning a sign to each of your data points.
T value
Now, the next step that is going to happen, that difference in means for this exact study that we are looking at now. That also gets converted to so many standard errors away from the mean. difference between an observed sample statistic and its hypothesised population measure of strength of evidence
The chi-squared goodness-of-fit test-hypothesis:
Null: no association. Alternative: there's an association
q-q plot
On this kind of plot is you take any value and you plot the percentage of values that are less than that in the data set. So it's value versus its quantile.
Type I error
Rejecting null hypothesis when it is true
Linear regression: results
Sets of data points will almost never fall on a straight line, but the mathematics underlying linear regression can try and make a straight line out of all the data point sets. When we do this we note that we usually get a direction. Most sets are either positively or negatively correlated. -With positive correlation, one variable (called the dependent variable, which is on the y-axis), increases as the other (called the independent variable, which is on the x-axis) also increases. -There is also a negative correlation and as you might imagine, the dependent variables decreases as the independent variable increases.
Skewness
Skewness is rather self-explanatory and is commonly present in clinical research. It is a marker that shows that there are more occurrences of certain data point values at one end of a spectrum than another. Below is a graph showing the age distribution of participants in a hypothetical research project. Note how most individuals were on the younger side. Younger data point values occur more commonly (although there seems to be some very old people in this study). The skewness in this instance is right-tailed. It tails off to the right, which means it is positively skewed. On the other hand, negative skewness would indicate the data is left-tailed.
Fisher's exact test- When to use?
Small values 1-5 one tailed
Parametric tests
Student T-test: (assuming equal variances+ assuming unequal variances + paired variable t-tests) ANOVA Linear aggression
Parametric tests when to use
T + ANOVA: -used when comparing (numerical values) between two categorical groups. -example: The variable cholesterol and white cell count contain data point that are ratio-type numerical and continuous, but the groups themselves are categorical (group A and B or test and control). T: 2 groups, ANOVA: >2 Linear regression: -used when looking for a correlation between two sets of values. These sets must come in pairs. Does one depend on the other. Is there a change in the one, as the other changes -example: The first value is all the pair come from one set of data point values and the second from a second set of data point values. I've mentioned an example before, looking at number of cigarettes smoked per day versus blood pressure value.
Student t-test: Type of data? Number of groups?
Tests whether the difference between the means of two independent populations is equal to a target value Does the mean height of female college students significantly differ from the mean height of male college students? These are truly the most commonly used and most people are familiar with Student's t-test. There is more than one t-test depending on various factors. As a group, though, they are used to compare the point estimate for some numerical variable between two groups. -two separate groups -numerical -ratio -continous -Normal disruption -equal variance -unpaired groups (not connected, not dependent on each other. if they are, use paired t-test.
Paired t-test: Type of data? Number of groups?
Tests whether the mean of the differences between dependent or paired observations is equal to a target value If you measure the weight of male college students before and after each subject takes a weight-loss pill, is the mean weight loss significant enough to conclude that the pill works? ***like if you have monozygotic twins i each group theyre dependant on each other so the results are not independent of each other. ***also like if you have two group of values of the same individuals one before and after interventions: data dependent )
Comparing paired observations: Signs
That takes each of the values and it attaches a sign to it. Now, there's only one of two signs that can be attached something. That's either a zero or a plus one. Those are the two signs that you attach to each value In your dataset for all the groups, for all their values.
Alternative hypothesis
The alternative hypothesis, also known as the test or research hypothesis predicts that there will be a difference or significant relationship.
Critical T value
The critical t-value is a line somewhere on your graph that says so many standard errors away from the mean would represent a value that if we are larger or smaller than that, we will have an area under the curve of 5%,
The Interquartile Range and Outliers
The interquartile range (IQR) is the difference between the values of the first and third quartiles. A simple subtraction. It is used to determine statistical outliers. Extreme or atypical values which fall far out of the range of data points are termed 'outliers' and can be excluded.
Variance and standard deviation
The method of describing the extent of dispersion or spread of data values in relation to the mean is referred to as the variance. We use the square root of the variance, which is called the standard deviation (SD). Imagine all the data values in a data set are represented by dots on a straight line, i.e. the familiar x-axis from graphs at school. A dot can also be placed on this line representing the mean value. Now the distance between each point and the mean is taken and then averaged, so as to get an average distance of how far all the points are from the mean. Note that we want distance away from the mean, i.e. not negative values (some values will be smaller than the mean). For this mathematical reason all the differences are squared, resulting in all positive values. The average of all these values is the variance. The square root of this is then the SD, the average distance that all the data points are away from the mean. As an illustration, the data values of 1, 2, 3, 20, 38, 39, and 40 have a much wider spread (standard deviation), than 17, 18, 19, 20, 21, 22, and 23. Both sets have an average of 20, but the first has a much wider spread or SD. When comparing the results of two group we should always be circumspect when large standard deviations are reported and especially so when the values of the standard deviations overlap for the two groups.
Normal distribution
The normal distribution is perhaps the most important distribution. We need to know that data point values for a sample are taken from a population in which that variable is normally distributed before we decide on what type of statistical test to use. Furthermore, the distribution of all possible outcomes (through the Central Limit Theorem) is normally distributed. The normal distribution has the following properties: -most values are centered around the mean -as you move away from the mean, there are fewer data points -symmetrical in nature -bell-shaped curve -almost all data points (99.7%) occur within 3 standard deviations of the mean Most variables that we use in clinical research have data point values that are normally distributed, i.e. the sample data points we have, come from a population for whom the values are normally distributed. As mentioned in the introduction, it is important to know this, because we have to know what distribution pattern the variable has in the population in order to decide on the correct statistical analysis tool to use. It is worthwhile to repeat the fact that actual data point values for a variable (i.e. age, height, white cell count, etc.) have a distribution (both in a sample and in the population), but that through the Central Limit Theorem, where we calculate how often certain values or differences in values occur, we have a guaranteed normal distribution.
The self-controlled case series (SCCS)
The self-controlled case series (SCCS) method is an alternative study method for investigating the association between a transient exposure and an adverse event. The method was developed to study adverse reactions to vaccines. The method uses only cases, no separate controls are required as the cases act as their own controls. Each case's given observation time is divided into control and risk periods. Risk periods are defined during or after the exposure. Then the method finds a relative incidence, that is, the incidence in risk periods relative to the incidence in control periods. Time-varying confounding factors such as age can be allowed for by dividing up the observation period further into age categories. An advantage of the method is that confounding factors that do not vary with time, such as genetics, location, socio-economic status are controlled for implicitly.
Types of t-tests
There are a variety of t-tests. Commonly we will have two independent groups. If we were to compare the average cholesterol levels between two groups of patients, participants in these two groups must be independent of each other, i.e. we cannot have the same individual appear in both groups. A special type of t-test exists if the two groups do not contain independent individuals as would happen if the groups are made up of homozygotic (identical) twins and we test the same variable in the same group of participants before and after an intervention (with the two sets of data constituting the two groups). There is also two variations of the t-test based on equal and unequal variances. It is important to consider the difference in the variances (square of the standard deviation) for the data point values for the two groups. If there is a big difference a t-test assuming unequal variances should be used.
Fisher's exact test- How to use?
There are cases in which the χ2-test does become inaccurate. This happens when the numbers are quite small, with totals in the order of five or less. There is actually a rule called Cochrane's rule which states that more than 80% of the values in the expected table (above) must be larger than five. If not, Fisher's exact test should be used. Fisher's test, though, only considers two columns and two rows. So in order to use it, the categorical numbers above must be reduced by combining some of the categories. In the example above we might combine considerable improvement, moderate improvement and no change into a single row and all the deteriorations and death in a second row, leaving us with a two column and two row consistency table (observed table). The calculation for Fisher's test uses factorials. Five factorial (written as 5!) means 5 x 4 x 3 x 2 x 1 = 120 and 3! is 3 x 2 x 1 = 6. For interest's sake 1! = 1 and 0! is also equal to 1. As you might realise, factorial values increase in size quite considerably. In the example above we had a value of 27 and 27! is a value with 29 numbers. That is billions and billions. When such large values are used, a researcher must make sure that his or her computer can accurately manage such large numbers and not make rounding mistakes. Fisher's exact test should not be used when not required due to small sample sizes.
Confidence Intervals
This is what confidence intervals are all about. 1. We construct a range of values (lower and upper limit, or lower and upper maximum), which is symmetrically formed around the sample mean. 2. and we infer that the population mean should fall between those values. We can never be entirely sure about this, though, 3. so we decide on how confident we want to be in the bounds that we set, allowing us to calculate these bounds. reverse: 1. 95% CI 2. calculate limites.
Ratio numerical data
This type applies to data that have a true 0 (zero), which means you can establish a meaningful relationship between the data points as related to the 0 (zero) value eg. age from birth (0) or white blood cell count or number of clinic visits (from 0). A systolic blood pressure of 200 mm Hg is indeed twice as high as a pressure of 100 mm Hg.
Types of distributions
We all know that certain things occur more commonly than others. We all accept that there are more days with lower temperatures in winter than days with higher temperatures. In the northern hemisphere there will be more days in January that are less than 10 degrees Celsius (50 degrees Fahrenheit) than there are days that are more than 20 degrees Celsius (60 degrees Fahrenheit). Actual data point values for any imaginable variable comes in a variety of, shall we say, shapes or patterns of spread. The proper term for this is a distribution. The most familiar shape is 1///// the normal distribution. Data from this type of distribution is symmetric and forms what many refer to as a bell-shaped curve. Most values center around the average and taper off to both ends. If we turn to healthcare, we can imagine that certain hemoglobin level occur more commonly than other in a normal population. There is a distribution to the data point values. In deciding which type of statistical test to use, we are concerned with the distribution that the parameter takes in the population. As we will see later, we do not always know what the shape of distribution is and we can only calculate if our sample data point values might come from a population in which that variable is normally (or otherwise) distributed. It turns out that there are many forms of data distributions for both discrete and continuous data type variables. Even more so, averages, standard deviations, and other statistics also have distributions. This follows naturally from the Central Limit Theorem we looked at before. We saw that if we could repeat an experiment thousands of times, each time selecting a new combination of subjects, some average values or differences in averages between two groups would form a symmetrical distribution. It is important to understand the various types of distributions because, as mentioned, distribution types have an influence on the choice of statistical analysis that should be performed on them. It would be quite incorrect to do the famous t-test on data values for a sample that do not come from variable with a normally distribution in the population from which the sample was taken. Unfortunately, most data is not shared openly and we have to trust the integrity of the authors and that they chose an appropriate test for their data. The onus then also rests on you to be aware of the various distributions and what tests to perform when conducting your own research, as well as to scrutinize these choices when reading the literature. In this lesson you will note that I refer to two main types of distributions. 1///// First, there is the distribution pattern taken by the actual data point values in a study sample (or the distribution of that variable in the underlying population from which the sample was taken). 2///// Then there is the distribution that can be created from the data point values by way of the Central Limit Theorem. There are two of these distributions and they are the Z- and the t-distributions (both sharing a beautifully symmetric, bell-shaped pattern, allowing us to calculate a p-value from them).
What do we want from this t-test?
We want the probability, the p-value. The sample values for your two groups come from different populations. There's some quintessential difference between the populations to which these two sample groups are, are inferring, or come from. If the assumptions that we've talked about until now are not met. Student's t-tests is invalid. You cannot believe the p-value that comes from it. You have to look at the others, whether it be unequal variance t-test, whether it be paired variable t-tests, or whether it be non-parametric tests.
Ordering values: Ranking example
We've got 13, 12, 2, 14, 15, 3, and 40. And I can certainly rank them, or at least place them in ascending order. So, we start at the smallest value and end up with the largest one. 2, 3, 12, 13, 14, 15, 40. And I can now just simply assign a rank to them. 1, 2, 3, 4, 5, 6, 7. And that 1, 2, 3, 4, 5, 6, 7 is what we're going to work with. We cannot work with the actual data values because they're not normally distributed. We can't use a mean, a mean would not be a proper representation of those values. We are looking at using the median. And because the median is a value for which half of the values will be less than and the other half more than it's important for us to know that rank and it's that rank that we're actually going to work with. Now, you might ask yourself what if there is a tie? I mean, if we're dealing with sample sets of 60, 100 patients in each group, we are going to have, or you might at least have some ties. What do we do then? So, imagine we have instead of the first 13, it was a 12. So, if we wrote them in order there it would be 2,3,12,12,14, 15 and 40. What do we do, what do we do with these ties? And as you can see, it's actually quite easy. The 2 would be ranked 1, the 3 would be ranked 2, and the two 12s would have had a rank of 3 and 4, but they equal, so you enter 3 and 4, 7 divided by 2 because they're two variables, so 3.5. So they each get a rank of 3.5. But remember they take up two positions, so the next one, the 14, actually has to take up rank number 5 because 1,2,3,4,5 and that gives you the 3.5, 3.5. So you just add all of them and divide by how any there are. So that will leave you with a 3.5, 3.5 because it was 7 divided by two and the next rank 6 and 7. So ranks are very, very important. It's what we're going to use to do most nonparametric tests
Confidence levels
What are they? As I've mentioned in the previous section, we only examine a small sample of subject from a much larger population. The aim is though, to use the results of our studies when managing that population. Since we only investigate a small sample, any statistic that we calculate based on the data point gathered from them, will not necessarily reflect the population parameter, which is what we are really interested in. Somehow, we should be able to take a stab at what that population parameter might be. Thinking on a very large scale, the true population parameter for any variable could be anything from negative infinity to positive infinity! That sounds odd, but makes mathematical sense. Let's take age for instance. Imagine I have the mean age of a sample of patients and am wondering about the true mean age in the population from which that sample was taken. Now, no one can be -1000 years old, neither can they be +1000 years old. Remember the Central Limit Theorem, though? It posited that the analysis of a variable from a sample was just one of many (by way of combinations). That distribution graph is a mathematical construct and stretches from negative to positive infinity. To be sure, the occurrences of these values are basically nil and in practice they are. So let's say (mathematically) that there are values of -1000 and +1000 (and for that matter negative and positive infinity) and I use these bounds as my guess as to what the mean value in the population as a whole is. With that wide a guess I can be 100% confident that this population mean would fall between those bounds. This 100% represent my confidence level and I can set this arbitrarily for any sample statistic. What happens if I shrink the bounds? Enough with the non-sensical values. What happens if for argument's sake the sample mean age was 55 and I suggest that the population mean is 45 to 65. If have shrunk the bounds, but now, logically, I should lose some confidence in my guess now. Indeed, that is what happens. The confidence level goes down. If I shrink it to 54 to 56, there is a much greater chance that the population mean escapes these bounds and the confidence level would be much smaller. A 95% confidence level It is customary to use a confidence level of 95% and that is the value that you will notice most often in the literature when confidence intervals are quoted. The 95% refers to the confidence levels and the values represent the actual interval. The mathematics behind confidence intervals constructs a distribution around the sample statistic based on the data point values and calculates what area would be covered by 95% (for a 95% confidence level) of the curve. The x-axis values are reconstituted to actual values which are then the lower and upper values of the interval.
Two-tailed test
When an alternative hypothesis states simply that there will be a difference, we conduct what is termed a two-tailed test.
Steps of comparing more than two groups
When comparing more than two groups, it is essential to start with analysis of variance. Only when a significant value is calculated should a researcher continue with comparing two groups directly using a t-test, so as to look for significant differences. If the analysis of variance does not return a significant result, it is pointless to run t-tests between groups and if done any significant finding should be ignored.
Linear regression: Type of data? Number of groups?
When comparing two or more groups, we have the fact that although the data point values for a variable are numerical in type, the two groups themselves are not. We might call the groups A and B, or one and two, or test and control. As such they refer to categorical data types. In linear regression we directly compare a numerical value to a numerical value. For this, we need pairs of values and in essence we look for a correlation between these. Can we find that a change in the set that is represented by the first value in all the pairs causes a predictable change in the set made up by all the second values in the pair. As an example we might correlate number of cigarettes smoked per day to blood pressure level. To do this we would need a sample of participant and for each have a value for cigarettes smoked per day and blood pressure value. As we will see later, correlation does not prove causation!
Why do we need to spend time distinguishing data?
You have to use very different statistical tests for different types of data, and without understanding what data type values (data points) reflects, it is easy to make false claims or use incorrect statistical tests.
Simple random
a master list of the whole population is available and each individual on that list has an equal likelihood of being chosen to be part of the sample.