Biostatistics
Type I error
The error of rejecting a true null hypothesis; i.e. declaring a difference exists when it does not.
Transformation
A change in the scale of measurement for some variable(s). Examples are the square room transformation and logarithm transformation.
hypothesis testing
A general term for the procedure of assessing whether sample data is consistent or otherwise with statements made about the population.
case-control study
(Syn: case comparison study, case compeer study, case history study, case referent study, retrospective study) The observational epidemiologic study of persons with the disease (or other outcome variable) of interest and a suitable control (comparison, reference) group of persons without the disease.
cohort study
(Syn: concurrent, follow-up, incidence, longitudinal, prospective study) The analytic method of epidemiologic study in which subsets of a defined population can be identified who are, have been, or in the future may be exposed or not exposed, or exposed in different degrees, to a factor or factors hypothesized to influence the probability of occurrence of a given disease or other outcome.
cross-sectional study
(Syn: disease frequency survey, prevalence study) A study that examines the relationship between diseases (or other health-related characteristics) and other variables of interest as they exist in defined population at one particular time.
clinical trial
(Syn: therapeutic trial) A research activity that involves the administration of a test regimen to humans to evaluate its efficacy and safety. The term is subject to wide variation in usage, from the first use in humans without any control treatment to a rigorously designed and executed experiment involving test and control treatments and randomization. Several phases of clinical trials are distinguished: Phase I trial Safety and pharmacologic profiles. The first introduction of a candidate vaccine or a drug into a human population to determine its safety and mode of action. In drug trials, this phase may include studies of dose and route of administration. Phase I trials usually involve fewer than 100 healthy volunteers. Phase II trial Pilot efficacy studies. Initial trial to examine efficacy usually in 200 to 500 volunteers; with vaccines, the focus is on immunogenicity, and with drugs, on demonstration of safety and efficacy in comparison to other existing regimens. Usually but not always, subjects are randomly allocated to study and control groups. Phase III trial Extensive clinical trial. This phase is intended for complete assessment of safety and efficacy. It involves larger numbers, perhaps thousands, of volunteers, usually with random allocation to study and control groups, and may be a multicenter trial. Phase IV trial With drugs, this phase is conducted after the national drug registration authority (e.g., the Food and Drug Administration in the United States) has approved the drug for distribution or marketing. Phase IV trials may include research designed to explore a specific pharmacologic effect, to establish the incident of adverse reactions, or to determine the effects of long-term use. Ethical review is required for phase IV clinical trials, but not for routine post marketing surveillance.
predictive value
(positive and negative): In screening and diagnostic tests, the probability that a person with a positive test is a true positive (i.e., does have the disease) is referred to as the "predictive value of a positive test." The predictive value of a negative test is the probability that a person with a negative test does not have the disease. The predictive value of a screening test is determined by the sensitivity and specificity of the test, and by the prevalence of the condition for which the test is used.
Continuous scale
(see measurement scale)
alternative hypothesis
1. A supposition, arrived at from observation or reflection, that leads to refutable predictions. 2. Any conjecture cast in a form that will allow it to be tested and refuted.
qualitative data
1. observations or information characterized by measurement on a categorical scale, i.e. a dichotomous (non-numeric) or nominal scale, or if the categories are ordered, an ordinal scale. Examples are sex, hair color, death or survival. 2. systematic non-numerical observations by sociologists, anthropologists, etc. using approved methods such as participant observation or key informants.
controlled trial
A Phase III clinical trial in which an experimental treatment is compared with a control treatment, the latter being either the current standard treatment or a placebo.
paired t-test (matched pair t-test)
A Student's t-test for the equality of the means of two populations, when the observations arised as paired samples. The test is based on the differences between the observations of the matched pairs. The test statistic is given by where n is the sample size, is the mean of the differences, and sd their standard deviation. If the null hypothesis of the equality of the population means is true then t has a Student's t-distribution with n - 1 degrees of freedom.
Biostatistics
A branch of science which applies statistical methods to biological problems. The science of biostatistics encompasses the design of biological experiments, especially in medicine and health sciences.
nonrandomized clinical trial
A clinical trial in which a series of consecutive patients receive a new treatment and those that respond (according to some pre-defined criterion) continue to receive it. Those patients that fail to respond receive an alternative, usually the conventional, treatment. The two groups are then compared on one or more outcome variables. One of the problems with such a procedure is that patients who respond may be healthier than those who do not respond, possibly resulting in an apparent but not real benefit of treatment.
meta-analysis
A collection of techniques whereby the results of two or more independent studies are statistically combined to yield an overall answer to a question of interest. The rationale behind this approach is to provide a test with more power than is provided by the separate studies themselves. The procedure has become increasingly popular in the last decade or so but it is not without its critics particularly because of the difficulties of knowing which studies should be included and to which population final results actually apply.
confounding variable
A confounding variable (also confounding factor , lurking variable , a confound , or confounder ) is an extraneous variable in a statistical model that correlates (positively or negatively) with both the dependent variable and the independent variable . The methodologies of scientific studies therefore need to control for these factors to avoid what is known as a type 1 error : A 'false positive' conclusion that the dependent variables are in a causal relationship with the independent variable . Such a relation between two observed variables is termed a spurious relationship . Thus, confounding is a major threat to the validity of inferences made about cause and effect, i.e. internal validity , as the observed effects should be attributed to the confounder rather than the independent variable. By definition, a confounding variable is associated with both the probable cause and the outcome. The confounder is not allowed to lie in the causal pathway between the cause and the outcome: If A is thought to be the cause of disease C, the confounding variable B may not be solely caused by behaviour A; and behaviour B shall not always lead to behaviour C. An example: Being female does not always lead to smoking tobacco, and smoking tobacco does not always lead to cancer. Therefore, in any study that tries to elucidate the relation between being female and cancer should take smoking into account as a possible confounder. In addition, a confounder is always a risk factor that has a different prevalence in two risk groups (e.g. females/males). (Hennekens, Buring & Mayrent, 1987).
logistic regression
A form of regression analysis used when the response variable is a binary variable. The method is based on the logistic transformation or logit of a proportion, namely As p tend to O, logit ( p ) tends to -∞ and as p tends to 1, logit (p) tends to ∞. The function logit (p) is a sigmoid curve that is symmetric about p = 0.5. Applying this transformation, this form of regression is written as; where p = Pr(dependent variable=1) and x1, x2,,,, xq are the explanatory variables. Using the logistic transformation in this way overcomes problems that might arise if p was modeled directly as a linear function of the explanatory variables, in particular it avoids fitted probabilities outside the range (0,1). The parameters in the model can be estimated by maximum likelihood estimation. See attached photo
likelihood function
A function constructed from a statistical model and a set of observed data that gives the probability of the observed data for various values of the unknown model parameters. The parameter values that maximize the probability are the maximum likelihood estimates of the parameters.
descriptive statistics
A general term for methods of summarizing and tabulating data that make their main features more transparent. For example, calculating means and variances and plotting histograms.
measures of central tendency
A general term for several values of the distribution of a set of values or measurements located at or near the middle of the set. The principal measures of central tendency are the mean, median, and mode.
Retrospective study
A general term for studies in which all the events of interest occur prior to the onset of the study and findings are based on looking backward in time. Most common is the case-control study, in which comparisons are made between individuals who have a particular disease or condition (the cases) and individuals who do not have the disease (the controls). A sample of cases is selected from the population of individuals who have the disease of interest and a sample of controls is taken from among those individuals known not to have the disease. Information about possible risk factors for the disease is then obtained retrospectively for each person in the study by examining past records, by interviewing each person and/or interviewing their relatives, or in some other way. In order to make the cases and controls otherwise comparable, they are frequently matched on characteristics known to be strongly related to both disease and exposure leading to a matched case-control study. Age, sex and socioeconomic status are examples of commonly used matching variables. Also commonly encountered is the retrospective cohort study, in which a past cohort of individuals are identified from previous information, for example, employment records, and their subsequent mortality or morbidity determined and compared with the corresponding experience of some suitable control group.
Receiver operating characteristic (ROC or relative operating characteristic) curve
A graphic means for assessing the ability of a screening test to discriminate between healthy and diseased persons. The term receiver operating characteristic comes from psychometry, where the characteristic operating response of a receiver-individual to faint stimuli or nonstimuli was recorded.
histogram
A graphical representation of a set of observations in which class frequencies are represented by the areas of rectangles centred on the class interval. If the latter are all equal, the heights of the rectangles are also proportional to the observed frequencies. A histogram of heights of elderly women is shown (see below).
historical controls
A group of patients treated in the past with a standard therapy, used as the control group for evaluating a new treatment on current patients. Although used fairly frequently in medical investigations, the approach is not to be recommended since possible biases, due to other factors that may have changed over the time, can never be satisfactory eliminated.
logit model
A linear model for the logit (natural log of the odds) of disease as a function of a quantitative factor X: Logit (disease given X = x ) = α + β x This model is mathematically equivalent to the logistic model.
probability
A measure associated with an event A and denoted by Pr(A) which takes a value such that 0 ≤ Pr(A) ≤ 1. Essentially the quantitative expression of the chance than an event will occur. In general the higher the value of Pr(A) the more likely It is that the event will occur. If the event cannot happen Pr(A) = 0; if an event is certain to happen Pr(A) = 1. Numerical values can be assigned in simple cases by one of the following two methods: If the sample space can be divided into subsets of n (n ≥ 2) equally likely outcomes and the event A is associated with r (0 ≤ r ≤ n) of these, then Pr(A) = r / n. If an experiment can be repeated a large number of times, n, and in r cases the event A occurs, then r / n is called the relative frequency of A. If this leads to a limit as n ? 8, this limit is Pr(A).
Standard deviation
A measure of dispersion or variation. The most commonly used measure of the spread of a set of observations. Equal to the positive square root of the variance.
mean
A measure of location or central value for a continuous variable. For a definition of the population value see expected value. For a sample of observations,x1 , x2 ,...,xn the measure is calculated as: Most useful when the data has a symmetric distribution and do not contain outliers. See definition website for the formula
Interquartile Range
A measure of spread given by the difference between the first and third quartiles of a sample.
Relative risk (RR or risk ratio)
A measure of the association between exposure to a particular factor and risk of a certain outcome, calculated as relative risk=incident rate among exposed/incident rate among nonexposed Thus a relative risk of 5, for example, means that an exposed person is 5 times as likely to have the disease than one who is not exposed. Relative risk does not measure the probability that someone with the factor will develop the disease. The disease may be rare among both the nonexposed and the exposed.
kappa
A measure of the degree of nonrandom agreement between observers or measurements of the same categorical variable K=Po-Pe/1-Pe where Po is the proportion of times the measurements agree, and Pe is the proportion of times they can be expected to agree by chance alone. If the measurements agree more often than expected by chance, kappa is positive; if concordance is complete, kappa = 1; if there is no more nor less than chance concordance, kappa = 0; if the measurements disagree more than expected by chance, kappa is negative.
Rate
A measure of the frequency of some phenomenon of interest given by Rate=(Number of events in specified period)/(average population during the period) (This resulting value is often multiplied by some power of ten to covert it to a whole number.)
prevalence
A measure of the number of people in a population who have a particular disease at a given point in time. Can be measured in two ways, as point prevalence and period prevalence, these being defined as follows; (See definitions website for formulas) Essentially measure the existence of a disease.
Incidence
A measure of the rate at which people without a disease develop the disease during a specific period of time. It measures the appearance of disease. More generally, the number of new events, e.g. new cases of a disease in a specified population, within a specified period of time. The term incidence is sometimes wrongly used to denote incidence rate. incidence=(number of new cases of a disease over a period of time)/(population at risk of the disease in the time period)
Stem and leaf plot
A method of displaying data in which each observation is split into two parts labeled the 'stem' and the 'leaf'. A tally of the leaves corresponding to each stem ahs the shape of a histogram but also retains the actual observation values.
proportional hazards model
A method that allows the hazard function to be modeled on a set of explanatory variables without making restrictive assumptions about the dependence of the hazard function on time. The model involved is where x1, x2, ...,xq are the explanatory variables of interest, and h(t) the hazard function. The so-called baseline hazard function, a(t), is an arbitrary function of time. For any two individuals at any point in time the ratio of the hazard functions is a constant. Because the baseline hazard function, a(t), does not have to be specified explicitly, the procedure is essentially a distribution free method. Estimates of the parameters in the model, i.e. ß1, ß2,...,ßq are usually obtained by maximum likelihood estimation, and depend only on the order in which events occur, not on the exact times of their occurrence.
Measurement error
A mismatch between an estimated value and its true value. Can be observed when using multiple measures of the same entity or concept.
dichotomous observation
A nominal measure with two outcomes (examples are gender male or female; survival yes or no); also called binary. See dichotomous data.
Kaplan-Meier estimate (product limit method
A nonparametric method of compiling life or survival tables. This combines calculated probabilities of survival and estimates to allow for censored observations, which are assumed to occur randomly. The intervals are defined as ending each time an event (death, withdrawal) occurs and are therefore unequal.
Parameter
A numerical characteristic of a population or a model. The probability of a 'success' in a binomial distribution, for example.
Statistic
A numerical characteristic of a sample. For example, the sample mean and sample variance.
symmetric distribution
A probability distribution or frequency distribution that is symmetrical about some central value.
normal distribution
A probability distribution, f(x), of a random variable, X, that is assumed by any statistical methods. Specifically given by: See definitions website for formula where µ and s2 are, respectively, the mean and variance of x. This distribution is bell-shaped as shown in the example given in graphic below:
experiment (in probability)
A probability experiment involves performing a number of trials to measure the chance of the occurrence of an event our outcome. http://www.uic.edu/classes/upp/upp503/sanders4-5.pdf
Bonferroni correction
A procedure for guarding against an increase in the probability of a type I error when performing multiple significance tests. To maintain the probability of a type I error at some selected value (α), each of the m tests to be performed is judged against a significance level (α/m ). For a small number of simultaneous tests (up to five) this method provides a simple and acceptable answer to the problem of multiple testing. It is however highly conservative and not recommended if large numbers of tests are to be applied, when one of the many other multiple comparison procedures available is generally preferable.
life table analysis
A procedure often applied in prospective studies to examine the distribution of mortality and/or morbidity in one or more diseases in a cohort study of patients over a fixed period of time. For each specific increment in the follow-up period, the number entering the period, the number leaving during the period, and the number either dying from the disease (mortality) or developing the disease (morbidity), are all calculated. It is assumed that an individual not completed the follow-up period is exposed for half this period, thus enabling the data for those 'leaving' and those 'staying' to be combined into an appropriate denominator for the estimation of the percentage dying from or developing the disease. The advantage of this approach is that all patients, not only those who have been involved for an extended period, can be included in the estimation process.
double-blinded trial
A procedure used in clinical trials to avoid the possible bias that might be introduced if the patient and/or doctor knew which treatment the patient is receiving. If neither the patient nor doctor are aware of which treatment has been given the trial is termed double-blind.
blinded study (blinding)
A procedure used in clinical trials to avoid the possible bias that might be introduced if the patient and/or doctor knew which treatment the patient is receiving. If neither the patient nor doctor are aware of which treatment has been given the trial is termed double-blind. If only one of the patient or doctor is unaware, the trial is called single-blind. Clinical trials should use the maximum degree of blindness that is possible, although in some areas, for example, surgery, it is often impossible for an investigation to be double-blind.
subjective probability (personal probability)
A radically different approach to allocating probabilities to events than, for example, the commonly used long-term relative frequency approach. In this approach, probability represents a degree of belief in a proposition, based on all the information. Two people with different information and different subjective ignorance may therefore assign different probabilities to the same proposition. They only constraint is that a single person's probabilities should not be consistent.
confidence interval (CI)
A range of values, calculated from the sample observations, that is believed, with a particular probability, to contain the true value of a population parameter. A 95% confidence interval, for example, implies that were the estimation process repeated again and again, then 95% of the calculated intervals would be expected to contain the true parameter value. Note that the stated probability level refers to properties of the interval and not to the parameter itself which is not considered a random variable.
stepwise regression (selection models in regression)
A series of methods for selecting 'good' (although not necessarily the best) subsets of explanatory variables when using regression analysis. The three most commonly used of these methods are forward selection, backward elimination and a combination of both of these known as stepwise regression. The criterion used for assessing whether or not a variable should be added to an existing model in forward selection or removed from an existing model in backward elimination is, essentially, the change in the residual sum-of-squares produced by the inclusion or exclusion of the variable. Specifically, in forward selection, an 'F-statistic' known as the F-to-enter is calculated as and compared with a preset term; calculated Fs greater than the preset value lead to the variable under consideration being added to the model. (RSSm and RSSm+1 are the residual sums of squares when models with m and m + 1 explanatory variables have been fitted.) In backward selection a calculated F less than a corresponding F-to-remove leads to a variable being removed from the current model. In the stepwise procedure, variables are entered as with forward selection, but after each addition of a new variable, those variables currently in the model are considered for removal by the backward elimination process. In this way it is possible that variables included at some earlier stage might later be removed, because the presence of new variables has made their contribution to the regression model no longer important. It should be stressed that none of these automatic procedures for selecting variables is foolproof and they must be used with care.
factor analysis
A set of statistical methods for analyzing the correlations among several variables in order to estimate the number of fundamental dimensions that underlie the observed data and to describe and measure those dimensions. Used frequently in the development of scoring systems for rating scales and questionnaires.
Standardization
A set of techniques used to remove as much as possible the effects of age or other confounding variables when comparing two or more populations. The common method uses weighted averaging of rates of age, sex or some other confounding variable(s) according of some specified distribution of these variables
one-tailed test (one-sided test)
A significance test for which the alternative hypothesis is directional; for example, that one population mean is greater than another. The choice between a one-sided and two-sided test must be made before any test statistic is calculated.
Chi-square statistic
A statistic having, at least approximately, a chi-squared distribution.
unbiased
A statistic is said to be -------- estimate of a given parameter when the mean of the sampling distribution of that statistic can be shown to be equal to the parameter being estimated. For example, the mean of a sample is an -------- estimate of the mean of the population from which the sample was drawn.
Test statistic
A statistic used to assess a particular hypothesis in relation to some population. The essential requirement of such a statistic is known a distribution when the null hypothesis is true.
logistic model
A statistical model of an individual's risk (probability of disease y ) as a function of a risk factor x : P( y | x ) = 1/(1 + e -a-βx) where e is the (natural) exponential function. This model has a desirable range, 0 to 1, and other attractive statistical features. In the multiple logistic model, the term βx is replaced by a linear term involving several factors, e.g., β1 x1 + β2 x2 if there are two factors x1 and x2.
log-linear model
A statistical model that uses an analysis of variance type of approach for the modeling of frequency counts in contingency tables.
cox regression model (Proportional Hazards Model)
A statistical model used in survival analysis developed by D.R. Cox in 1972 asserting that the effect of the study factors on the hazard rate in the study population is multiplicative and does not change over time.
likelihood ratio test
A statistical test based on the ratio of the maximum value of the likelihood function under one statistical model to the maximum value under another statistical model; the models differ in that one includes and the other excludes one or more parameters.
Goodness of fit test
A statistical test of the hypothesis that data have been randomly sampled or generated from a population that follows a particular theoretical distribution or model. The most common such tests are chi-square tests
Intervention Study
A study in which conditions are under the direct control of the investigator. In epidemiology, a study in which a population is selected for a planned trial of a regimen whose effects are measured by comparing the outcome of the regimen in the experimental group with the outcome of another regimen in a control group.
experimental study
A study in which conditions are under the direct control of the investigator. In epidemiology, a study in which a population is selected for a planned trial of a regimen whose effects are measured by comparing the outcome of the regimen in the experimental group with the outcome of another regimen in a control group.
experiment
A study in which the investigator intentionally alters one or more factors under controlled conditions in order to study the effects of doing so.
observational study
A study in which the objective is to uncover cause-and-effect relationships but in which it is not feasible to use controlled experimentation, in the sense of being able to impose the procedure or treatments whose effects it is desired to discover, or to assign subjects at random to different procedures. Surveys and most epidemiologic studies fall into this class. Since the investigator does not control the assignment of treatments there is no way to ensure that similar subjects receive different treatments. The classical example of such a study that successfully uncovered evidence of an important causal relationship is the smoking and lung cancer investigation of Doll and Hill.
adjusted rate (adjustment)
A summarizing procedure for a statistical measure in which the effects of differences in composition of the populations being compared have been minimized by statistical methods. Examples are adjustment by regression analysis and by standardization. Adjustment is often performed on rates or relative risks, commonly because of differing age distributions in populations that are being compared. The mathematical procedure commonly used to adjust rates for age differences is direct or indirect standardization. **Age adjustment can make the different groups more comparable. A "standard" population distribution is used to adjust death and hospitalization rates. The age-adjusted rates are rates that would have existed if the population under study had the same age distribution as the "standard" population.
Mantel-Haenszel test
A summary chi-square test developed by Mantel and Haenszel for stratified data and used when controlling for confounding.
Interaction
A term applied when two (or more) explanatory variables do not act independently on a response variable. The graphic below shows an example from a 2 x 2 factorial design. In statistics, interaction is also the necessity for a product term in a linear model. **See photo for example
homogeneity (homogeneous)
A term that is used in statistics to indicate the equality of some quantity of interest (most often a variance), in a number of different groups, populations, etc.
Regression coefficient (multiple regression)
A term usually applied to models in which a continuous response variable, y, is regressed on a number of explanatory variables, x1, x2,...,xq. Explicitly the model fitted is the model for n observations can be written as where contains the residual error terms and . Least squares estimation of the parameters involves the following set of equations The regression coefficients ß1, ß2,...,ßq give the change in the response variable corresponding to a unit change in the appropriate explanatory variable, conditional on the other variables remaining constant. Significance tests of whether the coefficients take the value zero can be derived on the assumption that for a given set of values of the explanatory variables, y has a normal distribution with constant variance.
multiple regression
A term usually applied to models in which a continuous response variable, y, is regressed on a number of explanatory variables, x1,x2,....xq. Explicitly the model fitted is where E denotes expected value. By introducing a vector and an n × (q+1) matrix X given by The model for n observations can be written as where contains the residual error terms and . Least squares estimation of the parameters involves the following set of equations The regression confficients, given the change in the response variable corresponding to a unit change in the appropriate explanatory variable, conditional on the other variables remaining constant. Significance tests of whether the coefficients take the value zero can be derived on the assumption that for a given set of values for the explanatory variables, y has a normal distribution with constant variance.
expected frequencies
A term usually encountered in the analysis of contingency tables. Such frequencies are estimates of the values to be expected under the hypothesis of interest. In a two-dimensional table, for example, the values under independence are calculated from the product of the appropriate row and column totals divided by the total number of observations.
gold standard trials
A term usually retained for those clinical trials in which there is random allocation to treatments, a control group and double-blinding.
Chi-square test for trend
A test applied to a two-dimensional contingency table in which one variable has two categories and the other has k ordered categories, to assess whether there is a difference in the trend of the proportions in the two groups. The result of using the ordering in this way is a test that is more powerful than using the chi-squared statistic to test for independence.
z-test
A test for assessing hypotheses about population means when their variances are known. For example, for testing that the means of two populations are equal, i.e. H0 : μ1 = μ2, when the variance of each population is known to be σ2 the test statistic is where and are the means of samples of size n1 and n2 from the two populations. If H0 is true, z, has a standard normal distribution.
McNemar test
A test for comparing proportions in data involving paired samples. The test statistic is given by where b is the number of pairs for which the individual receiving treatment A has a positive response and the individual receiving treatment B does not, and c is the number of pairs for which the reverse is the case. If the probability of a positive response is the same in each group, then X2 has a chi-squared distribution with a single degree of freedom.
logrank test
A test for comparing two or more sets of survival times, to assess the null hypothesis that there is no difference in the survival experience of the individuals in the different groups. For the two-group situation the test statistic is (see formula). where the weights, wj are all unity, d1j is the number of deaths in the first group at t(j) , the j th ordered death time, j = 1,2,..., r, and e1j is the corresponding expected number of deaths given by e1j =n1 jdj /n1j where dj is the total number of deaths at time t(j) , nj is the total number of individuals at risk at this time, and n1j the number of individuals at risk in the first group. The expected value of U is zero and its variance is given by Consequently can be referred to a standard normal distribution to assess the hypothesis of interest. Other tests use the same test statistic with different values for the weights. The Tarone-Ware test, for example, uses and the Peto-Prentice test uses See attached photo
F-test
A test for the equality of the variances of two populations having normal distributions, based on the ratio of the variances of a sample of observations taken from each. Most often encountered in the analysis of variance , where testing whether particular variances are the same also test for the equality of a set of means.
hazard rate (force of morbidity, instantaneous incidence rate):
A theoretical measure of the risk of an occurrence of an event, e.g. death or new disease, at a point in time, t , defined mathematically as the limit, as Δ t approaches zero, of the probability that an individual well at time t will experience the event by t + Δ t , divided by Δ t .
z-transformation (Fisher's Z transformation)
A transformation of Pearson's product moment correlation coefficient, r, give by The statistic z has a normal distribution with mean where p is the population correlation value and variance 1/(n-3) where n is the sample size. The transformation may be used to test hypotheses and to construct confidence intervals for p.
Fisher's z-transformation
A transformation of Pearson's product moment correlation coefficient, r, given by The statistic z has a normal distribution with mean where ? is the population correlation value and variance 1/( n -3) where n is the sample size. The transformation may be used to test hypotheses and to contrast confidence intervals for ?.
placebo
A treatment designed to appear exactly like a comparison treatment, but which is devoid of the active component
Scatterplot (Synonym for scatter diagram)
A two-dimensional plot of a sample of bivariate observations. The diagram is an important aid in assessing what type of relationship links the two variables. An example is shown in below.
proportion
A type of ratio in which the numerator is included in the denominator.
Random variable
A variable, the values of which occur according to some specified probability distribution.
Randomization (randomized experiment)
Allocation of individuals to groups, e.g., for experimental and control regimens, by chance.
Risk factor
An aspect of persona behavior or lifestyle, an environmental exposure, or an inborn or inherited characteristic which is thought to be associated with a particular disease or condition.
weighted average
An average of quantities to which have been attached a series of weights in order to make proper allowance for their relative importance. For example a weighted arithmetic mean of a set of observations x1, x2,..., xn, with weights w1, w2, ..., wn is given by
degrees of freedom
An elusive concept that occurs throughout statistics. Essentially the term means the number of independent units of information in a sample relevant to the estimation of a parameter or calculation of a statistic. For example, in a two-by-two contingency table with a given set of marginal totals, only one of the four cell frequencies is free and the table has therefore a single degree of freedom. In many cases the term corresponds to the number of parameters in a model. Also used to refer to a parameter of various families of distributions, for example, Student's t-distribution and the F-distribution.
Randomized controlled trial (RCT)
An epidemiologic experiment in which subjects in a population are randomly allocated into groups, usually called study and control groups, to receive or not receive an experimental preventive or therapeutic procedure, maneuver, or intervention. The results are assessed by rigorous comparison of rates of disease, death, recovery, or other appropriate outcome in the study and control groups. RCTs are generally regarded as the most scientifically rigorous method of hypothesis testing available in epidemiology.
factor
An event, characteristic, or other definable entity that brings about a change in a health condition or other defined outcome.
Sensitivity
An index of the performance of a diagnostic test, calculated as the percentage of individuals with a disease who are correctly classified as having the disease, i.e. the conditional probability of having a positive test result given having the disease. A test is sensitive to the disease if it is positive for most individuals having the disease where: Screening test results True status Total Disease Not Diseased Positive a b a + b Negative c d c + d Total a + c b + d a + b + c + d a. diseased individuals detected by the test (true positives) b. nondiseased individuals positive by the test (false positives) c. diseased individuals not detectable by the test (false negatives) d. nondiseased individuals negative by the test (true negatives) Sensitivity = a/(a + c) Specificity = d/(b + d) Predictive value (positive test result) = a/(a + b) Predictive value (negative test result) = d/(c + d)3
Specificity
An index of the performance of a diagnostic test, calculated as the percentage of individuals without the disease who are classified as not having the disease, i.e. the conditional probability of a negative test result given that the disease is absent. A test is specific if it is positive for only a small percentage of those without the disease.
correlation coefficient r (Pearson product moment)
An index that quantifies the linear relationship between a pair of variables. In a bivariate normal distribution, for example, the parameter, p. An estimator of p obtained from n sample values of the two variables of interest, (x1, y1), (x2, y2),...,(xn,yn), is Pearson's product moment correlation coefficient, r, given by **See picture The coefficient takes values between -1 and 1, with the sign indicating the direction of the relationship and the numerical magnitude its strength. Values of -1 and 1 indicate that the sample values fall on a straight line. A value of zero indicates the lack of any linear relationship between the two variables.
censored observation
An observation (Xi) on some variable of interest is said to be censored if it is known only that Xi =Li ( left-censored) or Xi =Ui ( right-censored) where Li and Ui are fixed values. Such observations arise most frequently in studies where the main purpose variable is time until a particular event occurs (for example, time to death) when at the completion of the study, the event of interest has not happened to a number of subjects.
post hoc comparisons
Analyses not explicitly planned at the start of a study but suggested by an examination of the data. Such comparisons are generally performed only after obtaining a significant overall F value.
Regression
As used by Francis Galton (1822-1911) one of the founders of modern biology and biometry, in his book Hereditary Genius (1869), this meant the tendency of offspring of exceptional parents to possess characteristics closer to the average for the general population. Hence "regression to the mean," i.e. the tendency of individuals at the extremes to have values nearer to the mean on repeated measurement. Can also be a synonym for regression analysis in statistics.
Bayes' theorem
Bayes' theorem (alternatively Bayes' law or Bayes' rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
categorical data
Categorical data represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, and educational level. While the latter two variables may also be considered in a numerical manner by using exact values for age and highest grade completed, it is often more informative to categorize such variables into a relatively small number of groups.
multivariate data
Data for which each observation consists of values for more than one random variable. For example, measurements on blood pressure, temperature and heart rate for a number of subjects. Such data are usually displayed in the form of a data matrix, i.e. when n is the number of subjects, q the number of variables and xij the observation on a variable j for subject i.
Goodness of fit
Degree of agreement between an empirically observed distribution and a mathematical or theoretical distribution.
factorial designs
Designs which allow two or more questions to be addressed in an investigation. The simplest factorial design is one in which each of two treatments or interventions are either present or absent, so that subjects are divided into four groups; those receiving neither treatment, those having only the first treatment, those having only the second treatment and those receiving both treatments. Such designs enable possible interactions between factors to be investigated. A very important special case of a factorial design is that where each of k factors of interest has only two levels; these are usually known as 2kfactorial designs. A single replicate of a 2kdesign is sometimes called an unreplicated factorial.
quantiles
Divisions of a probability distribution or frequency distribution into equal, ordered subgroups, for example, quartiles or percentiles.
dummy coding
Dummy coding provides one way of using categorical predictor variables in various kinds of estimation models (see also effect coding), such as, linear regression. Dummy coding uses only ones and zeros to convey all of the necessary information on group membership. http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm
Random sample
Either a set of n independent and identically distributed random variables, or a sample of n individuals selected from a population in such a way that each sample of the same size is equally likely.
mutually exclusive events
Events that cannot occur jointly.
probability distribution
For a discrete random variable, a mathematical formula that gives the probability of each value of the variable. See, for example, binomial distribution and Poisson distribution. For a continuous random variable, a curve described by a mathematical formula which specifies, by ways of areas under the curve, the probability that the variable falls within a particular interval. Examples include the normal distribution and the exponential distribution. In both cases the term probability density may also be used. (A distinction is sometimes made between 'density' and 'distribution', when the latter is reserved for the probability that the random variable falls below some value. In this dictionary, however, the latter will be termed the cumulative probability distribution and probability distribution and probability density used synonymously.
Regression analysis
Given data on a dependent variable y and one or more independent or predictor variables x1, x2, etc., regression analysis involves finding the "best" mathematical model (within some restricted class of models) to describe y as a function of the x's, or to predict y from the x's. The most common form is a linear model; in epidemiology, the logistic and proportional hazards models are also common.
Central Limit Theorem
If a random variable Y has population mean µ and population variance σ2, then the sample mean, , based on n observations, has an appropriate normal distribution with a mean µ and variance σ2/ n , for sufficiently large n. The theorem occupies an important place in statistical theory. In short, the Central Limit Theorem states that if the sample size is large enough, the distribution of sample means can be approximated by a normal distribution, even if the original population is not normally distributed.
variance
In a population, the second moment about the mean. An unbiased estimator of the population value is provided by s2 given by where x1, x2, ...,xn are the n sample observations and is the sample mean.
bias
In general terms, deviations of results or inferences from the truth, or processes leading to such deviation. More specifically, the extent to which the statistical method used in a study does not estimate the quantity thought to be estimated, or does not test the hypothesis to be tested. In estimated usually measured by the difference between a parameter estimate and its expected value. An estimator for which is said to be unbiased .
distribution (population)
In statistics this term is used for any finite or infinite collection of 'units', which are often people but may be, for example, institutions, events, etc.
population
In statistics this term is used for any finite or infinite collection of 'units', which are often people but may be, for example, institutions, events, etc.
hazard
Inherent capability of an agent or situation to have an adverse effect. A factor or exposure that may effect adversely effect health.
Mantel-Haenszel estimate
Mantel and Haenszel provided an adjusted odds ratio as an estimate of relative risk that may be derived from grouped and matched sets of data. It is now known as the Mantel-Haenszel estimate. The statistic may be regarded as a type of weighted average of the individual odds ratios, derived from stratifying a sample into a series of strata that are internally homogeneous with respect to confounding factors. The Mantel-Haenszel summarization method can also be extended to the summarization of rate ratios and rate differences from follow-up studies. An estimator of the assumed common odds ratio in a series of two-by-two contingency tables arising from different populations, for example, occupation, country of origin, etc.
complementary event
Mutually exclusive events A and B for which Pr(A) + Pr(B) = 1 where Pr denotes probability.
binary variable (binary observation)
Observations which occur in one of two possible states, these often being labeled 0 and I. Such data is frequently encountered in medical investigations; commonly occurring examples include 'dead/alive', 'improved/not improved' and 'depressed/not depressed.' Data involving this type of variable often require specialized techniques for their analysis such as logical regression.
covariate
Often used simply as an alternative name for explanatory variables, but perhaps more specifically to refer to variables that are not of primary interest in an investigation, but are measured because it is believed that they are likely to affect the response variable and consequently need to be included in analyses and model building.
analysis of covariance (ANCOVA)
Originally used for an extension of the analysis of variance that allows for the possible effects of continuous concomitant variables (covariates) on the response variable, in addition to the effects of the factor or treatment variables. Usually assumed that covariates are unaffected by treatments and that their relationship to the response is linear. If such a relationship exists then inclusion of covariates in this way decreases the error mean square and hence term now appears to also be more generally used for almost any analysis seeking to assess the relationship between a response variable and a number of explanatory variables. **An ANOVA is a regression, just one where all of the covariates are categorical. An ANCOVA is a regression with qualitative and continuous covariates **F test: A test for the equality of the variances of two populations having normal distributions, based on the ratio of the variances of a sample of observations taken from each. Most often encountered in the analysis of variance , where testing whether particular variances are the same also test for the equality of a set of means. **Analysis of covariance is used to test the main and interaction effects of categorical variables on a continuous dependent variable, controlling for the effects of selected other continuous variables, which co-vary with the dependent. The control variables are called the "covariates." ANCOVA is used for several purposes: - In experimental designs, to control for factors which cannot be randomized but which can be measured on an interval scale. - In observational designs, to remove the effects of variables which modify the relationship of the categorical independents to the interval dependent.
multiple comparison test
Procedures for detailed examination of the differences between a set of means, usually after a general hypothesis that they are all equal has been rejected. No single technique is best in all situations and a major distinction between techniques is how they control the possible inflation of the type I error.
Repeated-measures design
Repeated measures is a type of analysis of variance that generalizes Student's t test for paired samples. It is used when two or more measurements of the same type are made on the same subject. Analysis of variance is characterized by the use of factors, which are composed of levels. Repeated measures analysis of variance involves two types of factors--between subjects factors and within subjects factors. The repeated measures make up the levels of the within subjects factor. For example, suppose each subject has his/her reaction time measured under three different conditions. The conditions make up the levels of the within subjects factor. Depending on the study, subjects may divided into groups according to levels of other factors called between subjects factors. Each subject is observed at only a single level of a between-subjects factor. For example, if subjects were randomized to aeorbic or stretching exercise, form of exercise would be a between-subjects factor. The levels of a within-subject factor change as we move within a subject, while levels of a between-subject factor change only as we move between subjects.
Gaussian distribution
See: normal distribution
Variable
Some characteristic that differs from subject to subject or from time to time. Any attribute, phenomenon, or event that can have different values.
statistical significance
Statistical methods allow an estimate to be made of the probability of the observed or greater degree of association between independent and dependent variables under the null hypothesis. From this estimate, in a sample of given size, the statistical "significance" of a result can be stated. Usually the level of statistical significance is stated by the p value.
nonparametric method (distribution fee methods)
Statistical techniques of estimation and inference that are based on a function of the sample observations, the probability distribution of which does not depend on a complete specification of the probability distribution of the population from which the sample was drawn. Consequently the techniques are valid under relatively general assumptions about the underlying population. Often such methods involve only the ranks of the observations rather than the observations themselves. Examples are Wilcoxon's signed rank test and Friedman's two way analysis of variance. In many cases these tests are only marginally less powerful than their analogues which assume a particular population distribution (usually a normal distribution), even when that assumption is true. Also commonly known as nonparametric methods although the terms are not completely synonymous.
prospective study (cohort study)
Studies in which individuals are followed-up over a period of time. A common example of this type of investigation is where samples of individuals exposed and not exposed to a possible risk factor for a particular disease, are followed forward in time to determine what happens to them with respect to the illness under investigation. At the end of a suitable time period a comparison of the incidence of the disease amongst the exposed and non-exposed is made. A classical example of such a study is that undertaken among British doctors in the 1950s, to investigate the relationship between smoking and death from lung cancer. All clinical trials are prospective.
longitudinal study (cohort study)
Studies that give rise to longitudinal data. The defining characteristic of such a study is that subjects are measured repeatedly through time.
null hypothesis
The 'no difference' or 'no association' hypothesis to be tested (usually by means of a significance test) against an alternative hypothesis that postulates non-zero difference or association.
Chi-Square Distribution
The Chi-Square distribution is based on a normally distributed population with variance σ2, with randomly selected independent samples of size n and computed sample variance s2 for each sample. The sample statistic X2= ( n - 1) s2/σ2. The chi-square distribution is skewed, the values can be zero or positive but not negative, and it is different for each number of degrees of freedom. Generally, as the number of degrees of freedom increases, the chi-square distribution approaches a normal distribution.
Target Population
The collection of individuals, items, measurements, etc., about which it is required to make inferences. Often the population actually sampled differs from the target population and this may result in misleading conclusions being made. The target population requires a clear precise definition, and that should include the geographical area (country, region, town, etc.) if relevant, the age group and gender.
Range
The difference between the largest and smallest observations in a data set. Often used as an easy-to-calculate measure of the dispersion in a set of observations but not recommended for this task because of its sensitivity to outliers and the fact that its value increases with sample size.
Residual
The difference between the observed value of a response variable (yi) and the value predicted by some model of interest ( ). Examination of a set of residuals, usually by informal graphical techniques, allows the assumptions made in the model fitting exercise, for example, normality, homogeneity of variance, etc., to be checked. Generally, discrepant observations have large residuals, but some form of standardization may be necessary in many situations to allow identification of patterns among the residuals that may be a cause for concern. The usual standardized residual for observation yi, is calculated from where s2 is the estimated residual variance after fitting the model of interest and hi is the ith diagonal element of the hat matrix. An alternative definition of a standardized residual (sometimes known as the Studentized residual), is where now is the estimated residual variance from fitting the model after the exclusion of the ith observation.
binomial distribution
The distribution of the number of 'successes', X, in a series of n- independent Bernoulli trials where the probability of success at each trial is p and the probability of failure is q = 1- p . **See attached photo
Student's t-distribution
The distribution of the variable where is the arithmetic mean of n observations from a normal distribution with mean μ and s is the sample standard deviation. Given explicitly by where v = n - 1. The shape of the distribution varies with v and as v gets larger the shape of the t-distribution approaches that of the standard normal distribution. Some examples of such distributions are shwn in below. A multivariate version of this distribution arises from considering a q-dimensional vector having a multivariate normal distribution with mean vector μ and variance-covariance matrix Σ and defining the elements μi of a vector u as ui=μi + xi / y1/2, i = 1,2,..., q where vy ~ . Then u has a multivariate Student's t-distribution given by
Type II error
The error of failing to reject a false null hypothesis; i.e. declaring a difference does not exist when it in fact does.
Reliability
The extent to which the same measurements of individuals obtained under different conditions yield similar results. Reliability refers to the degree to which the results obtained by a measurement, procedure can be replicated. Lack of reliability may arise from divergences between observers or instruments of measurement or instability of the attribute being measured.
level of significance
The level of probability at which it is agreed that the null hypothesis will be rejected. Conventionally set at 0.05.
coefficient of variation (CV)
The measure of spread for a set of data defined as 100 x standard deviation / mean CV = s/x bar(100) = sample CV = σ/µ(100) = population Originally proposed as a way of comparing the variability in different distributions, but found to be sensitive to errors in the mean. Simpler definition: The ratio of the standard deviation to the mean. This is meaningful only if the variable is measured on a ratio scale.
mode
The most frequently occurring value in a set of observations. Occasionally used as a measure of location.
Sampling distribution
The probability distribution of a statistic calculated from a random sample of a particular size. For example, the sampling distribution of the arithmetic mean of samples of size n taken from a normal distribution with mean μ with standard deviation s, is a normal distribution also with mean μ but with standard deviation .
Intercept
The parameter in an equation derived from a regression analysis corresponding to the expected value of the response variable when all the explanatory variables are zero.
poisson distribution
The probability distribution of the number of occurrences, X, of some random event, in an interval of time or space. Given by , x = 0, 1, ... The mean and variances of the distribution are both λ. The skewness of the distribution is , and its kurtosis is 3+(1/λ). The distribution is positively skewed for all values of λ.
alpha (α)
The probability of a type I error, the error of rejecting a true null hypothesis, i.e. declaring a difference exists when it does not. **See chart photo
beta (β)
The probability of a type II error, the error of failing to reject a false null hypothesis, i.e. declaring that a difference does not exist when in fact it does.
power
The probability of rejecting the null hypothesis when it is false. Power gives a method of discriminating between competing test of the same hypothesis, the test with the higher power being preferred. It is also the basis of procedures for estimating the sample size needed to detect an effect of a particular magnitude. Mathematically, power is 1-β (type II error).
conditional probability
The probability that an event occurs given the outcome of some other event. Usually written, Pr(A l B). For example, the probability of a person being colour blind given that the person is male is about 0.1, and the corresponding probability given that the person is female is approximately 0.0001. It is not, of course, necessary that Pr(A l B) = Pr(A l B); the probability of having spots given that a patient has measles, for example, is very high, the probability of measles given that a patient has spots is, however, much less. If Pr(A l B) = Pr(A l B) then the events A and B are said to be independent.
inference (statistical)
The process of drawing conclusions about a population on the basis of measurements or observations made on a sample of individuals for the population.
matching (or matched groups)
The process of making a study group and a comparison group comparable with respect to extraneous factors. Often used in retrospective studies when selecting cases and controls to control variation in a response variable due to sources other than those immediately under investigation. Several kinds of matching can be identified, the most common of which is when each case is individually matched with a control subject on the matching variables, such as age, sex, occupation, etc. When the variable on which the matching takes place is continuous it is usually transformed into a series of categories (e.g. age), but a second method is to say that two values of the variable match if their difference lies between defined limits. This method is known as caliper matching. Also important is group or category matching in which the distributions of the extraneous factors are made similar in the groups to be compared.
point estimate (estimation)
The process of providing a numerical value for a population parameter on the basis of information collected from a sample. If a single figure is calculated for the unknown parameter the process is called point estimation. If an interval is calculated which is likely to contain the parameter, then the procedure is called interval estimation.
false-negative
The proportion of cases in which a diagnostic test indicates disease is absent in patients who have the disease.
false-positive
The proportion of cases in which a diagnostic test indicates disease is present in disease-free patients.
crossover rate
The proportion of patients in a clinical trial transferring from the treatment decided by an initial random allocation to an alternative one.
likelihood ratio
The ratio of the likelihood of observing data under actual conditions, to observing these data under the other, e.g., "ideal" conditions; or comparison of various model conditions to assess which model provides the best fit. Likelihood ratios are used to appraise screening and diagnostic tests in clinical epidemiology.
odds ratio (OR)
The ratio of two odds for a binary variable in two groups of subjects, for example, males and females. If the two possible states of the variable are labeled 'success' and 'failure' then the odds ratio is a measure of the odds of a success in one group relative to that in the other. When the odds of a success in each group are identical then the odds ratio is equal to one. Usually estimated as where a, b, c and d are the appropriate frequencies in the two-by-two contingency table formed from the data.
Risk ratio
The ratio of two risks, usually exposed/not exposed.
Ranks
The relative positions of the members of a sample with respect to some characteristic.
Effective sample size
The sample size after dropouts, deaths and other specified exclusions from the original sample.
analysis of variance (ANOVA)
The separation of variance attributable to one cause from the variance attributable to others. By partitioning the total variance of a set of observations into parts due to particular factors, for example, sex, treatment group, etc, and comparing variances (mean squares) by way of F-tests, differences between means can be assessed. The simplest analysis of this type involves a one-way design, in which N subjects are allocated, usually at random, to the k different levels of a single factor. The total variation in the observations is then divided into a part due to differences between level means (the between groups sum of squares) and a part due to the differences between subjects in the same group (the within groups sum of squares, also known as the residual sum of squares). These terms are usually arranged as an analysis of variance table. If the means of the populations represented by the factor levels are the same, then within the limits of random variations, the between groups mean square and within groups mean square, should be the same. Whether this is so can, if certain assumptions are met, be assessed by a suitable F-test are that the response variable is normally distributed in each population and that the populations have the same variance. Essentially an example of a generalized linear model with an identity link function and normally distributed errors.
Standard error (SE)
The standard deviation of the sampling distribution of a statistic. For example, the standard error of the sample mean of n observations is where s2 is the variance of the original observations.
contingency table (or two-way frequency table)
The table arising when observations on a number of categorical variables are cross-classified. Entries in each cell are the number of individuals with the corresponding combination of variable values. Most common are two-dimensional tables involving two categorical variables, an example of which is shown below. **See picture The analysis of such two-dimensional tables generally involves testing for the independence of the two variables using the familiar chi-squared statistics. Three- and higher-dimensional tables are now routinely analyzed using log-linear models.
cumulative frequency distribution
The tabulation of a sample of observations in terms of numbers falling below particular values. The empirical equivalent of the cumulative probability distribution. An example of such a tabulation is shown below. **See photo
Two-way analysis of variance (factorial AOV)
The two-way analysis of variance is an extension to the one-way analysis of variance. There are two independent variables (hence the name two-way). The two independent variables in a two-way ANOVA are called factors. The idea is that there are two variables, or factors, which affect the dependent variable. Each factor will have two or more levels within it, and the degrees of freedom for each factor is one less than the number of levels. The same assumptions apply for one-way analysis of variance.
median
The value in a set of ranked observations that divides the data into two parts of equal size. When there is an odd number of observations the median is the middle value. When there is an even number of observations the measure is calculated as the average of the two central values. Provides a measure of location of a sample that is suitable for asymmetric distributions and is also relatively insensitive to the presence of outliers.
Ratio
The value obtained by dividing one quantity by another: a general term of which rate, proportion, percentage, etc., are subsets. The important difference between a proportion and a ratio is that the numerator of a proportion is included in the population defined by the denominator, whereas this is not necessarily so for a ratio.
critical value
The value with which a statistic calculated from sample data is compared in order to decide whether a null hypothesis should be rejected. The value is related to the particular significance level chosen.
quartile
The values that divide a frequency distribution or probability distribution into four equal parts.
dependent variable (response or outcome variable)
The variable of primary importance in investigations since the major objective is usually to study the effects of treatment and/or other explanatory variables on this variable and to provide suitable models for the relationship between it and the explanatory variables.
Response variable
The variable of primary importance in investigations since the major objective is usually to study the effects of treatment and/or other explanatory variables on this variable and to provide suitable models for the relationship between the explanatory variables.
independent variable (explanatory variables):
The variables appearing on the right-hand side of the equations defining, for example, multiple regression or logistic regression, and which seek to predict or 'explain' the response variable. Using the term independent variable is not recommended since they are rarely independent of one another.
explanatory variable
The variables appearing on the right-hand size of the equations defining, for example, multiple regression or logistic regression, and which seek to predict or 'explain' the response variable. Also commonly known as the independent variables, although this is not to be recommended since they are rarely independent of one another.
Random error or variation
The variation in a data set unexplained by identifiable sources.
Independence
Two events are said to be independent if the occurrence of one is in no way predictable from the occurrence of the other. Two variables are said to be independent if the distribution of values of one is the same for all values of the other.
z-score (standard scores)
Variable values transformed to zero mean and unit variance.
Survival analysis
a class of statistical procedures for estimating the survival function and for making inferences about the effects on it of treatments, prognostic factors, exposures, and other covariates.
Sums of squares
a concept in inferential statistics and descriptive statistics. More properly, it is "the sum of the squared deviations". Mathematically, it is an unscaled, or unadjusted measure of variability. When scaled for the number of degrees of freedom, it estimates the variance, or spread of the observations about their mean value. The distance from any point in a collection of data, to the mean of the data, is the deviation. This can be written as Xi - Xbar, where Xiis the ith data point, and Xbar is the estimate of the mean. If all such deviations are squared, then summed, as in , we have the "sum of squares" for these data. When more data are added to the collection, the sum of squares will increase, except in unlikely cases such as the new data being equal to the mean. So usually, the sum of squares will grow with the size of the data collection. That is a manifestation of the fact that it is unscaled.
Discrete variable
a countable and finite variable, for example grade: 1, 2, 3, 4...- 12.
linear regression (of Y on X)
a form of regression analysis in which observational data are modeled by a function which is a linear combination of the model parameters and depends on one or more independent variables. In simple linear regression the model function represents a straight line. The results of data fitting are subject to statistical analysis. The data consist of m values taken from observations of the dependent variable (response variable) y . The independent variables are also called regressors, exogenous variables, input variables and predictor variables. In simple linear regression the data model is written as yi = ß0 + xiß1+ εi where εi is an observational error. ß0 (intercept) and ß1 (slope) are the parameters of the model.
frequency (occurrence)
a general term describing the frequency or occurrence of a disease or other attribute or event in a population without distinguishing between incidence and prevalence.
two-sample t-test
a hypothesis test for answering questions about the mean where the data are collected from two random samples of independent observations, each from an underlying normal distribution
effect or effect size
a measure of the strength of the relationship between two variables. In scientific experiments, it is often useful to know not only whether an experiment has a statistically significant effect, but also the size of any observed effects. In practical situations, effect sizes are helpful for making decisions. Effect size measures are the common currency of meta-analysis studies that summarize the findings from a specific area of research
Statistical test
a procedure that is intended to decide whether a hypothesis about the distribution of one or more populations or variables should be rejected or accepted. Statistical tests may be parametric or nonparametric. *parametric test: a statistical test that depends upon assumptions about the distribution of the data, e.g. that the data are normally distributed. *nonparametric method (distribution fee methods): Statistical techniques of estimation and inference that are based on a function of the sample observations, the probability distribution of which does not depend on a complete specification of the probability distribution of the population from which the sample was drawn. Consequently the techniques are valid under relatively general assumptions about the underlying population. Often such methods involve only the ranks of the observations rather than the observations themselves. Examples are Wilcoxon's signed rank test and Friedman's two way analysis of variance. In many cases these tests are only marginally less powerful than their analogues which assume a particular population distribution (usually a normal distribution), even when that assumption is true. Also commonly known as nonparametric methods although the terms are not completely synonymous.
weighted sample
a sample that is not strictly proportional to the distribution of classes in the universe population. A weighted sample has been adjusted to include larger proportions of some than other parts of the population because those parts accorded greater "weight" would otherwise not have sufficient numbers in the sample to lead to generalizable conclusions, or because they are considered to be more important, more interesting, more worthy of detailed study or other reasons.
Sample
a selected subset of a population. A sample may be random or nonrandom and may be representative or nonrepresentative. Several types of samples exist: 1. area sample - a method of sampling that can be used when the numbers in the population are unknown. The total area to be sampled is divided into subareas, e.g. by means of a grid that produces squares on a map; these subareas are then numbered and sampled, using a table of random numbers. 2. cluster sample - each unit selected is a group of persons (all persons in a city block, a family, a school, etc.) rather than an individual. 3. grab sample (sample of convenience) - samples selected by easily employed but basically nonprobabilistic methods. It is improper to generalize from the results of a survey based upon such a sample, for there is no way of knowing what types of bias may have been present. 4. probability (random) sample -all individuals have a known chance of selection. They may all have an equal chance of being selected, or, if a stratified sampling method is used, the rate at which individuals from several subsets are sampled can be varied so as to produce greater representation of some classes than others. 5. simple random sample - a form of sampling design in which n distinct units are selected from the N units in the population in such a way that every possible combination of n units is equally likely to be the sample selected. With this type of sampling design the probability that the ith population unit is included in the same, so that the inclusion probability is the same for each unit. Designs other than this one may also give each unit equal probability of being included, both other here does each possible sample of n units have the same probability. 6. stratified random sample - this involves dividing the population into distinct subgroups according to some important characteristic, such as age or socioeconomic status, and selecting a random sample out of each subgroup. If the proportion of the sample drawn from each of the subgroups or strata, is the same as the proportion of the total population contained in each stratum, then all strata will be fairly represented with regard to numbers of persons in the sample. 7. systematic sample - the procedure of selecting according to some simple, systematic rule, such as all persons whose names begin with specified alphabetic letters, born on certain dates, or located at specified points on a list. A systematic sample may lead to errors that invalidate generalizations.
multivariate analysis
a set of techniques used when the variation in several variables has to be studied simultaneously. In statistics any analytic method that allows the simultaneous study of two or more dependent variables.
principal component analysis
a statistical method to simplify the description of a set of interrelated variables. Its general objectives are data reduction and interpretation; there is no separation into dependent and independent variables; the original set of correlated variables is transformed into a smaller set of uncorrelated variables called the principal components. Often used as the first step in a factor analysis.
two-tailed test
a statistical significance test based on the assumption that the data are distributed in both directions from the central value(s).
parametric test
a statistical test that depends upon assumptions about the distribution of the data, e.g. that the data are normally distributed.
percentage
a way of expressing a number as a fraction of 100 (per cent meaning "per hundred").
frequency table
a way of summarizing data; used as a record of how often each value (or set of values) of a variable occurs. A frequency table is used to summarize categorical, nominal, and ordinal data. It may also be used to summarize continuous data once the data is divided into categories.
Survey
an investigation in which information is systematically collected but in which the experimental method is not used. A population survey may be conducted by face-to-face inquiry, self-completed questionnaires, telephone, postal service, or in some other way.
ordinal scale
classification into ordered qualitative categories, e.g. grade, where the values have a distinct order but their categories are qualitative in that there is no natural (numerical) distance between their possible values.
nominal scale
classification into unordered qualitative categories, e.g. race, religion, country of birth. Measurements of individual attributes are purely nominal scales, as there is no inherent order to their categories.
homoscedasticity
homo means "same" and -scedastic means "scattered" therefore homoscedasticity means the constancy of the variance of a measure over the levels of the factors under study.
multicollinearity
in multiple regression analysis, a situation in which at least some of the independent variables are highly correlated with each other. Such a situation can result in inaccurate estimates of the parameters in the regression model.
dummy variable (indicator variable)
in statistics, a variable taking only one of two possible values, one (usually 1) indicating the presence of a condition, and the other (usually 0) indicating the absence of the condition, used mainly in regression analysis.
frequency distribution
lists data values (either individually or by groups of intervals), along with their corresponding frequencies (or counts).
outliers
observations differing so widely from the rest of the data as to lead one to suspect that a gross error may have been committed, or suggesting that these values come from a different population. Statistical handling of outliers varies and is difficult.
dichotomous scale
one that arranges items into either of two mutually exclusive categories, e.g. yes/no, alive/dead.
bivariate
outcomes belong to two categories, e.g. yes/no, acceptable/defective "bivariate binomial distribution".
continuous data
result from infinitely many possible values that correspond to some continuous scale that covers a range of values without gaps, interruptions or jumps, e.g. blood pressure.
discrete data
result when the number of possible values is either a finite number or a "countable" number.
systematic error
see bias
Variation
see coefficient of variation
cohort
see cohort study below
outcome variable
see dependent variable
Standard normal distribution
see normal distribution
Inter-rater reliability (observer variation, inter-rater agreement, Concordance)
the degree of agreement among raters. It gives a score of how much homogeneity or consensus there is in the ratings given by judges. It is useful in refining the tools given to human judges, for example by determining if a particular scale is appropriate for measuring a particular variable. If various raters do not agree, either the scale is defective or the raters need to be re-trained. There are a number of statistics which can be used to determine inter-rater reliability. Different statistics are appropriate for different types of measurement. Some options are: joint-probability of agreement, Cohen's kappa and the related Fleiss' kappa, inter-rater correlation, concordance correlation coefficient and intra-class correlation.
mean-squared error
the expected value of the square of the difference between an estimator and the true value of a parameter. If the estimator is unbiased then the mean squared error (MSE) is simply the variance of the estimator. For a biased estimator the MSE is equal to the sum of the variance and the square of the bias.
kurtosis
the extent to which a unimodal distribution is peaked.
logit (log-odds)
the logarithm of the ratio of frequencies of two different categorical outcomes such as healthy versus sick.
multinomial distribution
the probability distribution associated with the classification of each of a sample of individuals into one of several mutually exclusive and exhaustive categories. When the number of categories is two, the distribution is called binomial.
predictive value of a negative test
the probability that a person with a negative test does not have the disease.
predictive value of a positive test
the probability that a person with a positive test is a true positive (i.e. does have the disease).
p value
the probability that a test statistic would be as extreme as or more extreme than observed if the null hypothesis were true.
measurement scale
the range of possible values for a measurement (e.g. the set of possible responses to a question, the physically possible range for a set of body weights). Measurement scales can be classified according to the quantitative character of the scale: 1. dichotomous scale - one that arranges items into either of two mutually exclusive categories, e.g. yes/no, alive/dead. 2. nominal scale - classification into unordered qualitative categories, e.g. race, religion, country of birth. Measurements of individual attributes are purely nominal scales, as there is no inherent order to their categories. 3. ordinal scale - classification into ordered qualitative categories, e.g. grade, where the values have a distinct order but their categories are qualitative in that there is no natural (numerical) distance between their possible values. 4. interval scale -an equal interval involves assignment of values with a natural distance between them, so that a particular distance (interval) between two values in another region of the scale. Examples include Celsius and Fahrenheit temperature, date of birth. 5. ratio scale - a ratio is an interval scale with a true zero point, so that ratios between values are meaningfully defined. Examples are absolute temperature, weight, height, blood count, and income, as in each case it is meaningful to speak of one value as being so many times greater or less than another value.
odds
the ratio of the probability of occurrence of an event to that of nonoccurrence (a binary variable), or the ratio of the probability that something is so to the probability that it is not so.
Marginals
the row and column totals of a contingency table.
percentile
the set of divisions that produce exactly 100 equal parts in a series of continuous values, such as blood pressure, weight, height, etc. Thus a person with blood pressure above the 80th percentile has a greater blood pressure value than over 80% of the other recorded values.
t-test (T-distribution)
the t-distribution is the distribution of a quotient of independent random variables, the numerator of which is a standard normal variate and the denominator of which is the positive square root of the quotient of a chi-square distributed variate and its number of degrees of freedom. The t-test uses a statistic that, under the null hypothesis, has the t-distribution to test whether two means differ significantly, or to test linear regression or correlation
maximum likelihood estimate
the value for an unknown parameter that maximizes the probability of obtaining exactly the data that were observed. Used to solve logistic regression.
Slope
used to describe the measurement of the steepness, incline, gradient, or grade of a straight line. A higher slope value indicates a steeper incline. The slope is defined as the ratio of the "rise" divided by the "run" between two points on a line, or in other words, the ratio of the altitude change to the horizontal distance between any two points on the line. The slope of a line in the plane containing the x and y axes is generally represented by the letter m, and is defined as the change in the y coordinate divided by the corresponding change in the x coordinate, between two distinct points on the line. This is described by the following equation: m = Δy / Δx If y is a linear function of x, then the coefficient of x is the slope of the line created by plotting the function. Therefore, if the equation of the line is given in the form y = mx + b then m is the slope. This form of a line's equation is called the slope-intercept form, because b can be interpreted as the y-intercept of the line, the y-coordinate where the line intersects the y-axis.
observer variation (error)
variation (or error) due to failure of the observer to measure or identify a phenomenon accurately. Observer variation erodes scientific credibility whenever it appears. There are two varieties of observer variation: interobserver variation, i.e. the amount observers vary from one another when reporting on the same material, and intraobserver variation, i.e. the amount one observer varies between observations when reporting more than once on the same material.