Epidemiology Exam 1 Chapter 2
Types of distribution curves
- Normal Distribution/Gaussian -standard normal -Same mean and different dispersions -Skewed distributions -measures of central tendency - median (better measure of central tendency for skewed distributions)
Graphical presentations
- Used to summarize key aspects of the data set - Types of graphs: - Bar chart - Line graph - Pie chart
parameter
- a variable for describing a population - measurable attribute of a population - designated by the symbol μ, mu
examples of populations
- all of the inhabitants of a country (e.g. China) - all of the people who live in a city (e.g. NY) - all students currently enrolled in a particular university - all of the people diagnosed with a disease such as type 2 diabetes or lung cancer
Nominal scales
- are qualitative and consist of categories that are not ordered (whereas ordered data has categories like best to worst) - include dichotomous scales ex: race, religion, gender, car ownership, cell phone provider
reasons for multimodal distributions of health outcomes:
- changes in lifestyle and immune statues of the host - latency effects latency refers to the time period bw initial exposure and a measurable response ie occurrence of conditions such as chronic diseases that have long latency periods and occur later in life
Rationale for using samples
- improved parameter estimates - possible cost savings
Curvilinear (inverted U-shape) scatterplot
- its possible for scatterplots to conform to nonlinear shapes, like a curved line - in these, the linear association bw X and Y is essentially 0 (-0.09), indicating that there is no linear association - However, nonlinear curves do not imply that there is no relationship bw two variables, ONLY that their relationship is nonlinear
Types of scatter plots
- perfect direct linear association - perfect inverse linear association - no association - positive relationship (r= 0.7 - curvilinear (inverted U-shape)
simple random sampling
- samples are selected by a random process - unbiased - the average of the sample estimates over all possible samples (of size n from N) is equal to the population parameter -example of a mean sample mean μ - population mean mean of all sample means of size n from N the mean of all sample means of size n will be equal to the population mean
Random Sampling
- simple random sampling - stratified random sampling
Symmetrical (non-skewed) distributions
- the mean and median are identical and can be used interchangeably - general rule: the arithmetic mean is preferred over median as a measure of central tendency
stratified random sampling
- uses over-sampling of strata in order to ensure that a sufficient number of individuals from a particular stratum are included in the final sample - can improve parameter estimates for large, complex population, especially when there is substantial variability among subgroups
Ordinal scales
-are categorical data that can be ordered and ranked, but are still qualitative; you are able to assign #'s/ranks even though your data is qualitative - the intervals between each point on the scale are not equal intervals - use bar graphs ex: socioeconomic status, occupational prestige, level of educational attainment, self-perception of health (strongly agree, agree, etc.)
Ratio scale
-most used scale in Epi - the ratio scale has a true zero point, so one can create ratio comparisons - ex: the Kelvin scale: 0 degrees K represents the absence of all heat, therefore we can say 200 degrees K is twice as hot as 100 degrees K
Measures of variation/dispersion/spread
-range -mean deviation -variance and standard deviation
in a generic contingency table
A = exposure is present and disease is present B = exposure is present and disease is absent C = exposure is absent and disease is present D = exposure is absent and disease is absent where A+B+C+D = all study subjects
contingency table
Another method for demonstrating associations A type of table that tabulates data according to two dimensions - there is an exposure variable (like viewing or not viewing alcoholic beverage commercials) - and an outcome variable (like whether study subjects engage in binge drinking) - column and row totals are known as marginal totals
Drawback of simple random sampling
Most large populations in the US and other countries are comprised of numerous subgroups, which an epidemiologist may want to investigate the characteristics of. Unfortunately, when a simple random sample of a large population is selected, members of subgroups of interest may not appear in sufficient numbers in the chosen sample to permit statistical analyses of them.... Stratified random sampling offers a work-around for this problem
parameter estimation
Recall that epidemiologists use statistics to estimate parameters... 2 types of parameters are - point estimate - interval estimate
interval estimate
Uses a range of values for estimation of a parameter in other words, it is defined as a range of values that with a certain level of confidence, contains the parameter ex: one common level of confidence is the 95% confidence interval, meaning that one is 95% certain the confidence interval contains the parameter or value that we are interested in
point estimate
Uses a single value for estimation of a parameter ex: is the use of x-bar (the sample mean) to estimate mu (the corresponding population mean)
Estimation
Using sample-based data to infer conclusions about the population - thus x̄ can be used as an estimate for μ (the pop. mean)
pie chart
a circle or pie - shows the proportion of cases according to several categories - the size of each piece of the pie is proportional to the frequency of cases - the pie chart demonstrates relative importance of each subcategory
population
a collection of people who share common observable characteristics
Epidemic Curve
a graphic plotting of the distribution of cases by time of onset ; its a unimodal curve - there is a baseline mean of cases over 5 years for ex (blue line), which tell you when most cases occurred - aids in identifying the cause of a disease outbreak - helps you understand outbreak and the distribution of cases - so you can prevent it from happening again ex: Foster farms Salmonella outbreak
Sample
a sub-group that has been selected, by using one of several methods, from the population
bar chart
a type of graph that shows the frequency of cases for categories of a discrete variable - height of each bar represents frquency of cases for each category -ex: qualitative, discrete variable such as a Yes/No variable - along base/x-axis of bar chart are categories of the variable: Sex Injection Drug use? Shared needle?
Normal distribution: Standard normal distribution
a type of normal distribution with: a mean of 0 and a standard deviation of 1 unit - is created when you perform the mathematical process of standardizing the distribution, and create the bell curve 68-95-99.7
Mean (x̄)
also called arithmetic mean or average - in distribution curves, the mean is the location on the X-axis
subthreshold phase of dose-response curve
at very beginnning, before threshold on curve - suggests that at low levels of dosage, no or minimal effect occurs
example of parameter
average age of the population
Interval scale
consist of continous data with equal intervals between points on the measurement scale without a true zero point - therefore we cannot calculate ratios ex: IQ Fahrenheit scale does not have a true zero point which is why 100 degrees F is not twice as hot as 50 degrees
qualitative data
do not have numerical values or rankings - measured on a categorical scale ex: marital status, sex, occupation (have no natural ordering)
Distributions with Multimodal Curves
has several peaks in the frequency of a condition
frequency tables
helpful in identifying outliers, extreme values - one of the most convenient ways summarize or display data in a grouped format - after tabulating data in freq. table, an epidemiologist might plot the data graphically as a bar chart, histogram, line graph or pie chart
positive relationship (ex r = 0.7) between variables scatterplot
if the relationship is fairly strong with r =0.7, then the points are going to be very close together and almost form a straight line - if an oval were to be drawn around the points, the oval would be cigar-shaped
68-95-99.7 rule (empirical rule)
in a normal distribution model, about 68% of values fall within 1 standard deviation of the mean, about 95% fall within 2 standard deviations of the mean, and about 99.7% fall within 3 standard deviations of the mean
sampling bias in non-random sampling:
individuals who have been selected are not representative of the population to which epidemiologists would like to generalize the results of the research. (often times, only people who are interested in the survey topic respond to the survey)
when r is negative, the association is
inverse, meaning that the value of one variable will increase if the value of the other variable decreases
Distribution Curves
is a graph that is constructed from the frequencies of the values of a variable - can take various forms. like symmetric and non-symmetric (skewed) - are described in terms of central tendency (mean, median, mode and dispersion/spread (SD, range, percentile, quartiles)
Standard Deviation of a sample (s)
is a measure of the dispersion (spread) - measure of how much your data varies from the mean to determine the standard deviation of a sample, you take the square root of the variance
Normal (Gaussian) Distribution
is a symmetrical distribution where the mean, median and mode are identical and fall exactly in the middle of the distribution
dose-response curve
is the plot of dose-response relationship, which is a type of correlative association bw an exposure (like a dose of a toxic chemical) and an effect (like a biologic outcome) ex: dose response relationship bw # of cigs smoked daily and mortality from lung cancer
continuous variable
made up of continuous data - have infinite number of possible values along a continuum - ex: heart rate, blood cholesterol, blood sugar levels, age, height, weight
discrete variable
made up of discrete data - have finite or countable number of values ex: household size (# of ppl who reside in house) # of doctor visits
Pearson correlation coefficient
measure of strength and direction of linear relationship between 2 continuous variables - varies from -1 to 0 to +1
median
middle point of a set of numbers to find: re-write numbers in data set from lowest to highest, middle number is the median; if even data set then average the 2 middle numbers and that is your median
Measures of central tendency (or location)
mode, median mean
Stanley Stevens' measurement scales
nominal, ordinal, interval, ratio in 1946, Stevens wrote that before conducting a data analysis, one should choose an analysis that is appropriate to the scale of measurement being used
N is deginated as the
number in the population
n is designated as the
number in the sample
mode
number occurring most frequently in a set or distribution of numbers aka the category in a frequency distribution that has the highest frequency of cases
Statistics
numbers that describe a sample
Scatter Plot Diagram
plots two variables, one on the X axis (horizontal) and one on the Y axis (vertical) - the measurements for each case or individual subject are plotted as a single data point (dot)
when r is positive, the association is
positive, when one variable increases so does the other
non-random sampling
prone to sampling bias bc of self-selection in internet surveys, media based polling, etc. - convenience sampling - systematic sampling - cluster sampling
examples of simple random sampling
random digit dialing for phone surveys, drawing a name from hat, or lists that include a large diverse population, like licensed drivers
Threshold dose-response curve
refers to the lowest dose at which a particular response may occur
Analyses of Bivariate Associations examines the
relationships between two variables ex: - scatter plots - correlation coefficients - contingency tables
quantitative data
reported as numerical quantities - obtained by counting or taking measurements (ex is measuring patient's height)
μ represents the
sample mean, the average of a population (ex is average age)
Histogram
similar to bar chart, used for continuous variables -used to display the frequency distributions for grouped categories of a continuous variable -coding procedures are applied to convert continuous variables to convert them into categories on the x-axis ages: 15-19 20-24 25-29 30-34
skewed distributions
skewed data are not equally distributed on both sides of the distribution - so its NOT a symmetrical distribution -they are either right or left skewed: determined by the direction that the tail of the distribution is pointing - in these cases most of the data is not necessarily close to the mean as we saw in normal distributions
a stratum is a
subgroup of the population -ex: a population can be stratified by racial or ethnic group, age category or socioeconomic status
Example of n and N:
suppose an epidemiologist wants to study the health characteristics of racial or ethnic subgroups that are uncommon the general population - the size of n is limited by our available budget - if n is small (which is often the case) in comparison to N, then only a few individuals from the minority group will enter the sample.
Remember that an association between two variables signifies ONLY that they are related and NOT
that the association is causal
simple random sampling is unbiased meaning
that the average of the sample estimates over all possible samples is equal to the population parameter
as r gets closer to -1 or +1
the association becomes stronger
Measures of central tendency of a skewed distribution
the mean, median and mode all have different values in a skewed distribution - and the median is a more appropriate measure of central tendency than the mean in a skewed distribution
Range
the range is the difference between the highest and lowest value in a group of numbers highest - lowest = range
cluster sampling
the researcher divides the population into separate groups (rather than individuals), called clusters. Then, a simple random sample of clusters is selected from the population. -a common method for sampling - can produce cost-savings (more parsimonious than random sampling) - creates unbiased estimates of parameters.
x̄
the sample estimate of μ (the sample mean)
in a scatter plot, the close the points lie with respect to the straight line of best fit through them (the regression line)...
the stronger the association between variable X and Y are
the closer r gets to 0,
the weaker the association becomes
if r = 0
there is no association
Same mean different distributions
these two have the same mean (ie location on X-axis) and different dispersions
significance of dose-response relationship
this relationship is one of the indicators used to address a causal effect of a suspected exposure associated with an adverse health outcome - ex: is dose response relationship bw # cigs smoked daily and rates of lung cancer mortality this dose-response relationship was one of considerations that led to the conclusion that smoking is a cause of lung cancer mortality
Universe
total set of elements from which a sample is selected
Line graph
used to display trends - points of graph have been joined by a line - a single point represents the frequency of cases for each category of a variable - when using more than one line, the epidemiologist is able to demonstrate comparisons among subgroups ex: time trends
systematic sampling
uses a systematic procedure to select a sample of a fixed size from a sampling frame (a complete list of people who constitute the population) - feasible when a sampling frame such as a list of names is available - but may not be representative of the sampling frame ex: an epidemiologist wants to select a sample of 100 individuals from an alphabetical list that contains 2000 names
convenience sampling
uses available groups selected by an arbitrary and easily performed method. - highly likely to be biased - not appropriate for application of inferential statistics - but can be helpful in descriptive studies and for suggesting add'l research ex: a group of patients who receive medical service from a physician who is treating them for a chronic disease
Variance of a sample (s^2)
variance is the degree of variability in a set of numbers. - it answers the question below in a mathematical way: "How different are the data points from one another?"
right skewed distribution
when most of the data is on the left side - tail of distribution trails off to the right - positively skewed
left-skewed distribution
when most of the data is on the right side - tail of distribution trails off to the left - negatively skewed
linear state of dose-response curve
where an increase in response is proportional to an increase in dose