PubHlth 223: Biostats Exam 1
modality
# of areas of pronounced density of observations - unimodal: 1 peak - bimodal: 2 peaks - mulitmodal: 2+ peaks - uniform: 0 peaks
variance: why do we use squared deviation in the calculation of varience?
- so that equally above and below the mean contribute equally. so that larger deviation from the mean weigh more heavily
descriptive statistics for numerical data
-central tendency measures: mean and median -spread of data measures: variance, SD, interquartile range
cluster sampling
-challenge: if it is not practical / economical to list all individuals that may be samples. clusters represent grouping that can be enumerated. cluster may not be made of homogenous observations. sample cluster and study all cases within the sample cluster - analysis methods are more complex for accounting for multiple layers of variability.
independent outcomes
-knowing the outcome of one random process provides no useful information about the second -knowing that a coin landed on a head (outcome of first random process) does not provide information to determine the land on second toss--> outcome of two tosses of a coin are independent
stratified sample
-list all possible cases in the population -classify each into a strata -the cases within a stratum are expected to be similar with respect to some underlying characteristic that may relate to the response variable of interest -take a simple random sample from each stratum
simple random sampling
-list all possible cases in the population -randomly select cases to be studied -there is no implied connection between the cases that are selected beyond the fact that they come from the specified population
rules for probability distribution
-the events listed must be disjoint (mutually exclusive) -all possible outcomes must be delineated (listed) -each probability must be between 0 and 1 -the probability must total 1
histogram
-visual of data density for continuous numerical data -higher bars represent where the data are relatively more common -displays shape of the data distribution -need to carefully consider "bin width"
what are 3 key decisions in hypothesis testing
1. null hypothesis 2. alternative hypothesis 3. the type 1 error you are willing to tolerate (how low does p-value have to be to conclude null should be rejection.. usually 5% or p<0.05
what are the two ways to use normal distribution?
1. uses related to specific values that may be observed - standardized observed data with normal distribution 2. uses related to summary statistics - determine probability of an observed result with normal distribution
sampling
a feature of observational and experimental studies. accurate interpretation of statistical analyses relies on understanding who got into the study and how -simple random -stratified sample -cluster sample -multi-stage sampling
random process
a situation in which we know what outcomes could happen, but we don't know which particular outcome will happen (coin toss, die roll)
explanatory variable
a variable that might cause changes in the response variable
standard error
measure of variability of an estimated statistic (SD of the sampling distribution of a statistic) -a function of both the variability in the data and the sample size -larger sample size= smaller standard error - reflects increasing precision as more information is oberved
disjoint
mutually exclusive
complementary events
mutually exclusive events whos probabilities add up to 1 ex: compliment event for event D (rolling a 2 and a 3) is Dcompliment (rolling a 1, 4, 5 and 6)
robust statistics
not significantly impacted by extreme values/ skewness (outliers) -median and IQR ae more robus to skewness and outliers than mean and SD
descriptive statistics for categorical data
number and percent
what type of variable would GPA be? continuous or numerical
numerical, continuous
how do you determine a histograms normality
over lay it with a normal distribution curve
joint probability
probability of a particular outcome of one random process AND a particular outcome of a second random process (inside contingency table)
Marginal Proability
probability of one random process regardless of another process (the totals on a contingency table)
probability
proportions of times a particular outcome (event) is observed out of all possible outcomes
1-sided hypothesis testing
question is whether a summary statistic is far enough away from a hypothetical value in one specified direction (greater? less?) - ex: is the response rate higher in the treated group?
2-sided hypothesis testing
question is whether summary statistic is different than a hypothetical value -ex: is the response rate different between groups?(allowing for it to be higher or lower)
primary tools for controlling confounding
randomization: ideally distributes a similar population of study participants to each treatment blocking: (stratified randomization) forces distribution of study participants with a particular characteristic to be evenly distributed in treatment group
P-value
represents the probability that the result of a study would be as favorable to the alternate hypothesis as this particular data if the null is true. -Use the information from the p-value to decide if we should reject or not reject the null hypothesis
observational research
retrospective: data collected B events have taken place prosepctive: study individuals and collect information that unfolds
which is larger (mean or median) if left and right skew
right skew: mean> median left skew: mean< median
variance (S^2)
roughly the average squared deviation from the mean
multistage sampling
similar to cluster sampling but there is a second stage of sampling in which cases within the sample cluster are sampled -analysis method are more complex to account for multilayer variability
sample
subset of population that is used to create an estimate of entire population, much more common, calculate statistics that strive to accurately estimate parameters
intensity map
summary of two variables, if one variable is geographic location
central limit theorem
the distribution of many statistics derived from repeated simulated samples will converge on a normal distribution. - as sample size increases, the distribution of sample mean will closely approximate normal distribution -distribution of sample means converge toward the center of the distribution and SD (spread) of sample mean decreases - sample size up, spread down - assume observations in sample are independent and sample is large
z-score
the number of standard deviations the measurement is above or below the age - matched mean bone mineral density
event
the particular outcome of a random process that we want to know the probability of
outcome
the possible results of a random process
statistical inference
the practice of drawing conclusions about a population from a sample of data recognizing that it has been observed in the context of random variation -while a given sample of data may not always lead us to a correct conclusions, statistical inference gives us looks to control and evaluate how often these errors occur
interquartile range
the range between Q1 and Q3: amount of spread observed in the central bolus of data percentiles- value for which a specified percent of observations are below Quartiles- the 3 cut points that delineate quarters of the observation
dot plot
useful for visualizing one numerical value -darker colors represent areas where there are more observations
segmented bar plot
visual representation of 2x2 contingency table
mosaic plot
visual representation of 2x2 contingency table. each box as a width and height corresponding to relative proportion of observation in a particular cell
bar plot
visualize a single categorical variable - can be used for single discrete numerical variable categories in any order along x-axis (unless the variable is ordinal- those stay in order
relative frequency bar plot
visualize a single categorical variable in proportions rather than numbers
stacked dot plot
visualize one variable -higher bars represent areas where there are more observations
pie chart
visualize single categorical data -sections represent promotion of total sample in a particular category -sections are usually sequentially ordered based on proportion
skewness
when a normal distribition has a long tail. skewed to which ever side the tail is on
single process outcome
when only one event is of interest -what is prob of A -what is prob of A or B -addition -disjoint outcomes: mutually exclusive -non-disjoint outcomes: can occur together
what is calculating in a population? what is calculated in a sample?
Population --> parameter sample --> statistic
type 1 error
Rejecting null hypothesis when it is true (H0=true, reject H0)-->
what determines location and what determines spread in normal distribution?
Location determined by mean spread determined by SD
visual summary of categorical data
bar plot, relative frequency bar plot, pie chart
numerical data
can be any numerical value. makes sense to perform mathematical functions on them/ can be placed in ascending or descending order/ would it make sense to subtract or add -continuous numerical: any number is possible within a range (blood pressure, weight) -discrete numerical: integers often naturally ordered. no possible option between variables (number of hospital stays)
categorical data
can be sorted into groups or categories -ordinal categorical: has clear order -regular categorical: groups or levels with no clear order ( blue, orange, purple)
non-disjoint
can happen at the same time ex: sum of two die can be both 2 and even
disjoint
cannot happen at the same time (mutally exclusive outcomes) ex: sume of two die cannot be 2 and 12
would zip code be categorical or numerical?
categorical--> wouldnt make sense to subtract or add zip code
standard deviation
describes how concentrated the data are around the mean -SD is rescaling of the variance back into the scale of the original data - SD is the square root of the variance
sample distribution
distribution of values for a statistic for all possible samples from the same population
population
entire group that researchers are interested in understanding
response variable
expected effect from the explanatory variable
type 2 error
failing to reject a false null hypothesis when Ha is actually true( HA is true, fail to reject H0)
bar graph
for discrete numerical data. also works for categorical -displays shape of data distribution
what is the 68-95-99.7 rule
for normal distribution data -68% falls within 1SD of the mean -95% falls into 2 SD of the mean -99.7% falls within 3 SD of the mean
scatter plot
graphical summary of two variables. each dot is a care with two pieces of information
differences between histogram and bar graph
histogram: continuous numerical variables on X, bins, frequency count on Y, no gaps Bar: discrete numerical variables on X, numerical values on Y, gaps, number with "jump category" instead of bin
purpose of outliers in box plot
identify skew in distribution - may idenetify data collection/entry errors - provide insight into interesting features in daata - provide caution; extreme values may distort out understanding of central tendencies and variability
when is something deemed statistically significant?
if the p-value is less than some prespecified value
dependent
knowing the outcome of one random process provides some information about the outcome of the second random process -knowing that the first card drawn from a deck and not replaces is an ace does provide info for determining a prob of dawing an ace on the second draw
alpha
level of significance in hypothesis testing to determine if H0 can be rejected - our probability of committing type 1 error
