Modules 1 and 2 - Descriptive and Inferential Statistics
Differential or inferential: explore the relationship between HIV infection and age, sex, rave and SES
inferential
Differential or inferential: use MCAT scores to predict the likelihood of graduating from medical school
inferential
If you have a distribution and the right tail is longer, what is its skewness?
positive
Inferential statistics: Which tests would you consider if there are categorical variables associated and variables X1, X2,... etc predict Y?
logistic regression
Inferential statistics: Which tests would you consider if you are looking for a difference between 2 groups?
unpaired t-test or paired t-test
What is a stem and leaf plot?
used to display data from small data sets
What does a confidence interval of 95% mean?
- 95% of the intervals would include the parameter - confidence level describes the uncertainty associated with a sampling method (some interval estimates would include the true population parameter and some would not) *High yield* - 95% confidence limits are approximately equal to the sample mean plus or minus two standard errors - to halve the confidence interval or double the precision, the sample size must be increased fourfold
What is a normal distribution curve?
- a graphical presentation of any quantitative variables such as: weights, heights, Hb levels and blood pressure - continuous data - occupies a major role in the techniques of statistical analysis
What is a frequency histogram?
- a histogram is a plot of the class frequencies, relative frequencies, or percent relative frequencies against the class boundaries (or class midpoints)
By which parameters are distribution curves described?
- arithmetic mean (determines location of center of curve) - standard deviation (scatter around the mean)
What is the difference between qualitative and quantitative data?
*Qualitative* - deals with descriptions - data can be observed but not measured - colors, textures, smells, tastes, appearance, beauty etc *Quantitative* - deals with numbers - data which can be measured - length, height, area, volume etc
What are the subsets of ratio?
*Discrete* - the variable takes on a countable number of values. - most often these variables indeed represent some kind of count such as the number of prescriptions an individual takes daily *Continuous* - our precision in measuring these variables is often limited by our measurements - units should be provided - result of a measuring process ie age, money, time, mph, height and weight
What are levels of numerical data?
*Interval* - distance exists but not ratio - zero is arbitrary and not an indication of absence of the measurement (ie. temperature scale in Celsius, calendar years, IQ scores and GPA) - have order and equal intervals - allow ranking and quantifying to compare magnitudes - For example, if you subtract two interval values (60 degrees F - 30 degrees F = 30 degrees F, the distance still makes sense. However, a ratio does not make sense, 60F is not twice as warm as 30F *Ratio* - ratios exist and zero indicates an absence of the measurment - ie height of 6 feet would be twice that of 3 feet
What are measures of central tendency?
*Mode* - most frequent data point - unaffected by extreme values - useful for qualitative data - may have more than 1 value *Median* - value that divides ranked data points into halves: 50% larger than it and smaller than it - may not exist as a data point in the set - influenced by position of items but not their values *Mean* - most stable measure - affected by extreme values - may not exist as data point on set
What are levels are categorical data?
*Ordinal* - ordering does exist --> observations can be ranked or have a rating scale attached (i.e. class levels or military rank) *Nominal* - ordering does not exist (ie social security number) - observations can be assigned a code in form of a number where numbers are simply labels - hair color, gender, ethnicity
What is the difference between accuracy and precision?
*Precision* - degree to which a figure is immune from random variation - the width of the confidence interval reflects precision; the wider the CI, the less precise the estimate *Accuracy* - degree to which an estimate is immune from systematic error or bias - how close a measurement comes to the truth, and is determined by how close a measurement comes to an existing value that has been measured by many, many experts
What is a boxplot?
*Very important* - box and median line - outliers displayed with a symbol (ie a circle) - whiskers extended to smallest and largest values that are normal - IQR = interquartile range = Q3-Q1 - IQR is the width of an interval which contains the middle 50% of the sample, so it is smaller than the range and its value is less affected by outliers - useful when large numbers of observations are involved and when two or more data sets are being compared
What are the two types of data and what is the difference between the two?
*categorical* - such as color of student's eyes or Likert scale: agree vs. disagree *numerical* - i.e. temperature of room in Fahrenheit or height in inches
A study is being conducted to understand the health outcomes for smokers in NY state, what is an example of the population, sample and element for this study?
*population* - all smokers in NY state *sample* - subset of population that is actually being observed or studied; smokers from 3 clinics in NY state *element* - a single observation represents a single data element; an individual smoker
What is a pie chart?
- *categorical* - Commonly used circular chart divided into sectors; each sector showing the relative size of each categorical variable
What is a bar chart?
- *categorical* - made up of columns and rows plotted on a graph - columns are positioned over a label that represents a categorical variable - height of column = size of group defined by column label
Inferential statistics: Which tests would you consider if there are continuous variables associated?
- are variables X and Y linearly related? --> Pearson's correlation - do multiple values predict Y? --> linear regression
What are the characteristics of normal distribution curve?
- bell-shaped and continuous curve - symmetrical - tails never touch the base line rather they extend into infinity in either direction - the mean, median and mode values coincide
What is descriptive statistics?
- describe, organize or summarize data - refers only to *actual* data available
What are the characteristics of range?
- difference between the max and min value in a dataset - may be misleading - ignores the way data is distributed - sensitive to outliers
What is a t-score?
- estimated SE is used to find a statistic, t, that can be used in place of z - must be used when making inferences about means that are based on estimates of population parameters such as estimated SE rather than on the population parameters themselves
What are different ways to present categorical data?
- frequency table - bar graph - pie chart
What are different ways to present quantitative data?
- histogram (frequency) - stem and leaf plot - box plot
What is a frequency table?
- lists categories of a variable an their respective frequencies - frequency is a count of subjects who fall into a particular category - relative frequency is the frequency divided by total count - independent of size of data set. Used to compare datasets of different sizes
What is inferential statistics?
- makes inferences that go beyond actual data - usually involves inductive reasoning (generalizing to a population after observing a sample) - involves use of a statistic to estimate a parameter - making a generalization about a larger group of individuals on the basis of a subset or sample
What is a z-score?
- measure of how many standard deviations below or above the population mean a raw score is; aka standard score (can be placed on normal distribution curve) - standardized value of observation x from a distribution that has mean and SD and ranges from -3 and +3 SDs - compare results from a sample value to a normal population - z-score of 1 is 1 SD above the mean - z-score of 2 is 2 SDs above the mean - z-score of -1.6 is 1.6 SDs below the mean
What is the standard error of the mean (SEM)?
- measure of the extent to which the sample means deviates from the true population mean
By which parameters are distribution curves summarized?
- measures of central tendency - measures of dispersion
What is a measure of dispersion?
- range - standard deviations - variance - mean deviation
What are the characteristics of standard deviation?
- square root of the variance - average of the differences between the mean and each observation in the data - reduce each value from the mean and then sum theses differences and divide it by the number of observations - measure of how spread the variability is present in the sample
What do degrees of freedom mean?
- t-tables express sample size in terms of degrees of freedom = n-1 - if you had four numbers and knew their sum total or their mean, all number values but one number are required
What is the addition rule?
- the probability of any one of several particular events occurring is equal to the sum of their individual probabilities (provided they are mutually exclusive) - ie picking a heart and diamond from a deck of cards
What is the multiplication rule?
- the probability of two or more statistically *independent* events all occurring is equal to the product of their individual probabilities - ie probability of having both cancer and schizophrenia
What is the difference between t-score vs z-score?
- the value of t for any given proportion is not constant, it varies according to sample size - when the sample size is large, the values of t and z are similar, but as samples get smaller, t and z scores become increasingly different
Which of the following categories best describes the tumor staging of ovarian cancer? 1. Categorical,Ordinal 2. Categorical,Nominal 3. Numerical,Interval 4. Numerical,Ratio
1. Categorical, Ordinal
A research group wishes to report their findings of a study predicting the mean concentration of arsenic in the municipal water supply of Bangladesh. This is an example of which type of statistics: A. Inferential B. Descriptive
A. Inferential
A research group wishes to report their findings of a study predicting the mean number of hours of time spent on Facebook per day, based on a sample of 1,000 graduate students. This is an example of which type of statistics: A. Inferential B. Descriptive
A. Inferential
If the sample of patients was increased what would happen to the SEM? A. It would decrease due to an inverse relationship with the square of the sample size B. It would increase due to an inverse relationship with the square of the sample size C. It would decrease due to an inverse relationship with the square of the standard deviation D. It would decrease due to an inverse relationship with the square of the standard deviation
A. It would decrease due to an inverse relationship with the square of the sample size
Inferential statistics: Which tests would you consider if you are looking for a difference between 2 or more groups?
ANOVA
Question: The health department is investigating the possibility of lead poisoning in a population of children in a large city. You sample 25 patients and find that their average serum lead level is 15 μg/dL. The standard deviation is 2.5. What is the estimated standard error? A. 0.25 B. 0.5 C. 2.0 D. 2.5 E. 5
B. 0.5 SE = 2.5/sr (25) = 2.5/5 = 0.5
The research group reports their findings of the study of arsenic in the municipal water supply of Bangladesh. The mean arsenic level was 100 μg/L, with a standard error of 9. Which of the following best represents the 95% Confidence Interval where the true mean arsenic level of this population lies? A. 91-109 B. 82-118 C. 69-131 D. 55-145 E. 96-104
B. 82-118 - Confidence Interval (CI) = sample mean (+/-) z(SE) - Sample mean 100, z = +/- 1.96, SE = 9 - CI = 100 +/- 1.96(9) = 100+/- 17.6 - CI = 82, 118
A research group wishes to report their findings of a survey describing the prevalence of obesity in students at their college. This is an example of which type of statistics: A. Inferential B. Descriptive
B. Descriptive
Researchers wish to generalize from a random sample of 36 patients the mean BMI of a population. They calculate the mean BMI of their sample to be 25 and the standard deviation to be 4. The 95% confidence interval of this sample estimating the mean true population BMI is closest to which of the following? A. 24.3-25.7 B. 23.9-26.1 C. 23.7-26.3 D. 24.3-26.3 E. 35.3-36.7
C. 23.7-26.3 - Estimated SE = 4/ 36 = 4/6 = 0.67 - 95%CI = sample mean +/- (1.96)SE - 95%CI = 25 +/- (1.96)(0.67) = 25+/-1.3 - 95%CI = 23.7, 26.3
Researchers wish to generalize from a random sample of 36 patients the mean BMI of a population. They calculate the mean BMI of their sample to be 25 and the standard deviation to be 4. The estimated standard error is closest to which of the following? A. 1.5 B. 1.25 C. 0.80 D. 0.67 E. 0.05
D. 0.67 Estimated SE = 4/sr(36) = 4/6 = 0.67
Question: In a *population* of runners with a mean resting heart rate of 60 beats/min and a standard deviation of 12, the probability that a random sample of 16 runners will have a mean heart rate of 66 beats/min or higher is closest to which of the following? A. 16% B. 10% C. 5% D. 2.5% E. 1%
D. 2.5%
Inferential statistics: Which tests would you consider if there are categorical variables associated and variables X and Y are related?
Spearman correlation
A researcher would like an estimate of the mean systolic blood pressure (SBP) of newborn infants. A random sample of 100 patients is drawn from a newborn nursery. The standard deviation for neonatal SBP is 6.5 mm Hg. What is the value of SEM (rounded to second Decimal)?
SEM = SD/square root of n SEM = 6.5/sr (100) SEM = 6.5/10 SEM = 0.65
What type of variable is ethnicity?
categorical, nominal
What type of variable is hair color?
categorical, nominal
What type of variable is class rank?
categorical, ordinal
Inferential statistics: Which tests would you consider if there are categorical variables associated and variables X1, X2,... and Y are independent?
chi square test
Different or inferential: distribution of favorite ice cream flavors at an ice cream shop
descriptive
Differential or inferential: summary of household age, income and debt.
differential
Differential or inferential: examine the association between opening of the plastics plant and the incidence of childhood malignancies
inferential
If you have a distribution and the left tail is longer, what is its skewness?
negative
Inferential statistics: Which tests would you consider if you are trying to determine the probability of reaching an endpoint of interest (ie death)
survival analysis