Descriptive Statistics: Summarizing the Data
sample size (n) population size (N)
units for sample/population size
Absolute Range: -"If we have a dataset that has extreme outliers (positive direction or negative) we're going to want to use the median to describe the central tendency of that dataset; if we use the median, we need to be able to measure variability too and absolute range is one of those ways."
- Difference b/t the largest and smallest observation in a dataset (ex: if your lowest value is 1 and your highest value is 10, then the absolute range can be written as 9 (difference between the two numbers, or it could be written as (1,10) - *Disadvantage*: the range is based solely on two observations and is likely not representative of the whole dataset. (if you have outliers, this doesn't tell you where they are) - Absolute range is particularly susceptible to outliers.
Standard Error of the Mean (SEM):
- defined as the variability of the sample means (very similar to SD) - calculated as the *SD divided by the square root of the sample size* - SEM will always be *smaller* than the sample SD ----> reported in the scientific and medical literature as an intermediary step in calculating confidence intervals. -*it's a variability within sample means itself*
Standard Deviation
- most common measure of the variability around the mean -used with the mean b/c it's the most informative when the dataset is normally distributed -mean and SD are commonly used together as the most informative measures of central tendency and variability of a dataset, respectively
Median
- number in the middle -50% of all ranked values are smaller and the other 50% are larger -*not influenced by extreme outliers* -therefore it's useful in situations in which there are unusually low or high values that would render the *mean* unrepresentative of the data
Central tendency
- tells you where average data, middle data, or the most common data located? -"a general term used to describe a single value near a point in the data that represents where the largest portion of data are located"
For a bell-curve *(normal distribution)*, the highest point on the Y-axis *(top of the bell)* is the mean *(on the X-axis)*. What happens when you go one or two standard deviations from the mean?
-68% of values will fall within +/- 1 SD from the mean -95% of the values will fall within +/- 2 SD from the mean -99.7% of the values will fall within +/- 3 SD from the mean -*this will be true when dealing with NORMAL DISTRIBUTION*
Distribution:
-Statistical test selection relies heavily on the distribution of data. These distributions can be summarized by centranl tendency and variation around the center. -Most important distribution is the normal or Gaussian curve (bell-shape curve)
When the curve is not distributed normally *(skewed to the positive or negative)*, the median is a better representation of central tendency because the skew will cause the mean to move to the left *(negatively skewed curves)* or the right *(positively skewed curves)*
-for skewed distributions, the median is a better representation of the central tendency than the mean, and the IQR is a better measure of the variability than the SD.
Mean
-most commonly used measure of central tendency (average of all observations in a dataset) -it *IS* influenced by extreme outliers -most useful when data are symmetrically distributed without outliers (i.e., normal distribution)
Interquartile Range (IQR): -use much more often than absolute range (which doesn't attest to reality due to the potential for outliers)
-quartiles are calculated in a way similar to the median, which splits a dataset into two equally sized groups (50th percentile) - *quartiles split the data into four approximate equal groups* [25th (lower quartile(Ql or Q1)) and 75th (upper quartile (Qu or Q3) percentile] -IQR is the range b/t the lower and upper quartiles ---->like the median, the IQR range is particularly *useful when data are not symmetrically distributed (there are outliers in either the pos OR neg direction, not both.*
Mode
-the value with the greatest frequency of occurrence - not generally used because it's often NOT representative of the data, particularly when the dataset is small
What is inferential statistics used for?
-to make predictions (i.e. infer) about a large amount of information (population) based on a sample. -used in decision making in many fields, including regulatory drug approval and clinical drug usage.
What is descriptive statistics used for?
-to organize, summarize, categorize, and display data (basically just summarizes data)
What graphical method best depicts variability through median and IQR *(used when you have extreme outliers)*?
-using *boxplots* -boxplots are one-dimensional graphs that can be drawn from the range, IQR, and median.
Variability measures: -Central Tendency (mean, median) does NOT provide any information about the variability in a given dataset therefore CT is usually is usually used with variability to better describe the dataset
1. Standard Deviation (SD) ----used with mean 2. Standard Error of the Mean (SEM) ----used with mean 3. Absolute Range ----used with median 4. Interquartile Range (IQR) ----used with median
Statistics is broken down into two fundamental parts:
1. descriptive statistics 2. inferential statistics
Most commonly used central tendency measures:
1. median (m) 2. mean (X or (micro sign) 3. mode (not used as much)
A dataset with a large number of outliers to best described by a *BLANK* and *BLANK* and opposed to a *BLANK2* and *BLANK2*.
BLANKS: median and IQR BLANKS2: mean and SD