Descriptive statistics
Population variance
- Average of squared deviations of values from the mean - Most commonly used measure of variation - Shows variation about the mean - Has the same units as the original data - never negative
mean
- Most common because most data is interval or ratio - most sensitive; affected by outliers - however if there are extreme scores can use mean so use median
Standard Deviation
- Most commonly used measure of variation - Shows variation about the mean; a measure of the avg scatter around the mean - Has the same units as the original data - the squared root of the variance
Measures of central tendency
- a descriptive statistic - measures of typicality - need more than one measure of central tendency bc of shape of distribution (sensitivity varies) - type of measurement determines which measure of central tendency to use - mean median and mode
Interquartile range
- eliminates problems of outliers bc 50% is where its stable - Eliminate some high- and low-valued observations and calculate the range from the remaining values - the middle 50% (Q3-Q1)
median
- is the 50th percentile - a little more sensitive than mode (not affected by extreme values) - can't use for categorical data - best measure for ordinal data - If the number of values is odd, the median is the middle number - If the number of values is even, the median is the average of the two middle numbers
mode
- is the least sensitive measure (not affected by extreme values) - most freq occurring - problems: multiple modes, no modes - use for categorical (nominal) or numerical data
skewed distributions and positions of measures
- mode in peak - median - mean on tail
measures of variation
- numbers that indicate how much scores differ from each other, and the measure of central tendency - gives info on the spread or variability of the data values - range, interqualtile range, variance, standard deviation, coefficient of variation - no measure of varariability for nominal data
range
- simplest measure of variablity - diff between largest-smallest disadvantages: ignores ways data distributed and sensitive to outliers
quartiles
- split data into equal 4 segments with equal number of values per segment - identifies relative position of scores in a distribution - Q1=25% - Q2=50% (median)
Empirical rule
- things are highly predictable If the data distribution is bell-shaped, then intervals: - interval + or -1 contains about 68% of the values in the population or the sample - interval + or -2 contains about 95% of the values in the population or the sample - interval + or -3 contains about 99.7% of the values in the population or the sample
Exploratory data analysis
Box-and-Whisker Plot is a Graphical display of data using 5-number summary: Minimum -- Q1 -- Median -- Q3 -- Maximum The Box and central line are centered between the endpoints if data are symmetric around the median - allows you to see if there are outliers
advantages of SD and variance
Each value in the data set is used in the calculation Values far from the mean are given extra weight (because deviations from the mean are squared)
variation
most imp concept in stats is analyzing variance bc it measures diversity and individual differences - on average how different is score from mean?
normal distribution
most naturally occurring variables are norm distributed - mean, median, mode all in center at same spot
variance
only interpretable in comparison to other measures of variability; can never be negative
parameters vs statisitcs
population parameters are constant; relatively stable but sample statistics are variable
reason for n-1
standard deviation is squared root of variance - without subtracting 1 you would underestimate what happens in population because no way a small group could represent diversity of the population - subtracting 1 bumps up the final answer - without it, it is a biased estimator of population
Insensitivity
stat doesn't change when data changes *always want most sensitive