Ch 2: Methods for Describing Sets of Data
Median
the number that divides the bottom 50% from the top 50% of the data middle value in its ordered list n = odd, then M or η is middle number n = even, then M or η is average of 2 middle numbers
3 Measures of Central Tendency
mean (numerical: discrete/continuous) median (numerical: discrete/continuous) mode (can be numerical or categorical)
Interquartile Range
(for middle 50% of data) IQR = upper quartile - lower quartile
3 graphs/charts that provide visual summary of qualitative data
1. (Relative) Frequency Table 2. Bar Chart (ex: Pareto Chart) 3. Pie Chart *Bars centered over: classes *White space between bars: yes
There are a larger number of numerical measures available to describe quantitive data sets. Most of these methods measure one of two data characteristics.
1. Central Tendency 2. Variability
2 rules for using the mean and standard deviation (together) to describe a data set
1. Chebyshev's Rule 2. Empirical Rule
3 graphs used for describing numerical data
1. Dot Plot 2. Stem and Leaf 3. Histogram
When numerically describing a data set, what 2 numerical properties should be reported together?
1. Mean 2. Standard deviation
What percent of measurements in data set lie to the left of upper quartile?
75%
Histogram Bar Locations
Discrete: "class interval" bars centered over the values Continuous: class intervals bars begin at lower endpoint of class interval and end at upper endpoint of class interval
Class Frequency
counts/NUMBER of observations/ind in data set that fall into a particular class
Stem and Leaf Display
Stem: listed in order in a column Leaf: for each quantitative measurement in data set is placed in corresponding stem row; leaves for observations with same stem value are listed in increasing order horizontally ideal for: −continuous data with high (same range; 15 decimals)/low (same range; 1 decimal; ex: temp) resolutions −large data sets −can be tricky for discrete, integer-valued data
General Policy
Greek letters: population Roman letters: sample
Inner Fences
Lower (LIF) = QL - 1.5 (IQR) Upper (UIF) = QU + 1.5 (IQR) outliers lie beyond points at distances 1.5(IQR) from each hinge/quartile
Outer Fences
Lower (LOF) = QL - 3 (IQR) Upper (UOF) = QU + 3 (IQR) really extreme; usually goes beyond data range; "highly" suspect OUTLIERS
Skewed Right
Mean is greatest Mode < Median < Mean ex: salaries of all persons employed by a large company ex: grades on a difficult test
Skewed Left
Mean is smallest Mean < Median < Mode ex: grades on an easy test ex: amounts of times students spent on a difficult exam
Measure of Center: Measure of Variability
Mean: Standard Deviation (s,σ) Median: IQR
Interpreting Box Plots
Median = center of distribution of data Length of box (IQR) = measure of sample's variability (useful for comparing 2 samples) Visually compare lengths of whiskers (if one is clearly longer, the distribution is prob skewed in direction of longer whisker) Analyze any measurements that lie beyond fences (less than 5% should fall beyond inner fences, even for very skewed distributions) (measurements beyond outer fences are prob outliers)
5 Number Summary
Min, QL, Median, QU, Max measures variability around median
Class Percentage
PERCENTAGE (class relative frequency) x 100%
Class Relative Frequency
PROPORTION written as DECIMAL (class frequency divided by total number of observations in data set) (class frequency) / n
Variability
SPREAD of the data with a SPECIFIC measure of center as REFERENCE
Measures of central tendency provide only a partial description of a quantitative data set.
The description is incomplete without a measure of variability, or spread, of the data set
3 Measures of Variability
aka spread; completes description of quantitative data set; most are in reference to measure of center 1. Range 2. Standard Deviation 3. IQR (Interquartile Range)
Histogram
analog to bar chart possible numerical values of quantitative variable are partitioned into CLASS INTERVALS, each of which has same width (a vertical bar is placed over each class interval) horizontal axis scale: class intervals order of values on horizontal axis: bars' location: discrete v continuous vertical axis: bar height = class frequency or class relative frequency, and class percentage????? space between bars: no = numerical
Histogram v Bar Chart
bar and pareto −white space between bars −bars centered over values −bar: random bar order vs −pareto: descending height determines bar order −cannot look at shape histogram −no white space between bars −limits of bases are defined by endpoints of interval defining the classes −bar order uniquely determined by numerical values of quantitate variable (unchangeable order...can look at SHAPE)
Distorting the Truth with Descriptive Stats
biased sampling small sample size poorly chosen average results falling within standard error using graphs to create an impression semi-attached figure (ex: contains) post-hoc fallacy
How is a Stem and Leaf similar to Dot Plot?
both plots: show data distribution, used to identify unusual data values, and used to determine specific data entries
Box Plot (aka modified or outlier)
box drawn with the hinges, or quartiles, at QL and QU, with M (median) in between QL and QU −inner fences −whiskers −IQR (QL, M, QU)
Methods for detecting outliers
box plots (graphical): beyond fences z-score (numerical): |z|>2 or |z|>3
"Class" (categorical data, discrete data, and continuous data)
categorical data: discrete data: continuous data:
Ordinal Variable
categorical variable, where values specify some type of ordering ex: Likert scale
Skewed
data set with one tail of distribution with more extreme observations (direction) than other tail median = almost always between mean and mode mean = always pulled in direction of longer tail mode = always at peak
How does a Pareto diagram differ from a bar chart?
descending order of bars
Measures of Relative Standing
describe the relative quantitive location of a particular measurement within a data set ex: percentile ranking or percentile score
Interquartile Range (IQR)
distance (range) between first and third quartiles IQR = QU-QL smaller area of IQR indicates: more dense values larger area of IQR indicates: less dense values
Population Mean
formula: symbol:µ
Sample Mean
formula: symbol: x bar
Statistic
general term for SAMPLE sample mean: x(bar) sample median: M sample variance: s² sample standard deviation: s
Parameter
general term from POPULATION population mean: µ population median: η population variance: σ² population standard deviation: σ
Divisor of Variance
gives formula for degrees of freedom
1st step in data analysis
identify and classify data! ex: Total Appearance Score is ordinal (not discrete)
Dot Plot
individual dot: numerical value of each quantitative measurement in data set (on horizontal scale) when data values repeat, dots are places above one another vertically ideal for: −continuous data with low resolution (values that repeat) −discrete data −small to moderate sample size (n)
Range
largest measurement minus the smallest measurement measure of variability NOT in reference to specific measure of center
Whiskers (and IFs)
lines drawn from each hinge/quartile to the most extreme measurement inside the inner fence end of 1st whisker: value in data that is closest to, but is not less than, lower inner fence end of last whisker: value in data that is closest to, but does not exceed, upper inner fence
Symmetric Data
mean, median, mode are approximately equal ex: amounts of times students spent studying last week ex: ages of cars in a used-car lot
Percentile Ranking (Percentile Score)
measure of relative standing useful only with large data sets
Lickert Scale
measurement device, developed by psychologist, for gauging opinions, attitudes, and personal values the qualitative values are assigned a numerical value indicative of their order special case of ORDINAL data ex: 1 = definitely dissatisfied to 5 = definitely satisfied
Mode
most frequently occurring value possible to have no mode, one mode, or more than one mode May not be useful; more meaningful measure can be obtained from relative frequency HISTOGRAM → the measurement class containing the largest relative frequency is called MODAL CLASS (for continuous data)--the histogram peak
Z-score
number of standard deviations the value x is from the mean (relative location of the value x with respect to the mean using the standard deviation as counting scale; counting by "s") should be used with symmetrical data couple with empirical rule to show how well ind falls in majority of observations outliers are more than 2 standard deviations (more than 95%) average of all z-scores = 0 standard deviation of all z-scores = 1
p-th percentile
number such that p% of the measurements fall below that number and (100-p)% fall above it ex: According to a statistics bureau, 25.6% of all licensed drivers stopped by police are 41 years or older, which is the 74.4th percentile.
Outlier
observation/measurement that is UNUSUALLY large or small relative to other values in data set due to: 1. measurement in observed, recorded, or entered into computer incorrectly 2. measurement comes from a different population 3. measurement is correct, but represents a rare (chance) event lie beyond inner and/or outer fences!!! more than 2-3 standard deviations from mean
Class
one of the CATEGORIES into which the qualitative data can be classified
(Relative) Frequency Table
only method for NUMERICALLY describing CATEGORICAL data Cumulative relative frequency: last value in column should always be 1
Quartiles
percentiles that portion a data set into 4 categories, each category containing 25% of the measurements lower quartile: 25th percentile of data set middle quartile: 50th percentile (median M) upper quartile: 75th percentile only 3 quartiles to create 4 parts! no 4th quartile!
Bivariate Relationship
relationship between 2 numerical variables scatterplot: positive (increasing) negative (decreasing) no relationship
(Sample) Standard Deviation
s = √(s²) always positive roughly measures the distance of a typical observation from the sample mean when comparing variability of 2 data sets, data set with larger standard deviation is more variable provides a measure of variability within a single data set
Pie Chart
slice = classes of qualitative variable represented size/area: proportional to class relative frequency order of classes relative to one another: arbitrary
Z-score form
standardized version of a variable (form) Z = (variable - mean of variable) / standard deviation of variable z = (variable - mean) / st. deviation
Mean
sum of measurements divided by number of measurements (arithmetic AVERAGE; summation notation) most common measure of center
Sample Variance
s² divisor: n-1 point of reference: sample mean
Divisor of Sample Variance: n-1
s² = "unbiased" estimator for σ²
Central Tendency
tendency of data to CENTER about certain numerical values
Pareto Diagram
type of bar graph horizontal axis: class *order of values on horizontal axis: *descending height (frequency, relative frequency)* (tallest to shortest from L to R) bars' location determined by: descending count order bars are centered over: class vertical axis: class frequency, class relative frequency, or class percentage ???? bar height = class frequency, class relative frequency, or class percentage space between the bars: yes
Empirical Rule
used only with data sets having a bell-shaped symmetric distribution (not uniform); unimodal APPROXIMATELY more precise % of data will fall within k st. dev. within 1 standard deviation: 68.3% within 2 standard deviations: 95.4% within 3 standard deviations: 99.7%
Chebyshev's Rule
used with any data set AT LEAST (1-(1/k²)) × 100% of data is within k standard deviations of mean k≠1 so k>1 and can be non-integer less precise % of data will fall within k st. dev. within 1 standard deviation: no useful info is provided within 2 standard deviations: 75% within 3 standard deviations: 89%
Bar Graph
with vertical bars = classes of qualitative variable (vertical = better representation to compare heights) horizontal axis: class/categories order of values on horizontal axis: arbitrary bars' location determined by: arbitrary ???? bars are centered over: class/category vertical axis: class relative frequency, class percentage ????? bar height = class frequency, class relative frequency, or class percentage white space between bars: yes
Population Variance
σ² divisor: N
Numerical Descriptive Statistics
−Relative frequency table (intervals come from bars of histogram) −Measures of center −Measures of variability −Measures of relative standing −Z-scores −Boxplot