Descriptive Statistics
Cumulative Frequency Distributions
A tabular display of the number of observations in a batch of data that have a value less tan or equal to each value of the measurement.
Limits of Cumulative Relative Frequency Distribution
Requires that the data be at least ordinal so that the values can be arranged from smallest to largest.
Limitations of skewness
Requires the measurement to be interval
Frequency Polygons
Similar to a histogram except the frequency or relative frequency is represented by the height of a line
Sensitivity to extreme criteria:
The mean is the most sensitive, the mode is the least.
Information content criteria:
The mean uses the most of the information, the mode the least.
The Objective of descriptive statistics
The objective of descriptive statistics is to summarize data. To provide a method to convey impressions about the data.
The Purpose of Skewness
To describe if values on one side of central tendency are more or less common than values on the other side. If the tail on the right is longer it is positively skewed or skewed to the right.
Purpose of Dispersion Statistics
To describe something about the atypical values or the spread of values.
Purpose of Univariate Graphical Summaries
To provide a rapid way to summarize tabular information
Purpose of Univariate Numerical Descriptions
To provide an even more compact summary of data that yields impressions that one could get from a graphical summary.
Purpose of Central Tendency Statistics
To provide some description of the common, typical, or representative values of the measurement.
Data sumarizing problems
Univraiate Problems Bivariate problems Multivariate problems
Benefit to Cumulative Frequency Distribution
When compared to a frequency distribution it provides positional information about each value. Works even if x is a continuous measurement!
The loss function criteria:
When we use the measures of central tendency to represent the entire batch of data, there are errors associated with this process. We should choose a way to represent the data with the descriptive statistic that has the lowest possible error of the type we consider to be most important.
What is the issue with grouping?
You end up with a lot of arbitrary features.
The computation criteria:
does the computation make sense for the type of data it is.
Measures of dispersion
range, interquartile range, mean absolute deviation, mean squared deviation, standard deviation
Cumulative Relative Frequency Distribution
A tabular display of the fraction not observations in a batch of data that have a value less than or equal to each value of the measurement.
Relative Frequency Distribution
A tabular display of the fraction of observations in a batch of data that is associated with each value of the measurement.
Limitations to Cumulative Frequency Distribution
1. Must be discrete measurement, no commonness information provided for individual values. 2. Requires that the data be at least ordinal so that the values can be arranged from smallest to largest.
Frequency Distribution Limitations
1. Provides no useful summary if the measurement is continuous unless the values are grouped. 2. Grouping data involves introducing both error and arbitrariness 3. The frequency in isolation provides no information about the commonness of values.
Limitations to Relative frequency distributions
1. Provides no useful summary if the measurement is continuous unless the values are grouped. 2. Grouping data involves introducing both error and arbitrariness.
Benefits of Cumulative Relative Frequency Distribution
1. Provides the commonness of positional information about each value. 2. Works even if the measurement is continuous!!
Relative Frequency Density Curve
A curve which represents frequency per unit width as a limiting process in which the width of an interval (amount of grouping) gets smaller and smaller. This approach is designed to combat the arbitrary aspects of grouping when the measurement is continuous. Not used for discrete measurements.
Ogives
A polygon based on the cumulative relative frequency. Commonness is portrayed by steepness. Positional information portrayed by value.
Frequency Distribution
A tabular display of how often each value of a measurement occur in a batch of data.
Limitations to measures of dispersion
Almost all dispersion statistics require that the measurement be interval
Limitation of the Mode statistic
Because it is based on frequency, frequency must be meaningful. Requires discrete data.
Limitation of the Median Statistic
Because the construction requires the use of the greater than or equal to priority of values, the measurement must be at least ordinal.
Limitation of the mean statistic
Because the construction requires the values to be added together, sums and differences must make sense. So, the measurements must be at least interval for the mean to be a useful statistic.
Mean
Calculated as the sum of all values divided by the number of observations in the batch of data.
Types of Univariate Numerical Descriptions
Central tendency, dispersion, skewness, kurtosis, locational information
The 5 criteria to guide the choice of statistic
Computation, information content, purpose, sensitivity to extremes, and the loss function.
Mode
Describes the most frequency occurring value of the measurement
Median
Describes the value suc that at least 50% of the data is less than or equal to that value and at least 50% is greater than or equal to that value.
The two key properties for choosing a descriptive approach
Is the measurement continuous or discrete What is the level of the measurement?
Summarizing Strategy
List Oriented Descriptions Graphical Descriptions Numerical Descriptions
Purpose criteria:
Mode is best for defining typical value, median is best for describing typical individual, mean is best for representing entire batch of data.
Types of central tendency statistics
Mode, median, arithmetic mean (mean), geometric mean, trimmed mean, weighted mean, etc.
Choice between alternatives of dispersion:
Most dispersion measures are also loss functions. Sensitivity to extremes is a common rationale for choice. But, we do want it to be sensitive. Naturalness or interpretability of the statistic (MSD and variance are unnatural).
Relative Frequency Polygons
Picture conveys the same impression as a frequency polygon except the vertical scale is changed.
Benefits of Relative Frequency Distribution
Provides useful information about the commonness of each value.