CS995 - Descriptive Statistics for a Single Variable
Spread
A way to describe the dispersion of quantitative data is all data clustered around one point or is it spread out?
Histogram
Quantitative Distribution (shape and spread) of quantitative data
Box Plot
Quantitative The center, spread, and outliers in a given data set
Standard Deviation Rule
68% percent of data will fall within 1 standard deviation of the mean 95% of the data will fall within 2 standard deviations of the mean 99.7% of the data will fall within 3 standard deviation of the mean A greater standard deviation means that the data is more spread out
Data Distribution - Skewed Distributions
Asymmetric distribution Long tail Skewed left: more data falls further to the left of the peak Skewed right: vice versa (aka positively skewed)
What would be the best type of graph to use to display the age of all employees in a particular division in a company? a) Bar chart b) Histogram c) Scatterplot d) Pie chart
Feedback: The correct answer is b. This is quantitative data that will be grouped into ranges or bins. Therefore, a histogram is the best choice to display this data.
Anatomy of Box Plot
Four parts: First whisker Two rectangles another whisker Note: regardless of size, each part represents 25% of the data Convenient way of showing five important values: min, max, 1st quartile, median, third quartile
Histogram VS Bar char
Histogram: displays frequencies or relative frequencies for quantitative data (like how many people fall from various intervals of heights) Bar chart: frequencies for categorical data (how many people fall in different country)
Extreme values in Symmetric Distribution
If distribution is symmetric, extreme values on either side of distribution will be roughly similar Symmetric distribution will have similar extreme positive and negative values
Five Number summary
List the minimum, first quartile, Q2 (median), third quartile, maximum in a data set
Bimodal/multimodal
a distribution has two clear peaks rather than one a histogram with two or more clear peaks is called multimodal
Mean
aka average single value that represents the center of a set of data values only used when data is symmetric mean is not a resistant measure of center
1.5 IQR Criterion Rule for Outliers
any points that are more than 1.5 X IQR (IQR = Q3 - Q1) above Q3 or below Q1. Upper outlier: Q3 + 1.5 x IQR Lower outlier: Q1 - 1.5 x IQR
Reliable data and valid data
both consistent and repeatable (reliable) data is resulting from a test that accurately measures what is intended to measure (valid)
Data Distribution - Symmetric Distributions
common type of frequency distribution Left half of histogram being roughly equal to the right half NOTE: Just because a histogram is symmetric does not make it normal
Dot plot
Quantitative Distribution of data (clusters, gaps, outliers) Useful for smaller data sets
Stem Plot (Stem and leaf plot)
Quantitative Distribution or shape of data according to place values
median
halfway point of a set of values we can use median when data is skewed media is a resistant measure of center to find median: 1. sort data from smallest to largest 2. if number of values is odd, the halfway point is median 3. if number of values is even, find the center two values, and divide the sum of two values by 2
Standard deviation
tells how far, on average, the data points are from the mean. Used for symmetric data
Range
the difference between the smallest and greatest values of a data set
Interquartile range
measures the difference between the third quartile and the first quartile
Extreme values in Skewed Distribution
refer to values in a histogram that come after a gap as extreme values and possible outliers
Mode
value that occurs most in a data set there can be more than one mode in a data set
Quartiles
values divide data set into four equally sized groups. A data set has three quartiles that split the data into four equally sized groups
A study was conducted on the number of attendees each day at the state fair. You are asked to recommend a method for displaying the data graphically so that the shape of the data can be seen, and each data value is also visible. What would be the best choice among the following? (Enter the letter that corresponds with your choice.) a. Bar chart b. Histogram c. Scatter plot d. Stem plot
D stem plot Stem plot is the best choice as these types of graphs show the shape of a data set and each data value.
Extreme values in Skewed Distribution - Preferred measures of center and the measures of spread for normal and skewed distributions
Distribution - Measures of Center - Measures of Spread Skewed - Median - Range or IQR Normal Symmetric - Mean - Standard Deviation
9. A marketing researcher was investigating residential water usage in a metropolitan area for a report she was putting together for a client. She polled individual households and asked them to report their average monthly water bill. The lowest average monthly water bill was $35.17$35.17and the highest average monthly water bill was $153.20$153.20. When presenting the data, she did not want the decimal values to get lost. What display would you suggest she use? a. Histogram b. Pie Chart c. Stem plot d. Box Plot
The answer is c. A stem plot is a good choice as you can see the distribution of the data and the values are preserved.
6. You are designing a study of the number of hours worked by financial analysts working at a particular firm. You are especially interested in knowing if there are any outliers in the data, as well as the median number of hours worked and the approximate distribution of the data. Which graphical display would satisfy your needs? a. Bar chart b. Histogram c. Stem plot d. Box plot
The answer is d. A box plot is a good display to use to show the shape of a data set, as well as outliers (if any).
It is important to start the frequency scale on a bar chart at which value to be certain not to overemphasize a difference in values? a) Zero b) The lowest measured value c) The smallest frequency d) As long as the scale is even, it doesn't matter where you start it.
The correct answer is a. A vertical scale that does not start at zero can exaggerate the differences in a data set.
You are a professional trainer at a local sports academy. You ask your athletes to determine the number of grams of protein they consume for a particular meal. Which of the following would be the best choice to illustrate the shape of the data you collect? a) Bar chart b) Pie chart c) Box plot d) None of the above
The correct answer is c. As the data you are collecting is quantitative data, from the choices below a box plot would be your best choice to illustrate the shape of the data.
Of the following sets of data, which would you assume should have the smallest range? a) Price in dollars ($) of penny stocks currently being traded over-the-counter through the OTC Bulletin Board. b) Ages of stockbrokers currently on the trading floor. c) The number of trades on the NYSE on any given day. d) The ages of interns currently in the college summer internship program.
The correct answer is d. 21 years is the average age of a college student in the summer internship program. The variation from this amount is generally one or two years ++ or −-. Therefore, we can assume the data set would be 19,20,21,22,2319,20,21,22,23 years of age. The range would be equal to 23−19=423-19=4. Each of the other options has a far greater probability of having data that is more spread out.
What is the best type of graph to use where it is easiest to estimate outliers? a) Stem plot b) Histogram c) Dot plot d) Box plot
The correct answer is d. Outliers are determined by Q1Q1 and Q3Q3, which are clearly shown on a box plot. The outliers themselves are also displayed on the box plot.