Statistics Chapter 2 (Bentley)
Contingency Tables and Scatter Diagrams
Two methods for summarizing the data for two variables simultaneously.
Frequency Distribution
(For qualitative data) groups into categories and records the number of observations that fall into each category. Shows the frequency (or number) of items in each of several non-overlapping classes. Objective is to provide insights about the data that cannot be quickly obtained by looking only at the original data.
Relative Frequency Distribution
(Of each category) equals the proportion (fraction) of observations in each category. A category's ____________ is calculated by dividing the frequency by the total number of observations. The sum of the _______________ should equal one (or a value very close to one due to rounding). Identifies the proportion (or the fraction) of observations that falls into each class, that is it is equal to Class Frequency/Total Number of Observations.
Frequency Distribution (Guidelines for Determining the Number of Classes)
-Use between 5 and 20 classes. -Data sets with a larger number of elements usually require a larger number of classes. -Smaller data sets usually require fewer classes.
Frequency Distribution (Guidelines for Determining the Width of Each Class)
-Use classes of equal width -Approximate Class Width=(Largest Data Value-Smallest Data Value)/Number of Classes Note: Making the classes the same width reduces the chance of inappropriate interpretations.
Scatterplot Graph May Reveal...
1. A linear relationship exists between the two variables. 2. A curvilinear relationship exists between the two variables 3. No relationship exists between the two variables.
Guidelines for Constructing a Frequency Distribution
1. Classes are mutually exclusive. 2. Classes are exhaustive 3. The total number of classes in a frequency distribution usually ranges from to 20. 4. Once we choose the number of classes for a raw data set, we can then approximate the width of each class by using the formula (Large Value-Smallest Value)/Number of Classes.
Frequency Distribution (Three Steps to Define Classes with Quantitative Data)
1. Determine the number of non-overlapping classes. 2. Determine the width of each class. 3. Determine the class limits.
Cautionary Comment When Constructing or Interpreting Charts or Graphs
1. The simplest graph should be used for a given set of data. Strive for clarity and avoid unnecessary adornments. 2. Axes should be clearly marked with the numbers of their respective scales; each axis should be labeled. 3. The scale on the vertical axis should begin at zero. Moreover, the vertical axis should not be given a very high value as an upper limit.
Pie Chart
A segmented circle whose segments portray the relative frequencies of the categories of some qualitative variable. Commonly used graphical device for presenting relative frequency and percent frequency distributions for qualitative data. First draw a circle; then use the relative frequencies to subdivide the circle into sectors that correspond to the relative frequency for each class.
Stem-and-Leaf Display (Leaf Units)
A single digit is used to define each leaf. In the preceding example, the leaf unit was 1. Leaf units may be 100, 10, 1, 0.1, and so on. Where the leaf unit is not shown, it is assumed to equal 1. The leaf unit indicates how to multiply the stem-and-leaf numbers in order to approximate the original data.
Contingency Table
A tabular summary of data for two variables. Can be used when: -One variable is qualitative and the other is quantitative -Both variables are qualitative -Both variables are quantitative The left and top margin labels define the classes for the two variables.
Frequency Distribution (Guidelines for Determining the Class Limits)
Class limits must be chosen so that each data item belongs to one and only one class. The lower class limit identifies the smallest possible data value assigned to the class. The upper class limit identifies the largest possible data value assigned to the class. The appropriate values for the class limits depend on the level of accuracy of the data. Note: An open-end class requires only a lower class limit or an upper class limit.
Polygon
Connects a series of neighboring points where each point represents the midpoint of a particular class and its associated frequency or relative frequency.
Exploratory Data Analysis
Consist of simple arithmetic and easy-to-draw pictures that can be used to summarize data quickly.
Contingency Table: Simpson's Paradox
Data in two or more contingency tables are often aggregated to produce a summary contingency table. We must be careful in drawing conclusions about the relationship between the two variables in the aggregated contingency table. In some cases, the conclusions based upon an aggregated contingency table can be completely reversed if we look at the unaggregated data. The reversal of conclusions based on aggregate and unaggregated data is called SImpson's paradox.
Bar Chart
Depicts the frequency or the relative frequency for each category of the qualitative variable as a series of horizontal or vertical bars, the lengths of which are proportional to the values that are to be depicted. On one axis (usually the horizontal axis), we specify the labels that are used for each of the classes. For the other axis, a frequency, relative frequency, or percent frequency scale can be used. Using a bar of fixed width drawn above each class label, we extend the height appropriately. The bars are separated to emphasize the fact that each class is a separate category.
Stretched Stem-and-Leaf Display
If we believe the original stem-and-leaf display has condensed the data too much, we can stretch the display vertically by using two stems for each leading digit(s). Whenever a stem value is stated twice, the first value corresponds to leaf values of 0-4, and the second value corresponds to leaf values of 5-9.
Frequency Distribution (Note on Number of Classes and Class Width)
In practice, the number of classes and the appropriate class width are determined by trial and error. Once a possible number of classes is chosen, the appropriate class width is found. The process can be repeated for a different number of classes. Ultimately, the analyst uses judgement to determine the combination of the number of classes and class width that provides the best frequency distribution for summarizing the data.
Pareto Diagram
In quality control, bar charts are used to identify the most important causes of problems. When the bars are arranged in descending order of height from left to right (with the most frequently occurring cause appearing first). Named for its founder, Vilfredo Pareto, an Italian economist.
Classes
Intervals
Ogive
Is a graph that plots the cumulative frequency or the cumulative relative frequency of each class against the upper limit of the corresponding class. A graph of a cumulative distribution. The data values are shown on the horizontal axis. Shown on the vertical axis are the: -cumulative frequencies, or -cumulative relative frequencies, or -cumulative percent frequencies The frequency (one of the above) of each class is plotted as a point. The plotted points are connected by straight lines.
Scatter Diagram
Is a graphical presentation of the relationship between two quantitative variables. One variable is shown on the horizontal axis and the other variable is shown on the vertical axis. The general pattern of the plotted points suggests the overall relationship between the variables.
Scatterplot
Is a graphical tool that helps in determining whether or not two quantitative variables are related in some systematic way. Each point in the diagram represents a pair of known or observed values of the two variables.
Histogram
Is a series of rectangles where the width and height of each rectangle represent the class width and frequency (or relative frequency) of the respective class. Another common graphical presentation of quantitative data. The variable of interest is placed on the horizontal axis. A rectangle is drawn above each class interval with its height corresponding to the interval's frequency, relative frequency, or percent frequency. Unlike a bar graph, a histogram has no natural separation between rectangles of adjacent classes.
Percent Frequency
Is the percent (%) of observations in a category; it equals the relative frequency of the category multiplied by 100.
Dot Plot
One of the simplest graphical summaries of data. A horizontal axis shows the range of data values. Then each data value is represented by a dot placed above the axis.
Trendline
Provides an approximation of the relationship.
Cumulative Frequency Distribution
Records the number of observations that falls below the upper limit of each class.
Stem-and-Leaf Display
Shows both the rank order and shape of the distribution of the data. It is similar to a histogram on its side, but is has the advantage of showing the actual data values. The first digits of each data item are arranged to the left of a vertical line. To the right of the vertical line we record the last digit for each item in rank order. Each line in the display is referred to as a stem. Each digit on a stem is a leaf.
Cumulative Percent Frequency Distribution
Shows the percentage of items with values less than or equal to the upper limit of each class.
Cumulative Relative Frequency Distribution
Shows the proportion of items with values less than or equal to the upper limit of each class.
Histograms Showing Skewness
Symmetric -Left tail is the mirror image of the right tail (ex. height and weight of people) Moderately Skewed Left -A longer tail to the left (ex. exam scores) Moderately Skewed Right -A longer tail to the right (ex. housing values) Highly Skewed Right -A very long tail to the right (ex. executive salaries)
Cumulative Distributions
The last entry in a cumulative frequency distribution always equals the total number of observations. The last entry in a cumulative relative frequency distribution always equals 1.00. The last entry in a cumulative percent frequency distribution always equals 100.