C459: Module 4: Examining Distributions, using visual displays and numerical summaries

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

What are the steps to determining the Inter-Quartile Range (IQR)?

1. Arrange the data in increasing order -find the mean 2. Find the median of the lower 50% of the data (Q1) 3. Find the median of the upper 50% of the data (Q3) 4. Middle 50% of the data falls between Q1 & Q3 So... IQR=Q3-Q1

What is the five number summary?

A combination of Min, Q1, Median, Q3, Max A quick numerical description of the center and spread of a distribution.

The 1.5(IQR) Criterion for Outliers

An observation is considered a suspected outlier if it is: less than Q1 - 1.5(IQR), or more than Q3 + 1.5(IQR).

f an outlier can be explained to have been produced under fundamentally different conditions from the rest of the data (or by a fundamentally different process), such an outlier can be removed from the data if your goal is to investigate only the process that produced the rest of the data.

An outlier might indicate a mistake in the data (like a typo, or a measuring error), in which case it should be corrected if possible or else removed from the data before calculating summary statistics or making inferences from the data (and the reason for the mistake should be investigated).

Here are the number of hours that nine students spend on the computer on a typical day: 1 6 7 5 5 8 11 12 15 The median number of hours spent on the computer is:

Answer: 7 Good job! After you order the data, since n = 9, the median is the (9 + 1) / 2 = 5th observation in the ordered list, which in this case is 7.

Side-By-Side (Comparative) Boxplots

Boxplots are most useful when presented side-by-side for comparing and contrasting distributions from two or more groups.

What are the variables classification types?

Categorical and Quantitative

Distribution of categorical data is displayed numerically by using?

Counts and percentages

Understanding Outliers

Even though it is an extreme value, if an outlier can be understood to have been produced by essentially the same sort of physical or biological process as the rest of the data, and if such extreme values are expected to eventually occur again, then such an outlier indicates something important and interesting about the process you're investigating, and it should be kept in the data.

Here is how the IQR is actually found: Arrange the data in increasing order, and find the median M. Recall that the median divides the data, so that 50% of the data points are below the median and 50% are above the median.

Find the median of the lower 50% of the data. This is called the first quartile of the distribution, and the point is denoted by Q1. Note from the picture that Q1 divides the lower 50% of the data into two halves, containing 25% of the data points in each half. Q1 is called the first quartile, since one quarter of the data points fall below it.

What can be used to display quantitative data graphically?

Histograms, Stemplots, & Boxplots

Measures of Spread

In order to describe the distribution, we need to supplement the graphical display not only with a measure of center, but also with a measure of the variability (or spread) of the distribution. In this section, we will discuss the three most commonly used measures of spread: Range Inter-quartile range (IQR) Standard deviation

Histograms -what does center mean?

It is the distribution midpoint

Histograms -what does spread mean?

It is the range of data. Smallest observation to the largest observation.

What does skewed right represent on a histogram chart?

It is when the tail is on the right (larger values) and is much longer.

What does skewed left represent on a histogram?

It is when the tail represents the smaller values and is much longer on the left than on the right.

How do you form a stemplot?

Leaf =the right most digit Stem =everything except for the right most digit

How do you determine the range?

Max observation - Min observation

How do you calculate spread?

Max-Min

A survey taken in a large statistics class contained the question: "What's the fastest you have driven a car (in miles per hour)?" For the 87 males surveyed, we found the following: min=55, Q1=95,Median=110, Q3=120, Max=155. Should the largest observation in this data set be classified as an outlier?

No Good job! The IQR in this case is 120 - 95 = 25. Applying the 1.5(IQR) rule, we find: Q3 + 1.5(IQR) = 120 + 1.5(25) = 157.5, and, therefore, the largest observation, 155, should NOT be classified as an outlier. Note, however that in this case: Q1 - 1.5(IQR) = 95 - 1.5(25) = 57.5, and, therefore, the smallest observation, 55, should be classified as an outlier.

What are outliers?

Observations that fall outside of the overall pattern

Interpreting the Histogram

Once the distribution has been displayed graphically in a histogram, we can describe the overall pattern of the distribution and mention any striking deviations from that pattern. More specifically, we should consider the following features of the distribution:

How do you find the median?

Order the data from smallest to largest. Determine if the is an even or odd amount of points If odd, find the "spot" of the median by n (number of observations) (n + 1)/2 If even, find mean of the two spots n/2 & n/2 + 1

Outliers

Outliers are observations that fall outside the overall pattern. Outliers are data points that fall outside the overall pattern of the distribution and need further research before continuing the analysis.

Shape, Center, and Spread=Overall Pattern

Outliers=deviations from pattern

What are two simple graphical displays for categorical data?

Pie Chart and Bar Chart

Histograms are used as a graphic display of what?

Quantitative Data

What are the three most commonly used measures of spread?

Range, IQR, & Standard Deviation

What do you think the shape of the distribution of age of death from trauma (accident, murder, suicide, drug overdose, etc.) would be when represented by a histogram? Why? Recall that we talked earlier about the shape of the distribution of age of death from natural causes (heart disease, cancer, etc.). Use a similar type of reasoning for the age of death from trauma.

Skewed Right - The bulk of deaths from trauma, accidents, suicide, drug overdose, etc. happen at a younger age, and fewer at an older age. Therefore, we expect the distribution of age of death from trauma to be skewed right.

How do you describe the shape of a histogram?

Symmetric and peakedness (modes)

There are two simple graphical displays for visualizing the distribution of categorical data:

The Pie Chart and The Bar Chart

What is mean?

The average of the set of observations.

The Boxplot

The boxplot graphically represents the distribution of a quantitative variable by visually displaying the five-number summary and any observations that were classified as a suspected outlier using the 1.5(IQR) criterion.

Center of distribution

The center of the distribution is its midpoint—the value that divides the distribution so that approximately half the observations take smaller values, and approximately half the observations take larger values.

The Five Number Summary

The combination of all five numbers (min, Q1, M, Q3, Max) is called the five number summary, and provides a quick numerical description of both the center and spread of a distribution.

Histogram

The idea behind creating a histogram is to break the range of values into intervals and count how many observations fall into each interval.

Mean

The mean is the average of a set of observations i.e., the sum of the observations divided by the number of observations. When finding a mean, you need to make sure you add the numbers first, then divide (follow the correct order of operations)

The Current Population Survey conducted by the Census Bureau records the incomes of a large sample of U.S. households each month. What will be the relationship between the mean and median of the collected data?

The mean will be bigger than the median. Good job! The distribution of incomes is skewed right, so the mean will be bigger than the median.

The SAT Math scores of 1,000 future engineers and physicists are recorded. What will be the relationship between the mean and median of the collected data?

The mean will be smaller than the median. Well done! Since the SAT Math scores for these students will be mostly high scores, the distribution will be skewed to the left. Thus, the few low scores (outliers) will make the mean smaller than the median.

Median

The median, M, is the midpoint of the distribution. In other words, the median is the value that satisfies the following: half the observations are smaller than (or equal to) the median, and half the observations are larger than (or equal to) the median. To find the median: Order the data from smallest to largest. Consider whether n, the number of observations, is even or odd.

Range

The range covered by the data is the most intuitive measure of variability. The range is exactly the distance between the smallest data point (min) and the largest one (Max). Range = Max - min

Spread

The spread (also called variability) of the distribution can be described by the approximate range covered by the data. From looking at the histogram, we can approximate the smallest observation (min), and the largest observation (max), and thus approximate the range.

The Standard Deviation Rule: Approximately 68% of the observations fall within 1 standard deviation of the mean. Approximately 95% of the observations fall within 2 standard deviations of the mean. Approximately 99.7% (or virtually all) of the observations fall within 3 standard deviations of the mean.

The standard deviation rule. When using the Standard Deviation Rule, you need to make sure you multiply the standard deviation by 1, 2, or 3 first, then add or subtract from the mean

Stemplot

The stemplot (also called stem and leaf plot) is another graphical display of the distribution of quantitative data. The leaf is the right-most digit. The stem is everything except the right-most digit. So, if the data point is 34, then 3 is the stem and 4 is the leaf. If the data point is 3.41, then 3.4 is the stem and 1 is the leaf.

What are the components of a box plot?

The top bar =upper 25% The middle box =50% (Median) The bottom bar =lower 25%

How do you use IQR to determine outliers?

There is a suspected outlier if: The point is below Q1 - 1.5 X IQR The point is above Q3 + 1.5 X IQR

dotplot.

There is another type of display that we can use to summarize a quantitative variable graphically—the dotplot. The dotplot, like the stemplot, shows each observation, but displays it with a dot rather than with its actual value.

What is median?

This represents the midpoint or the center of the data.

One Quantitative Variable

To display data from one quantitative variable graphically, we can use either the histogram or the stemplot

Symmetric-Uniform distribution

Uniform distribution

What are the histogram modes?

Unimodal (1 peak) Bimodal (2 peaks) flat (no peaks)

What is mode?

Value that occurs the most times.

standard deviation,

We now move on to another measure of spread, the standard deviation, which quantifies the spread of a distribution in a completely different way.

Shape

When describing the shape of a distribution, we should consider the following: Symmetry/skewness of the distribution Peakedness (modality)—the number of peaks (modes) the distribution has

Split Stemplots

When some of the stems hold a large number of leaves, it is common for statistical software to split each stem into two: the first holding the leaves 0-4, and the second holding the leaves 5-9.

Inter-Quartile Range (IQR)

While the range quantifies the variability by looking at the range covered by ALL the data, the IQR measures the variability of a distribution by giving us the range covered by the MIDDLE 50% of the data.

Can a stem plot be used to display a quantitative variable distribution?

Yes

A survey taken of 140 sports fans asked the question: "What is the most you have ever spent for a ticket to a sporting event?" For the 140 sports fans surveyed, we found the following: min=85, Q1=130, Median=145, Q3=150, Max=250. Should the smallest observation be classified as an outlier?

Yes God job! The IQR is 150 - 130 = 20. Using the 1.5(IQR) criterion we get 130 - 1.5(20) = 100. Since the smallest observation of 85 is smaller than 100, it should be considered an outlier.

Numerical Measures

a more precise numerical description of the center and spread of the distribution. In this section we will learn: how to quantify the center and spread of a distribution with various numerical measures; some of the properties of those numerical measures; and how to choose the appropriate numerical measures of center and spread to supplement the histogram.

variable

a particular characteristic of the individual. Examples: Gender, Age, Weight, Height, Smoking, and Race.

Boxplot

another graphical display of the distribution of a quantitative variable, the boxplot.

Symmetric-Double peaked

bimodal distribution

dataset

is a set of data identified with particular circumstances. Datasets are typically displayed in tables, in which rows represent individuals and columns represent variables.

How do you describe the distribution of quantitative data numerically?

mean, median (M), mode (m)

Categorical variables

take category or label values, and place an individual into one of several groups.

Quantitative variables

take numerical values, and represent some kind of measurement.

Skewed Left Distributions

the left tail (smaller values) is much longer than the right tail (larger values).

Mode

the mode is the most commonly occurring value in a distribution.

skewed right distribution

the right tail (larger values) is much longer than the left tail (small values).

Symmetric Distributions

unimodal-Single peak

The idea behind the standard deviation is to quantify the spread of a distribution by measuring how far the observations are from their mean, x ¯ x¯ . The standard deviation gives the average (or typical distance) between a data point and the mean, x ¯ x¯ .

we'll use SD as an abbreviation for standard deviation, and use s as the symbol.

By distribution of a variable, we mean:

what values the variable takes, and how often the variable takes those values.


संबंधित स्टडी सेट्स

MODULE 1 - MAJOR TYPES OF RETIREMENT PLANS

View Set

QUIZ 4 - PROJECT INTEGRATION MANAGEMENT

View Set

MENS HEALTH-STUDY THIS FOR PRACTICE Q'S!!!

View Set

Corporate Finance Exam 1 Review(Chpts 1-5)

View Set