STA301 Module 2

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Numerical Summaries for Quantitative Data

-Measures of center provide only a partial description of a quantitative data set. The description of data is incomplete without a measure of variability. - Knowledge of center along with the dispersion helps us to visualize the shape of the data set as well as its extreme values. -The more the data vary, the less a measure of center can tell us.

Properties of Range

1. Range is easy to compute and easy to understand. 2. It is an insensitive (not resistant) measure of data variation when the data sets are large. This is because two data sets can have the same range and be vastly different with respect to data variation. This is the drawback of the range.

Important properties of the mean:

1. The mean for the data set may be unique and not necessarily one of the data values. 2. The mean is affected by extremely high or low values (non resistance statistic), called outliers. So, if the data have outliers, gaps, skewness, etc. the mean may not be the appropriate measure of center. 3. The formula for the mean uses numerical values for the observations. So, the mean is appropriate only for quantitative variables. 4. The mean is used in computing other statistics,such as the variance.

Class

A class is one of the categories into which qualitative data can be classified.

Box Plot:

A graph of a five-number summary is known as box plot. If present, outliers are typically depicted as dots. If outliers are present, the whiskers end at the smallest and/or largest data value(s) that is/are not outliers.

Question 26: The percentage of data points falling at or below the upper quartile is A. 25 B. 50 C. 75 D. 100

C. 75

Upper Quartile (Qu)

Q3 or QU (upper quartile or third quartile) - It is the median of the upper half of the data and divides the bottom 75% of data from top 25%

Quartiles

Quartiles are the special type of percentiles that divides the data into quarters (four categories) each category containing exactly 25% of the measurements.

Continuous Variables (Quantitative)

They can assume an infinite number of values between any two specific values. They often include fractions and decimals. Examples: temperature, rainfall, gasoline, etc.

Question 16: Mean can be computed for A. quantitative data. B. categorical data. C. both quantitative and categorical data.

quantitative data.

sample standard deviation:

s = (s^2) ^(1/2)

Possible causes of an outlier:

1. Measurement error (such as observational error, equipment failure, incorrect coding in the data set, etc.) 2. Misclassified measurement: The data value may have been obtained from a subject that is not in the defined population. 3. A particular observation in a sample could be a rare (chance) event from a valid data set. In general, the skewed distributions contains rare events.

Question 8: Which of the following can be used to display categorical data?

Bar chart

Middle Quartile (M)

the mean or 50th percentile

sample Z score

z = (x - x with an over bar)/s

Dot Plots:

• In dot plot, the numerical value of each quantitative measurement in the data set is represented by a dot on a horizontal scale. • The dot plot condenses the data by grouping all values that are the same. • When the data values repeat, the dots are placed above one another vertically, forming a pile at that particular numerical location. • This plot can be used to compare two or more data sets. • Since we show each data values in this plot, it is preferred for small data set.

Is median better than mean?

-In certain situations, the median may be better measure of central tendency than the mean. In particular the median is less affected (resistance statistic) than the mean to extremely large or small measurements (outliers). -If a data set has extremely large or small measurements, the mean could mislead the measure of central tendency. In this situation, the median would be better measure of central tendency.

interquartile range (IQR)

distance between lower and upper quartiles IQR = Qu - Ql

Mean (Average):

The mean of a set of quantitative data is the sum of the measurements, divided by the number of measurements contained in the data set.

Range:

The range can be defined as the difference between the largest and the smallest measurements of a quantitative data set. Range(R) = (Max) - (Min)

Quantitative Variables (Numerical Variable)

When the values of a variable are measured numerical quantities with units, we call it quantitative variable. We can find arithmetic summaries such as means or ranges. Quantitative variables must have units. The units indicate... • how each value has been measured. • the corresponding scale of measurement. • how much of something we have. • how far apart two values are.

Unimodal:

A data set that has only one value that occurs with the greatest frequency is said to be Unimodal.

Question 32: If the bars of a histogram represent the proportion of the total count that falls into each interval, what must the heights of the bars sum to? A. The total number of observations in the data set. B. One C. 1 divided by the total number of intervals used in the histogram. D. Not enough information to tell.

B. One

Class Frequency (f):

The class frequency is the number of observations in the data set that fall into a particular class.

Ordinal Variable

The term ordinal can be applied to a variable whose categorical values possesses some kind of order.

Pie Chart:

-In pie chart, the categories (classes) of the qualitative variable are represented by slices of a pie (circle). The size of each slice is proportional to the class relative frequency or percentage or frequency. -The purpose of the pie graph is to show the relationship of the parts to the whole by visually comparing the sizes of the slices. -The nature of the pie graph is analyzed by looking at the size of the sections in the pie graph. -This chart is best for data sets with few categories (less than 8).

Median (M):

-The median of a set of quantitative data is the middle number when the measurements are arranged in ascending order. -It splits the data into two parts with equal number of observations, when they are ordered in ascending order. In other words, the number that divides the bottom 50% of the data from the top 50% of the data is the Median. -When the sample size(n) is odd, a single observation occurs in the middle. When the sample size is even, two middle observations occur, and the median is the midpoint between the two.

Lower Quartile (Ql)

25th percentile of a data set

Time Series Plot:

A display of values against time is sometimes called a time series plot.

Frequency and Relative Frequency Table:

A frequency table records the counts (frequency) for each of the categories (classes) of the qualitative variable in a data set. If we use proportions (decimal equivalent of a percentage) for each of the categories we call it relative frequency table.

Histogram:

A histogram is a graph for a quantitative variable. Since there are no categories, we usually slice up all the possible values into bins (intervals) and then count the number of cases that fall in each bin.

Question 25: The percentage of data points falling at or below the lower quartile is A. 25 B. 50 C. 75 D. 100

A. 25

Outlier:

An observation that is unusually large or small relative to the data values we want to describe is called an outlier.

Effects of Outlier:

An outlier can strongly affect the mean, standard deviation and other statistics.

Question 18: Which of the following is not affected by extreme outliers? A. Variance B. Median C. Range D. Mean

B. Median

Methods for Describing Qualitative Data: Graphical methods:

Bar chart, Pie chart, and Pareto diagram

Classification of a distribution based on Mode:

Based on mode, distributions can be classified as unimodal, bimodal, and multimodal. It is not necessary that mode always exists.

Importance of Displaying and Describing Data (Variable):

By doing this we can see patterns, relationships, trends, and exceptions pertained in a data.

Question 30: In which of the following plot(s) of the quantitative variables, are all of the data points are clearly visible? A. Box plot B. Histogram C. Bar chart D. Stem-and-leaf display

D. Stem-and-leaf display

Question 3: Consider the following three scenarios and determine if the numerical variables described in each scenario (I, II, and III) is discrete or continuous. I. Hours of lifetime lost because of smoking. II. The volume of air exhaled per breath by students while taking a statistics exam. III. The number of trips a fire truck made when temperature exceeded 1000 F.

I.Continuous II. continuous III. discrete.

Question 15: Which of the following statements are true? I. The mean is always one of the data points. II.The mean, median, and mode can never all be the same. III.The mean and standard deviation are not resistant to outliers. IV.The median is the same as the 50th percentile and the second quartile. V.When n is even, the median is one of the data points.

III and IV

Multimodal:

If a data set has more than two values that occur with the same greatest frequency, each value is used as the mode, and the data set is said to be multimodal.

No Mode Example: Following data show the number of coal employee per county for 10 selected counties in southwestern Pennsylvania. Find mode. 110, 731, 1031, 84, 20, 118, 1162, 1977, 103, 752

Since each value occurs only once, mode does not exist.

Calculate median by hand:

Step 1: arrange the data set in ascending order. Step 2: Select the middle point index. The middle observation has index (n+1)/2. That is, the median is the value of (n+1)/2 th observation in the ordered sample.

population z score

z = (x - μ)/σ

Stem-and-leaf display (plot):

A stem-and-leaf plot is a data plot that uses part of the data value as the stem (leading digits) and part of the data value as the leaf (final digit) to form groups or classes. This method of organizing data includes both sorting (both stems and leaves are ordered) and graphing. Each stem is a number on the left of the display and a leaf is a number to the right of it. The stem is shown to the left of solid vertical line and the leaf to the right of it. This plot conveys similar information as a histogram. Turned on its side, it has the same shape as the histogram.

A five-number summary:

Following specific five numbers are generally refereed as a five-number summary of the data set. These numbers are used to produce a box plot (discuss later). 1. The lowest value of the data set (i.e., minimum) 2. Q1 or QL (first quartile or lower quartile or 25th percentile) 3. Q2 or M (the median or second quartile or 50th percentile) 4. Q3 or QU (third quartile or upper quartile or 75th percentile) 5. The highest value of the data set (i.e., maximum)

Question 4: Consider the following three scenarios and determine if the numerical variables described in each scenario (I, II, and III) is discrete or continuous. I. The increase in length of life of a cancer patient following chemotherapy. II. The volume of gasoline lost due to evaporation during the filling of a gas tank. III. The number of cracks that exceed 8 inches in 1 mile of a major highway.

I. Continuous II. continuous III. discrete.

Describing the Distribution for Categorical Data:

When describing the distribution of categorical data, we typically discuss how the observations are distributed among the categories.

No Mode:

When no data value occurs more than once, the data set is said to have no mode.

Cross-Sectional Data

When several variables are all measured at the same time point, the data is called cross-sectional data. For example, data on sales revenue, number of customers, and expenses for last month at each Starbucks (more than 20,000 locations as of 2012) at one point in time would be cross-sectional data.

Pareto Diagram / Pareto Chart:

It is a bar graph with the categories of the qualitative variable arranged by height in descending order from left to right. The height of the bars represents the frequencies or relative frequencies. **Note: In Pareto Chart, comparisons can be made by looking at the heights of the bars

Empirical Rule:

It is a rule of thumb that applies to data sets with frequency distributions that are mound-shaped and symmetric. Rule 1: Approximately 68% of the measurements will fall within one standard deviations of the mean. [i.e. within the interval − , + for sample and − , + for population] Rule 2: Approximately 95% of the measurements will fall within two standard deviations of the mean. [i.e. within the interval ( − 2, + 2) for sample and − 2, + 2 for population] Rule 3: Approximately 99.7% (essentially all) of the measurements will fall within three standard deviations of the mean. [i.e. within the interval − 3, + 3 for sample and − 3, + 3 for population]

Measure of Central Tendency (Location):

It is defined as the measure of tendency of the data to cluster, or center, about certain numerical values. Among several measures of central tendency we discuss on Mean, Median, and Mode.

Properties of Variance and Standard Deviation

Section 3.4 Spread of the Distribution Properties of Variance and Standard Deviation: 1. For two data sets, when the means are equal, the larger the variance or the standard deviation is, the more variable the data are. 2. If the data are rescaled, the standard deviation is also rescaled. For instance, if we change annual incomes from dollars to thousands of dollars, the standard deviation also changes by a factor of 1,000. 3. S ≥ 0

Time Series

Variables that are measured at regular intervals over time are called a time series. Typical measuring points are months, quarters, or years.

Unimodal Example: Consider NFL signing Bonus data. Following are the signing bonuses of eight NFL players for a specific year. The bonuses in million dollars are: 18.0, 14.0, 34.5, 10.0, 11.3, 10.0, 12.4, 10.0 Find mode.

We can calculate mode as follows: It is helpful to arrange the data in order although it is not necessary. 10, 10, 10, 11.3, 12.4, 14, 18, 34.5 Since $10 millions occurred 3 times (greatest frequency), the mode is $10 millions.

Types of Variable Notes:

We typically associate discrete variables with something we can count and continuous variables with something we can measure. Some variables can be both categorical and quantitative. How data are classified depends on why we are collecting the data.

Percentile

for any n measurements (arranged in ascending order or descending order), the pth percentile in a number such that p% of the measurements fall below that number and (100-p)% fall above it. Ex: Suppose an instructor tells you that you scored at the 90th percentile, it means that 90% of the grades were lower than yours and 10% were higher than yours.

Bar Graph/Chart:

• Bar chart is one of the methods to represent qualitative or categorical data. • In bar chart, the categories (classes) of the qualitative variable are represented by bars, where the height of each bar is either the class frequency (count), class relative frequency, or class percentage. • The bars are the same width, so their heights determine their areas, and the areas are proportional to the counts in each class. • Bar charts make the comparison of the classes easy and natural.

Mode:

-The value that occurs most frequently in a data set is called the mode. -The mode indicates the most common outcome. -The mode shows where the data tend to concentrate. -The mode is the only measure of central tendency that has to be an actual data value in the sample. -Mode can be computed for both quantitative and qualitative data.

Numerical Measure of Relative Position/ Relative Standing

-They tell where a specific data value falls within the data set or its relative position in comparison with other data values. -These measures are used extensively in psychology and education and sometimes are referred to as norms. -In this section we discuss percentiles and quartiles.

Example of use of mode in business

A retailer of men's clothing would be interested in the modal neck size and sleeve length of potential customers.

Question 19: The distribution of salaries of professional basketball players is skewed to the right. Which measure of central tendency would be the best measure to determine the location of the center of the distribution? A. Median B. Mode C. Mean D. Range

A. Median

Question 22: Which of the following is not a measure of variability? A. Proportion B. Population variance C. Interquartile Range (IQR) D. Standard deviation

A. Proportion

Question 28: Which of the following is/are not used to display categorical data? A. Scatter plot B. Pie charts C. Stem-and-leaf displays D. Bar charts

A. Scatter plot and C. Stem-and-leaf displays

Question 35: Consumer Reports National Research Center routinely compares products and services. A poll of more than 1,800 U.S. residents was conducted shortly after the 2008 holiday season to determine consumer tipping behavior. These data are A. cross-sectional. B. time series. C. not enough information is provided. D. both A and B.

A. cross-sectional.

Question 36: Which of the following can be used to display quantitative data? A. Mosaic plot B. Box plot C. Contingency table D. Bar chart

B. Box plot

Question 20: Circle all of the following statements about sample standard deviation and sample variance that are true? A. The value of s2 is always greater than the value of s. B. The larger the value of s2 or s, the smaller the variability of the data set. C. If s2 or s is equal to zero, all the measurements must have the same value. D. We compute the standard deviation s, which is the square root of the variance s2, in order to measure the variability in the same unit as the original observations are.

C & D

Question 17: Which of the following statistics can be used to describe the center and/or shape of the distribution for both quantitative and categorical variables? A. Mean B. Median C. Mode D. Standard deviation

C. Mode

Class Percentage:

The class percentage is the class relative frequency multiplied by 100. Class percentage = (class relative frequency) * 100

Class Relative Frequency (Proportion):

The class relative frequency is the class frequency divided by the total number of observations in the data set. Class Relative Frequency = (class frequency (f) ) / (total frequency (n) ) *Class relative frequency sum to 1.

Question 13: In the following data set, which (mean or median) measure of center tendency will be better? Why? 3, 4, 3, 5, 2, 4, 25

The median would be better because of the number 25 which is an outlier.

Variance (s^2):

The sample variance for a sample of n measurements is the average of sum of squared deviations of each data value from the mean.

Nominal Variable

The term nominal can be applied to a categorical variable whose values are used only to name categories. Example: Gender is a nominal variable which has two categories. 1 = Male 2 = Female

Standard Deviation (s):

The variance plays an important role in measuring spread, but the units are the square of the original units of the data. Taking the square root of the variance corrects this issue and gives us the standard deviation. Specifically, The standard deviation indicates how far, on average, the observations are from the mean.

Discrete Variables (Quantitative)

They assume values that can be counted. It has possible values from a set of separate numbers. Examples: number of children in a family, number of car accidents, shoe sizes, etc.

Question 21: Which of the following statement(s) about sample standard deviations and sample variances is/are not true? A. The value of s is always greater than or equal to zero. B. The smaller the value of s^2 or s, the larger the variability of the data set. C. If s^2 or s is equal to zero, all the measurements must have the same value. D. We compute the standard deviation s, which is the square root of the variance s^2, in order to measure the variability in the same units as the original observations.

B. The smaller the value of s^2 or s, the larger the variability of the data set.

Question 24: The z-score and percentile are measures of A. location. B. relative position. C. relative frequency. D. variability. E. normality.

B. relative position

Question 34: Determine whether data are a time series or are cross-sectional. The U.S. Bureau of Labor Statistics publishes the monthly CPI (consumer price index). The index shows the changes in prices paid by urban consumers for a market basket of goods and services. These data are A. cross-sectional. B. time series. C. not enough information is provided. D. both A and B.

B. time series.

Question 29: Which of the following is not used to display categorical data? A. Pareto diagram B. Pie charts C. Stem-and-leaf displays D. Bar charts

C. Stem-and-leaf displays

Question 2: Which of the following variables is not discrete? A. the number of words on an 8.5 x 11 inch sheet of paper. B. the number of courses in which a college student is enrolled. C. the number of attempts needed in order to successfully complete a task. D. the number in a group of 20 people who have college degrees. E. the distance traveled by a motorcycle on one gallon of gas.

D. The distance traveled by a motorcycle on one gallon of gas.

Methods for Describing Qualitative Data: Numerical methods

Frequency and relative frequency distribution (table)

Question 9: Which of the following graphical methods cannot be used to describe categorical variables?

Histogram

Bimodal:

If a data set has two values that occur with the same greatest frequency, both values are considered to be the mode and the data set is said to be bimodal.

"lack of skewness"

In any data set if the mean and median are almost equal, it indicates a lack of skewness in the data set. In other words, the data exhibit a tendency to have as many measurements in the left tail of the distribution as in the right tail.

Measure of Variability (Dispersion):

It is defined as the measure of spread of the data about certain numerical values. Specifically, variability is defined as the measure of spread of data around the mean. Among several measures of dispersion we discuss on Range, Variation, Standard Deviation, and Interquartile Range (IQR).

Example of Pareto Diagram Use

Network TV Shows (this is the same data set used to produce frequency distribution) Produce and interpret a Pareto Diagram for these data.

Methods to Detect outliers

Numerical method - such as z-score Graphical method - such as box plot z-score method for detecting outlier: If |z|>2, the data value is considered as possible outlier. If |z|>3, the data value is considered an outlier.

Bimodal Example: Consider Licensed Nuclear Reactors data. The data show the number of licensed nuclear reactors in the United States for a recent 15-year period. Find mode. 104 104 104 104 104 107 109 109 109 109 109 110 111 111 112

Since the values 104 and 109 both occur 5 times, the modes are 104 and 109.

Qualitative Variables (Categorical Variables)

They are the variables which can be places into distinct categories according to some characteristics or attribute. This variable simply names the categories (whether with words or numerals). These variables: • arise from descriptive responses to questions like "What kind of advertising do you use?". • may only have two possible values (like "yes" or "no"). • may be a number like a zip code. • cannot find averages for this variable.

Relative Frequency Histogram:

This graph reports the percentage or proportion of cases in each bin. The shape of the two histograms (frequency and relative frequency) are the same, except for the labeling of the vertical axis. A relative frequency histogram is faithful to the area principle by displaying the percentage of cases in each bin instead of the count.


Ensembles d'études connexes

Introduction to Cloud Computing 1

View Set

AP Psychology Chapter 10 Multiple Choice Practice

View Set

Ch. 13 - General Characteristics of Viruses

View Set

Research Methods Chapter 4 Study Set

View Set