Ch 2: Methods for Describing Sets of Data

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Median

the number that divides the bottom 50% from the top 50% of the data middle value in its ordered list n = odd, then M or η is middle number n = even, then M or η is average of 2 middle numbers

3 Measures of Central Tendency

mean (numerical: discrete/continuous) median (numerical: discrete/continuous) mode (can be numerical or categorical)

Interquartile Range

(for middle 50% of data) IQR = upper quartile - lower quartile

3 graphs/charts that provide visual summary of qualitative data

1. (Relative) Frequency Table 2. Bar Chart (ex: Pareto Chart) 3. Pie Chart *Bars centered over: classes *White space between bars: yes

There are a larger number of numerical measures available to describe quantitive data sets. Most of these methods measure one of two data characteristics.

1. Central Tendency 2. Variability

2 rules for using the mean and standard deviation (together) to describe a data set

1. Chebyshev's Rule 2. Empirical Rule

3 graphs used for describing numerical data

1. Dot Plot 2. Stem and Leaf 3. Histogram

When numerically describing a data set, what 2 numerical properties should be reported together?

1. Mean 2. Standard deviation

What percent of measurements in data set lie to the left of upper quartile?

75%

Histogram Bar Locations

Discrete: "class interval" bars centered over the values Continuous: class intervals bars begin at lower endpoint of class interval and end at upper endpoint of class interval

Class Frequency

counts/NUMBER of observations/ind in data set that fall into a particular class

Stem and Leaf Display

Stem: listed in order in a column Leaf: for each quantitative measurement in data set is placed in corresponding stem row; leaves for observations with same stem value are listed in increasing order horizontally ideal for: −continuous data with high (same range; 15 decimals)/low (same range; 1 decimal; ex: temp) resolutions −large data sets −can be tricky for discrete, integer-valued data

General Policy

Greek letters: population Roman letters: sample

Inner Fences

Lower (LIF) = QL - 1.5 (IQR) Upper (UIF) = QU + 1.5 (IQR) outliers lie beyond points at distances 1.5(IQR) from each hinge/quartile

Outer Fences

Lower (LOF) = QL - 3 (IQR) Upper (UOF) = QU + 3 (IQR) really extreme; usually goes beyond data range; "highly" suspect OUTLIERS

Skewed Right

Mean is greatest Mode < Median < Mean ex: salaries of all persons employed by a large company ex: grades on a difficult test

Skewed Left

Mean is smallest Mean < Median < Mode ex: grades on an easy test ex: amounts of times students spent on a difficult exam

Measure of Center: Measure of Variability

Mean: Standard Deviation (s,σ) Median: IQR

Interpreting Box Plots

Median = center of distribution of data Length of box (IQR) = measure of sample's variability (useful for comparing 2 samples) Visually compare lengths of whiskers (if one is clearly longer, the distribution is prob skewed in direction of longer whisker) Analyze any measurements that lie beyond fences (less than 5% should fall beyond inner fences, even for very skewed distributions) (measurements beyond outer fences are prob outliers)

5 Number Summary

Min, QL, Median, QU, Max measures variability around median

Class Percentage

PERCENTAGE (class relative frequency) x 100%

Class Relative Frequency

PROPORTION written as DECIMAL (class frequency divided by total number of observations in data set) (class frequency) / n

Variability

SPREAD of the data with a SPECIFIC measure of center as REFERENCE

Measures of central tendency provide only a partial description of a quantitative data set.

The description is incomplete without a measure of variability, or spread, of the data set

3 Measures of Variability

aka spread; completes description of quantitative data set; most are in reference to measure of center 1. Range 2. Standard Deviation 3. IQR (Interquartile Range)

Histogram

analog to bar chart possible numerical values of quantitative variable are partitioned into CLASS INTERVALS, each of which has same width (a vertical bar is placed over each class interval) horizontal axis scale: class intervals order of values on horizontal axis: bars' location: discrete v continuous vertical axis: bar height = class frequency or class relative frequency, and class percentage????? space between bars: no = numerical

Histogram v Bar Chart

bar and pareto −white space between bars −bars centered over values −bar: random bar order vs −pareto: descending height determines bar order −cannot look at shape histogram −no white space between bars −limits of bases are defined by endpoints of interval defining the classes −bar order uniquely determined by numerical values of quantitate variable (unchangeable order...can look at SHAPE)

Distorting the Truth with Descriptive Stats

biased sampling small sample size poorly chosen average results falling within standard error using graphs to create an impression semi-attached figure (ex: contains) post-hoc fallacy

How is a Stem and Leaf similar to Dot Plot?

both plots: show data distribution, used to identify unusual data values, and used to determine specific data entries

Box Plot (aka modified or outlier)

box drawn with the hinges, or quartiles, at QL and QU, with M (median) in between QL and QU −inner fences −whiskers −IQR (QL, M, QU)

Methods for detecting outliers

box plots (graphical): beyond fences z-score (numerical): |z|>2 or |z|>3

"Class" (categorical data, discrete data, and continuous data)

categorical data: discrete data: continuous data:

Ordinal Variable

categorical variable, where values specify some type of ordering ex: Likert scale

Skewed

data set with one tail of distribution with more extreme observations (direction) than other tail median = almost always between mean and mode mean = always pulled in direction of longer tail mode = always at peak

How does a Pareto diagram differ from a bar chart?

descending order of bars

Measures of Relative Standing

describe the relative quantitive location of a particular measurement within a data set ex: percentile ranking or percentile score

Interquartile Range (IQR)

distance (range) between first and third quartiles IQR = QU-QL smaller area of IQR indicates: more dense values larger area of IQR indicates: less dense values

Population Mean

formula: symbol:µ

Sample Mean

formula: symbol: x bar

Statistic

general term for SAMPLE sample mean: x(bar) sample median: M sample variance: s² sample standard deviation: s

Parameter

general term from POPULATION population mean: µ population median: η population variance: σ² population standard deviation: σ

Divisor of Variance

gives formula for degrees of freedom

1st step in data analysis

identify and classify data! ex: Total Appearance Score is ordinal (not discrete)

Dot Plot

individual dot: numerical value of each quantitative measurement in data set (on horizontal scale) when data values repeat, dots are places above one another vertically ideal for: −continuous data with low resolution (values that repeat) −discrete data −small to moderate sample size (n)

Range

largest measurement minus the smallest measurement measure of variability NOT in reference to specific measure of center

Whiskers (and IFs)

lines drawn from each hinge/quartile to the most extreme measurement inside the inner fence end of 1st whisker: value in data that is closest to, but is not less than, lower inner fence end of last whisker: value in data that is closest to, but does not exceed, upper inner fence

Symmetric Data

mean, median, mode are approximately equal ex: amounts of times students spent studying last week ex: ages of cars in a used-car lot

Percentile Ranking (Percentile Score)

measure of relative standing useful only with large data sets

Lickert Scale

measurement device, developed by psychologist, for gauging opinions, attitudes, and personal values the qualitative values are assigned a numerical value indicative of their order special case of ORDINAL data ex: 1 = definitely dissatisfied to 5 = definitely satisfied

Mode

most frequently occurring value possible to have no mode, one mode, or more than one mode May not be useful; more meaningful measure can be obtained from relative frequency HISTOGRAM → the measurement class containing the largest relative frequency is called MODAL CLASS (for continuous data)--the histogram peak

Z-score

number of standard deviations the value x is from the mean (relative location of the value x with respect to the mean using the standard deviation as counting scale; counting by "s") should be used with symmetrical data couple with empirical rule to show how well ind falls in majority of observations outliers are more than 2 standard deviations (more than 95%) average of all z-scores = 0 standard deviation of all z-scores = 1

p-th percentile

number such that p% of the measurements fall below that number and (100-p)% fall above it ex: According to a statistics​ bureau, 25.6​% of all licensed drivers stopped by police are 41 years or older, which is the 74.4th percentile.

Outlier

observation/measurement that is UNUSUALLY large or small relative to other values in data set due to: 1. measurement in observed, recorded, or entered into computer incorrectly 2. measurement comes from a different population 3. measurement is correct, but represents a rare (chance) event lie beyond inner and/or outer fences!!! more than 2-3 standard deviations from mean

Class

one of the CATEGORIES into which the qualitative data can be classified

(Relative) Frequency Table

only method for NUMERICALLY describing CATEGORICAL data Cumulative relative frequency: last value in column should always be 1

Quartiles

percentiles that portion a data set into 4 categories, each category containing 25% of the measurements lower quartile: 25th percentile of data set middle quartile: 50th percentile (median M) upper quartile: 75th percentile only 3 quartiles to create 4 parts! no 4th quartile!

Bivariate Relationship

relationship between 2 numerical variables scatterplot: positive (increasing) negative (decreasing) no relationship

(Sample) Standard Deviation

s = √(s²) always positive roughly measures the distance of a typical observation from the sample mean when comparing variability of 2 data sets, data set with larger standard deviation is more variable provides a measure of variability within a single data set

Pie Chart

slice = classes of qualitative variable represented size/area: proportional to class relative frequency order of classes relative to one another: arbitrary

Z-score form

standardized version of a variable (form) Z = (variable - mean of variable) / standard deviation of variable z = (variable - mean) / st. deviation

Mean

sum of measurements divided by number of measurements (arithmetic AVERAGE; summation notation) most common measure of center

Sample Variance

s² divisor: n-1 point of reference: sample mean

Divisor of Sample Variance: n-1

s² = "unbiased" estimator for σ²

Central Tendency

tendency of data to CENTER about certain numerical values

Pareto Diagram

type of bar graph horizontal axis: class *order of values on horizontal axis: *descending height (frequency, relative frequency)* (tallest to shortest from L to R) bars' location determined by: descending count order bars are centered over: class vertical axis: class frequency, class relative frequency, or class percentage ???? bar height = class frequency, class relative frequency, or class percentage space between the bars: yes

Empirical Rule

used only with data sets having a bell-shaped symmetric distribution (not uniform); unimodal APPROXIMATELY more precise % of data will fall within k st. dev. within 1 standard deviation: 68.3% within 2 standard deviations: 95.4% within 3 standard deviations: 99.7%

Chebyshev's Rule

used with any data set AT LEAST (1-(1/k²)) × 100% of data is within k standard deviations of mean k≠1 so k>1 and can be non-integer less precise % of data will fall within k st. dev. within 1 standard deviation: no useful info is provided within 2 standard deviations: 75% within 3 standard deviations: 89%

Bar Graph

with vertical bars = classes of qualitative variable (vertical = better representation to compare heights) horizontal axis: class/categories order of values on horizontal axis: arbitrary bars' location determined by: arbitrary ???? bars are centered over: class/category vertical axis: class relative frequency, class percentage ????? bar height = class frequency, class relative frequency, or class percentage white space between bars: yes

Population Variance

σ² divisor: N

Numerical Descriptive Statistics

−Relative frequency table (intervals come from bars of histogram) −Measures of center −Measures of variability −Measures of relative standing −Z-scores −Boxplot


Ensembles d'études connexes

Chapter 5 - Job Based Structures and Job Evaluation

View Set

Seeley's A&P Chapter 9 Muscular System

View Set

Chapter 1 Quiz - completing the application, underwriting, and delivering the policy

View Set

Ch 19 Postoperative Nursing Management

View Set

Life Insurance Numbers to Remember

View Set

2.1 Nouns, Pronouns, and Agreements

View Set

Ch. 33 Acute/Coronary Artery Disease

View Set

الاردن والقضية الفلسطينية

View Set

Chapter 36: Nursing Care of a Family with an Ill Child

View Set