Chapter 2 - Summarizing Data

Ace your homework & exams now with Quizwiz!

Third quartile

75th percentile, and is the median of the larger half of the data set

Bar plot

A bar plot is a common way to display a single categorical variable.

Paired Data

We say observations are paired when the two observations correspond to each other. In unpaired data, there is no such correspondence. ex. number of lines in an email and number of characters

Comparing distributions

When comparing distributions, compare them with respect to center, spread, and shape as well as any unusual observations. Such descriptions should be in context.

Histogram

When counts per bin are plotted as bars.

Back-to-back stem-and-leaf plot

When the stem is in the middle and the leaves are on the left and right of it. The left and right sides show data for 2 different groups.

Split stem-and-leaf plot

When there are too many numbers on one row or there are only a few stems, we split each row into two halves

Pie chart

a chart that shows the relationship of a part to a whole.

Categorical data

Data that consists of names, labels, or other nonnumerical values.

Robust estimates

The median and IQR are called robust estimates because extreme observations have little effect on their values. The mean and standard deviation are much more affected by changes in extreme observations.

Parallel box plot

The parallel box plot is a traditional tool for comparing across groups.

Row proportions

The row proportions are computed as the counts divided by their row totals.

Row total

The row totals provide the total counts across each row.

Hollow histogram

Used to compare numerical data across groups. These are just the outlines of histograms of each group put on the same plot,

Dot plot

Uses dots to show the frequency, or number of occurrences, of the values in a data set These graphs make it easy to observe important features of the data, such as the location of clusters and presence of gaps

Interquartile range (IQR)

We calculate the variability in the data using the range of the middle 50% of the data: Q3 − Q1. This quantity is called the interquartile range

Relative frequency

When we want to know what fraction or percent of the data meet a certain criteria, we use relative frequency instead of frequency. Relative frequency is a fancy term for percent or proportion. It tells us how large a number is relative to the total.

Outlier

an observation that appears extreme relative to the rest of the data.

Column total

are total counts down each column.

Column proportions

column proportion is computed as the count divided by the corresponding column total.

Five number summary

consists of the minimum, the maximum, and the three quartiles (Q1, Q2, Q3) of the data set being studied.

Scatterplot

provides a case-by-case view of data that illustrates the relationship between two numerical variables. In any scatterplot, each point represents a single case.

Point estimate

reasonable estimate of μx.

Population mean (μx)

represents the average of all observations in the population.

Sample mean (x-bar)

sometimes called the average, is a common way to measure the center of a distribution of data.

Sample standard deviation (s) + Equation

standard deviation as a measure of spread. It is useful to think of the standard deviation as the average distance that obser- vations fall from the mean. The standard deviation is defined as the square root of the variance.

Box plot

summarizes a data set using five summary statistics while also plotting unusual observations, called outliers.

Empirical rule

tells us that usually about 68% of the data will be within one standard deviation of the mean and about 95% will be within two standard deviations of the mean. these percentages are not strict rules.

Population variance (o^2) and standard deviation

the population values have special symbols: σ2 for the variance and σ for the standard deviation

First quartile

which is the 25th percentile, and is the median of the smaller half of the data set.

Rules of thumb for identifying outliers

• More than 1.5× IQR below Q1 or above Q3 • More than 2 standard deviations above or below the mean.

Cumulative frequency histogram

This type of histogram shows cumulative, or total, frequency achieved by each bin, rather than the frequency in that particular bin.

Mode

A mode is represented by a prominent peak in the distribution.

Segmented bar plot

A segmented bar plot is a graphical display of contingency table information.

Contingency table

A table that summarizes data for two categorical variables in this way is called a contingency table. Each value in the table represents the number of times a particular combination of variable outcomes occurred. For example, the value 149 corresponds to the number of emails in the data set that are spam and had no number listed in the email.

Adding or subtracting values to the data

Adding or subtracting a to all values of an observation adds or subtracts a to measures of center and location (mean, median, quartiles, percentiles). Does not change shape or measure of spread (IQR, range, standard deviation).

Outlier EQUATION

Anything lower than: Q1 −1.5×IQR is an outlier. Anything higher than: Q3 +1.5×IQR is an outlier.

Center (of a distribution)

Either the mean (symmetric) or the median (skewed).

Frequency table

For larger samples, we prefer to think of the value as belonging to a bin ex. 0-5,000; 5,001-10,000. Tables that show (through numbers) counts per bin are called frequency tables.

Shape (of a distribution)

Frequency and relative frequency histograms are especially convenient for describing the shape of the data distribution. Right skewed, left skewed, and symmetric are all words that can be used to describe shape.

Data density

Histograms provide a view of the data density. Higher bars represent where the data are relatively more common. The bars make it easy to see how the density of the data changes relative to the number of characters.

Unimodal, bimodal, multimodal

Histograms that have one, two, or three prominent peaks are called unimodal, bimodal, and multimodal, respectively. Any distribution with more than 2 prominent peaks is called multimodal.

Sample variance (s^2)

If we square these deviations and then take an average, the result is about equal to the s2 sample variance

Stem-and-leaf plot

In a stem-and-leaf plot, each number is broken into two parts. The first part is called the stem and consists of the beginning digit(s). The second part is called the leaf and consists of the final digit(s). When making a stem-and-leaf plot, remember to include a legend that describes what the stem and what the leaf represent.

Median

In an ordered data set, the median is the observation right in the middle. If there are an even number of observations, the median is the average of the two middle values.

Numerical data

Information that consists of numbers only

Multiplying or dividing data by a value

Multiplies or divides measures of center and location (mean, median, quartiles, percentiles) by b. Multiplies or divides measures of spread (IQR, range, standard deviation). Does not change the shape of the distribution.

Association (positive,negative,none)

Positive association means that larger values of the first variable are associated with larger values of the second variable (and small values of the first variable are associated with small values of the second). Negative association means that larger values of the first variable are associated with smaller values of the second & small values of the first variable are associated with large values of the second. Additionally, the association can follow a linear trend or a curved (nonlinear) trend.

Second quartile (Median)

Q2 represents the second quartile, which is equivalent to the 50th percentile (i.e. the median). 50% of the values lie below it and 50% lie above it.

Distribution

Refers to the values that a variable takes and the frequency of these values.

Relative Frequency Table

Shows the percents (relative frequencies) of observations in each category or class.


Related study sets

BIOL 1322 Nutrition & Diet Therapy Chapter 11 Homework

View Set

PUBLIC POLICY AND PROGRAM IMPLEMENTATION TEST

View Set

CH 24 Peds- The child needing nursing care

View Set