Chapter 2 terms Ba 201
histogram
A common graphical presentation of quantitative data
cumulative frequency distribution
A variation of the frequency distribution that provides another tabular summary of quantitative data
legitimately missing data
Data sets commonly include observations with missing values for one or more variables. In some cases missing data naturally occur; these are called
Missing at random (MAR)
If the tendency for an observation to be missing a value for some variable is related to the value of some other variable(s) in the data, the missing value is called
Missing completely at random (MCAR)
If the tendency for an observation to be missing the value for some variable is entirely random, then whether data are missing does not depend on either the value of the missing data or the value of any other variable in the data. In such cases the missing value is called
illegitimately missing data
In other cases missing data occur for different reasons; these are called
Quartiles
It is often desirable to divide data into four parts, with each part containing approximately one-fourth, or 25 percent, of the observations. These division points are referred to as the
Outliers
Sometimes a data set will have one or more observations with unusually large or unusually small values. These extreme values are called
Interquartile Range (IQR)
The difference between the third and first quartiles is often referred to as
random variable, uncertain variable
a quantity whose values are not known with certainty
Observation
a set of values corresponding to a set of variables
population
all elements of interest
z-score
allows us to measure the relative location of a value in the data set
median
another measure of central location, is the value in the middle when the data are arranged in ascending order (smallest to largest value).
cross sectional data
are collected from several entities at the same, or approximately the same, point in time.
Data
are the facts and figures collected, analyzed, and summarized for presentation and interpretation.
range
can be found by subtracting the smallest value from the largest value in a data set.
Empirical Rule
can be used to determine the percentage of data values that are within a specified number of standard deviations of the mean.
time series data
collected over several time periods.
sample
data from a subset of the population
categorical data
if arithmetic cannot be performed on the data it is
quantitative data
if numeric and arithmetic operations, such as addition, subtraction, multiplication, and division, can be performed on them.
Missing Not at Random (MNAR)
if the tendency for the value of a variable to be missing is related to the value that is missing.
coefficient of variation
indicates how large the standard deviation is relative to the mean.
Covariance
is a descriptive measure of the linear association between two variables.
box plot
is a graphical summary of the distribution of data.
geometric mean
is a measure of location that is calculated by finding the nth root of the product of n values.
variance
is a measure of variability that utilizes all the data.
frequency distribution
is a summary of data that shows the number (frequency) of observations in each of several nonoverlapping classes, typically referred to as bins.
relative frequency distribution
is a tabular summary of data showing the relative frequency for each bin.
scatter chart
is a useful graph for analyzing the relationship between two variables.
standard deviation
is defined to be the positive square root of the variance.
Variation
is the difference in a variable measured over observations (time, customers, items, etc.).
dimension reduction
is the process of removing variables from the analysis without losing crucial information
Percentile
is the value of a variable at which a specified (approximate) percentage of observations are below that value.
mode
is the value that occurs most frequently in a data set.
approximate bin width
largest data value minus smallest data value divided by the number of bins
correlation coefficient
measures the relationship between two variables, and, unlike covariance, the relationship between two variables is not affected by the units of measurement for x and y.
Mean
most commonly used measure of location is the mean (arithmetic mean), the average, measure of central location
% frequency distribution
summarizes the percent frequency of the data for each bin.
imputation
systematic replacement of missing values with values that seems reasonable
A characteristic or a quantity of interest that can take on different values is known as a
variable