Data analytics: Chapter 4: Descriptive Statistics
When the correlation coefficient approaches the value ____, it indicates that there is a weak relationship between the two variables.
0
Which of the following is NOT a characteristic of the midrange?
it is robust to outliers
correlation
the covariance divided by the product of the standard deviations
The ________ mean is the appropriate measure to use when evaluation growth rates.
geometric
the _______ mean is the multiplicative average of a data set.
geometric
Which characteristic does not describe the range?
it considers all data values in the data range
The second quartile is also the _____
1. 50th percentile 2. median
The mode(s) for the data set: 4, 4, 5, 6, 9, 9 is ________
4 and 9
If a data set has a standard deviation of 4 units and a mean of 10 units, the coefficient of variation is ___________.
4/10 = 40%
If Fund A has a coefficient of variation of 1.1, and Fund B has a coefficient of variation of 0.9, Fund _____ has the greater relative dispersion.
A
If the revenue over a four year period was $2000, $2000, $3000, and $5000, what is the geometric mean revenue? Round to a whole number.
G = square root of (2000)(2000)(3000)(5000) = $2783
Match the following terms with their meaning mesokurtic platykurtic leptokurtic
Mesokurtic: normal bell shaped distribution Platykurtic:a flatter distribution than normal with heavier tails Leptokurtic: a sharply peaked distribution with thinner tails
True or false: the arithmetic mean is the average of the data set.
true
The interquartile range of a data set _________
1. represents the middle 50% of the data 2. is calculated by subtracting the first quartile from the third quartile
leptokurtic
a populations that is more sharply peaked than a normal population
Characteristics of the standard deviation
- nonnegative because the deviations arounds the mean are squared - can have any nonnegative value, depending on the unit of measurement - can be compared only for data sets measured in the same units - should not be compared if the means differ substantially, even when units of measurement are the same
A company sold 1000 units in its first year of operation, 1400 units in its second year of operation, and 1680 unites in the third year of operation. The average growth rate of the company's sales for years one to three is _____%. (Round your final answer to a decimal answer with four places and then cover to % with 2 decimals).
2 square root of 1680/1000 -1 =.2961
Standard deviations can be compared ______
1. for data sets with the same measurement units 2. for data sets with the same measurement units and similar magnitude
Place the steps for using the method of medians in finding quartiles in the proper order.
1. sort the observations 2. find the median for the entire data set, Q2 3. find the medians of the data values above and below Q2.
Accuracy of grouped estimates depends on ______.
1. the bin frequencies 2. the distribution of data within the bins 3. the number of bins
mode
The value that occurs most frequently in a given data set. - a data set may have multiple modes or no more at all. - only useful measure of central tendency for categorical data Excel's function =MODE.SNGL(Data)
When calculating a percentile, the first step is to arrange the data set in ________.
ascending order (from least to greatest)
why is the mean the balancing point?
because it has the property that distances from the mean to the data points always sum to zero
outliers
data values outside μ ± 3σ are rare (less than 1%) in a normal distribution
The measure of central location that can best be labeled as the midpoint of the data set is the ________.
median
The summary measures for grouped data are _______.
only approximate values
symmetric data
the mean and median are about the same - tails of histogram are balanced
skewed right / positively skewed
the mean exceeds the median - long tail of histogram points right
skewed left / negatively skewed
the mean is below the median - long tail of histogram points left
The sum of deviation from mean is always _____.
zero
Which of the following correlation coefficients indicate the strongest inverse relationship between two variables?
-0.87
A box plot is constructed using several different values. Which of the following values from a data set are included in a box plot?
1. the first quartile 2. the largest value 3. the second quartile 4. the smallest value 5. the third quartile
Generally, the _______ is the best measure of center when outliers are present.
median
mean of absolute deviation
reveals the average distance from the center - absolute values must be used; otherwise the deviations around the mean would sum to zero excel function: =AVEDEV(data)
When comparing two data sets with different units of measurement, what is the relative measure of dispersion?
the coefficient of variation
skewness coefficient
this unit-free statistic can be used to compare two samples measured in different units or to compare one sample with a known reference distribution such as the symmetric normal distirbution Excel function =SKEW(data)
Which of the items below describes the usefulness of a standard deviation?
to gauge the relative position of data values within the data set
True or false: the trimmed mean can mitigate the effect of outliers.
true
Nadia purchased 400 shares of XYZ stock at $20 per share. When the stock decreased in value to $16 a share, Nadia purchased 600 more shares of XYZ stock. The weighted average price per share that Nadia paid for XYZ stock is $_____________.
$17.60
Pat's time in the 1600 meter run places Pat in the 85th percentile in the school. What percentage of students are faster than Pat?
15
The population standard deviation of the data set 3, 4, 5, 6, and 7 is ___. (Round your final answer to 1 decimal place).
1.4
Inner fences on a boxplot are ______x IQR above Q3 and below Q1. Outer fences are ________x IQR above Q3 and below Q1.
1.5; 3
The midrange for data with Q1 = 10 and Q3 = 45 is _______.
10+45/2= 27.5
if a company sold 1000 units in its first year of operation, and 1400 units in its second year of operation, then the growth rate of the company's sales is _____.
1400-1000/1000*100=40
The maximum value of a data set is 200 and the minimum is 80. The midrange is equal to _____.
80+200/2=140
The empirical rule states that approximately ______% of observations will fall within two standard deviations of the mean.
95.44
weighted mean
a sum that assigns each data value a weight wj that represents a fraction of the total
The owner of BevaMart wants to study the relationship between the temperature and hot chocolate sales. Th owner computed the covariance between temperature and hot chocolate sale to be -81.46. Based on the covariance, which option best describes the linear relationship between temperature and hot chocolate.
as the temperature increases, hot chocolate sales decrease
midhinge
average of the first and third quartiles - always exactly halfway between Q1 and Q3, while the median Q2 can be anywhere within the "box," which suggests a new way to describe skewness
Which of the following characteristics can be seen on a box plot?
1. variability 2. shape 3. center
method of medians
1. sort the observations 2. find the median Q2 3. find the median of the data value that lie below Q2 4. find the median of the data values that lie above Q2 - one method of finding the quartiles
Which of the following can be used to determine the proportion of data points that fall within a specified number of standard deviations from the mean?
1. the empirical rule - assuming a normal distribution 2. Chebyshev's Theorem
Which of the following statements is true?
1. two data sets could have the same mean but different standard deviations 2. two data sets could have different means but the same standard deviation
population variance
the sum of squared deviations from the mean divided by the population size a variance is basically a mean squared deviation excel function for population: =VAR.P(data) excel function for sample: =VAR.S(data)
mean
the sum of the data values divided by the number of data items - the most familiar statistical measure of center - affected by every sample item - is the balancing point or fulcrum in a distribution if we view the x-axis as a lever arm and represent each data item as a physical weight excel's mean function =AVERAGE(data)
coefficient of variation
unit free measure of dispersion - the standard deviation expresses as a percent of the mean to compare dispersion in data sets with dissimilar units of measurement or dissimilar means
Multiplying data values by a fraction (where the fractions add to 1) and summing results in a ______ mean.
weighted
True or False: Chebyshev's theorem should only be applied to data sets that are normally distributed.
false
True or false: Summaries of grouped observations are just as accurate as summaries of a data set of individual observations.
false
Chebyshev's Theorem
for any data set, no matter how it is distributed, the percentage of observations that lie within K standard deviations of the mean must be at least 100. says that for any population with mean μ and standard deviation σ: k = 2at least 75.0% will lie within μ ± 2σ. k = 3at least 88.9% will lie within μ ± 3σ. k = 4at least 93.8% will lie within μ ± 4σ.
Suppose a data set has 80 data points. A 5% trimmed mean would be calculated by removing the _____ highest values and the ______ lowest values.
four; four
Generally, skewness can be assessed by comparing _____.
the mean and median
In which of the following data sets would the arithmetic mean not be a good measure of central location?
0, 8, 8, 9, 10 the value 0 is an outlier for the data set
Match the characteristic to what it describes. center variability shape
center: typically or middle value; where the data values are concentrated variability: spread of data values or dispersion shape: symmetrical or skewed
standardized data
a general approach to identifying unusual observations is to redefine each observation in terms of its distance from the mean in standard deviations to obtain standardized data. excels function: =STANDARDIZE(Xvalue, Mean, StDev)
An owner of a grocery store wanted to determine the brands of soda that customers purchase at the store. When summarizing the data about soda brand purchase the meaningful measure of center is the ________.
mode
The ________ is the measure of center that identifies the most frequently occurring value in the data set.
mode
When estimating sigma using the following formula Xmax-Xmin/6, one is assuming the distribution is ____________.
normal
shape
may be judged by looking at the histogram or by comparing the mean and median
When do bimodal and/or multimodal distributions occur?
when dissimilar populations are combined into one sample
The median for the data set 6, 4, 9, 5 is ______.
5.5
In general, a data point is considered an outlier if it falls more than _________ standard deviations away from the average.
3
If the median price for a home is $200,000, then _____% of homes cost less than $200,000.
50
The median of the data set: 10, 6, 4, 9, 5 is ________
6 place the values in numerical order, then find the middle value
The following data set: 4, 6, 3, 5, has a range equal to _______.
6-3=3
Calculate the standardized score for the following data value. Assume the mean = 100 and the standard deviation = 25: x=60, z= ________.
60-100/25=-1.6
For a given distribution, the range is 60. Assuming the distribution is bell-shaped, the estimated standard deviation = _________.
60/6=10
Estimating Sigma
For a normal distribution, essentially all the observations lie within μ ± 3σ, so the range is approximately 6σ (from μ − 3σ to μ + 3σ). Therefore, if you know the range xmax − xmin, you can estimate the standard deviation as σ = (xmax − xmin)/6.
mesokurtic
a normal bell-shaped population - serves as a benchmark
platykurtic
a population that is flatter than a normal population
box plot / box and whisker plot
a useful tool of exploratory data analysis based on the 5-number summary -shows center, variability, shape
kurtosis
refers to the relative length of the tails and the degree of concentration in the center - not the same as variability the coefficient is obtained from the excel function: =KURT(data)
The numerical measure, σXY.is used frequently by financial portfolio managers. This measure is called the ________.
covariance
sample correlation coefficient
describes the degree of linearity between paired observations on two quantitative variables X and Y Excel's function: =CORREL(Xdata,Ydata)
Characteristics of numerical data
1. center: where are the data values connected? What seem to be typical or middle data values? Is there central tendency? 2. Variability: How much dispersion is there in the data? how spread out are the data values? Are there unusual values? 3. Shape: Are the data values distributed symmetrically? Skewed? Sharply peaked? Flat? Bimodal?
Place the steps in order, from beginning to end, to calculate a mean for grouped data.
1. find the midpoint for each class of grouped data 2. multiply the midpoint of each class by the number of observations in its class 3. sum the products of the midpoints and observations 4. divide by the total number of observations
6 common measures of center
1. mean 2. median 3. mode 4. midrange 5. geometric mean 6. trimmed mean
5 measures of variability for a sample
1. range 2. sample variance 3. sample standard deviation 4. coefficient of variation 5. mean of absolute deviation
Place in order, from beginning to end, the steps to calculate the mean absolute deviation.
1. calculate the arithmetic mean for the data set 2. find the absolute difference between each data set value and the mean 3. sum the absolute difference 4. divide by the sample (or the population) size
Which shape matches the mean and the median relationship? Left skewed Right skewed Symmetrical
Left skewed = mean<median Symmetrical = mean = median Right skewed = mean>median
To calculate the arithmetic mean ______.
all of the data points must be added together, then divided by the number of data points
The correlation coefficient values _______.
fall between -1 and +1, inclusive
The average of the absolute differences between the values of the data set and the mean is the __________
mean absolute deviation
standard deviation
the square root of the variance - a single number that helps us understand how individual values in a data set vary from the mean excel function for a sample: =STDEV.S(data) excel function for a population: =STDEV.P(data)
Which of the following are measures of center of a data set?
mean, median, mode
covariance
measures the degree to which to values of X and Y change together - particularly important in financial portfolio analysis Example: if the prices of two stocks X and Y tend to move in the same direction, their covariance is positive (σXY > 0), and conversely if their prices tend to move in opposite directions (σXY < 0). If the prices of Xand Y are unrelated, their covariance is zero (σXY = 0) Excel function for a population: =COVARIANCE.P(Xdata,Ydata) Excel function for a sample: =COVARIANCE.S(Data,Ydata)
Empirical Rule
says that for data from a normal distribution, we expect the interval μ ± kσ to contain a known percentage of the data k = 168.26% will lie within μ ± 1σ. k = 295.44% will lie within μ ± 2σ. k = 399.73% will lie within μ ± 3σ.
quartiles
scale points that divide the sorted data into four groups of approximately equal size, that is, the 25th, 50th, and 75th percentiles, respectively. - the 2nd quartile is the median - The 1st and 3rd quartiles indicate center because they define the boundaries for the middle 50% of the data. They also indicate variability because the interquartile range (Q3-Q1) measures the degree of spread in the data - the 1st quartile is the median of the data value below Q2 - the 3rd quartile is the median of the data values above Q2 - generally resist outliers excel function: =QUARTILE.EXC(data, k) k is the quartile you are searing for
The square root of the average squared deviation of data values from their mean is known as the ____.
standard deviation
median
the 50th percentile or midpoint of the sorted sample data set - separates the upper and lower halves of the sorted observations - especially useful when there are extreme values Excels function =MEDIAN(data)
range
the difference between the largest and smallest observations Range= Xmax-Xmin
When calculating a mean for grouped data, ______.
the midpoint of each bin is used to approximate the individual values in that bin