Data analytics: Chapter 4: Descriptive Statistics

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

When the correlation coefficient approaches the value ____, it indicates that there is a weak relationship between the two variables.

0

Which of the following is NOT a characteristic of the midrange?

it is robust to outliers

correlation

the covariance divided by the product of the standard deviations

The ________ mean is the appropriate measure to use when evaluation growth rates.

geometric

the _______ mean is the multiplicative average of a data set.

geometric

Which characteristic does not describe the range?

it considers all data values in the data range

The second quartile is also the _____

1. 50th percentile 2. median

The mode(s) for the data set: 4, 4, 5, 6, 9, 9 is ________

4 and 9

If a data set has a standard deviation of 4 units and a mean of 10 units, the coefficient of variation is ___________.

4/10 = 40%

If Fund A has a coefficient of variation of 1.1, and Fund B has a coefficient of variation of 0.9, Fund _____ has the greater relative dispersion.

A

If the revenue over a four year period was $2000, $2000, $3000, and $5000, what is the geometric mean revenue? Round to a whole number.

G = square root of (2000)(2000)(3000)(5000) = $2783

Match the following terms with their meaning mesokurtic platykurtic leptokurtic

Mesokurtic: normal bell shaped distribution Platykurtic:a flatter distribution than normal with heavier tails Leptokurtic: a sharply peaked distribution with thinner tails

True or false: the arithmetic mean is the average of the data set.

true

The interquartile range of a data set _________

1. represents the middle 50% of the data 2. is calculated by subtracting the first quartile from the third quartile

leptokurtic

a populations that is more sharply peaked than a normal population

Characteristics of the standard deviation

- nonnegative because the deviations arounds the mean are squared - can have any nonnegative value, depending on the unit of measurement - can be compared only for data sets measured in the same units - should not be compared if the means differ substantially, even when units of measurement are the same

A company sold 1000 units in its first year of operation, 1400 units in its second year of operation, and 1680 unites in the third year of operation. The average growth rate of the company's sales for years one to three is _____%. (Round your final answer to a decimal answer with four places and then cover to % with 2 decimals).

2 square root of 1680/1000 -1 =.2961

Standard deviations can be compared ______

1. for data sets with the same measurement units 2. for data sets with the same measurement units and similar magnitude

Place the steps for using the method of medians in finding quartiles in the proper order.

1. sort the observations 2. find the median for the entire data set, Q2 3. find the medians of the data values above and below Q2.

Accuracy of grouped estimates depends on ______.

1. the bin frequencies 2. the distribution of data within the bins 3. the number of bins

mode

The value that occurs most frequently in a given data set. - a data set may have multiple modes or no more at all. - only useful measure of central tendency for categorical data Excel's function =MODE.SNGL(Data)

When calculating a percentile, the first step is to arrange the data set in ________.

ascending order (from least to greatest)

why is the mean the balancing point?

because it has the property that distances from the mean to the data points always sum to zero

outliers

data values outside μ ± 3σ are rare (less than 1%) in a normal distribution

The measure of central location that can best be labeled as the midpoint of the data set is the ________.

median

The summary measures for grouped data are _______.

only approximate values

symmetric data

the mean and median are about the same - tails of histogram are balanced

skewed right / positively skewed

the mean exceeds the median - long tail of histogram points right

skewed left / negatively skewed

the mean is below the median - long tail of histogram points left

The sum of deviation from mean is always _____.

zero

Which of the following correlation coefficients indicate the strongest inverse relationship between two variables?

-0.87

A box plot is constructed using several different values. Which of the following values from a data set are included in a box plot?

1. the first quartile 2. the largest value 3. the second quartile 4. the smallest value 5. the third quartile

Generally, the _______ is the best measure of center when outliers are present.

median

mean of absolute deviation

reveals the average distance from the center - absolute values must be used; otherwise the deviations around the mean would sum to zero excel function: =AVEDEV(data)

When comparing two data sets with different units of measurement, what is the relative measure of dispersion?

the coefficient of variation

skewness coefficient

this unit-free statistic can be used to compare two samples measured in different units or to compare one sample with a known reference distribution such as the symmetric normal distirbution Excel function =SKEW(data)

Which of the items below describes the usefulness of a standard deviation?

to gauge the relative position of data values within the data set

True or false: the trimmed mean can mitigate the effect of outliers.

true

Nadia purchased 400 shares of XYZ stock at $20 per share. When the stock decreased in value to $16 a share, Nadia purchased 600 more shares of XYZ stock. The weighted average price per share that Nadia paid for XYZ stock is $_____________.

$17.60

Pat's time in the 1600 meter run places Pat in the 85th percentile in the school. What percentage of students are faster than Pat?

15

The population standard deviation of the data set 3, 4, 5, 6, and 7 is ___. (Round your final answer to 1 decimal place).

1.4

Inner fences on a boxplot are ______x IQR above Q3 and below Q1. Outer fences are ________x IQR above Q3 and below Q1.

1.5; 3

The midrange for data with Q1 = 10 and Q3 = 45 is _______.

10+45/2= 27.5

if a company sold 1000 units in its first year of operation, and 1400 units in its second year of operation, then the growth rate of the company's sales is _____.

1400-1000/1000*100=40

The maximum value of a data set is 200 and the minimum is 80. The midrange is equal to _____.

80+200/2=140

The empirical rule states that approximately ______% of observations will fall within two standard deviations of the mean.

95.44

weighted mean

a sum that assigns each data value a weight wj that represents a fraction of the total

The owner of BevaMart wants to study the relationship between the temperature and hot chocolate sales. Th owner computed the covariance between temperature and hot chocolate sale to be -81.46. Based on the covariance, which option best describes the linear relationship between temperature and hot chocolate.

as the temperature increases, hot chocolate sales decrease

midhinge

average of the first and third quartiles - always exactly halfway between Q1 and Q3, while the median Q2 can be anywhere within the "box," which suggests a new way to describe skewness

Which of the following characteristics can be seen on a box plot?

1. variability 2. shape 3. center

method of medians

1. sort the observations 2. find the median Q2 3. find the median of the data value that lie below Q2 4. find the median of the data values that lie above Q2 - one method of finding the quartiles

Which of the following can be used to determine the proportion of data points that fall within a specified number of standard deviations from the mean?

1. the empirical rule - assuming a normal distribution 2. Chebyshev's Theorem

Which of the following statements is true?

1. two data sets could have the same mean but different standard deviations 2. two data sets could have different means but the same standard deviation

population variance

the sum of squared deviations from the mean divided by the population size a variance is basically a mean squared deviation excel function for population: =VAR.P(data) excel function for sample: =VAR.S(data)

mean

the sum of the data values divided by the number of data items - the most familiar statistical measure of center - affected by every sample item - is the balancing point or fulcrum in a distribution if we view the x-axis as a lever arm and represent each data item as a physical weight excel's mean function =AVERAGE(data)

coefficient of variation

unit free measure of dispersion - the standard deviation expresses as a percent of the mean to compare dispersion in data sets with dissimilar units of measurement or dissimilar means

Multiplying data values by a fraction (where the fractions add to 1) and summing results in a ______ mean.

weighted

True or False: Chebyshev's theorem should only be applied to data sets that are normally distributed.

false

True or false: Summaries of grouped observations are just as accurate as summaries of a data set of individual observations.

false

Chebyshev's Theorem

for any data set, no matter how it is distributed, the percentage of observations that lie within K standard deviations of the mean must be at least 100. says that for any population with mean μ and standard deviation σ: k = 2at least 75.0% will lie within μ ± 2σ. k = 3at least 88.9% will lie within μ ± 3σ. k = 4at least 93.8% will lie within μ ± 4σ.

Suppose a data set has 80 data points. A 5% trimmed mean would be calculated by removing the _____ highest values and the ______ lowest values.

four; four

Generally, skewness can be assessed by comparing _____.

the mean and median

In which of the following data sets would the arithmetic mean not be a good measure of central location?

0, 8, 8, 9, 10 the value 0 is an outlier for the data set

Match the characteristic to what it describes. center variability shape

center: typically or middle value; where the data values are concentrated variability: spread of data values or dispersion shape: symmetrical or skewed

standardized data

a general approach to identifying unusual observations is to redefine each observation in terms of its distance from the mean in standard deviations to obtain standardized data. excels function: =STANDARDIZE(Xvalue, Mean, StDev)

An owner of a grocery store wanted to determine the brands of soda that customers purchase at the store. When summarizing the data about soda brand purchase the meaningful measure of center is the ________.

mode

The ________ is the measure of center that identifies the most frequently occurring value in the data set.

mode

When estimating sigma using the following formula Xmax-Xmin/6, one is assuming the distribution is ____________.

normal

shape

may be judged by looking at the histogram or by comparing the mean and median

When do bimodal and/or multimodal distributions occur?

when dissimilar populations are combined into one sample

The median for the data set 6, 4, 9, 5 is ______.

5.5

In general, a data point is considered an outlier if it falls more than _________ standard deviations away from the average.

3

If the median price for a home is $200,000, then _____% of homes cost less than $200,000.

50

The median of the data set: 10, 6, 4, 9, 5 is ________

6 place the values in numerical order, then find the middle value

The following data set: 4, 6, 3, 5, has a range equal to _______.

6-3=3

Calculate the standardized score for the following data value. Assume the mean = 100 and the standard deviation = 25: x=60, z= ________.

60-100/25=-1.6

For a given distribution, the range is 60. Assuming the distribution is bell-shaped, the estimated standard deviation = _________.

60/6=10

Estimating Sigma

For a normal distribution, essentially all the observations lie within μ ± 3σ, so the range is approximately 6σ (from μ − 3σ to μ + 3σ). Therefore, if you know the range xmax − xmin, you can estimate the standard deviation as σ = (xmax − xmin)/6.

mesokurtic

a normal bell-shaped population - serves as a benchmark

platykurtic

a population that is flatter than a normal population

box plot / box and whisker plot

a useful tool of exploratory data analysis based on the 5-number summary -shows center, variability, shape

kurtosis

refers to the relative length of the tails and the degree of concentration in the center - not the same as variability the coefficient is obtained from the excel function: =KURT(data)

The numerical measure, σXY.is used frequently by financial portfolio managers. This measure is called the ________.

covariance

sample correlation coefficient

describes the degree of linearity between paired observations on two quantitative variables X and Y Excel's function: =CORREL(Xdata,Ydata)

Characteristics of numerical data

1. center: where are the data values connected? What seem to be typical or middle data values? Is there central tendency? 2. Variability: How much dispersion is there in the data? how spread out are the data values? Are there unusual values? 3. Shape: Are the data values distributed symmetrically? Skewed? Sharply peaked? Flat? Bimodal?

Place the steps in order, from beginning to end, to calculate a mean for grouped data.

1. find the midpoint for each class of grouped data 2. multiply the midpoint of each class by the number of observations in its class 3. sum the products of the midpoints and observations 4. divide by the total number of observations

6 common measures of center

1. mean 2. median 3. mode 4. midrange 5. geometric mean 6. trimmed mean

5 measures of variability for a sample

1. range 2. sample variance 3. sample standard deviation 4. coefficient of variation 5. mean of absolute deviation

Place in order, from beginning to end, the steps to calculate the mean absolute deviation.

1. calculate the arithmetic mean for the data set 2. find the absolute difference between each data set value and the mean 3. sum the absolute difference 4. divide by the sample (or the population) size

Which shape matches the mean and the median relationship? Left skewed Right skewed Symmetrical

Left skewed = mean<median Symmetrical = mean = median Right skewed = mean>median

To calculate the arithmetic mean ______.

all of the data points must be added together, then divided by the number of data points

The correlation coefficient values _______.

fall between -1 and +1, inclusive

The average of the absolute differences between the values of the data set and the mean is the __________

mean absolute deviation

standard deviation

the square root of the variance - a single number that helps us understand how individual values in a data set vary from the mean excel function for a sample: =STDEV.S(data) excel function for a population: =STDEV.P(data)

Which of the following are measures of center of a data set?

mean, median, mode

covariance

measures the degree to which to values of X and Y change together - particularly important in financial portfolio analysis Example: if the prices of two stocks X and Y tend to move in the same direction, their covariance is positive (σXY > 0), and conversely if their prices tend to move in opposite directions (σXY < 0). If the prices of Xand Y are unrelated, their covariance is zero (σXY = 0) Excel function for a population: =COVARIANCE.P(Xdata,Ydata) Excel function for a sample: =COVARIANCE.S(Data,Ydata)

Empirical Rule

says that for data from a normal distribution, we expect the interval μ ± kσ to contain a known percentage of the data k = 168.26% will lie within μ ± 1σ. k = 295.44% will lie within μ ± 2σ. k = 399.73% will lie within μ ± 3σ.

quartiles

scale points that divide the sorted data into four groups of approximately equal size, that is, the 25th, 50th, and 75th percentiles, respectively. - the 2nd quartile is the median - The 1st and 3rd quartiles indicate center because they define the boundaries for the middle 50% of the data. They also indicate variability because the interquartile range (Q3-Q1) measures the degree of spread in the data - the 1st quartile is the median of the data value below Q2 - the 3rd quartile is the median of the data values above Q2 - generally resist outliers excel function: =QUARTILE.EXC(data, k) k is the quartile you are searing for

The square root of the average squared deviation of data values from their mean is known as the ____.

standard deviation

median

the 50th percentile or midpoint of the sorted sample data set - separates the upper and lower halves of the sorted observations - especially useful when there are extreme values Excels function =MEDIAN(data)

range

the difference between the largest and smallest observations Range= Xmax-Xmin

When calculating a mean for grouped data, ______.

the midpoint of each bin is used to approximate the individual values in that bin


Set pelajaran terkait

HIS1005 History of Western Civilization Lesson 14 Quiz

View Set

Final: Chapter 16 Auditing Operations and Completing the Audit

View Set

Constant of Proportionality (Word Probs)

View Set

Chapter 5 The Legislative Branch

View Set

Electoral College and Gerrymandering

View Set