MA 180 (Chapter 3)

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Mean from Frequency Distribution

Mulitply each class midpoint and frequency of the class, then add the products, and lastly, divide by the sum of the frquencies: x-bar = sigma(f * x) / sigma f (f = frequency; x = midpoint)

Calculating Mean Percent

(I actual mean - frequency mean I / actual mean) x 100%

How to deal with outliers in the data set:

**** Investigate your sampling methodology to determine if the outlier is potentially invalid Do your calculations with and without the outlier and compare the results Document the presence of outliers and consider the possibility of taking a second sample

Variance

***** The variance is a measure of variation equal to the square of the standard deviation. Sample variance: s2 - Square of the sample standard deviation s Population variance: o2 - Square of the population standard deviation

Standard Deviation - Important Properties

****** The standard deviation is a measure of variation of all values from the mean. The value of the standard deviation s is usually positive (it is never negative). The value of the standard deviation s can increase dramatically with the inclusion of one or more outliers (data values far away from all others). The units of the standard deviation s are the same as the units of the original data values.

Ordinary values

-2 greater than or equal to z-score less than or equal to 2.

Calculationg Percentile of Data Value

= number of values less than x / total number of values * 100 Round it off to the nearest whole number

Modified Boxplot Construction

A special symbol (such as an asterisk) is used to identify outliers. The solid horizontal line extends only as far as the minimum data value that is not an outlier and the maximum data value that is not an outlier.

Quartiles

Are measures of location, denoted Q1, Q2, and Q3, which divide a set of data into four groups with about 25% of the values in each group. Q1 (First Quartile) separates the bottom 25% of sorted values from the top 75%. (same are P25) Q2 (Second Quartile) same as the median; separates the bottom 50% of sorted values from the top 50%. (same as P50) Q3 (Third Quartile) separates the bottom 75% of sorted values from the top 25%. (same as P75) divided ranked scores into four equal parts

Coefficient of Variation

CV for a set of nonnegative sample or population data, expressed as a percent, describes the standard deviation relative to the mean Sample : CV= s/x-bar * 100% Population: CV= o/mu * 100%

Round-off Rule for Measures of Center

Carry one more decimal place than is present in the original set of data values

Finding the median

First sort the values (arrange them in order), then follow one of these rules: 1. If the number of data values is odd, the median is the value located in the exact middle of the list. 2. If the number of data values is even, the median is found by computing the mean of the two middle numbers.

5-Number Summary

For a set of data, the 5-number summary consists of these five values: 1.Minimum value 2.First quartile Q1 3.Second quartile Q2 (same as median) 4.Third quartile, Q3 5.Maximum value

Empirical Rule

For data sets having a distribution that is approximately bell shaped, the following properties apply: * About 68% of all values fall within 1 standard deviation of the mean. *About 95% of all values fall within 2 standard deviations of the mean. *About 99.7% of all values fall within 3 standard deviations of the mean.

Outliers for Modified Boxplots

For purposes of constructing modified boxplots, we can consider outliers to be data values meeting specific criteria. In modified boxplots, a data value is an outlier if it is: above Q3 by an amount greater than 1.5 IQR or below Q1 by an amount greater than 1.5 IQR

Inferential Statistics

In later chapters we'll learn to use sample data to make inferences or generalizations about a population.

Descriptive Statistics

In this chapter we'll learn to summarize or describe the important characteristics of a data set (mean, standard deviation, etc.).

Mean (Advantages)

Is relatively reliable i.e. it tends to vary less than other measures of center. Takes every data value into account

Mean (Disadvantage)

Is sensitive to every data value, one extreme value can affect it dramatically; is not a resistant measure of center

Comparing Variation in Different Samples

It's a good practice to compare two sample standard deviations only when the sample means are approximately the same. When comparing variation in samples with very different means, it is better to use the coefficient of variation, which is defined as follows :

Converting from the kth Percentile to the Corresponding Data Value

L = k / 100 * n n: total # of vaules in the data set k: percentile being used L: locator that gives the postion of the value Pk: kth percentile

10 - 90 Percentile Range

P90 - P10

Midquartile

Q3 + Q1 / 2

Interquartile Range (or IQR):

Q3 - Q1

Semi-interquartile range

Q3 - Q1 / 2

Boxplots - Normal Distribution

The key feature is, it's completely symmetric about it's mean

Chebyshev's Theorem

The proportion (or fraction) of any set of data lying within K standard deviations of the mean is always at least 1-1/K2, where K is any positive number greater than 1. Example: For K = 2, at least 3/4 (or 75%) of all values lie within 2 standard deviations of the mean. For K = 3, at least 8/9 (or 89%) of all values lie within 3 standard deviations of the mean.

unbiased

The sample variance s2 is an unbiased estimator of o2 , which means the values of s2 tend to target o2.

Outlier

Value that lies very far away from the cast majority of the other values in a dara set Important Principles: - An outlier can have a dramatic effect on the mean and the standard deviation. - An outlier can have a dramatic effect on the scale of the histogram so that the true nature of the distribution is totally obscured

Standard deviation

a set of sample values, denoted by s, is a measure of variation of values about the mean or how much the data values deviate away from the mean. NEVER negative and is only zero when all data values are the same number.

Unusual

any values that fall above or below the min and max usual values

Percentiles

are measures of location. There are 99 percentiles denoted P1, P2, . . . P99, which divide a set of data into 100 groups with about 1% of the values in each group.

x with - over it (x-bar)

denotes the mean of a set of sample values = Sigma x / n this is simple mean

Long line u (mu)

denotes the mean of all values in a population = Sigma x / N this is the population mean

E (sigma symbol)

denotes the sum of a set of values

Modified Boxplots

described earlier are called skeletal (or regular) boxplots. Some statistical packages provide modified boxplots which represent outliers as special points.

Skewed

distribution of data is skewed if it is not symmetric and extends more to one side than the other

Symmetric

distribution of data is symmetric if the left half of its histogram is roughly a mirror image of its right half

Boxplot

graph of a data set that consists of a line extending from the minimum value to the maximum value, and a box with lines drawn at the first quartile, Q1, the median, and the third quartile, Q3

x

is the variable used to represent the individual data values.

Weighted Mean

mean where some values contribute more than others example: grades (tests are weighted more than hw in most classes) To calculate: WM: x-bar = sigma (w*x) / sigma w

Multimodal

more than two data values occur with the same greatest frequency

No mode

no data value is repeated

Population Standard Deviation

o = square root of sigma (x - mu) squared divided by N. o = sigma cannot be used in practice only used for theoretical signficance

Range

rarely used a set of data values is the difference between the maximum data value and the minimum data value. = (max value) - (min value) It is very sensitive to extreme values; therefore not as useful as other measures of variation.

N

represents the number of data values in a population.

n

represents the number of data values in a sample.

Variance-Notation

s = sample standard deviation s2 = sample variance o = population standard deviation o2 = population variance

Sample Standard Deviation (Shortcut)

s = square root of n sigma (x squared) - (sigma x) squared divided by n (n-1)

Sample Standard Deviation Formula

s = square root of sigma (x - x-bar) squared divided by n - 1

Measures of Variation

spread, variability of data, width of a distribution 1. Standard deviation 2. Variance 3. Range (rarely used)

Z score

standardized value the number of SD that a given value x is above or below the mean Sample: z = x - xbar / s Population: z= x = mu / o

Arithmetic Mean (Mean)

the measure of center obtained by adding all the data values and dividing the total by the number of data values It's also called as an 'Average'.

Median

the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude often denoted by x with ~ over it (pronounced 'x-tilde') is not affected by an extreme value - is a resistant measure of the center is not affected by an extreme value - is a resistant measure of the center

Measure of Center

the value at the center or middle of a data set Mean Median Mode Midrange (rarely used)

Midrange

the value midway between the maximum and minimum values in the original data set = max value + min value / 2 Too Sensitive to extremes because it uses only the maximum and minimum values. rarely used in practice

Mode

the value that occurs with the greatest frequency Data set can have one, more than one, or no mode ***The mode isn't used much with numerical data. However, the mode is the only measure of center that can be used with the data at the nominal level of measurement. (Remember the nominal level of measurement applies to data that consists of names, labels, or categories only - no ordering scheme.)

Biased

the values of s do not target σ , i.e. the values of s generally tend to underestimate or overestimate the values of σ .

Estimating Standard Deviation

to roughly estimate the SD from a collection of known sample data s ~ range / 4

Bimodal

two data values occur with the same greatest frequency

Usual

values in a data set are those that are typical and not too extreme min usual value = (mean) - 2 x (SD) max usual value = (mean) + 2 x (SD)

Unusual values

z score less than -2 z score greater than 2


Ensembles d'études connexes

Social Media Marketing: Mid-Term

View Set

POSC 100 - study guide #4 - Ch14

View Set