OBA 311 Week 2
Identifying Outliers
No standard definition of what an outlier is. SOMETIMES rules are: - z-score greater than +3 and less than -3
Sample
A subset of the population. Used to obtain sufficient information to draw a valid inference about a population
Population
ALL items of interest for a particular investigation
Mean
Average.. Outliers can affect the value of the mean
Measures of location
Mean, Median, Mode, Midrange
Covariance
Measure of linear ASSOCIATION between two variables X and Y. Population covariance has just N Sample covariance has N-1
Samples Variability
Samples are sensitive to sample size. Different samples will have different histogram shapes, means, standard deviations, ect.
Median
Specifies the middle value when the data are arranged from least to greatest half below and half above not affected by outliers
Interquartile Range (IQR)
The difference between the 1st and 3rd quartiles (Q3-Q1) includes only the middle 50% of the data and is not influenced by extreme values
Association
Two variables have a strong statistical relationship with one another if they appear to move together but sometimes statistical relationships even occur when they aren't cause-and-effect (ice cream sales & murder)
Correlation
a measure of the linear RELATIONSHIP between two variables which does not depend on the units of measurement correlation coefficient is scaled between -1 and 1
Statistical Thinking
a philosophy of learning and action for improvement based on principals that -all work occurs in a system of interconnected processes -variation exists in all processes -better performance results in reducing variation
Variance
average of the squared deviations from the mean for a population the denominator is just N for a sample the denominator is N-1
Skewness
describes the lack of symmetry of data Distributions off to the RIGHT are positive Distributions off to the LEFT are negative
Chebyshev's Theorem for standard deviation
for any data set the proportion of values that lie within k standard deviations of the mean is 1-1/k^2 so for 2 standard deviations of the mean = 1-(1/2^2) = .75 or 75%
Coefficient of Kurtosis (CK)
measures the degree of kurtosis of a population
Sign of z-score
negative if number is LEFT of the mean positive if the number is RIGHT of the mean
Mode
observation that occurs most frequently useful data set that contains small number of unique values can easily identify the mode from a frequency distribution or from a histogram
Coefficient of Variation (CV)
provides a relative measure of dispersion in a data relative to the mean. Provides a relative measure of risk to return CV = (Standard deviation) / (mean)
standardized value aka z-score
provides a relative measure of the distance of an observation is from the mean which is independent of the units of measurement
Kurtosis
refers to peakedness or flatness of a histogram coefficient of kurtosis
Measures of dispersion
refers to the degree of variation in the data; that is, the numerical spread of the data key measures: range interquartile range variance standard deviation
Range
the DIFFERENCE between the maximum and the minimum value in a data set affected by outliers and is often used in small data sets different than midrange because midrange is the average and range is the difference
Midrange
the average of the greatest and least values in the data set Caution: extreme values easily distort the results
proportion
the fraction of data that have a certain characteristic
Standard deviation
the square root of the variance.