K303 Module 1 Quiz
What we can do with outliers (4)
- Correct the error and re-analyze the data - Investigate the process to determine the cause of the outlier - Determine whether you failed to consider a factor that affects the process - Investigate the process and the outlier to determine whether the outlier occurred by chance; conduct the analysis with and without the outlier to see its impact on the results
5 Step Problem Solving
1. Define the problem 2. Gather Data 3. Analyze the data, think creatively & let the problem rest 4. Select a solution through preliminary 5. Evaluate & monitor the solution
5 steps of Business Analytics
1. Problem Definition 2. Data Preperation 3. Technical Analysis & Modeling 4. Evaluation of Results 5. Deployment
2nd quartile
50% percentile and a median
Approximately ____ of all observations are within plus or minus one standard deviation of the mean ( for data symmetrical data)
68%
Approximately ____ of all observations are within plus or minus two standard deviations of the mean; ( for symmetrical data)
95%
Approximately ____ of all observations are within plus or minus three standard deviations of the mean (for symmetrical data)
99.7%
has two modes (two values that occur more frequently than any other) for the data item (variable)
Bi-modal
forward looking and actionable
Business Analytics
backward looking and descriptive
Business Intelligence
Identify the middle or center position of the data distribution. Mean- "average" Mode- "most" Median- "middle"
Central Tendency
Dependence or association How close two variables are to having a linear relationship with one-another Correlated: Height and weight Not Correlated: Hair color and # stars in sky
Correlation
collection of values that is typically recorded in tabular form ex: database table or a spreadsheet
Data set
A method of summarizing the characteristics of a single variable within a set of data. - apply when we have data for a complete population—for example, test scores for all students enrolled in a course for which nine sections are offered.
Descriptive Statistics
the subjects of the data collection ex: individual patients on whom measurements are taken
Elements
= Q1 - (3 × IQR) Values below this are extreme outliers
Extreme Lower Limit (skewness)
= Q3 + (3 × IQR) Values above this are extreme outliers.
Extreme Upper Limit (skewness)
MIN, 1st quartile, 2nd quartile, 3rd quartile, MAX
Five number summary
.70 to .90 (-.70 to -.90)
High
come into play when we have data for only a subset, or sample, of a population—for example, test scores for students in only some of the nine sections of that course.
Inferential statistics
the P75-P25, says how spread out on the # line the middle 50% of observances are
Interquartile
.00 to .30 (-.00 to -.30)
Low
= mean - (2 × the standard deviation) Values below this are outliers.
Lower Limit of symmetrical
Used with skewed data Less susceptible to influence of outliers
Median
= Q1 - (1.5 × IQR) Values below this are mild or extreme outliers.
Mild Lower Limit (skewness)
= Q3 + (1.5 × IQR) Values above this are mild or extreme outliers.
Mild Upper Limit (skewness)
Mostly used in categorical data, not continuous data
Mode
.30 to .70 (-.30 to -.70)
Moderate
has two or more modes
Multi-modal
One mode (most frequent number, "peak")
Normal Distribution
single observation is the value of a variable for a particular element ex: Dallas, Texas
Observations
Uses historic data to model and predict potential outcomes of a situation, often incorporating the use of uncertain variables.
Predictive Analytics
Prescribes a course of action to achieve the best result for a given problem.
Prescriptive Analytics
IQR=
Q3-Q1
Skewed, Median &
Quartiles
dividing a set of observations into 4 groups of equal width - Quartiles divide a distribution into four parts. The second quartile is better known as the median.
Quartiles
Most of the data are clustered around the center, while the more extreme values on either side of the center become less rare as the distance from the center increases
SHAPE—Normal Distribution
Symmetrical shape (bell curve)
SHAPE—Normal Distribution
What three characteristics should you focus on in descriptive statistics?
Shape, Center, Spread
Asymmetrical Distribution: the two sides will not be mirror images of each other
Skewed
Skewness is the tendency for the values to be more frequent around the high or low ends of the x-axis
Skewed
SMART Goals =
Specific Measurable Achievable Idealistic Time based
Measures of Spread: summarize the data in a way that shows how scattered the values are and how much they differ from the mean value (2)
Standard Deviation Interquartile Range
Average distance between each data point and the mean. Indicates the spread of the data, and therefore the volatility within the process or business.
Stdev
This open-source programming language for statistical computing and graphics provides a wide variety of analytical techniques.
The R Project
= mean + (2 × the standard deviation) Values above this are outliers.
Upper Limit of symmetrical
.90 to 1.00 (-.90 to -1.00) =
Very high
is a number between -1 and 1 that describes the linear relationship between two variables.
correlation coefficient
is the process of finding new and potentially useful knowledge from data by conducting analyses from different perspectives in order to uncover patterns, typically in large data sets. - builds on ideas from machine learning/artificial intelligence, pattern recognition, statistics, and database systems. It is the exploration and analysis of data by automatic and semi-automatic means
data mining
refers to the attempt to extract meaning from one or more large data sets using techniques that, generally speaking, owe their workings to statistics and artificial intelligence.
data mining
4 organizational terms in statistical analysis
data set, elements, variable & observations
apply when we have data for a complete population
descriptive statistics
communicates both idea of the range & the idea of shape
distribution
box-whisker plot
distribution by percentile
histogram =
distribution of frequency
to solve for the lower and upper bounds of the middle 68, 95, and 99.7 percent of the observations. - In a normal distribution, the mean, median, and mode are identical. Moreover, if we assume a normal distribution, and if we know its two defining parameters—mean and standard deviation
empirical rule
indicates how many observations fall within each of a number of categories (classes or bins). The classes themselves can be determined by the person (or computer program) doing the analysis, or they can be determined by some external logic.
frequency distribution
by counting the number of departures for this flight that fell into each of those timeliness bins.
frequency table
comes into play when we data for only a subset or sample of a population
inferential statistics
istribution is asymmetrical, and the imbalance is seen in a left tail that is longer than the right tail. Test scores are often left skewed, with very low scores (the left-hand side of the histogram) being the exception. - negatively skewed
left skewed
can describe straight-line relationships between two variables but is not appropriate to describe piecewise relationships like the one shown above, nor is it appropriate in describing other nonlinear relationships.
linear correlation
another parametric distribution that is used to characterize multiplicative processes. A variable is lognormally distributed if the logarithms of the values for the variable are normally distributed. Lognormal distributions are often used in finance to model asset prices.
lognormal
the 1rst quartile =
lower quartile is equivalent to the 25th percentile
an attempt to understand customer purchase patterns, usually to drive more sales—and found that certain purchase patterns were good indicators of pregnancy
market basket analysis
About 68% of values lie within one standard deviation (σ) away from the mean; about 95% of the values lie within two standard deviations; and about 99.7% are within three standard deviations. This is known as the empirical rule or the 3-sigma rule
normal distribution
- An unusually large or small observation - Can have a disproportionate effect on statistical results, such as the mean, which can result in misleading interpretations
outliers
distribution, rigidly defined by mathematical properties and intended to characterize processes that are additive.
parametric
idea of dividing a set of observations into 100 groups of equal width
percentiles
distribution is asymmetrical, and the imbalance is seen in a right tail that is longer than the left tail -positively skewed
right skewed
If skew is less than −1 or greater than +1, the distribution is ?
skewed
as we have seen, is a measure of asymmetry, imbalance, or "lopsidedness" in a variable. A skewness of zero indicates a balanced (and possibly but not necessarily symmetric) distribution of values. Negative skew suggests an imbalance created by low values (graphically, a tail to the left), and positive skew suggests an imbalance created by high values (tail to the right).
skewness
measure of asymmetry, imbalance in a variable
skewness
Symmetrical, mean &
stdev
used when we need to compare spread around the mean of two or more variables, to obtain a value that's indp. of the scale of the variable (divide the stdev by the mean) expresses how much variability exist as a % of the mean
stdev
Variabilitiy
stdev, normalized stdev : communicate the idea of spread around the mean
This single statistic communicates two aspects of a linear relationship: strength and direction. The farther from zero, the stronger the linear relationship. The sign of the coefficient communicates the nature of the relationship, direct (positive sign) or inverse (negative sign).
strength/direction
1 standard way of displaying the distribution of values for a variable using very little space
summary
If skew is between -1 and +1, we will treat the data as ?
symmetrical
Only want to use MEAN with _________ data! Susceptible to the influence of outliers! As the data becomes skewed the mean loses its ability to provide the best central location for the data because the skewed data is dragging it away from the typical value
symmetrical
Negatively Skewed/Left Skewed-
tail on left side is longer
Positively Skewed/Right Skewed
tail on right side is longer
distribution occurs if there is only one 'peak' (a highest point) in the distribution, as seen in the previous histograms. -This means there is one mode (a value that occurs more frequently than any other) for the data item (variable)
uni-modal
3rd quartile =
upper quartile = the 75%
Spread = ? Tells us how well the mean represents the data set
variability
types of measurements & other information recorded
variables
