K303 Module 1 Quiz

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

What we can do with outliers (4)

- Correct the error and re-analyze the data - Investigate the process to determine the cause of the outlier - Determine whether you failed to consider a factor that affects the process - Investigate the process and the outlier to determine whether the outlier occurred by chance; conduct the analysis with and without the outlier to see its impact on the results

5 Step Problem Solving

1. Define the problem 2. Gather Data 3. Analyze the data, think creatively & let the problem rest 4. Select a solution through preliminary 5. Evaluate & monitor the solution

5 steps of Business Analytics

1. Problem Definition 2. Data Preperation 3. Technical Analysis & Modeling 4. Evaluation of Results 5. Deployment

2nd quartile

50% percentile and a median

Approximately ____ of all observations are within plus or minus one standard deviation of the mean ( for data symmetrical data)

68%

Approximately ____ of all observations are within plus or minus two standard deviations of the mean; ( for symmetrical data)

95%

Approximately ____ of all observations are within plus or minus three standard deviations of the mean (for symmetrical data)

99.7%

has two modes (two values that occur more frequently than any other) for the data item (variable)

Bi-modal

forward looking and actionable

Business Analytics

backward looking and descriptive

Business Intelligence

Identify the middle or center position of the data distribution. Mean- "average" Mode- "most" Median- "middle"

Central Tendency

Dependence or association How close two variables are to having a linear relationship with one-another Correlated: Height and weight Not Correlated: Hair color and # stars in sky

Correlation

collection of values that is typically recorded in tabular form ex: database table or a spreadsheet

Data set

A method of summarizing the characteristics of a single variable within a set of data. - apply when we have data for a complete population—for example, test scores for all students enrolled in a course for which nine sections are offered.

Descriptive Statistics

the subjects of the data collection ex: individual patients on whom measurements are taken

Elements

= Q1 - (3 × IQR) Values below this are extreme outliers

Extreme Lower Limit (skewness)

= Q3 + (3 × IQR) Values above this are extreme outliers.

Extreme Upper Limit (skewness)

MIN, 1st quartile, 2nd quartile, 3rd quartile, MAX

Five number summary

.70 to .90 (-.70 to -.90)

High

come into play when we have data for only a subset, or sample, of a population—for example, test scores for students in only some of the nine sections of that course.

Inferential statistics

the P75-P25, says how spread out on the # line the middle 50% of observances are

Interquartile

.00 to .30 (-.00 to -.30)

Low

= mean - (2 × the standard deviation) Values below this are outliers.

Lower Limit of symmetrical

Used with skewed data Less susceptible to influence of outliers

Median

= Q1 - (1.5 × IQR) Values below this are mild or extreme outliers.

Mild Lower Limit (skewness)

= Q3 + (1.5 × IQR) Values above this are mild or extreme outliers.

Mild Upper Limit (skewness)

Mostly used in categorical data, not continuous data

Mode

.30 to .70 (-.30 to -.70)

Moderate

has two or more modes

Multi-modal

One mode (most frequent number, "peak")

Normal Distribution

single observation is the value of a variable for a particular element ex: Dallas, Texas

Observations

Uses historic data to model and predict potential outcomes of a situation, often incorporating the use of uncertain variables.

Predictive Analytics

Prescribes a course of action to achieve the best result for a given problem.

Prescriptive Analytics

IQR=

Q3-Q1

Skewed, Median &

Quartiles

dividing a set of observations into 4 groups of equal width - Quartiles divide a distribution into four parts. The second quartile is better known as the median.

Quartiles

Most of the data are clustered around the center, while the more extreme values on either side of the center become less rare as the distance from the center increases

SHAPE—Normal Distribution

Symmetrical shape (bell curve)

SHAPE—Normal Distribution

What three characteristics should you focus on in descriptive statistics?

Shape, Center, Spread

Asymmetrical Distribution: the two sides will not be mirror images of each other

Skewed

Skewness is the tendency for the values to be more frequent around the high or low ends of the x-axis

Skewed

SMART Goals =

Specific Measurable Achievable Idealistic Time based

Measures of Spread: summarize the data in a way that shows how scattered the values are and how much they differ from the mean value (2)

Standard Deviation Interquartile Range

Average distance between each data point and the mean. Indicates the spread of the data, and therefore the volatility within the process or business.

Stdev

This open-source programming language for statistical computing and graphics provides a wide variety of analytical techniques.

The R Project

= mean + (2 × the standard deviation) Values above this are outliers.

Upper Limit of symmetrical

.90 to 1.00 (-.90 to -1.00) =

Very high

is a number between -1 and 1 that describes the linear relationship between two variables.

correlation coefficient

is the process of finding new and potentially useful knowledge from data by conducting analyses from different perspectives in order to uncover patterns, typically in large data sets. - builds on ideas from machine learning/artificial intelligence, pattern recognition, statistics, and database systems. It is the exploration and analysis of data by automatic and semi-automatic means

data mining

refers to the attempt to extract meaning from one or more large data sets using techniques that, generally speaking, owe their workings to statistics and artificial intelligence.

data mining

4 organizational terms in statistical analysis

data set, elements, variable & observations

apply when we have data for a complete population

descriptive statistics

communicates both idea of the range & the idea of shape

distribution

box-whisker plot

distribution by percentile

histogram =

distribution of frequency

to solve for the lower and upper bounds of the middle 68, 95, and 99.7 percent of the observations. - In a normal distribution, the mean, median, and mode are identical. Moreover, if we assume a normal distribution, and if we know its two defining parameters—mean and standard deviation

empirical rule

indicates how many observations fall within each of a number of categories (classes or bins). The classes themselves can be determined by the person (or computer program) doing the analysis, or they can be determined by some external logic.

frequency distribution

by counting the number of departures for this flight that fell into each of those timeliness bins.

frequency table

comes into play when we data for only a subset or sample of a population

inferential statistics

istribution is asymmetrical, and the imbalance is seen in a left tail that is longer than the right tail. Test scores are often left skewed, with very low scores (the left-hand side of the histogram) being the exception. - negatively skewed

left skewed

can describe straight-line relationships between two variables but is not appropriate to describe piecewise relationships like the one shown above, nor is it appropriate in describing other nonlinear relationships.

linear correlation

another parametric distribution that is used to characterize multiplicative processes. A variable is lognormally distributed if the logarithms of the values for the variable are normally distributed. Lognormal distributions are often used in finance to model asset prices.

lognormal

the 1rst quartile =

lower quartile is equivalent to the 25th percentile

an attempt to understand customer purchase patterns, usually to drive more sales—and found that certain purchase patterns were good indicators of pregnancy

market basket analysis

About 68% of values lie within one standard deviation (σ) away from the mean; about 95% of the values lie within two standard deviations; and about 99.7% are within three standard deviations. This is known as the empirical rule or the 3-sigma rule

normal distribution

- An unusually large or small observation - Can have a disproportionate effect on statistical results, such as the mean, which can result in misleading interpretations

outliers

distribution, rigidly defined by mathematical properties and intended to characterize processes that are additive.

parametric

idea of dividing a set of observations into 100 groups of equal width

percentiles

distribution is asymmetrical, and the imbalance is seen in a right tail that is longer than the left tail -positively skewed

right skewed

If skew is less than −1 or greater than +1, the distribution is ?

skewed

as we have seen, is a measure of asymmetry, imbalance, or "lopsidedness" in a variable. A skewness of zero indicates a balanced (and possibly but not necessarily symmetric) distribution of values. Negative skew suggests an imbalance created by low values (graphically, a tail to the left), and positive skew suggests an imbalance created by high values (tail to the right).

skewness

measure of asymmetry, imbalance in a variable

skewness

Symmetrical, mean &

stdev

used when we need to compare spread around the mean of two or more variables, to obtain a value that's indp. of the scale of the variable (divide the stdev by the mean) expresses how much variability exist as a % of the mean

stdev

Variabilitiy

stdev, normalized stdev : communicate the idea of spread around the mean

This single statistic communicates two aspects of a linear relationship: strength and direction. The farther from zero, the stronger the linear relationship. The sign of the coefficient communicates the nature of the relationship, direct (positive sign) or inverse (negative sign).

strength/direction

1 standard way of displaying the distribution of values for a variable using very little space

summary

If skew is between -1 and +1, we will treat the data as ?

symmetrical

Only want to use MEAN with _________ data! Susceptible to the influence of outliers! As the data becomes skewed the mean loses its ability to provide the best central location for the data because the skewed data is dragging it away from the typical value

symmetrical

Negatively Skewed/Left Skewed-

tail on left side is longer

Positively Skewed/Right Skewed

tail on right side is longer

distribution occurs if there is only one 'peak' (a highest point) in the distribution, as seen in the previous histograms. -This means there is one mode (a value that occurs more frequently than any other) for the data item (variable)

uni-modal

3rd quartile =

upper quartile = the 75%

Spread = ? Tells us how well the mean represents the data set

variability

types of measurements & other information recorded

variables


Kaugnay na mga set ng pag-aaral

BUS-101 Test Ch. 7: Management and Leadership

View Set

Business Law 2 - Chapter 24 Intro to Negotiable Instruments

View Set

Chapter 4 Arousal, Stress, Anxiety

View Set

Universal Declaration of Human Right - Summary Articles (1-30)

View Set

2016 SharePoint chapter 3 Lists and Libraries

View Set

PN Adult Medical Surgical Online Practice 2020 A

View Set

Psychlearn12 - reporting results

View Set