Business Analytics Midterm

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Business applications of BA

- New customer acquisition, customer loyalty programs, customer churn - Cross sell / upsell - Pricing tolerance - Inventory and staffing optimization - Fraud detection

Business drivers for analytics

- Optimize business operations: sales, pricing, profitability - Identify business risk: customer churn, fraud -Predict new opportunities: upsell, crossell, new customers -Comply with laws: anti-money laudering

Data measurement

-Byte = 1 character -Kilobyte = 1,000 -Megabyte = 1,000,000 -Gigabyte = 1,000,000,000 -Terabyte = 12 zeroes -Petabyte = 15 zeroes -Exabyte = 18 zeroes -Zettabyte = 21 zeroes -Yottabyte = 24 zeroes

Business Analytics Process - BAP

-Define the problem -Prepare for modeling -Modeling -Deploy model -Monitor performance -Business problem - loop!

BA and decision making process

-Problem analysis -Decision analysis -Choice

Data Mining Techniques - Business process

-Transform data -Act on information -Measure the results -Identify business opportunities

Model building techniques

1. Association (Market Basket Analysis) 2. Classification (Map each item of data into one of a predefined set of classes - e.g., Decision Tree, Discriminant Analysis) 3. Clustering (Grouping data - no predefined classes) 4. Predictions (Predict a value of a variable - Regression analysis) 5. Sequential patterns (analyzing time series data - e.g., find out a seasonal pattern)

Business analytics life cycle CRISP

1. Business understanding 2. Data understanding 3. Data preparation 4. Modeling 5. Evaluation 6. Deployment

Challenges of big data

1. Cannot process in a timely manner 2. Infrastructure and software that support high data volumes, variety, velocity, and complexity are being developed (new) and expensive a. Hadoop, NoSQL databases 3. Data governance (for accurate & quality data) 4. Data integration 5. Data compliance/regulation 6. Security & privacy - remember Target? 100 million customer data were compromised

Business Understanding - BA life cycle

1. Develop an understanding of the business domain, problem, purpose of the DM project 2. Translate the goals into DM problem definition - frame business problem as an analytic challenge, formulate initial hypothesis 3. Prepare a strategy for achieving these objectives Assess the resources (people, technology, time, data)

Model building - BA life cycle

4. Model building - a model is a simple representation of a complex reality - a mathematical / statistical model is made up of variables and a mathematical relationship a. Determine the methods, techniques, and workflow for model building b. Explore the data to learn about the relationships between the variables c. Select key variables (dependent, independent) and models you are likely to use d. Partition data into training, validations and test data sets e. Build and execute mode - apply multiple techniques if appropriate f. Build model using training data set and estimate its quality using separate validation and test data sets - model is fit on training data and evaluated (scored) against test data g. Asses validity of results h. Fine tune models to optimize results.

Evaluate and communicate results - BA life cycle

5. Evaluate and communicate results a. Evaluate the different models developed and pick the best model b. Interpret the model results c. Determine whether the model achieved the objectives - determine if you succeeded or failed d. Identify the key findings and quantify business value e. Summarize your findings f. Answer the business question you are trying to address g. Convey your findings to stakeholders h. Compare the outcomes to your criteria for success and failure. i. Identify key findings ii. Quantify business value i. Communicate the findings and outcome to the various team members and stakeholders. i. Summarize findings, depending on audience ii. Include caveats, assumptions, and any limitations of results iii. Clearly articulate the outcomes

Deployment - BA life cycle

6. Deployment and operationalize a. Deliver final reports, briefings, code, & technical documents b. Run a pilot project c. Implement your models in a production environment d. Integrate it into operational system, decision making e. Deploy the model f. Ongoing monitoring and maintenance of the model i. Everything changes over time; for example, customer preference changes, competitors moves A new model needs to be developed to reflect the change. g. Run a pilot (in a production environment) h. Assess the benefits i. Provide final deliverables j. Implement the model in the production environment k. Update process, retrain, and retire as needed l. After deploying the model, conduct follow up to reevaluate the model after it has been in production for a period. i. Still meeting the goals & expectations? ii. Are desired changes occurring? iii. Monitor/Update model if necessary

Alternative hypothesis

Alternative hypothesis is what we prove if you have enough evidence to demonstrate the validity of claims Alternative hypothesis (Ha) - something new is happening

Association models

Association - what events are likely to occur together - example - perform market basket analysis to increase cross selling opportunities

Data mining, Business Analytics, Business Intelligence differences

BA - process of doing analysis domain - uses stats and DM techniques Data and text mining Process mining Optimization DM - process of discovering new patterns in large sets that involve AI and stats Classification Prediction Segmentation/clusters Association BI - starts with data, ends with decisions

What is Big Data?

BD is data whose scale, distribution, diversity and or timeliness require the use of new technical architectures and analytics to enable insights that unlocks new sources of business value

Exploratory Analytics

Basic analytics - what is happening - cluster, key variables, segmentation analysis

Business analytics

Broader, explaining why particular results occurred and understanding what might happen in the future, and applying what is learned within context of business problem or opportunity.

Data mining

Business process for exploring large amounts of data to discover meaningful patterns and rules. Methods such as machine learning, AI, statistical tools.

What is pairwise deletion (correlation analysis)?

Cases with a missing value for variable V1 is excluded only when analysis includes that variable V1

What are the data types measurement scales?

Categorical • Nominal • Ordinal Continuous • Interval • Ratio

What is heteroscedasticity?

Circumstance in which variable along the line of best fit is unequal as you move along the line.

What is homoscedasticity?

Circumstance in which variance along the line of best fit remain similar as you move along the line.

Classification models

Classify - classify based on other characteristics associated with individual - ex - fraud credit card transaction, mortgage is good or bad risk

Cluster models

Cluster - put people or things into groups by recognizing patterns -ex - assign customer to different segment based on buying patterns or other attributes, group stock by similar risk

Analytics value chain

Continuous feedback loop: Manage data - perform analytics - drive decisions Work moves from left to right but you have to think from right to left. Start with the decisions to be influenced. Analytics should support a business strategy. You need to frame the outcomes that would be valuable for the business.

What is r?

Correlation coefficient - measure of the strength of a linear association between two variables. R is measured from -1 to 1. Closer to -1 or 1 indicates a strong relationship, closer to zero indicates weak relationship. Negative number indicates that variables move in separate directions, positive number indicates that variables move in same direction

What is cross sectional data?

Data collected at the same time or approximately at same point in time - it's a snapshot at a time o Time is not a variable or a factor o Example - Home price dependent on size of house, number of bedrooms

What is time series data?

Data collected over different time periods - daily, monthly, quarterly, yearly • Time is an independent variable • Examples: Trend over time - Dow Jones average for the last 10 years, revenue over the last four quarters • Daily average temperature for the last 40 years

What is data imputation?

Data imputation method - replaces missing values that would take advantage of the knowledge based on the rest of the data o Replaces missing data with an estimated value based on other available information. o Impute only continuous data o Select multiple variables (especially missing) to impute o You can't impute categorical data because it is qualitative - when you code categorical data as a number - JMP may return it as continuous data o Missing categorical data may be replace by mode o JMP uses a technique called Listwise deletion and Pairwise deletion

Basic blocks of decision making

Data, information, insight, intelligence, action

What is the empirical rule?

Empirical rule = The empirical rule states that for a normal distribution, nearly all of the data will fall within three standard deviations of the mean. The empirical rule can be broken down into three parts: 68% of data falls within the first standard deviation from the mean. 95% fall within two standard deviations

Prediction models

Estimate / predict - what is the value of the target variable - ex - predict orders for next quarter based on historical patterns so you can optimize supply chain and inventory management - regression analysis, neural networks

Descriptive Analytics

Examination of data or content - what happened. BI and visualizations. Basic analytics - summaries, trends, patterns.

Prescriptive Analytics

Examines data and content to answer what should be done? Techniques - graph analysis, simulation, complex event processing, neural networks, machine learning. Advanced - data science.

Predictive Analytics

Examines data or content to answer what is likely to happen? Techniques - regression, forecasting, multivariate stats, pattern matching, predictive modeling - advanced.

Statistical analysis includes

Exploratory analysis - become familiar with data to gain insights into structure and variables Exploratory modeling - understand potential relationships between variables and identify most important variables Predictive modeling - predict new observations - what is likely to happen

What is an F -test?

F-test is used to find out if the variances between the two populations are significantly different

What is an independent variable?

Factor = X • Otherwise known as a factor • Explanatory variable • Predictor variable • Input variable • It is how to achieve it • Impact on responses • It explains the response behavior • It is "cause" • Examples - Price, advertising, gender, income, age

Characteristics of Big Data

Huge volume of data Complexity of data types and structures Speed of new data creation and growth

Difference between information and insight

Information - data that is aggregated to a level where it makes sense for decision making. Tells you what happened, helps identify problems. Insight - information that has been analyzed and interpreted. Tells you why, when, where, how it happened, helps you understand and solve problems.

Define the problem

Key outcome of this step is to define the problem and project.

Level of significance

Level of significance (alpha) - usually .05 - based on calculating the probability of how rare the observed outcomes of the study are If calculated probability of observing the results - the p value - is smaller than the level of significance - alpha - then we have enough evidence to conclude that the null hypothesis is false.

Null hypothesis

Null hypothesis is a statement about a population that you cannot prove Null hypothesis (Ho) - nothing new is happening, assumed to be true, equal sign is always with null hypothesis Null always has the equal sign

Graphical representation of hypothesis testing

One tail test: If the null has <> signs Two tail test: If the null is using just = sign

T-test summary

One type of hypothesis tests are t-tests, which are tests that examine whether two means are statistically significantly different from each other or whether the difference between them simply occurred by chance

Monitor performance

Ongoing monitoring interest that the model or models continue to produce the desired results.

Why companies should use BA

Only differentiation is business processes executed efficiently and effectively, data they collect, and decision making based on analyzing data. BA helps companies become more efficient and effective and help companies make smart decisions BA will help personalize products and services.

Chi Square test

Otherwise known as: Contingency analysis, Pearson's chi-square, chi-square test of association. What is it? The chi-square test for independence is used to discover if there is a relationship between two categorical variables. What variables are used? Your two variables should consist of two or more categorical, independent groups.

Matched pairs T-test

Otherwise known as: Dependent T test, Paired T test, Paired samples t-test What is it? A dependent t-test is an example of a "within-subjects" or "repeated-measures" statistical test. This indicates that the same participants are tested more than once. Thus, in the dependent t-test, "related groups" indicates that the same participants are present in both groups. The reason that it is possible to have the same participants in each group is because each participant has been measured on two occasions on the same dependent variable. When would you use it? For example, you might have measured the performance of 10 participants in a spelling test (the dependent variable) before and after they underwent a new form of computerized teaching method to improve spelling. You would like to know if the computer training improved their spelling performance What variables are used? One dependent variable that is continuous (interval or ratio) and one categorical variable that has only two related groups

Two sample T-test

Otherwise known as: Independent T test What is it? Inferential statistical test that determines whether there is a statistically significant difference between the means of two unrelated groups. Example: Are there differences in salary between genders? What variables are used? One independent categorical variable with two levels or groups and one continuous dependent variable.

BA building blocks to success

People, organizational processes, tools/technology/data

Supervised models in data mining

Predefined outcome variable - known dependent variable • Decide which class each case belongs • Find out major characteristics that differentiate a predefined group • Regression, logistic regression, decision trees • Predict demand based on price You have target variable in supervised DM technique - algorithms learn which values of target are associated with predictor variables

What is qualitative data?

Qualitative data - categorical data - represented by text and characters o Examples Sales region, gender, month o Used in grouping data of quantitative variables Example - Average sales by region, average age per gender

What is quantitative data?

Quantitative data - continuous data - is numeric data - you can calculate mean, median, standard deviation o Sales, income, age, GPA, temperature o Not all numeric data quantitative - zip code, cc #, parts #

What is a dependent variable?

Response = Y • Otherwise known as the response variable • Target variable • Outcome variable • It is what we want to know • It is the variable that we want to estimate, predict, optimize • It is "effect" • Examples - Sales, grades, credit scores, fraud, default

Big data structures

Structured - data containing defined data type, format, structure such as transnational data Semi-structured - textual data files with a discernible pattern that enables parsing such as XML Quasi-structured - textual data with erratic data formats that can be formatted with effort, tools, time such as click stream data Unstructured - data that has no inherent structure such as images, video, text documents

What is Business Analytics?

The act of turning data into actions through its tool/technologies, processes, and people. Transforming data into information, insights and intelligence.

Modeling

The end result of this step is a model, or set of models, that address the problem

How is linear regression calculated?

The most common method for fitting a regression line is the method of least-squares. This method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line (if a point lies on the fitted line exactly, then its vertical deviation is 0). Because the deviations are first squared, then summed, there are no cancellations between positive and negative values.

What is a p-value?

The p-value is the probability of getting results as extreme as the observed values under null hypothesis.

Why do we care about different data types?

The statistical procedure that you can perform depends on the type of data that you have - continuous versus categorical • you can perform parametric statistics - calculate Z, T, F statistics - with continuous data • You can only perform non-parametric statistics - chi-square - with categorical data

What is a T- test?

The t-test is used to find out if the means between two populations is significantly different.

Prepare for modeling

This step is about compiling data and preparing data for analysis and modeling

Deploy model

This step is putting the model to use. The final result of the process is to deploy the model for use within the organization

Drivers of BD

Transactional data - WWW, Email - Mobile, text - Sensors, Facebook - Wearables, IOT - Self driving cars

Type 1 error

Type I error is detecting an effect that is not present. Incorrect rejection of the null. You pull a fire alarm when there is no fire.

Type 2 error

Type II error is failing to detect an effect that is present. Incorrectly retaining a false null hypothesis. You should have pulled the fire alarm but you didn't.

Unsupervised models in data mining

Unsupervised technique/model No predefined outcome groups - no dependent or outcome variable Define groups of cases with similar characteristics Find out structure of data Cluster analysis, association /market based analysis Categorizing customers - gender, age, Search for patterns and structure among all variables

2 more "V"s of Big Data

Veracity, value.

3 "V"s of Big Data

Volume, velocity, variety

Correlation

What does this test do? Correlation coefficient (or Pearson correlation coefficient, for short) is a measure of the strength of a linear association between two variables and is denoted by r. Basically, a Pearson product-moment correlation attempts to draw a line of best fit through the data of two variables, and the Pearson correlation coefficient, r, indicates how far away all these data points are to this line of best fit (i.e., how well the data points fit this new model/line of best fit). What variables can be used? Two continuous variables. Assumptions of r Linearity: Relationship between two variables is linear. Bivariate normal distribution: Data are from random sample of population where the variables are normally distributed

One sample T-test

What is it? Comparison of mean of a sample (observed) and mean of a population (expected). When do you use it? Used to determine if a sample comes from a population with a specific mean. What variables do you use? Normally distributed continuous (interval or ratio) data.

Simple Linear Regression

What is it? It is used when we want to predict the value of a variable based on the value of another variable. The variable we want to predict is called the dependent variable (or sometimes, the outcome variable). The variable we are using to predict the other variable's value is called the independent variable (or sometimes, the predictor variable). What variables do you use? Your two variables need to be continuous. What are the assumptions? • Linear relationship • Independence of observations • Data needs to show homoscedasticity - variance along the line of best fit remain similar as you move along the line • Residuals (errors) are approximately normally distributed.

Two Way ANOVA

What is it? The two-way ANOVA compares the mean differences between groups that have been split on two independent variables (called factors). When do you use it? The primary purpose of a two-way ANOVA is to understand if there is an interaction between the two independent variables on the dependent variable. For example, you could use a two-way ANOVA to understand whether there is an interaction between gender and educational level on test anxiety amongst university students, where gender (males/females) and education level (undergraduate/postgraduate) are your independent variables, and test anxiety is your dependent variable. What variables are used? One dependent continuous variable, more than one independent variable consisting of two or more categorical independent groups.

One way ANOVA

What is this test for? The one-way analysis of variance (ANOVA) is used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups. What does this test do? The one-way ANOVA compares the means between the groups you are interested in and determines whether any of those means are statistically significantly different from each other What variables are used? One categorical variable consisting of two or more groups, serving as the independent variable, and one continuous variable, serving as the dependent variable.

What is a Z score?

Z score = z Score. A z-score (aka, a standard score) indicates how many standard deviations an element is from the mean. A z-score can be calculated from the following formula. z = (X - μ) / σ where z is the z-score, X is the value of the element, μ is the population mean, and σ is the standard deviation.

Data Understanding - BA life cycle

a. Data collection from existing database i. Clear description of the problem ii. Identify relevant data for problem iii. Selected variable should not contain overlapping info b. Diverse data source for data mining i. Internal ii. External - government, demographic, market report iii. Created - research, survey c. Data understanding process i. Collect initial data ii. Explore the data through visualization, descriptive stats iii. Verify the quality - missing values? iv. Find the outliers v. Hypothesis testing vi. Basic info about data before modeling

Data preparation - BA life cycle

a.Test and clean data for better quality i. Standardize formats ii. Handle outliers, missing or incorrect data, iii. Check redundancies iv. Remove useless data v. Unify scales, units of analysis vi. Transform data - continuous to categorical vii. Aggregate data if necessary viii. Explore data - visualization ix. Example - if you find gender has a weak relationship with purchase - based on correlation - drop gender, select stronger variables for DM. Decide how you will handle missing data - remove or impute data.

What is listwise deletion (cluster analysis)?

o Method for handling missing data o An entire record is excluded from analysis if any single value is missing o Listwise deletion affects the statistical power of the test conducted. Statistical power relies in part on high sample size. Also, may be an issue if the missing data is not random - for example - if people leave sensitive question blank in a questionnaire - this could leave a bias in the finding

What is ratio scale?

• Continuous data that has a natural zero • Time is ratio scale since zero time is meaningful • Weight, height, age, income, miles per gallon are all ratio data • The interval between two numbers are the same • The ratio between two numbers is meaningful • All mathematical operations are permissible • Highest level in terms of information quality • Can perform parametric statistics with this type of data

What is interval scale?

• Continuous data where the intervals between each value equal - examples: temperature in degrees Fahrenheit or Celsius. The difference between 39 and 40°F is the same magnitude as the difference between 79 and 80°F but 80°F is not twice hotter than 40°F • There is no true zero, not possible to compute ratios • you can add and subtract but you cannot multiply and divide - example: 10°F +10°F equals 20°F but 10°F multiplied by two does not equal 20°F • There are very few data that are of this type • You can perform parametric statistics with this type of data

Skewed Curves - Distributions

• Negative skewness - observations fall towards high side of the scale, few low observations • Positive skewness - observations fall towards low side of the scale, few high observations • Many technique will not work if there is skew • Parametric techniques - need to have normal distribution - ANOVA, Regression

What is ordinal scale?

• Refers to the data that have a natural order of ranking - example: bigger, faster, longer • With ordinal data, you cannot state with certainty whether the intervals between each value are equal • Example - we often use rating scales to rate a movie. On a five-point scale, the difference between a five and a four is not necessarily the same as the difference between the four and a three • Attitudinal scales and the Likert -type questions use usually see on a survey or ordinal - although many points on the scale likely are of equal intervals and used as continuous data by researchers • Easy to remember - ordinal sounds like order

What is nominal scale?

• Used to describe data that are categorical, qualitative • Examples: gender, type of car, ethnicity, ZIP Code, city name • The categories can be defined using names or text - examples: male, female or numbers such as a ZIP Code • Does not provide a lot of information • Can be used for classification • Cannot add or subtract the categories are represented by numbers - example: male = 1, female = 2


Kaugnay na mga set ng pag-aaral

International Logistics Ch. 9-12

View Set

Certification Test for A/C and Heating

View Set

Essentials of the Legal Environment of Business Chapters 7 & 8 & 10

View Set

Texas Law of Agency Exam Prep. Chapter 10

View Set

American Economic History Midterm Exam 2

View Set

Health Assessment Coursepoint 26

View Set