MAR6930 AI and Data Science Midterm
The Analytics Process - Business Understanding
"Business Understanding" can also be described as "domain knowledge" or "functional knowledge" ▪ What is the business problem? ▪ How does the business problem fit into business process? ▪ What would successfully solving the problem look like? Define success. ▪ Assess the kinds of analytics models that would create the recommendation or action to solve the business problem. ▪ Defining how the analytics solution would fit back into the business process. ▪ Assess what key variables might be to solve the problem. ▪ Assess data sources that might be available to solve the problem.
Analytics Techniques - Time Series Models (also "Forecasting Models")
Description: Time Series Models use data to predict future time period observations of a variable of interest. Used in supply chain, finance, accounting. Uses excel
The Analytics / Data Science Team
- Data Scientist / Project Lead (domain expertise and analytics methodologies) - Functionalist / Subject Matter Expert (deep functional knowledge, understand databases) - Modeler / Statistician (performs model development, statistical expertise) - Data Engineer / DBA (develops data and solution arquitecture) - Technical or Business Analyst (assists w data preparation. Documents artifacts, makes proof of concept, prepares business briefing)
Why Do Businesses Care about Marketing Analytics?
- Make better investment decisions - understand customers better - attract new customers - make more relevant offers to potential customers - predict customer behavior - retain existing customers - decide on the important of product or service features to customers - Sophisticated understanding of customers - Forecasting - New product development decisions - Segmentation, targeting, positioning decisions
Standard analytics methodologies
- Permit teams across an organization to have a common approach to analytics projects ▪ Help non-analytics professionals understand analytics projects ▪ Help managers make decisions about investments in analytics projects, people, data and software ▪ Facilitate analytics team members to make decisions and take actions during an analytics project ▪ Describe in detail both large and small steps to consider taking in your analytics project ▪ Identify responsibilities of team members who conduct the analytics project
Four major activities in data understanding
- collect initial data - describe the data - explore the data - verify the data quality
CRIPS-DM (SPSS/IBM) Process
1. Business understanding 2. Data understanding 3. Data preparation 4. Modeling 5. Evaluation 6. Deployment
Firm specific analytics methodology
1. Understand the Business Problem 2. Evaluate Data Sources: 3. Techniques to Address the Problem 4. Select the Right Technology 5. Produce Analytic Outcomes: 6. Integrate with Business Processes and Systems
ETL (extraction, transformation, and loading)
A process that extracts information from internal and external databases, transforms the information using a common set of enterprise definitions (cleaning w wrangling, constructing w munging and feature engineering, integrating combining data), and loads the information into a data warehouse
Analytics Techniques - Social Network Analysis (SNA)
Analytics Techniques - Social Network Analysis (SNA). Used for ex. to find fraud rings and how they're connected. Social network analysis.
SEMMA Methodology
Analytics process designed by SAS • Focuses on the model building process instead of the entire end-to-end process • Provides more details on the modeling step of the CRISP-DM methodology.
The Predictive Analytics Lifecycle - SAS
Another way to view the analytics methodology is through the roles of the individuals assigned to the team • This SAS analytics model focuses on different roles that perform the task. Business manager (domain expert, makes decisions, ROI evaluation), business analyst (Data exploration, visualization, report creation), data miner/statistician (exploratory analysis, descriptive segmentation and marketing, IT systems management (model validation, deployment, monitoring, data preparation)
Artificial Neural Networks (ANN)
Artificial Neural Networks or ANN is an information processing paradigm that is inspired by the way the biological nervous system such as brain process information.
The Cross Reference Industry Standard Process for Data Mining (CRISP-DM) - SPSS / IBM
Based on what we have seen in the previous analytics models, we will follow the CRISP-DM model to illustrate how to complete marketing analytics projects • The CRISP-DM model has the benefit of defining the sub-steps and deliverables for each of the major steps in the analytics modeling process
Major analytics methodologies used by firms and organizations:
CRoss Industry Reference Process for Data Mining (CRISP-DM) - most popular ▪ Sample, Explore, Modify, Model, and Assess (SEMMA) ▪ Foundation Methodology for Data Science (FMDS) ▪ Agile Analytics Project Model ▪ Predictive Analytics Lifecycle ▪ Team Data Science Process (TDSP) ▪ Firm-specific methodologies
Analytics Techniques - Classification and Regression Trees (CART
Classification and Regression Trees (CART) or Decision Trees. Works by partitioning the predictor variables and pruning using validation data. Used to predict a record or event.
Business intelligence
Conversion of large amounts of data into summarized information for the decision-maker. Summarized information can be used for reports, dashboard, data visualization, data file. Includes: pivot table, data cubes, query-enabled large datasets, etc.
ETL in depth (extract, transform, load)
Data engineer builds and maintains data systems, cleans and wrangles data into usable state, determines SQL, develops data ingestion methods, establishes and interacts with data tables and lakes. Data preparation is 70 - 80% of project.
Analytics process data understanding
Data in org (ERP system, CRM, Customer service info, web server info, online shopping info, web cookie info, cell phone app info, web and audio files) Data outside org (dara broker, analyticsIQ,Spokeo, etc. Open source scraped info social media, US Census)
Analytics Techniques - Optimization Models
Determine the best possible combination of variables to produce a specified outcome. Outcome is objective function. Restriction on range of inputs are constraints. Useful to help know which combination of three different products will yield to the most net income. Uses linear and non linear programming.
Supervised learning
Humans provide labels to data related to the outcome of a task in order to "train" the computer algorithm
Metric variables (also called numeric variables)
Interval: numbers that are equal distance apart but do not have an absolute zero value. E.g. temperature in celsius or farenheight Ratio: do have an absolute 0 value. Can be expressed as ration for meaning 50%, 20lbs Boolean (binary true false, 0 or 1) factors (categorical nominal binomial ordinal) integer (whole numbers) text (string categorical, nominal, binomial, ordinal) date (numeric or categorical data)
Predictive modeling
Key compontent of artificial intelligence. Create probability estimate that an event of interest will occur using one or more variables. Expressed as a percentage or a certainty using a cutoff score. Include: logistic regression, random forests, Neural networks.
Categorical variables (or non-metric variables):
Nominal (names only) Binomial (yes, no) Ordinal (order) placed in order low medium high etc.
Model assessment precisions and recall
Precision: fraction of relevant examples (true positives) among all examples which were predicted to belong to a certain class Recall: fraction of samples that were predicted to belong to a class with respect to all samples that truly belong to the class
Return on Marketing Investment (ROMI) Formula
Return on Marketing Investment (ROMI) = [Incremental Revenue Attributable to Marketing ($) * Contribution Margin (%) - Marketing Spending ($
Data Munging
Same as data wrangling (cleaning and unifying complex data for easy analysis and use.) Can also include creating new features in the data sets or data tables
Multivariate statistics
Statistical analyses that describe how individual variables within a dataset relate to one another. Relationship can be an important next step in developing analytics models or data visualizations. Include: ANOVA, T-test, Z-test, Scatterplot, multiple regression.
Unsupervised learning
The computer doesn't receive labels from humans, but looks to find "structure" in the input data. This can be the first step in labeling the data for further supervised learning.
Reinforcement learning
The computer interacts with the environment and data from the environment to reward or penalize algorithms as it is exposed to more data (or scenarios). Example - cars under varying conditions like speed, cornering, acceleration, etc
Data Storage
The physical or cloud-based location that you store data for analytics. Can be in the analytics program (excel) commercial database software (MS Access, Oracle, SQL) open source database software (MySQL), Cloud-based data storage (AWS, Google, IBM)
Team Data Science Process (TDSP) Lifecycle - Microsoft
This version of the analytics process model focuses on development of analytics projects in the cloud • Cloud-based data acquisition and storage permits more big data efforts and analytics techniques than some on-premise environments • The 4 major steps listed here still have the same features of the CRISP-DM model
Artificial intelligence (AI)
Utilization of computers and algorithms to perform a human action or activity. Many AI apps use neural networks. AI differs from predictive modeling because it not only creates a prediction, but often acts on prediciton. Include: computer vision, natural language processing, speech recognition, driving ehicels, etc.
Big Data
Volume - measured in absolute size of data Variety - structured or unstructured data Velocity - speed at which data is exposed to models or algorithms Veracity - quality/accuracy of data that analytics models are exposed to
Comparing across distributions
When you compare across distributions, you can discover how some of these distributions might complement each other in analytics models. Data must be normalized, remove outliers, develop the size of bins and range of data for a histogram.
Univariate data analysis
a way to discover characteristics of individual variables in a dataset or database (descriptive statistics or exploratory data analysis.) can include histograms, boxplots, density plots. Also statistical distribution analysis in distribution models (mean, median, z-scores)
Analytics Techniques - Naïve Bayes Classifier
assign observations to different "classes." Classes are groups of observations that are similar, frequent machine learning problem. Used with observations that have categorical predictor variables. Uses prior probabilities of similar classes to predict current observation. Used to detect fraudulent financial reporting.
The Analytics Process - Model Assessment
best method depends upon the analytics technique ▪ e.g. R-squared (r2 ) in linear regression models ▪ Confusion matrix for binary classification models. perform a validation, most data scientist use a train - test approach • A subset of the data is used to develop the model • A portion of the data is "held out" from training the model to be used to test the model • Depending on the type of model, the "hold out" data may be randomly selected
Data Wrangling
cleaning and unifying messy and complex data sets for easy access and analysis. Involves data acquisition, joining data, data cleansing.
Analytics Techniques - Ensemble Models
combine multiple models (potentially using multiple techniques) to create a single output model. Used to forecast using multiple models. combine multiple types, the only restriction is that the models can be combined.
Reports
computer outputs that are predesigned to utilize data to convey information to a human decision-maker. Include text, graphs, spreadsheets, etc.
The Analytics Process - Modeling
data scientist and statistician apply the best modeling technique given the business problem and the data available. Selecting the modeling techniques that match the business problem and data ▪ Building multiple models with multiple data sources and elements ▪ Choosing the best model and tweaking the parameter setting. Assessing the outcome. Revising the model parameters based on assessment.
The Analytics Process - Deployment
deployment state where the data science model is integrated with the firm's systems and business processes • The model recommended by the data science team may have to be modified in order to be implemented "in production" ▪ For example, in order to run on a phone, the model may have to be rewritten in Java ▪ If the program is run against a live data stream instead of batch processed, it may have to be rewritten ▪ Example - algorithms that detect potential fraud on credit card transactions. Generally follows a similar CRISP-DM approach.
Data Types
describes the kind of data that can be collected an utilized in analytics models and data science applications. • Important because each type of model requires specific data types to calculate an outcome correctly. • Categorical variables (or non-metric variables): • Nominal - Variables that are names only (e.g. Boy / Girl)... count, but not put into order • Binomial - special case of nominal (yes / no or 0, 1) - if a number the numbers can be assigned without regard to order • Ordinal - categories that can be placed in order (e.g. low, medium, high or assigned 1, 2,
Analytics Techniques - General Linear Regression (GLM)
extension of ordinary linear regression that permits variables to have less restrictive assumptions about the predicted variable than ordinary linear regression models. Used in regression using factors in social science. Multifactor ANOVA, etc.
Analytics
extensive use of data, statistical and quantitative analysis, explanatory and predictive models, and fact based management to drive decisions and actions.
Analytics Techniques - Cluster Analysis
family of analytics methods that group observations which are alike into groups Used for ex. to group cars into categories based on characteristics.
Analytics Techniques - Structural Equation Models
fit networks of data to constructs in a series of models. Constructs are conceptual variables (latent) that are estimated by series of measurable variables.
Agile analytics project model
here are many different models for Agile Analytics development • Many of the steps mimic the CRISP-DM process • The major difference in the Houston Analytics model is the assumption that both the problem and solution teams are working on their respective efforts at the same time • Both teams perform the evaluation step
Data munging and Data wrangling
ingesting and converting data into a format usable by analytics models. Data munging may also involve the creation of new variables from existing variables. May involve manipulating a data set to deal with missing data. May also involve combining data sources together to create a single source for analytics models.
Data Understanding
involves collecting the data, identifying problems in the data and developing initial understanding about the data. • Identifying data type/s • Identifying data size • Identifying data velocity • Identifying data veracity • Data storage method & technology
Analytics Techniques - Support Vector Machines (SVM)
machine learning method of classifying groups using hyper-planes. The analytics method identifies the hyper-plane that best separates the groups used for ex to classify images
Analytics Techniques - Multiple Discriminant Analysis (MDA)
model-based approach to classifying observations. Used to predict corporate failure for ex.
Analytics Techniques - Text Mining Analytics Models
models include a series of methods that convert plain text into categories or similar groups. Used for ex to determine groups of docs that have related topics.
Statistical Distribution Analysis:
numerically describing how data for a single variable is distributed across the range of values in the variable observations. e important when analyzing how a variable is likely to "fit" or add additional value in a predictive model • Can be described mathematically using distribution functions and parameters.
The Analytics Process - Modeling Techniques
o Time series / forecasting models o Optimization models o Naïve Bayes Classifier o Classification and Regression Trees o Random Forests o Logistic Regression o Ensemble Models o Sentiment Analysis o Cluster Analysis
Data Analytics
series of data-driven capabilities that extract meaning from data and facilitate better decision-making. Includes machine learning, artificial intelligence, etc.
Analytics Techniques - Sentiment Analysis
specific kind of Text Analytics modeling technique that identifies a writer's attitude toward a topic that they are writing about. Used to know whether you'll like a restaurant in yelp for ex. Uses latent semantic analytics or support vector machines.
Descriptive statistics
statistical analyses that contribute to understanding of data or are used in the initial steps of more complex analytics process. Sometimes called univariate statistics or analyses, performed on single variables within a set of data. Include: variable type, means, mode, median, histogram, box plot, etc.
Analytics Techniques - Conjoint Analysis
survey-based analytic method designed to determine the optimal mix of features and pricing to include in a product offering
Data visualization
technique to convey information about data utilizing charts, graphs, and dashboards. Can be predesigned or self-designed depending on software and display device. Visualization of data enhances insights, comprehension, retention. Helps with storytelling.
Analytics Techniques - Logistic Regression
the predicted (dependent) variable is a 1 or 0, yes or no, or categorical (more than 2 outcome cases). Used to detect the probability of an event
Analytics Techniques - Artificial Neural Networks (ANN)
use input layers, hidden layers and output layers to classify observations.. Used for ex to classify images.
nalytics Techniques - Multidimensional Scaling (MDS) Also called Perceptual Mapping
used to measure the distance between all observations in multiple dimensions e.g. to measure car characteristics in multiple dimensions.
Machine learning
uses data with algorithms to create a prediction or an output used for an action or task. Models are trained by the training data, learning is created putting weights or coefficients that determine the impact of variables upon the output. Learning can be supervised (data is classified) or unsupervised (algorithm classifies data). Deep learning is recursive.
Dashboards
visual representations of data that convey information to a human decision-maker. Dashboards can be either fixed representations of the data or they can be "Self-Service." Fixed dashboards are predesigned for a specific decision-maker or decision. Self-service allow decision-maker to connect data and customize the visualization of the data to their specific needs.