MAR6930 AI and Data Science Midterm

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

The Analytics Process - Business Understanding

"Business Understanding" can also be described as "domain knowledge" or "functional knowledge" ▪ What is the business problem? ▪ How does the business problem fit into business process? ▪ What would successfully solving the problem look like? Define success. ▪ Assess the kinds of analytics models that would create the recommendation or action to solve the business problem. ▪ Defining how the analytics solution would fit back into the business process. ▪ Assess what key variables might be to solve the problem. ▪ Assess data sources that might be available to solve the problem.

Analytics Techniques - Time Series Models (also "Forecasting Models")

Description: Time Series Models use data to predict future time period observations of a variable of interest. Used in supply chain, finance, accounting. Uses excel

The Analytics / Data Science Team

- Data Scientist / Project Lead (domain expertise and analytics methodologies) - Functionalist / Subject Matter Expert (deep functional knowledge, understand databases) - Modeler / Statistician (performs model development, statistical expertise) - Data Engineer / DBA (develops data and solution arquitecture) - Technical or Business Analyst (assists w data preparation. Documents artifacts, makes proof of concept, prepares business briefing)

Why Do Businesses Care about Marketing Analytics?

- Make better investment decisions - understand customers better - attract new customers - make more relevant offers to potential customers - predict customer behavior - retain existing customers - decide on the important of product or service features to customers - Sophisticated understanding of customers - Forecasting - New product development decisions - Segmentation, targeting, positioning decisions

Standard analytics methodologies

- Permit teams across an organization to have a common approach to analytics projects ▪ Help non-analytics professionals understand analytics projects ▪ Help managers make decisions about investments in analytics projects, people, data and software ▪ Facilitate analytics team members to make decisions and take actions during an analytics project ▪ Describe in detail both large and small steps to consider taking in your analytics project ▪ Identify responsibilities of team members who conduct the analytics project

Four major activities in data understanding

- collect initial data - describe the data - explore the data - verify the data quality

CRIPS-DM (SPSS/IBM) Process

1. Business understanding 2. Data understanding 3. Data preparation 4. Modeling 5. Evaluation 6. Deployment

Firm specific analytics methodology

1. Understand the Business Problem 2. Evaluate Data Sources: 3. Techniques to Address the Problem 4. Select the Right Technology 5. Produce Analytic Outcomes: 6. Integrate with Business Processes and Systems

ETL (extraction, transformation, and loading)

A process that extracts information from internal and external databases, transforms the information using a common set of enterprise definitions (cleaning w wrangling, constructing w munging and feature engineering, integrating combining data), and loads the information into a data warehouse

Analytics Techniques - Social Network Analysis (SNA)

Analytics Techniques - Social Network Analysis (SNA). Used for ex. to find fraud rings and how they're connected. Social network analysis.

SEMMA Methodology

Analytics process designed by SAS • Focuses on the model building process instead of the entire end-to-end process • Provides more details on the modeling step of the CRISP-DM methodology.

The Predictive Analytics Lifecycle - SAS

Another way to view the analytics methodology is through the roles of the individuals assigned to the team • This SAS analytics model focuses on different roles that perform the task. Business manager (domain expert, makes decisions, ROI evaluation), business analyst (Data exploration, visualization, report creation), data miner/statistician (exploratory analysis, descriptive segmentation and marketing, IT systems management (model validation, deployment, monitoring, data preparation)

Artificial Neural Networks (ANN)

Artificial Neural Networks or ANN is an information processing paradigm that is inspired by the way the biological nervous system such as brain process information.

The Cross Reference Industry Standard Process for Data Mining (CRISP-DM) - SPSS / IBM

Based on what we have seen in the previous analytics models, we will follow the CRISP-DM model to illustrate how to complete marketing analytics projects • The CRISP-DM model has the benefit of defining the sub-steps and deliverables for each of the major steps in the analytics modeling process

Major analytics methodologies used by firms and organizations:

CRoss Industry Reference Process for Data Mining (CRISP-DM) - most popular ▪ Sample, Explore, Modify, Model, and Assess (SEMMA) ▪ Foundation Methodology for Data Science (FMDS) ▪ Agile Analytics Project Model ▪ Predictive Analytics Lifecycle ▪ Team Data Science Process (TDSP) ▪ Firm-specific methodologies

Analytics Techniques - Classification and Regression Trees (CART

Classification and Regression Trees (CART) or Decision Trees. Works by partitioning the predictor variables and pruning using validation data. Used to predict a record or event.

Business intelligence

Conversion of large amounts of data into summarized information for the decision-maker. Summarized information can be used for reports, dashboard, data visualization, data file. Includes: pivot table, data cubes, query-enabled large datasets, etc.

ETL in depth (extract, transform, load)

Data engineer builds and maintains data systems, cleans and wrangles data into usable state, determines SQL, develops data ingestion methods, establishes and interacts with data tables and lakes. Data preparation is 70 - 80% of project.

Analytics process data understanding

Data in org (ERP system, CRM, Customer service info, web server info, online shopping info, web cookie info, cell phone app info, web and audio files) Data outside org (dara broker, analyticsIQ,Spokeo, etc. Open source scraped info social media, US Census)

Analytics Techniques - Optimization Models

Determine the best possible combination of variables to produce a specified outcome. Outcome is objective function. Restriction on range of inputs are constraints. Useful to help know which combination of three different products will yield to the most net income. Uses linear and non linear programming.

Supervised learning

Humans provide labels to data related to the outcome of a task in order to "train" the computer algorithm

Metric variables (also called numeric variables)

Interval: numbers that are equal distance apart but do not have an absolute zero value. E.g. temperature in celsius or farenheight Ratio: do have an absolute 0 value. Can be expressed as ration for meaning 50%, 20lbs Boolean (binary true false, 0 or 1) factors (categorical nominal binomial ordinal) integer (whole numbers) text (string categorical, nominal, binomial, ordinal) date (numeric or categorical data)

Predictive modeling

Key compontent of artificial intelligence. Create probability estimate that an event of interest will occur using one or more variables. Expressed as a percentage or a certainty using a cutoff score. Include: logistic regression, random forests, Neural networks.

Categorical variables (or non-metric variables):

Nominal (names only) Binomial (yes, no) Ordinal (order) placed in order low medium high etc.

Model assessment precisions and recall

Precision: fraction of relevant examples (true positives) among all examples which were predicted to belong to a certain class Recall: fraction of samples that were predicted to belong to a class with respect to all samples that truly belong to the class

Return on Marketing Investment (ROMI) Formula

Return on Marketing Investment (ROMI) = [Incremental Revenue Attributable to Marketing ($) * Contribution Margin (%) - Marketing Spending ($

Data Munging

Same as data wrangling (cleaning and unifying complex data for easy analysis and use.) Can also include creating new features in the data sets or data tables

Multivariate statistics

Statistical analyses that describe how individual variables within a dataset relate to one another. Relationship can be an important next step in developing analytics models or data visualizations. Include: ANOVA, T-test, Z-test, Scatterplot, multiple regression.

Unsupervised learning

The computer doesn't receive labels from humans, but looks to find "structure" in the input data. This can be the first step in labeling the data for further supervised learning.

Reinforcement learning

The computer interacts with the environment and data from the environment to reward or penalize algorithms as it is exposed to more data (or scenarios). Example - cars under varying conditions like speed, cornering, acceleration, etc

Data Storage

The physical or cloud-based location that you store data for analytics. Can be in the analytics program (excel) commercial database software (MS Access, Oracle, SQL) open source database software (MySQL), Cloud-based data storage (AWS, Google, IBM)

Team Data Science Process (TDSP) Lifecycle - Microsoft

This version of the analytics process model focuses on development of analytics projects in the cloud • Cloud-based data acquisition and storage permits more big data efforts and analytics techniques than some on-premise environments • The 4 major steps listed here still have the same features of the CRISP-DM model

Artificial intelligence (AI)

Utilization of computers and algorithms to perform a human action or activity. Many AI apps use neural networks. AI differs from predictive modeling because it not only creates a prediction, but often acts on prediciton. Include: computer vision, natural language processing, speech recognition, driving ehicels, etc.

Big Data

Volume - measured in absolute size of data Variety - structured or unstructured data Velocity - speed at which data is exposed to models or algorithms Veracity - quality/accuracy of data that analytics models are exposed to

Comparing across distributions

When you compare across distributions, you can discover how some of these distributions might complement each other in analytics models. Data must be normalized, remove outliers, develop the size of bins and range of data for a histogram.

Univariate data analysis

a way to discover characteristics of individual variables in a dataset or database (descriptive statistics or exploratory data analysis.) can include histograms, boxplots, density plots. Also statistical distribution analysis in distribution models (mean, median, z-scores)

Analytics Techniques - Naïve Bayes Classifier

assign observations to different "classes." Classes are groups of observations that are similar, frequent machine learning problem. Used with observations that have categorical predictor variables. Uses prior probabilities of similar classes to predict current observation. Used to detect fraudulent financial reporting.

The Analytics Process - Model Assessment

best method depends upon the analytics technique ▪ e.g. R-squared (r2 ) in linear regression models ▪ Confusion matrix for binary classification models. perform a validation, most data scientist use a train - test approach • A subset of the data is used to develop the model • A portion of the data is "held out" from training the model to be used to test the model • Depending on the type of model, the "hold out" data may be randomly selected

Data Wrangling

cleaning and unifying messy and complex data sets for easy access and analysis. Involves data acquisition, joining data, data cleansing.

Analytics Techniques - Ensemble Models

combine multiple models (potentially using multiple techniques) to create a single output model. Used to forecast using multiple models. combine multiple types, the only restriction is that the models can be combined.

Reports

computer outputs that are predesigned to utilize data to convey information to a human decision-maker. Include text, graphs, spreadsheets, etc.

The Analytics Process - Modeling

data scientist and statistician apply the best modeling technique given the business problem and the data available. Selecting the modeling techniques that match the business problem and data ▪ Building multiple models with multiple data sources and elements ▪ Choosing the best model and tweaking the parameter setting. Assessing the outcome. Revising the model parameters based on assessment.

The Analytics Process - Deployment

deployment state where the data science model is integrated with the firm's systems and business processes • The model recommended by the data science team may have to be modified in order to be implemented "in production" ▪ For example, in order to run on a phone, the model may have to be rewritten in Java ▪ If the program is run against a live data stream instead of batch processed, it may have to be rewritten ▪ Example - algorithms that detect potential fraud on credit card transactions. Generally follows a similar CRISP-DM approach.

Data Types

describes the kind of data that can be collected an utilized in analytics models and data science applications. • Important because each type of model requires specific data types to calculate an outcome correctly. • Categorical variables (or non-metric variables): • Nominal - Variables that are names only (e.g. Boy / Girl)... count, but not put into order • Binomial - special case of nominal (yes / no or 0, 1) - if a number the numbers can be assigned without regard to order • Ordinal - categories that can be placed in order (e.g. low, medium, high or assigned 1, 2,

Analytics Techniques - General Linear Regression (GLM)

extension of ordinary linear regression that permits variables to have less restrictive assumptions about the predicted variable than ordinary linear regression models. Used in regression using factors in social science. Multifactor ANOVA, etc.

Analytics

extensive use of data, statistical and quantitative analysis, explanatory and predictive models, and fact based management to drive decisions and actions.

Analytics Techniques - Cluster Analysis

family of analytics methods that group observations which are alike into groups Used for ex. to group cars into categories based on characteristics.

Analytics Techniques - Structural Equation Models

fit networks of data to constructs in a series of models. Constructs are conceptual variables (latent) that are estimated by series of measurable variables.

Agile analytics project model

here are many different models for Agile Analytics development • Many of the steps mimic the CRISP-DM process • The major difference in the Houston Analytics model is the assumption that both the problem and solution teams are working on their respective efforts at the same time • Both teams perform the evaluation step

Data munging and Data wrangling

ingesting and converting data into a format usable by analytics models. Data munging may also involve the creation of new variables from existing variables. May involve manipulating a data set to deal with missing data. May also involve combining data sources together to create a single source for analytics models.

Data Understanding

involves collecting the data, identifying problems in the data and developing initial understanding about the data. • Identifying data type/s • Identifying data size • Identifying data velocity • Identifying data veracity • Data storage method & technology

Analytics Techniques - Support Vector Machines (SVM)

machine learning method of classifying groups using hyper-planes. The analytics method identifies the hyper-plane that best separates the groups used for ex to classify images

Analytics Techniques - Multiple Discriminant Analysis (MDA)

model-based approach to classifying observations. Used to predict corporate failure for ex.

Analytics Techniques - Text Mining Analytics Models

models include a series of methods that convert plain text into categories or similar groups. Used for ex to determine groups of docs that have related topics.

Statistical Distribution Analysis:

numerically describing how data for a single variable is distributed across the range of values in the variable observations. e important when analyzing how a variable is likely to "fit" or add additional value in a predictive model • Can be described mathematically using distribution functions and parameters.

The Analytics Process - Modeling Techniques

o Time series / forecasting models o Optimization models o Naïve Bayes Classifier o Classification and Regression Trees o Random Forests o Logistic Regression o Ensemble Models o Sentiment Analysis o Cluster Analysis

Data Analytics

series of data-driven capabilities that extract meaning from data and facilitate better decision-making. Includes machine learning, artificial intelligence, etc.

Analytics Techniques - Sentiment Analysis

specific kind of Text Analytics modeling technique that identifies a writer's attitude toward a topic that they are writing about. Used to know whether you'll like a restaurant in yelp for ex. Uses latent semantic analytics or support vector machines.

Descriptive statistics

statistical analyses that contribute to understanding of data or are used in the initial steps of more complex analytics process. Sometimes called univariate statistics or analyses, performed on single variables within a set of data. Include: variable type, means, mode, median, histogram, box plot, etc.

Analytics Techniques - Conjoint Analysis

survey-based analytic method designed to determine the optimal mix of features and pricing to include in a product offering

Data visualization

technique to convey information about data utilizing charts, graphs, and dashboards. Can be predesigned or self-designed depending on software and display device. Visualization of data enhances insights, comprehension, retention. Helps with storytelling.

Analytics Techniques - Logistic Regression

the predicted (dependent) variable is a 1 or 0, yes or no, or categorical (more than 2 outcome cases). Used to detect the probability of an event

Analytics Techniques - Artificial Neural Networks (ANN)

use input layers, hidden layers and output layers to classify observations.. Used for ex to classify images.

nalytics Techniques - Multidimensional Scaling (MDS) Also called Perceptual Mapping

used to measure the distance between all observations in multiple dimensions e.g. to measure car characteristics in multiple dimensions.

Machine learning

uses data with algorithms to create a prediction or an output used for an action or task. Models are trained by the training data, learning is created putting weights or coefficients that determine the impact of variables upon the output. Learning can be supervised (data is classified) or unsupervised (algorithm classifies data). Deep learning is recursive.

Dashboards

visual representations of data that convey information to a human decision-maker. Dashboards can be either fixed representations of the data or they can be "Self-Service." Fixed dashboards are predesigned for a specific decision-maker or decision. Self-service allow decision-maker to connect data and customize the visualization of the data to their specific needs.


Ensembles d'études connexes

Chapter 5 anatomy and physiology

View Set

Chapter 59: Caring for Clients with Disorders of the Bladder and Urethra

View Set

Peds Comp Final Mckinney Chapters

View Set

Chapter 55 Treatment of Burns & Wounds*

View Set

Chapter 15 - Quick-Service Restaurants

View Set

chapter 27 uprep (disorders of cardiac function, and heart failure... etc.)

View Set