Final Review

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

M.L. vs D.L.

A nice deep learning kind of natural language processing models, then you can. You don't need to pre code or pre-specify those features Can just throw all text into the model and will extract info by itself! - end to end process - often use deeper neural network in deep learning

An Illustrative Example - Diapers and Beers

A number of convenience store clerks, the story goes, noticed that men often bought beer at the same time they bought diapers. The store mined its receipts and proved the clerks' observations correct. So, the store began stocking diapers next to the beer coolers, and sales skyrocketed - So, don't be surprised if you find six-packs stacked next to diapers!

Example of AB Test Pt. 2

AB test whether or not to expand medicaid? - Increasing expansion might allow more people to use medicaid! - More people would be willing to receive these health services b/c they're covered by insurance - But with more coverage, it creates a moral hazard or people are more likely to visit ER because they're covered - Less clear how patients will behave this way

Lecture 10 AI, Machine Learning and Deep Learning A.I. vs M.L. vs D.L.

AI - broad term that covers any technique which enables computers to mimic human intelligence either using logic (pre-specified rules) or machine learning or models (decision tree) or deep learning - pre-coded rule example: "if then rules" --> when you're on customer service and they say "press 1 for service, 2 for review, etc." - no data related algorithm here - all pre-coded by human knowledge Machine Learning - subset of AI that includes statistical technique which enables machine to improve upon task with experience (training data or historical data) Deep Learning - permits our software to train self without need to perform feature engineering - usually backed up by multi-layer neural networks - require use of vast amount of data

Using ROC for Model Comparison

Any model above the diagonal line is good! - you're closer to the Y-axis aka more TP After a certain point we prefer M2 over M1 Calculate area under curve to choose which model is better AUC - area under curve - max area under curve is 1 - min area is 0.5 (random guess)

ROC Curve Example

As we turn the sensitivity of the receiver up, so it detects more and more German planes, it also mislabels more flocks of geese - as your threshold increases (detecting 92% german planes) --> you will be more likely to detect any minor signal (even if it's falsely detected)

Machine Learning - Supervised Approach Apply Model to Test Data

Can compare different models --> compare the optimal performance of decision tree model versus a neural network model or model versus a deep learning model. Once you choose which model, Decide: is the model better than a human? Is the model good enough?

Example of AB Test

Cancel button is taken away! What is my prediction on the cancellation rate with Version A or Version B? Control group: Version A Treatment group: Version B Should Facebook make this change with the button? Could lead to more customer service calls (people can't find cancel button, frustrates customers) Results: AB test (comparing the two flows) can reveal underlying consumer behavior!

Tableau Dimensions & Measures

Categories are dimensions à categorical data Discrete vs. Continuous Date -Discrete Quarter 1: sums quantity of all four Qs we get from our data -Shows total sum for each quarter -Continuous Quarter: shows quarter for each year -Continuously shows break down in each quarter

Traditional ML

Caveat: Machine learning often requires people to hand-engineer certain features for the machine to look for, which can be complex and time-consuming. Needs to input Data for feature engineering process Pixel example: cannot give raw data of pixels --> hard for machine to process - use feature engineering to simplify pixel data - humans hand group image pixels into areas as structured input for the machine to use in decision tree algorithm to learn new algorithm - but this process is very exhaustive - humans still need to code many features and help the machine - handheld and produce those features as input for the machine learning algorithm okay for the training process

Demonstration in RapidMiner

Demonstrate Decision Tree & Model Evaluation Demonstrate the comparison between decision tree and logistics regression

Machine Learning - Supervised Approach Decision Tree Classification Task

Develop model using supervised learning approach: • Two phases: training and testing. • Training builds a model using a large historical data sample (training set) - create model • Testing involves trying out the model on new, previously unseen data to determine its accuracy and physical performance characteristics. - validate model (somewhat) Similar to the human learning experience: • Uses observations to form a model of the important characteristics of some phenomenon. • Uses generalizations of 'real world' and ability to fit new data into a general framework.

Neural Network and Handwritten Digit Problem

Different hand writings - people write "9" in different ways Computer vision algorithm based on deep learning network can always recognize numbers and deposit checks How does the model use its hidden layer to help its recognition of different handwriting? - machine learning model can recognize the high level feature (structural combination of loop and bar)

1. Metrics for Performance Evaluation

Focus on the predictive capability of a model - Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: A = true positive (TP) B = false negative (FN) → the most costly! You thought they would not default (negative) but they did (you were false) → you lose the money you loaned to them Can make b machine more sensitive to avoid this again in the future However trying to avoid/lower false negative cases will raise the false positive cases C = false positive (FP) D = true negative (TN) a+d are correct so we can get prediction accuracy by dividing a+d by all of the outcomes

Artificial General Intelligence (General AI) Artificial Narrow Intelligence (Functional AI)

General: Machine can be made to think and function as human mind - Always been a dream since 1950s (since first computer) Narrow: Machine's ability to perform a single task extremely well, even better than humans. - Specialized algorithm for specific applications - Ex: Can recognize objects in a video and can enable an image (can detect certain subjects in the video) Ex: Can do natural language processing AI "Machine" "Mimic" "Human"

Association Rule Discovery: Definition

Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will predict occurrence of an item based on occurrences of other items

Neural Network for Customer Acquisition

Goal: How to decide whether to rent or buy a property - Probabilities at each step - This version gives the model more representation power and capacity - Can model very complicated relationship between input and output features

How to determine the Best Split

Greedy approach: - Nodes with homogeneous (alike) class distribution are preferred - Need a measure of node impurity (not 5/5 --> too similar)

Methods of Estimation

Holdout - Challenge: If your part/partition accidentally includes some outliers in your dataset then it's not very fair for your model Cross Validation Can get more accuracy through bootstrap? - split data into multiple parts/partitions - trim and evaluate model in 5 rounds (5 rounds of training/testing to get 5 prediction accuracies) - in each round 1/5 parts is test set and 4/5 is train set and then switch which is part of test/train set - choose test and training data sets - perform training and validation - model uses up all information in the data Bootstrap Sampling with replacement

Ice cream & Crime Example

Ice Cream and Crime → Causal or correlation? Is ice cream consumption causing more levels of crime? - This sounds ridiculous - probably just a random correlation - Is it not casual b/c it's scientifically unlikely or is it due to another correlational factor? Is ice cream consumption correlational to levels of crime? Why? 1. Omitted Variable 1 - Temperature - More ice cream consumption in hotter temperature - Hotter weather - more people outside, higher chance of crime 2. Omitted Variable 2 - Population - Zip code with more population will have higher consumption rate of ice cream - Also higher crime rate 3. Omitted Variable 3 - Socioeconomic Status - Higher income, higher consumption - Also higher crime rate? - Depends on specific context The casual relationship is not so clear cut!!! → hard to decipher in a lot of cases in society

What AI (Deep Learning) can do?

Image Recognition - can detect each of these images and match them - in time square deep learning can detect all people and determine trajectory (woman, height, age) - How, how, how to take this image, right. So as I mentioned, the image is consisting of many, many, many, many small pixels right imagine like if it's like this. This image is seems to be low resolution. So it might be a 1080 --> break pixels down into colors - In this way you can represent a photo right and once once we have a represent photo using this pixel matrix format we can flatten the matrix. - Right into this type of vector, then we can put a vector to the computer vision to the deep learning algorithm. - Computer Vision algorithm to learn everything to take every single pixel information there. So we need to do some transformation in order to take that. So the transformation process like convolution. I'll talk about those in detail next Thursday

Unsupervised Learning Example: Clustering

Imagine you're the brand manager for PnG - wide range of products People have very different purchase behavior and preferences You can cluster these types of customers based on type Want to make sure within cluster (intracluster) that you have many similar data points here Intracluster distances are minimized (very similar, all in same cluster) Intercluster distances are maximized (distance between clusters themselves!)

Importance of A/B test in Industry

Inspiring talk by Mark Zuckerberg on AB test (from 9:45) Number of A/B tests at Top Tech Firms: Per day? Interviews for Business Analysts, Data Scientists and Product Managers Lack of Talents in this area, even in the very top tech firms Towards a dream of automated decision making

Measure of Impurity: GINI***

Maximum (0.5) when records are equally distributed among 2 classes, implying least information gain from the split Minimum (0.0) when all records belong to one class, implying most information gain from the split Binary Attributes: Computing GINI Index - Splits into two partitions - Effect of Weighing partitions: Larger and Purer Partitions are sought for From 10%, 90% is a huge lift compared to 50% or 55% (50/60 averaged)

Model Evaluation

Metrics for Performance Evaluation - How to evaluate the performance of a model? Methods for Performance Evaluation - How to obtain reliable estimates? Methods for Model Comparison - How to compare the relative performance among competing models?

Multidimensional data with OLAP

Multidimensional data - Facts (numeric measurements) (e.g., sales revenue) - Associated with other dimensions (e.g., location and time) How to represent multidimensional data? (This is the core question in designing & using OLAP) - Relational table - Matrix - Data cube Data aggregated to different details using different dimensions Outcome is a measure - the cell output - Location - one dimension - Time - Branch

3. Methods for Model Comparison - How to compare the relative performance among competing models? -Accuracy/Confusion Matrix (see previous slides) -ROC Curve

No model is better than another (sometimes one dominates the other) We must therefore determine the best model based on our situation

Decision Tree for Fraud Detection

One type of algorithm for each decision on this tree Yes or no Goal: How do you differentiate and answer

Underfitting and Overfitting (decision trees)

Overfitting --> too many nodes/splits in decision tree leads to higher test errors but error rate of prediction accuracy goes down - as you keep splitting, you can make your prediction better and better (node becomes more pure/homogenous) - training goes down Why have overfitting?--> where test error starts to bounce increase again on graph When using a very small dataset (your training will be biased towards your small sample size) Sample data set may have some idiosyncratic noise (not representative of all data) Test data set has a higher error rate - like an exam in a class Training set is like in a class - Performance of model is always better in training than testing data set - You'll always perform better on a problem you've already tried (practice is easier than vs. being tested)

Notes on Overfitting

Overfitting results in decision trees that are more complex than necessary Training error no longer provides a good estimate of how well the tree will perform on previously unseen records Need new ways for estimating errors Choose tree with lowest error rate but also enough complexity!!

Tobacco Consumption Example

People with more tobacco consumption are more likely to get lung cancer Is this casual? If it isn't casual what's an alternative explanation? In reality, it's just a correlation Explanation - People who smoke tobacco are more likely to do other damaging substances for their drugs - Maybe people who consume tobacco will engage in other activities that would actually lead to cancer - It's not the tobacco consumption that causes cancer but rather the lifestyle choices of tobacco smokers - Lifestyle choices can cause cancer and those lifestyle choices are correlated with tobacco smoking - Also, people who are stressed are more likely to get cancer and smoke tobacco

Why more focus on SQL in DSO428

Relational Data Model & SQL are fundamental way of organizing and accessing data Business Analytics requires pre-setup (e.g. OLAP cube) and does not allow flexible combination of data from different sources/tables, thus limit the potential of big data Easy to understand Business Analytics software once you have a deep understanding of relational data model/SQL (e.g. a main challenge of Tableau is how to prepare the input table) Very easy to learn yourself

Scale drives Deep Learning Process

Requires huge amount of data for deep learning process Traditional --> limited learning curve (cannot improve learning curve or performance of algorithm by adding more data) --> red line The larger the neural network (more images), the greater the improvement on learning curve

Multidimensional Operations

Roll-up: - Aggregations on the data. - Performed by moving up the dimensional hierarchy or - by dimensional reduction - e.g. 4-D sales data to 3-D sales data. Drill-down: - The reverse of roll-up. - Reveal the detailed data that forms the aggregated data. - Performed by moving down the dimensional hierarchy or by dimensional introduction - e.g. 3-D sales data to 4-D sales data. Slice and Dice - Page is like a slice - Filter is like a where - Look at data from different viewpoints - Slice performs a selection on one dimension of the data. u e.g., sales revenue (type = 'Flat'). - Dice uses two or more dimensions. u e.g., sales revenue (type = 'Flat' and time = 'Q1').*** - where... Pivot (slide 16) - Two axis - Time is Y axis and city is X axis → you switch them so time is X axis and city is Y axis - Rotate the data to provide an alternative view of the same data. u e.g. sales revenue data displayed using the location (city) as x-axis against time (quarter) as the y-axis can be rotated so that time (quarter) is the x-axis against location (city) is the y-axis.

ROC Curve

Sensitivity = True Positive (TP), 1-Specfiity=False Positive (TP,FP): (0,0): declare everything to be negative class (no German plane) (1,1): declare everything to be positive class (every signal is a German plane) (1,0): ideal (Perfectly differentiate German plane from geese) --> top left § Diagonal line: • Random guessing • Below diagonal line: prediction is opposite of the true class - as you increase sensitivity, the chances of error (FP) will increase while TP decreases! → the tradeoff (you can choose a sweet spot for your firm) - As you raise your condition or threshold higher (with credit score > 800 versus credit score > 500 will yield fewer people with a good credit score), the chances of error (FP) will increase - If your sensitivity is low, your credit score would be anything > 0 (everyone would correctly fall under this bucket)

Deep Learning: Tensorflow Playground

Shwos power of deep learning network Can see how many layers you need in order to represent a very complex pattern Classification problem in this example (separate blue dot from orange dot/separate classes of data) You can increase representation power by adding hidden layers!

Lecture 13 What is the relationship between Business Analytics software (e.g. Tableau) and SQL? Is there a 1-1 mapping between Tableau operations and SQL?

Similar formatting - both have columns, tables → workspace of each program are similar Tableau Dashboard shown Tableau uses the SAP OLE DB for OLAP provider (part of the Open Analysis Interfaces) to interact with SAP BW***? is data mining sap and olap data visualization?**

Figure 33.2: Data Model for Multidimensional Data (Kimball dimension modeling Ch.33)

Star Schema - Fact table at the center surrounded by dimension tables

Operations

Start with a pre-saved table at most granular level Roll-up --> GROUP BY (DIMENSIONS) Drill-down --> keep the granular data Slice and Dice --> WHERE(dim1=.., dim2=...) Pivot --> change display

Stopping Criteria for Tree Induction

Stop expanding a node when all the records belong to the same class Pruning trees based on model evaluation(discussed later)

Identification of Treatment Effect Omitted Variable Bias & Its Solution --Experiment

Students who choose to take DSO have a better future than those who don't → causal or correlational? Correlational factors - Omitted variable (unobserved) 1. Majors/minors are already in fields that lead to higher salaries 2. Motivation - More motivated students may choose this class What if you want to do a lottery to estimate how many students will choose and not choose DSO? - Randomization process will help with the unobserved variable The randomization will account for these variables***

Machine Learning Algorithms

Supervised: (like the human learning process) - need a training process with labeled data - put labeled data into algorithm to guide algorithm to calibrate its model parameters - Goal: predict a label Ex: - training a baby - give baby lots of examples of toy car - say these 4 types of models are trucks - you give them some labeled data for them to associate the visual characteristics with the name/label "truck" - overtime the baby might learn if they see another car on the street, they can predict the label to be a truck Ex: - give historic data of those how have defaulted - will predict/label future loan applicant behavior Unsupervised: - no labeled data - cluster based on similarity - model is directly used for pattern recognition - separate based on types of entity Ex: - give 100 photos to baby of a sky, grass, and desert - baby does not know the "label" or these names but the baby can still recognize similarities and differences across different images - can distinguish patterns

Online Analytical Processing (OLAP)

The dynamic synthesis, analysis, and consolidation of large volumes of multidimensional data Optimize the storing and querying of large volume, multidimensional data aggregated (summarized) to various levels of detail to support the analysis of this data. Help users better understand the data from various angles. - OLAP can be used for analysis

Machine learning Process vs. Traditional Programming

Traditional programming is human specified rules/algorithms --> pre-specified algorithm and the AI is mainly doing the execution with these pre-specified rules - "if customer presses 1, do what, which output?" - human coded rules - straight forward process Machine Learning: goes one step further - instead of human specifying all the details of the algorithm, why not let the machine learn the rules of the algorithm? - Teach the machine the algorithm instead of creating an algorithm to input for each machine - Teach them how to learn!! - the machine need to learn the rules by itself. And after training process, they can then make a prediction and recognize certain task. - give machine historical data and final output for them to LEARN from! - once this algorithm is formed, (machine used decision tree or user neural network to form this algorithm), we can put the algorithm back to this traditional process process for any new data - in new loan applicants, the machine will draw this algorithm it has learned, then make a prediction/output seeing this! - the machine learns the algorithm itself from historical patterns rather than us telling them which algorithm --> much more efficient On the flip side: we as humans don't know the algorithm, we don't know what happens in the blue box for machine learning!

Machine Learning - Supervised Approach Example of Decision Tree

Training Set

Tree Induction (learn model)

Tree Induction - how to decide Greedy strategy. • Split the records based on an attribute test that optimizes certain criterion. Issues • Determine how to split the records - How to specify the attribute test condition? - How to determine the best split? • Determine when to stop splitting Model training → Learn model will learn the tree and its splitting attributes Model validation → Apply model Use test set to test the machine Human with answer will check the machine's accuracy If ⅘ machines get it right, then the machine accuracy is 4/5 Average rate is ⅕ = average rate of failure Choose the best model with lowest rate of failure

Neural Network: Basic Setup

Very different than Decision Tree!! Decision tree uses features through binary questions (yes or no) to sort through data Neural Network is more like Regression - parallel use of all the input - but no direct mapping between input layer and outcome (unlike linear regression) - we have a hidden layer in neural network

Nature of Data Mining

We, human, specify the framework/structure of the data mining model, but machine is learning the details (e.g. parameter) from data. The structure can be very flexible (e.g. deep learning) Do we know what is actually happening? The machine just follow a series of steps and rules (magic) Does it matter? How do we know if we are right or wrong?

How to Find the Best Split***

Which Attribute should we use to split the data? (50% yes and 50% no results will gives you no information on your study!!) Which split will allow us to make a better prediction? Attribute A or B? - B because homogenous results If B attribute is whether or not person has a previous loan refund C0: don't default C1: default Yes - 90% winning change - can make a very good guess of for the person who who said yes (who have previous loan refund and who will not default) - get a 90% change of winning (people won't default) No - 90% winning chance again - can make a good guess of who said no (who will default) - Average of 90% prediction accuracy - 90% change of winning (people will default because they don't have previous loan refunds) Better than betting and having winning chance is 50% and 60% --> more uncertain If the tree is deeper and wider you'll have more splits!! As you split more and more you can make the prediction better and better Results are more pure Favor B because 90% accuracy is a huge lift from 10% compared to 50 or 55% (60/40 averaged) B will increase prediction accuracy! Based on the calculation, we will choose the splitting attributes --> then you must determine when to stop splitting!

AB Test Process

Why do we need AB test? → Correlation doesn't mean causality We can pin down the causal effect of x on y → can see this casual relationship Helps us think whether the relationship is casual or not

Neural Network: Basic Setup (many hidden layers)

Why do we need hidden layers? - process of recognizing digits on a check - by the last layer there will be a prediction of what the number is - pattern of activation layers to determine the final prediction - layered structure can transform low level image into a high level prediction

Example of Classification Model Bottom right corner of matrix

§ Decision Tree based Methods § Logistic Regression § Neural Networks § Naïve Bayes § Support Vector Machines

ROC (Receiver Operating Characteristic)

§ Developed in 1950s for signal detection theory to analyze noisy signals • Characterize the trade-off between positive hits and false alarms § ROC curve plots TP (on the y-axis) against FP (on the x-axis) § Performance of each classifier represented as a point on the ROC curve • changing the threshold of algorithm

How do we know our model/rules make sense? Model Evaluation and Selection

§ Evaluation metrics: How can we evaluate performance? § Use validation test set of class-labeled tuples instead of training set when assessing accuracy § Methods for estimating a classifier's accuracy: • Holdout method, random subsampling • Cross-validation • Bootstrap § Comparing models: • Confidence intervals • Cost-benefit analysis and ROC Curves

Mathematical Measures of Node Impurity***

§ Gini Index § Entropy § Misclassification error Choose splitting attribute that gives us the most gain!

How to Address Overfitting

§ Pruning the tree • Grow decision tree to its entirety • Trim the nodes of the decision tree in a bottom-up fashion • If error rate is reduced after trimming, replace sub-tree by a leaf node. • Class label of leaf node as majority class of instances in the sub-tree Picture: Left: complexity, number of nodes Middle: Training error rate Right: Test error rate As model becomes more complex, the error rate is decreasing on the table but only to a certain point - the test error rate starts increasing again! - we want to find this sweet spot to minimize errors - we care MOST about minimizing testing error (the performance of the model)

Learning Curve

• Learning curve shows how accuracy changes with varying sample size • Effect of large sample : - Bias in the estimate - Variance of estimate --> saturated at some point Graph tells us the final capacity of the model! - the upper limit of the model - the highest prediction accuracy it can reach is <90% no matter how much data we feed the model - also tells us the learning speed of the model based on the graph progress Deep Learning Model is slower! --> curve would be shallower and increase slower - in small data area, this is why deep learning model doesn't have a high prediction or have much advantage without lots of data

Ad Click: Is Prediction Enough?

→ how to choose the right customers to advertise to - Promotion codes actually lead to a decrease in profits!! People get annoyed - Target people with high propensity but a low incremental effect Rank customers based on Y, not the deltaY → Why CMO choose to adopt this practice? - B/c makes them look better → easier for financing approval if presenting to the CEO (can highly sku figures/stats this way) - Attach promotion code with people who would buy anyway - Rank people based on Y and then target them - This leads to a higher ROI for marketing - People find it very unintuitive to think about the causal effect (delta Y) rather than the direct/correlated effect (Y) - Here → when you are trying to find prediction Y is good enough but if you are trying to find the source of the change or problem, you need to use delta y

Is Prediction Enough?

→ when is prediction enough and when do you need casualty Imagine you are mayor of LA deciding between the casual or correlational relationship of ice cream and crime - The mayor cares about casualty here! - they want to know if there's a casual relationship - In this case the prediction is not enough! But imagine you're a visitor of LA - The visitor only cares about the current crime rate across the city to make a decision on where to stay - In this case the prediction (correlation) is enough since you're just visiting and want to avoid crime!


Ensembles d'études connexes

Unit 5-Other Investment Vehicles

View Set

ASVAB - Everything You Need To Know

View Set

The Essenials of Managing Conflic - Final

View Set

CISSP Practice Questions and Definitions

View Set

Leadership in Business Ch. 1-5 Exam

View Set

Stoichiometry Exam Study Chapter 9

View Set

Foundations of Psychiatric Nursing

View Set