BCOR 2205: Mods 6 & 7

¡Supera tus tareas y exámenes ahora con Quizwiz!

Logistic regression

is a simple twist on linear regression. It gets its name from the fact that it is a regression that predicts what's called the log odds of an event occurring.

8 Criteria of Auto ML Excellence

1-Accuracy 2-Productivity 3-Ease of use 4-Understanding and learning 5-Process transparency- effects understanding and learning 6- Generalizability across contexts 7-Recommended actions

Boolean (when states are represented as true or false]

0 -> no 1 -> yes Yes/No, True/False, 1/0

Training Set

Data used to create/train the model

Test

Data you pretend is from the future (don't use to train the model)

Our Project Statements (BAD EXAMPLE)

"Readmissions cost our hospital $65m last year, and we don't have a model in place to determine which patients are at risk of readmission"

Log Loss

- A measure of accuracy -Rather than evaluating the model directly on whether it assigns cases (rows) to the correct "label", the model is evaluated based on probabilities generated by the model and their distance from the correct answer -lower scores are better

Generalizable across contexts

. An AutoML should work for all target types, data sizes, and different time perspectives. In other words, it can predict targets that are numerical (regression problems), as well as targets that contain categorical values (classification problems), both whose values have two categories and those with multiple categories. Additionally, the system should be capable of handling small, medium, and big data. When data is too big, AutoML should automatically sample down the dataset as necessary or perhaps even automate the selection of the optimal sample size. Finally, an AutoML should be able to handle both cross-sectional (data collected at one time, or treated as such) and longitudinal data (data where time order matters). Time is a complex point that we will return to in Chapters 25 and 26. For the hospital readmission project, time is not a factor as we focus on the first interaction with any patient, specifically their discharge, and then examine the next month to find out whether they were readmitted or not. Our dataset is on the higher end of small (10,000 records), and its target has two values (readmitted=True or readmitted=False). We expect the AutoML system to work as well for the hospital setting as for other business settings.

Confusion Matrix

A chart that tells us what actually happened

Productivity

A large part of the productivity improvements of AutoML will come from the same processes listed under accuracy in large part because a data scientist's job is to constantly hunt for accuracy improvements. This hunt means perpetually living with a voice at the back of their head whispering that if only they could find a better algorithm or could tune an existing algorithm, the results would improve. Being freed from that voice is the single greatest productivity impact available. Other factors improving productivity include graceful handling of algorithm-specific needs. For example, some algorithms, such as regression, will throw out any case with even a single missing value, so we "impute" missing values (for example by setting all values to the mean of y. A large part of the productivity improvements of AutoML will come from the same processes listed under accuracy in large part because a data scientist's job is to constantly hunt for accuracy improvements. This hunt means perpetually living with a voice at the back of their head whispering that if only they could find a better algorithm or could tune an existing algorithm, the results would improve. Being freed from that voice is the single greatest productivity impact available. Other factors improving productivity include graceful handling of algorithm-specific needs. For example, some algorithms, such as regression, will throw out any case with even a single missing value, so we "impute" missing values (for example by setting all values to the mean of the feature) before such algorithms get the data. Other types of algorithms work better without such imputation, or with information on which cases have had a given feature imputed. Assuming good-quality data and existing subject matter expertise, the analyst should be able to conduct the analysis and be ready (PowerPoint and all) to present results to top management within two hours. Without subject matter expertise, the requisite time will be closer to one week. Without access to an AutoML, the timeline may be closer to three months. For the hospital readmissions project, productivity gains come from being able to run different projects per disease group at a vastly improved pace.

Understanding & learning

An AutoML platform or application should improve an analyst's understanding of the problem context. The system should visualize the interactions between features and the target and should "explain" so that the analyst can stand her ground when presenting findings to management and difficult questions rain down from above. One way this may be accomplished by the system would be through interactively allowing the analyst to experiment with different decisions. In short, the system should support thinking around business decisions. For the hospital project, the analyst may find that patients admitted for elective procedures are six percentage points more likely to be readmitted than emergency admits, potentially prompting some reflective thinking. A good AutoML system will allow the analyst to uncover such 43 potentially useful findings, but only subject matter expertise will allow confident evaluation of this finding.

Ease of use

Analysts should find the system easy to use. Meaning that an opensource system designed for coding data scientists should integrate easily into their process flow and be easily understandable. For non-coding analysts, the requirements are entirely different. Here the analyst must be guided through the data science process, and visualization and explanations must be intuitive. The system should minimize the machine learning knowledge necessary to be immediately effective, especially if the analyst does not know what needs to be done to the data for a given algorithm to work optimally. Once a model has been created and selected for use in production, operationalizing that model into production should be easy. For the hospital project, this means that the health management organization's analysts should all have the prerequisite skills to oversee AutoML, and that upon selection of a model, it is simple to implement it into the hospital's decision flow such that when a patient is ready for release, the system automatically uploads their record to the ML model, the model produces a prediction, and recommendations are provided to the user of the hospital system. For medium-range predictions of the likelihood of being readmitted, perhaps a cheap intervention prescribed as half an hour of watching a video on how to manage one's diabetes is likely to improve outcomes?

Numerical Data (aka Quantitative)

Arise from counting, measuring, or some kind of mathematical operation i.e. 20 of you visited the course website since last class Can be discrete or continuous Quant = number based

Categorical Data (aka Qualitative)

Described by words rather than numbers i.e. Freshmen, Sophomore, Junior, Senior i.e. Train, Plane, Bus, etc. Can be nominal or binary

General Platforms

Designed for general-purpose machine learning split into two types: open source and commercial

Process transparency

Because AutoML insulates the analyst from the underlying complexity of the machine learning, it can be harder to develop trust in such systems. It is critical, therefore, that for analysts who seek to understand machine learning, as we will do in this book, it is possible to drill into the decisions made by the system. Process transparency interacts with the earlier criteria of understanding and learning in that without transparency; learning is limited. It is worth noting that process transparency focuses on improving the 44 knowledge of the machine learning process, whereas the understanding and learning criterion focuses more on learning about the business context. Finally, process transparency should enable evaluation of models against each other beyond just their overall performance; which features are the most important in the neural network vs. the logistic regression and in which cases each excels. For the hospital readmission project, details such as which algorithm produced the best model and the transformations applied to the data should also be available to the analyst

Agglomerative Clustering:

Bottom up→ all the data points start off as their own cluster and are merged together until there is only 1

Accuracy

By far the most important criterion. Without accuracy, there is no reason to engage in the use of AutoML (or any machine learning for that matter). Accuracy stems from the system selecting which features to use and creating new ones automatically, as well as comparing and selecting a variety of relevant models and tuning those models automatically. The system should also automatically set up validation procedures, including cross validation and holdout, and should rank the candidate models by performance, blending the best models for potential improvements. If a model predicts readmission of ten patients within a month, but only one is readmitted, then the accuracy may be too low for use.

Features

Can be thought of as the independent variables we will use to predict the target

Feature Engineering

Cleaning data, combining features, splitting features into multiple features, handling missing values, and dealing with text, etc.

Accuracy Equation

Correct Classifications/Total # of Cases True Positives + True Negatives/Total # of Cases

Should We Use More Data?

Cost? Time "additional" data was collected If it's older, it may not add additional insights Use validation results to aid in decision about using more data even if you have to "manually" run a few more models

Supervised Machine Learning

Data scientist tells the machine what it wants it to learn (identifies target) The data scientist selecting WHAT they want to machine to learn (likely buyers vs likely non-buyers)

Model diagnostics

Evaluation of top models

3 Types of Relationships: Feature Effects:

Feature Impact for specific feature values DR location: Models tab → select model → understand → feature effects → computer feature effects Actual Value: Determined by looking at all of the actual values, taking an average, and plotting them Predicted value: Determined by looking at all of the predicted values, taking an average, and plotting them

Silhouette Score:

Helps determine how well fit our clusters are, even without existing labels The data points in one cluster are close to each other, but far away in others Low silhouette score = closer together High silhouette score = further apart

Speed vs. Accuracy:

How rapidly will the model evaluate new cases after being put into production? How many nanoseconds will it take to get the actual results? (select the models that are off of 64% of the data. in the frame on the right side)

Context-Specific Tools

Implemented within another system or for a specific purpose (Salesforce Einstein in Salesforce)

*

In Data Robot, a * in the leaderboard means Data Robot used more Data to create the model

Continuous Data

Infinite number of possible responses' •Like any point on a number line Discrete •Finite number of options Any value on a spectrum •Examples: course letter grade, country of origin, or Likert scale

Algorithm selection and hyper-parameter tuning

Keeping up with the "dizzying number" of available algorithms and their quadrillions of parameter combinations

LINEAR DISCRIMINANT ANALYSIS (LDA)

LDA asks which category is more likely--> a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events.

Additional Features:

Learning curves will NOT tell you this

Machine Learning: Basic Overview

Machine learning is about predicting the future based on the past PAST (training data)--> learn--> model/predictor---> FUTURE (testing data)--> model/predictor--> predict

Auto ML

Makes machine learning accessible to most people because it removes the need for years of experience in the most arcane aspects of data science, such as the math, statistics, and computer science skills required to be a top contender in traditional machine learning.

Binary data

Nominal attribute with only two categories/states

Target

The variable we are trying to predict and gain insights about (often an outcome) that we are trying to understand and predict in future cases

Unsupervised Machine Learning

Up to machine to decide what it wants to learn The computer sorts the data by itself Leaves it to the machine to decide what it wants to learn, with no guarantee that what it learns will be useful to the analyst

Our Project Statements (GOOD EXAMPLE)

Our organization suffers over $65 million in preventable losses annually due to the readmission of diabetes patients who are discharged from our hospitals too soon or who are inadequately prepared to manage their disease on their own. However, keeping all patients in the hospital longer is costly, risky, and inconvenient to patients. We will create a machine-learning model capable of detecting which patients are likely to be readmitted within 30 days of discharge and develop educational and support programs targeting these patients. 60% of our patients are discharged to home without home health services; this group mirrors the overall patient population in terms of average readmission. We are especially interested in detecting cases where these patients are highly likely to be readmitted.

Classifier:

Predicts which group something will be in: classifies it

Commercial Platform

Provided by a commercial vendor, presumably for a price.

The Machine Learning Pipeline

Raw Data --> Features --> Models --> Deploy in Produciton -- Predictions

Dimensionality Reduction:

Reducing the number of variables we have to deal with

K-Nearest Neighbors:

Relies on the idea that data points will be similar to other data points that are near it

Model

Set of weighted relationships between the features and the target (act of buying vs. not buying and a set of features)

Learning Curves:

Shows how the models predictive ability changes with 'sample size' Answers the question: "Will more data help our model?" or "Would more data improve the model's predictive ability" Learning curves: tell you if additional cases will be helpful The steeper the line is, the more value you'll see by adding more data

Prediction Target

The behavior of a "thing" (person, stock, organization, etc. in the past that we will need to predict the future

Resource availability

The AutoML system should be compatible with existing business systems and easily integrated with other tools in the business intelligence ecosystem. Such compatibility means that it should be able to connect to existing databases and file formats when ingesting data. A system developed around a proprietary file format that requires turning existing data into that file format before use by the system will rate poorly on this criterion. The system should also allow easy use of the resulting model, either through an application programming interface (API) or through code that can be placed easily into the organizational workflow. A solid AutoML should address memory issues, storage space, and processing capabilities in a flexible manner. While this will differ depending on the intended use, for a subject matter expert user of AutoML, this means that a cloud-based system or an enterprise system on a powerful cluster of servers is probably necessary as an analyst's individual computer could not be expected to scale up to handle any problem in a reasonable amount of time. An AutoML for the coding data scientist could, on the other hand, expect the analyst to set up "containers" or virtual machines. Finally, support for the system should be available, either through online forums or customer service, allowing the analyst access to support from experienced machine learning experts. For the hospital project, the data, which is available as a stand-alone file should be easily uploadable to the AutoML platform where we should not have to worry whether the file is too large or there is enough "memory" and processing power available to handle the data and analytical processes.

Which Model to Select?

The models at the top of the leaderboard are typically the better models

3 Types of Relationships Importance:

The overall impact of a feature without consideration of the impact of other features Green bar (larger = more important)

3 Types of Relationships: Feature Impact

The overall impact of a features adjusted for the impact of the other features Models → Understand → Feature Impact → Enable Feature Impact Takes all of the data and shuffles each one of the variables to see how it impacted the data If it greatly changes the data, you know that the variable was very impactful

AutoML: Automated Machine Learning

The process of automating Machine Learning Makes ML possible with extensive math/stat/programming Model diagnosis- Evaluation/ranking of the models Models can be run simultaneously Combinations of models can be run (Blenders) Most of oru models will run in 20-30 minutes DataRobot can outperform data scientists "Areas where repetitive tasks negatively impact the productivity of their data scientists, and AutoML has a definite positive impact on productivity"

Exploratory Data Analysis

The process of examining the descriptive statistics for all features as well as their relationship with the target variable

Recommend actions.

This criterion is mostly for context-specific AutoML. With specific knowledge of the analytical context, the system should be able to transfer a probability into action. For example, as the user of an AutoML, you may decide to analyze employee turnover data. The system returns robust predictions on which employees are likely to stay and who will leave over the next year. Let's say two employees come back with the same probability of leaving, .90 (90% likely to depart the organization). One is an employee you 45 know to be the backbone of the company, and one is a slacker. The AutoML cannot understand this unless it has also been programmed to "know" the context of the problem.

Open-Source Platform

Tools tend to be developed by and for computer and data scientists and generally require knowledge of programming languages

Date and Time Data:

Usually stored internally as an unsigned integer (number of seconds since 1970). Data and time lead to conversion nightmares because of how many formats there are

Unit of Analysis:

What specifically are you looking at? What does your data set tell you? What is it defining/describing?

Categorical (a.k.a Nominal) Data

You can identify groups are different, but no meaningful ranking Examples -Occupation {teacher, dentist, data scientist, student...} -Marital status {single, married, divorced, widowed} -CustomerId {1, 2, 3,..., n}

Text (Strings) Data

You specify a number of characters - This is either exact or a maximum depending on the data type -Varchar, variable length string, text varying, VSTR< and other names for variable length

AI

can be thought of as a collection of machine learning algorithms with a central unit deciding which of the ML algorithms need to kick in at that time, similar to how different parts of the human brain specialize in different tasks.

Classification

predicts the category to which a new case belongs. For example, we might build a model of divorce during the first ten years of marriage (our target is divorced: TRUE or FALSE). Alternatively, we can use machine learning to conduct regression, that is, to predict the target's numeric value. With this second kind of target, we might, for example, target how many years of blissful union a couple has ahead of them.

A unit (of analysis)

the what, who, where, and when of our project. The what could be whether a visitor did indeed click on an ad. The who could be readmitted (or not) patients. An example of the where could be the location of a future crime, and when might be the time we expect a piece of machinery to break down. 5 For each unit of analysis, there can be numerous outcomes that we might want to predict. If the customer is the unit of analysis and you work for Netflix, you may want to predict whether they would be likely to accept an upgrade offer, whether they are likely to cancel their contract next week, and whether they are more likely to enjoy watching Stranger Things or House of Cards

Machine Learning

•A subset of AI --Builds models to predict future outcomes •"A field of study that gives computers the ability to learn without being explicitly programmed"-McClendon & Meghanathan • •The practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the world Takes questions that are already answered and tries to figure out how to predict it It's called machine learning because instead of following strict rules and instructions from humans, the computers (or machines) learn how to do things from data.

Business Problem

•Anything a company would want to know in order to increase sales or reduce costs

•What might a company like to know?

•Customers likely to buy a product •Customers likely to return a product •Why Customers do not purchase a product •Why Customers do purchase a product •Why Customers are dissatisfied •Why Customers do not renew their contracts •Who will be a bad customer

Data Robot Terms

•Feature Name- directly from Flat File •Index- common way to talk about feature (i.e. feature #34) •Importance (green bars)- Alternating Conditional Expectations (ACE Score) •Answers is there a relationship? •Var Type- Boolen, Categorical, Numeric, Text •Check that they match your expectations •Unique- Number of unique values •Missing- Number of missing values •Descriptive Stats- Mean, Stddev, Median, Min, Max •Where applicable

Artificial Intelligence:

•Machines that can perform tasks that are characteristic of human intelligence. Earliest AI: Checkers video games, Chess, etc. Now it analyzes data

Before We Load the Data

•Open data and examine •Also open the Data Dictionary to be sure you have a good grasp on the terms

Specify the Business Problem

•State problem in language of business (not language of modeling) •What actions might result from this modeling project •Specify actions that might result •Include specifics (number of customers affected, costs etc.) •Explain impact to the bottom line

2. Acquire & Explore Data

◻ Find appropriate data ◻ Merge data into single table ◻ Conduct exploratory data analysis ◻ Find and remove any target leakage ◻ Feature engineering

4. Interpret & Communicate

◻ Interpret model ◻ Communicate model insights

5. Implement, Document & Maintain

◻ Set up batch or API prediction system ◻ Document modeling process for reproducibility ◻ Create model monitoring and maintenance plan

1. Define Project Objectives

◻ Specify business problem ◻ Acquire subject matter expertise ◻ Define unit of analysis and prediction target ◻ Prioritize modeling criteria ◻ Consider risks and success criteria ◻ Decide whether to continue

3. Model Data

◻ Variable selection ◻ Build candidate models ◻ Model validation and selection


Conjuntos de estudio relacionados

Ch 10 Middle Childhood: Social and Emotional Development

View Set

Inclusion, Equity, and Diversity

View Set

Real Estate Vocab (need to study)

View Set

Lección 5 - Estructura 5.3 - Reciprocal reflexives - Así fue

View Set

CHEM 122, UNIT 2 - SCALAR/PODER/PRATICE EXAM Questions

View Set

Psych 354 Final Exam (chapters 1-14)

View Set