INFO Quiz 3 2205
Descriptive Stats
-Mean, Std. dev, Median, Min, Max
8 Criteria of Auto ML Excellence
1) Accuracy 2) Productivity 3) Ease of use 4) Understanding and learning 5) Resource availability 6) Process transparency (effects understanding and learning) 7) Generalizability across contexts 8) Recommended actions
Three types of relationships in DataRobot are...?
1. Importance 2. Feature Impact 3. Feature Effects
Why is Cross-Validation better than validation?
1. Uses more data to build model off of (builds off of multiple folds), which makes it more accurate 2. Helps prevent overfitting 3. Validates off of multiple folds
What % of the data is the validation set?
16% of the data
What % of the data is the holdout set?
20% of the data; this allows for 5-fold validation using the remaining 80% (80 / 5 = 16)
In quick mode DataRobot uses what % of the data to initially build the model in round 1?
32% of the data
What type of validation does DataRobot use?
5 Fold-Cross Validation
Only use models built off what % of data
64
In round 2, DR runs off of what % of the data
64% of the data (this is what algorithms run off of)
Zip file
A document that has been compressed to take up less space on a computer and to make it quicker to download.
What is Logloss?
A measure of accuracy; Rather than evaluating the model directly on whether it assigns cases (rows) to the correct "label", the model is evaluated based on probabilities generated by the model and their distance from the correct answer; Lower scores are better
Confusion Matrix
A table showing actual and predicted (TN, FN, FP, TP)
What is the advantage of cross-validation over validation?
Able to use more data to build model and more of that model to validate your data
The score in Machine Learning that is calculated by taking the correct classifications divided by total number of cases measured is defined as...
Accuracy
What is the most important criteria for Auto ML excellence?
Accuracy
What are the Confusion Matrix Metrics?
Accuracy, True Positive Rate, True Negative Rate
Importance (green bars)
Alternating Conditional Expectations (ACE Score); Answers the question "is there a relationship?"
How do you get the cross validation score?
Averaging the validation scores less the holdout (average of the 5 scores)
Why will blended models be slower?
Because it has to blend multiple models
Var Type
Boolean, categorical, numeric, text
Features
Can be thought of as the independent variables we will use to predict the target
Which type of data is more predictive when using decision tree analysis?
Categorical is more predictive than text (because it breaks it up into different categories)
Feature Engineering
Cleaning data, combining features, splitting features into multiple features, handling missing values, and dealing with text, etc.
Supervised ML
Data scientist tells the machine what it wants it learn (identifies target)
In DataRobot, a * in the leaderboard means that...
DataRobot used more data to create the model, DO NOT use these models
Which is more important for a model, speed or accuracy?
Depends on what your using it for (For example, day-trading models should probably be faster)
Decision Trees
Diagrams where answers to yes or no questions lead decision makers to address additional questions until they reach the end of the tree.
Flat Line in learning curves means?
Don't add more data
What does Auto ML do?
Evaluate and rank different models; run combinations of models simultaneously
When evaluating ML results, I should always choose the fastest model
False
Where would you find your best feature for your best model?
Feature impact
Feature Effects
Feature impact for specific feature values (circle and + points)
Discrete data
Finite number of options; Like course letter grade, country of origin, etc.
What does the confusion matrix tell us?
How often we are right and how often we are wrong
Why is cross validation important?
If the original validation partition is not representative of the overall population, then the resulting model may appear to have a high accuracy when in reality it just happens to fit the unusual validation set well
Aggregate Targets
In some cases, a target might be a combination of multiple attributes, this is called a...
continuous data
Infinite number of possible responses; Like any point on a number line
For learning curves, if a line is pretty horizontal
It likely does not need more data
In HW 3, How did you know ENET Blender was the best model?
It used 64% of the data, did not have an *, and had the lowest cv score
Where do you go to compare models in DR?
Leaderboards
What was the measure to determine how accurate a model was?
Logloss
Artificial Intelligence (AI)
Machines that can perform tasks that are characteristic of human intelligence
By looking at past preferences we can...
Make predictions about the future
How do you find "feature effects: in DataRobot?
Models --> Select Model --> Understand --> Feature Effects --> Compute Feature Effects
How do you find "feature impact" in DataRobot?
Models --> Understand --> Feature Impact --> Enable Feature ImpactHow do
Steep line in learning curve means?
More data would be helpful
What are types of categorical data?
Nominal and binary
Binary data
Nominal attribute with only two categories/states (Ex: 0 --> no, 1--> yes)
Should more data always be used?
Not necessarily, consider the costs and the fact that if data is too old, it won't be helpful at all, use validation results to aid in decision about using more data
Missing
Number of missing values
Unique
Number of unique values
What is one of the biggest problems with decision trees?
Overfitting
Over Training (Over Fitting)
Poor generalization. The model simply memorizes the training examples and is not able to give correct outputs also for patterns that were not in the training dataset
ROC Curve
Probability density on the y-axis and probability of event on the x-axis; Positive cases are green, negative is purple; Line predicts where things are gonna be positive or negative)
Any row with even one missing value will get that whole row kicked out from the analysis when using...
Regression and Neural Networks
Linear Discriminant Analysis
Segregates a larger group into homogeneous subgroups
Learning Curves
Shows how the models predictive ability changes with 'sample size'; "will more data help my model"
Training Set
Subsection of a dataset from which the machine learning algorithm uncovers or "learns" relationships between the features and the target variable
What type of ML is focused on in this class?
Supervised Machine Learning is the focus of this class
True Negative Rate
TN / (TN + FP)
Accuracy (Confusion Matrix)
TP + TN / All cases
True Positive Rate
TP / (TP + FN)
What do decision tress do?
Take several different features to predict the target
Which data set should never be used to make decisions about which algorithms to use?
The Holdout Set
Where would you go to find what models are being used to create a blended model?
The blueprint
Feature Impact
The overall impact of a feature adjusted for the impact of the other features (Blue horizontal bar graph)
Importance
The overall impact of a feature without consideration of the impact of other features (Green Bars)
Machine Learning (ML)
The practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the word; A subset of AI
Exploratory Data Analysis
The process of examining the descriptive statistics for all features as well as their relationship with the target variable
Overfitting
The process of fitting a model too closely to the training data for the model to be effective on other data.
Unsupervised ML
Up to the machine to decide what it wants to learn
Data Splitting
Use algorithms to split data and build several models. Split into: 1. Training Set 2. Validation Set 3. Holdout Set
Holdout Sample
Used to assess the performance of that model
cross validation
Verifying the results obtained from a validation study by administering a test or test battery to a different sample (drawn from the same population)
Unit of Analysis
What the data set is about (Ex: for HW 3 it was late deliveries)
When do you unlock the holdout?
When you implement the model.
Nominal data
You can identify groups are different, but no meaningful ranking (Ex: occupation, marital status, customer ID, etc.)
Text (String) data
You specify a number of characters, either exact or max amount
Holdout Set
a subsection of a dataset to provide a final estimate of the machine learning model's performance after if has been trained and validated. Holdout sets should never be used to make decisions about which algorithms to use for improving tuning algorithms.
For learning curves, when a line is steep, it means
adding more data should increase accuracy
Index
common way to talk about feature
Feature Name
directly from Flat File
What are types of numerical data?
discrete and continuous
Validation score
evaluated on a variety of metrics
What happens if you have too much generalization? (underfitting)
inaccurate results
regression analysis
measures the impact of a set of variables on another variable
k-fold cross validation
partition the data set (less the holdout set) into k equal subsets, each subset is called a fold
ML is about predicting the future based on the...
past
Features are your
predictors
Data Dictionary
stores definitions, such as data types for fields, default values, and validation rules for data in each field
Validation (test) set
subsection of a dataset to which we apply the machine learning algorithm to see how accurately it identifies relationships between the known outcomes for the target variable and the dataset's other features
Which Logloss score would appear higher on the leaderboard?
the lowest one
Auto ML
the process of automating machine learning; makes ML possible without extensive math/stat/programming
Target
the variable we are trying to predict and gain insights about
Boolean
when states are represented as as true or false