INFO Quiz 3 2205

Descriptive Stats

-Mean, Std. dev, Median, Min, Max

8 Criteria of Auto ML Excellence

1) Accuracy 2) Productivity 3) Ease of use 4) Understanding and learning 5) Resource availability 6) Process transparency (effects understanding and learning) 7) Generalizability across contexts 8) Recommended actions

Three types of relationships in DataRobot are...?

1. Importance 2. Feature Impact 3. Feature Effects

Why is Cross-Validation better than validation?

1. Uses more data to build model off of (builds off of multiple folds), which makes it more accurate 2. Helps prevent overfitting 3. Validates off of multiple folds

What % of the data is the validation set?

16% of the data

What % of the data is the holdout set?

20% of the data; this allows for 5-fold validation using the remaining 80% (80 / 5 = 16)

In quick mode DataRobot uses what % of the data to initially build the model in round 1?

32% of the data

What type of validation does DataRobot use?

5 Fold-Cross Validation

Only use models built off what % of data


In round 2, DR runs off of what % of the data

64% of the data (this is what algorithms run off of)

Zip file

A document that has been compressed to take up less space on a computer and to make it quicker to download.

What is Logloss?

A measure of accuracy; Rather than evaluating the model directly on whether it assigns cases (rows) to the correct "label", the model is evaluated based on probabilities generated by the model and their distance from the correct answer; Lower scores are better

Confusion Matrix

A table showing actual and predicted (TN, FN, FP, TP)

What is the advantage of cross-validation over validation?

Able to use more data to build model and more of that model to validate your data

The score in Machine Learning that is calculated by taking the correct classifications divided by total number of cases measured is defined as...


What is the most important criteria for Auto ML excellence?


What are the Confusion Matrix Metrics?

Accuracy, True Positive Rate, True Negative Rate

Importance (green bars)

Alternating Conditional Expectations (ACE Score); Answers the question "is there a relationship?"

How do you get the cross validation score?

Averaging the validation scores less the holdout (average of the 5 scores)

Why will blended models be slower?

Because it has to blend multiple models

Var Type

Boolean, categorical, numeric, text


Can be thought of as the independent variables we will use to predict the target

Which type of data is more predictive when using decision tree analysis?

Categorical is more predictive than text (because it breaks it up into different categories)

Feature Engineering

Cleaning data, combining features, splitting features into multiple features, handling missing values, and dealing with text, etc.

Supervised ML

Data scientist tells the machine what it wants it learn (identifies target)

In DataRobot, a * in the leaderboard means that...

DataRobot used more data to create the model, DO NOT use these models

Which is more important for a model, speed or accuracy?

Depends on what your using it for (For example, day-trading models should probably be faster)

Decision Trees

Diagrams where answers to yes or no questions lead decision makers to address additional questions until they reach the end of the tree.

Flat Line in learning curves means?

Don't add more data

What does Auto ML do?

Evaluate and rank different models; run combinations of models simultaneously

When evaluating ML results, I should always choose the fastest model


Where would you find your best feature for your best model?

Feature impact

Feature Effects

Feature impact for specific feature values (circle and + points)

Discrete data

Finite number of options; Like course letter grade, country of origin, etc.

What does the confusion matrix tell us?

How often we are right and how often we are wrong

Why is cross validation important?

If the original validation partition is not representative of the overall population, then the resulting model may appear to have a high accuracy when in reality it just happens to fit the unusual validation set well

Aggregate Targets

In some cases, a target might be a combination of multiple attributes, this is called a...

continuous data

Infinite number of possible responses; Like any point on a number line

For learning curves, if a line is pretty horizontal

It likely does not need more data

In HW 3, How did you know ENET Blender was the best model?

It used 64% of the data, did not have an *, and had the lowest cv score

Where do you go to compare models in DR?


What was the measure to determine how accurate a model was?


Artificial Intelligence (AI)

Machines that can perform tasks that are characteristic of human intelligence

By looking at past preferences we can...

Make predictions about the future

How do you find "feature effects: in DataRobot?

Models --> Select Model --> Understand --> Feature Effects --> Compute Feature Effects

How do you find "feature impact" in DataRobot?

Models --> Understand --> Feature Impact --> Enable Feature ImpactHow do

Steep line in learning curve means?

More data would be helpful

What are types of categorical data?

Nominal and binary

Binary data

Nominal attribute with only two categories/states (Ex: 0 --> no, 1--> yes)

Should more data always be used?

Not necessarily, consider the costs and the fact that if data is too old, it won't be helpful at all, use validation results to aid in decision about using more data


Number of missing values


Number of unique values

What is one of the biggest problems with decision trees?


Over Training (Over Fitting)

Poor generalization. The model simply memorizes the training examples and is not able to give correct outputs also for patterns that were not in the training dataset

ROC Curve

Probability density on the y-axis and probability of event on the x-axis; Positive cases are green, negative is purple; Line predicts where things are gonna be positive or negative)

Any row with even one missing value will get that whole row kicked out from the analysis when using...

Regression and Neural Networks

Linear Discriminant Analysis

Segregates a larger group into homogeneous subgroups

Learning Curves

Shows how the models predictive ability changes with 'sample size'; "will more data help my model"

Training Set

Subsection of a dataset from which the machine learning algorithm uncovers or "learns" relationships between the features and the target variable

What type of ML is focused on in this class?

Supervised Machine Learning is the focus of this class

True Negative Rate

TN / (TN + FP)

Accuracy (Confusion Matrix)

TP + TN / All cases

True Positive Rate

TP / (TP + FN)

What do decision tress do?

Take several different features to predict the target

Which data set should never be used to make decisions about which algorithms to use?

The Holdout Set

Where would you go to find what models are being used to create a blended model?

The blueprint

Feature Impact

The overall impact of a feature adjusted for the impact of the other features (Blue horizontal bar graph)


The overall impact of a feature without consideration of the impact of other features (Green Bars)

Machine Learning (ML)

The practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the word; A subset of AI

Exploratory Data Analysis

The process of examining the descriptive statistics for all features as well as their relationship with the target variable


The process of fitting a model too closely to the training data for the model to be effective on other data.

Unsupervised ML

Up to the machine to decide what it wants to learn

Data Splitting

Use algorithms to split data and build several models. Split into: 1. Training Set 2. Validation Set 3. Holdout Set

Holdout Sample

Used to assess the performance of that model

cross validation

Verifying the results obtained from a validation study by administering a test or test battery to a different sample (drawn from the same population)

Unit of Analysis

What the data set is about (Ex: for HW 3 it was late deliveries)

When do you unlock the holdout?

When you implement the model.

Nominal data

You can identify groups are different, but no meaningful ranking (Ex: occupation, marital status, customer ID, etc.)

Text (String) data

You specify a number of characters, either exact or max amount

Holdout Set

a subsection of a dataset to provide a final estimate of the machine learning model's performance after if has been trained and validated. Holdout sets should never be used to make decisions about which algorithms to use for improving tuning algorithms.

For learning curves, when a line is steep, it means

adding more data should increase accuracy


common way to talk about feature

Feature Name

directly from Flat File

What are types of numerical data?

discrete and continuous

Validation score

evaluated on a variety of metrics

What happens if you have too much generalization? (underfitting)

inaccurate results

regression analysis

measures the impact of a set of variables on another variable

k-fold cross validation

partition the data set (less the holdout set) into k equal subsets, each subset is called a fold

ML is about predicting the future based on the...


Features are your


Data Dictionary

stores definitions, such as data types for fields, default values, and validation rules for data in each field

Validation (test) set

subsection of a dataset to which we apply the machine learning algorithm to see how accurately it identifies relationships between the known outcomes for the target variable and the dataset's other features

Which Logloss score would appear higher on the leaderboard?

the lowest one

Auto ML

the process of automating machine learning; makes ML possible without extensive math/stat/programming


the variable we are trying to predict and gain insights about


when states are represented as as true or false

