CSE160 Exam 1

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Information Gain Equation

IG(parent, children) = entropy(parent) - [p(c1) * entropy(c1) + p(c2) * entropy(c2) + ...]

The most appropriate type of attribute that should be used to encode letter grade is called

Ordinal

What does the R function str() do?

Shows the structure of the dataframe or vector. Gives the type for each vector (column), name if available, and shows the first few entries.

T|F: Overfitting is the tendency of data mining procedures to create models that generalize to previously unseen data points.

False

T|F: The creation of models from data is known as model deduction.

False

What does this function do: help(function_name)

Document what a function does

T|F: Information gain can be used for attribute selection

True

Learning Curve

shows the generalization performance plotted against the amount of training data used

Cosine Distance

Commonly use for texts and documents (vector dot product).

Different distance measures have different properties. Suppose we want to ignore differences in scale across instances—technically, we want to ignore the magnitude of the vectors. Which of the following shall we select?

Cosine Distance

Factors

Factor is something that store categorical values as index values into a vectors of the actual strings

T|F: A collection that is impure means that it is homogeneous with respect to the target variable.

False

T|F: A complex tree with many leaves is the kind of tree that will best prevent overfitting.

False

T|F: A learning curve shows generalization performance plotted against model complexity.

False

T|F: By definition, node construction in decision trees always results in binary trees.

False

T|F: In general, if two classes are linearly separable, there is exactly one linear discriminant.

False

T|F: Logical vectors in R cannot be a part of arithmetic operations.

False

T|F: Logistic regression is misnamed because it does not use a log function.

False

T|F: A loss function is used to determine how much penalty should be assigned to an instance based on the error in the model's predicted value for that instance.

True

T|F: A main purpose of creating homogeneous regions is so that we can predict the target variable of a new, unseen instance by determining which segment it falls into

True

T|F: All data mining procedures have the tendency to overfit to some extent.

True

T|F: Data science is the discipline of making data useful.

True

T|F: Decomposing a data analytics problem into recognized tasks is a critical skill.

True

T|F: Information gain measures how much an attribute improves/decreases entropy due to new information being added.

True

T|F: Test data should be strictly independent of model building so that we can get a good estimate of model accuracy.

True

When would you NOT want to increase the complexity of your model?

When your model is overfitting

Linear discriminant

a data analysis technique that uses a line to separate data into two classes

scan()

ask for input from the keyboard

According to Data Science for Business, in terms of nearest neighbor methods, two unique aspects of overall intelligibility are the justification of a specific _____________ and the intelligibility of an entire ______________.

decision, model

Four types of data analytics

descriptive, diagnostic, predictive, prescriptive

Logistic Loss

does not assign zero penalty to any points, but gives less penalty to points correctly classified with high confidence

Over-fitting

undesirable machine learning behavior that occurs when the machine learning model gives accurate predictions for training data but not for new data

Pruning

simplifies a decision tree to prevent over-fitting to noise in the data (Reduce overfitting)

Euclidean distance

the straight-line distance, or shortest possible path, between two points (pethagorean theorem)

Which loss function specifies a loss proportional to the distance from the boundary?

Absolute error

The data mining procedure that produces a model which determines the category to which an individual sample belongs.

Classification

Categorical prediction (What category does this belong in?)

Classification model

Supervised Segmentation

How can we segment the population into groups that differ from each other with respect to some quantity of interest

Pre-pruning

stops growing a branch when information becomes unreliable

Name the function in R that reveals the structure of a data object:

str()

Post-pruning

takes a fully-grown decision tree and discards unreliable parts (generally preferred)

Supervised learning is model creation where the model describes a relationship between a set of selected variables and a predefined variable called the _______ variable.

target

Laplace Correlation Eq.

(n+1)/(n+m+2) n = positive instances m = negative instances

What are some issues with Nearest-Neighbor Models?

- Having too many attributes or irrelevant attributes, may confuse distance calculations - Computational Efficiency: NN can be expensive

Entropy Equation

-(p1 log2(p1) + p2 log2(p2)) p = probability

Benefit of attribute selection?

1. Better prediction 2. Faster prediction 3. Better explanations and more tractable models 4. Reduced computational and/or storage cost

What are the two types of model validation?

1. Cross-validation 2. Temporal split(s)

Rules for: Trees Structured Model

1. No 2 parents share descendants 2. There are no cycles 3. The branches always "points downwards" 4. Every example always ends up at a leaf node with some specific class

Uses for Similarity

1. Retrieving and ranking things from a collection 2. Recommending things 3. Classifying things 4. Performing regression

Logistic Regression OR Tree Induction

1. Smaller training-set sizes = logistic regression 2. Larger training-set sizes = trees

Jaccard Distance

Categorical data (not numeric). The proportion of the characteristics that is share between both category.

Predictive model

A formula for estimating the unknown value of interest: the target

Loss Function

A loss function determines how much penalty should be assigned to an instance based on the error in the model's predicted value (the lower the better) Zero-one: loss asks if a mistake was made Absolute error: specifies a loss proportional to the absolute distance from the boundary Squared error: specifies a loss proportional to the square of the distance from the boundary

Entropy

A measure of disorder. How well mix is the set. Purity measure. Perfectly mix = 1 purest = 0

Manhattan Distance

A measure of travel through a grid system like navigating around the buildings and blocks of Manhattan, NYC. (absolute value)

A learning curve is defined as:

A plot of the performance of the classifier on the y axis on a fixed test set as the size of the training data grows on the x axis

Logistic Regression

Actually a classifier. For logistic regression, the model produces a numeric estimate. Logistic regression is a class probability estimation model and not a regression model.

The data science process (aka CRISP-DM)

Data understanding -> data preparation -> modeling -> model evaluation -> model deployment

What is the name for a list of vectors where each vector has the exact same number of elements as the others?

Dataframe

Linear Models OR Tree Induction?

Depends on the situation How smooth How non-linear How much data do we have Characteristics of data

Over-fitting the data

Finding chance occurrences in data that look like interesting pattern, but which do NOT generalize to unseen data.

Hinge Loss

Hinge loss only becomes positive when an example is within the margin or on the wrong side of the boundary. - Loss increases linearly w/ example's distance from the correct margin - Penalizes points more the farther they are from the correct margin

Similarity

If two things are similar in some ways, they often share other characteristics as well

Non-linear functions

Linear functions can actually represent nonlinear model, if we include more complex features in the functions.

Support Vector Machine

Linear model. Chooses line that maximize the space between the line and data point "margin", equally space. Basically in the middle of the two data points. Uses "hinge loss".

A simplified representation of reality created to serve a purpose

Model

How many neighbors should be used in k-nn?

No simple answer

When measuring distance between elements of a dataset, what are the units used for Euclidean Distance?

None

Looking too hard at a set of data might result in finding something that does not generalize to unseen data. This is called what?

Overfitting

How can we tell if our model has overfit?

Overfitting: too specific Underfitting: too vague

The data mining procedure that attempts to characterize the typical behavior of an individual or population.

Profiling

What type of data mining task would be used to predict the gas mileage of a car?

Regression

Numeric prediction (What is the age of this customer?)

Regression model

na.omit()

Removes rows with missing values.

Instance / Example

Represents a fact or a data point Described by a set of attributes (fields, columns, variables, or features)

What does this function do: apropos("function_name")

Searches database for function by name or a partial name

What does this function do: help.search("decriptive_word")

Searches database for function when you do not know the name

What does this function do: example(function_name)

Shows how to use a function

When we analyze a set of data with a defined target in mind, what kind of model are we building?

Supervised

Fitted Graph

shows the generalization performance as well as the performance on the training data, but plotted against model complexity (fixed amount of training data)

Holdout Dataset (Test data)

The complexity of a model first decrease the error, but the model become more and more complex, the error starts to increase again. (underfitting, good, overfitting)

Training Data

The complexity of a model increase the performance. (decrease the error)

Model Induction

The creation of a model from data. Also called learning or training a model.

Assuming the existence of a data frame called cats, explain the difference between cats[1,] and cats[,1] in R.

The first returns just the first row (all columns) while the second returns just the first column (all rows).

How does data leakage negatively impact model building?

The model will see information which is not available when making decisions

What is the result of supervised data mining?

The result of training a model is something that can make a prediction when given a new example.

Why is it important to have separate training and test sets?

The training set is used to learn the model. But it cannot be used to evaluate the performance of the model - we need the unseen data of the test set to measure the ability of this model to generalize (to perform well on new data).

Name the function in R that concatenates data elements together into a vector:

c()

What happen to decision boundaries as K increases?

increasing K "simplifies and smooths decision boundary

Information Gain

measures the change in entropy due to any amount of new information being added

Leakage (Leaking)

mistake that is made by the creator of a machine learning model in which information about the target variable is leaking into the input of the model during the training of the model; information that will not be available in the ongoing data that we would like to predict on.

Which of these terms best describes the process of turning a data set with a bunch of junk in it into a nice clean data set?

munging

Logistic regression models produce a ____________ estimate for a _____________ target variable.

numeric, categorical

read.table()

reads a file in table format and creates a data frame from it (can also read in string with "text=" parameter)

What happen if the training set size changes?

result in different generalization performance from the resultant model


Ensembles d'études connexes

Factors that will affect insurance cost and basic coverage terms

View Set

NFA 201: Unit 2 SmartBook- Nutrition Information Fact or Fiction?

View Set

Foundations of Western Culture14-27

View Set

Anatomy and Physiology: Chapter 6,7,&8 Review

View Set

William Howard Taft's Domestic Policy

View Set

MH Exam #2 | Chapter 13 PrepU Questions

View Set

ch 4 Macroeconomics: Price Ceilings, floors, binding, and non binding

View Set