CSE160 Exam 1
Information Gain Equation
IG(parent, children) = entropy(parent) - [p(c1) * entropy(c1) + p(c2) * entropy(c2) + ...]
The most appropriate type of attribute that should be used to encode letter grade is called
Ordinal
What does the R function str() do?
Shows the structure of the dataframe or vector. Gives the type for each vector (column), name if available, and shows the first few entries.
T|F: Overfitting is the tendency of data mining procedures to create models that generalize to previously unseen data points.
False
T|F: The creation of models from data is known as model deduction.
False
What does this function do: help(function_name)
Document what a function does
T|F: Information gain can be used for attribute selection
True
Learning Curve
shows the generalization performance plotted against the amount of training data used
Cosine Distance
Commonly use for texts and documents (vector dot product).
Different distance measures have different properties. Suppose we want to ignore differences in scale across instances—technically, we want to ignore the magnitude of the vectors. Which of the following shall we select?
Cosine Distance
Factors
Factor is something that store categorical values as index values into a vectors of the actual strings
T|F: A collection that is impure means that it is homogeneous with respect to the target variable.
False
T|F: A complex tree with many leaves is the kind of tree that will best prevent overfitting.
False
T|F: A learning curve shows generalization performance plotted against model complexity.
False
T|F: By definition, node construction in decision trees always results in binary trees.
False
T|F: In general, if two classes are linearly separable, there is exactly one linear discriminant.
False
T|F: Logical vectors in R cannot be a part of arithmetic operations.
False
T|F: Logistic regression is misnamed because it does not use a log function.
False
T|F: A loss function is used to determine how much penalty should be assigned to an instance based on the error in the model's predicted value for that instance.
True
T|F: A main purpose of creating homogeneous regions is so that we can predict the target variable of a new, unseen instance by determining which segment it falls into
True
T|F: All data mining procedures have the tendency to overfit to some extent.
True
T|F: Data science is the discipline of making data useful.
True
T|F: Decomposing a data analytics problem into recognized tasks is a critical skill.
True
T|F: Information gain measures how much an attribute improves/decreases entropy due to new information being added.
True
T|F: Test data should be strictly independent of model building so that we can get a good estimate of model accuracy.
True
When would you NOT want to increase the complexity of your model?
When your model is overfitting
Linear discriminant
a data analysis technique that uses a line to separate data into two classes
scan()
ask for input from the keyboard
According to Data Science for Business, in terms of nearest neighbor methods, two unique aspects of overall intelligibility are the justification of a specific _____________ and the intelligibility of an entire ______________.
decision, model
Four types of data analytics
descriptive, diagnostic, predictive, prescriptive
Logistic Loss
does not assign zero penalty to any points, but gives less penalty to points correctly classified with high confidence
Over-fitting
undesirable machine learning behavior that occurs when the machine learning model gives accurate predictions for training data but not for new data
Pruning
simplifies a decision tree to prevent over-fitting to noise in the data (Reduce overfitting)
Euclidean distance
the straight-line distance, or shortest possible path, between two points (pethagorean theorem)
Which loss function specifies a loss proportional to the distance from the boundary?
Absolute error
The data mining procedure that produces a model which determines the category to which an individual sample belongs.
Classification
Categorical prediction (What category does this belong in?)
Classification model
Supervised Segmentation
How can we segment the population into groups that differ from each other with respect to some quantity of interest
Pre-pruning
stops growing a branch when information becomes unreliable
Name the function in R that reveals the structure of a data object:
str()
Post-pruning
takes a fully-grown decision tree and discards unreliable parts (generally preferred)
Supervised learning is model creation where the model describes a relationship between a set of selected variables and a predefined variable called the _______ variable.
target
Laplace Correlation Eq.
(n+1)/(n+m+2) n = positive instances m = negative instances
What are some issues with Nearest-Neighbor Models?
- Having too many attributes or irrelevant attributes, may confuse distance calculations - Computational Efficiency: NN can be expensive
Entropy Equation
-(p1 log2(p1) + p2 log2(p2)) p = probability
Benefit of attribute selection?
1. Better prediction 2. Faster prediction 3. Better explanations and more tractable models 4. Reduced computational and/or storage cost
What are the two types of model validation?
1. Cross-validation 2. Temporal split(s)
Rules for: Trees Structured Model
1. No 2 parents share descendants 2. There are no cycles 3. The branches always "points downwards" 4. Every example always ends up at a leaf node with some specific class
Uses for Similarity
1. Retrieving and ranking things from a collection 2. Recommending things 3. Classifying things 4. Performing regression
Logistic Regression OR Tree Induction
1. Smaller training-set sizes = logistic regression 2. Larger training-set sizes = trees
Jaccard Distance
Categorical data (not numeric). The proportion of the characteristics that is share between both category.
Predictive model
A formula for estimating the unknown value of interest: the target
Loss Function
A loss function determines how much penalty should be assigned to an instance based on the error in the model's predicted value (the lower the better) Zero-one: loss asks if a mistake was made Absolute error: specifies a loss proportional to the absolute distance from the boundary Squared error: specifies a loss proportional to the square of the distance from the boundary
Entropy
A measure of disorder. How well mix is the set. Purity measure. Perfectly mix = 1 purest = 0
Manhattan Distance
A measure of travel through a grid system like navigating around the buildings and blocks of Manhattan, NYC. (absolute value)
A learning curve is defined as:
A plot of the performance of the classifier on the y axis on a fixed test set as the size of the training data grows on the x axis
Logistic Regression
Actually a classifier. For logistic regression, the model produces a numeric estimate. Logistic regression is a class probability estimation model and not a regression model.
The data science process (aka CRISP-DM)
Data understanding -> data preparation -> modeling -> model evaluation -> model deployment
What is the name for a list of vectors where each vector has the exact same number of elements as the others?
Dataframe
Linear Models OR Tree Induction?
Depends on the situation How smooth How non-linear How much data do we have Characteristics of data
Over-fitting the data
Finding chance occurrences in data that look like interesting pattern, but which do NOT generalize to unseen data.
Hinge Loss
Hinge loss only becomes positive when an example is within the margin or on the wrong side of the boundary. - Loss increases linearly w/ example's distance from the correct margin - Penalizes points more the farther they are from the correct margin
Similarity
If two things are similar in some ways, they often share other characteristics as well
Non-linear functions
Linear functions can actually represent nonlinear model, if we include more complex features in the functions.
Support Vector Machine
Linear model. Chooses line that maximize the space between the line and data point "margin", equally space. Basically in the middle of the two data points. Uses "hinge loss".
A simplified representation of reality created to serve a purpose
Model
How many neighbors should be used in k-nn?
No simple answer
When measuring distance between elements of a dataset, what are the units used for Euclidean Distance?
None
Looking too hard at a set of data might result in finding something that does not generalize to unseen data. This is called what?
Overfitting
How can we tell if our model has overfit?
Overfitting: too specific Underfitting: too vague
The data mining procedure that attempts to characterize the typical behavior of an individual or population.
Profiling
What type of data mining task would be used to predict the gas mileage of a car?
Regression
Numeric prediction (What is the age of this customer?)
Regression model
na.omit()
Removes rows with missing values.
Instance / Example
Represents a fact or a data point Described by a set of attributes (fields, columns, variables, or features)
What does this function do: apropos("function_name")
Searches database for function by name or a partial name
What does this function do: help.search("decriptive_word")
Searches database for function when you do not know the name
What does this function do: example(function_name)
Shows how to use a function
When we analyze a set of data with a defined target in mind, what kind of model are we building?
Supervised
Fitted Graph
shows the generalization performance as well as the performance on the training data, but plotted against model complexity (fixed amount of training data)
Holdout Dataset (Test data)
The complexity of a model first decrease the error, but the model become more and more complex, the error starts to increase again. (underfitting, good, overfitting)
Training Data
The complexity of a model increase the performance. (decrease the error)
Model Induction
The creation of a model from data. Also called learning or training a model.
Assuming the existence of a data frame called cats, explain the difference between cats[1,] and cats[,1] in R.
The first returns just the first row (all columns) while the second returns just the first column (all rows).
How does data leakage negatively impact model building?
The model will see information which is not available when making decisions
What is the result of supervised data mining?
The result of training a model is something that can make a prediction when given a new example.
Why is it important to have separate training and test sets?
The training set is used to learn the model. But it cannot be used to evaluate the performance of the model - we need the unseen data of the test set to measure the ability of this model to generalize (to perform well on new data).
Name the function in R that concatenates data elements together into a vector:
c()
What happen to decision boundaries as K increases?
increasing K "simplifies and smooths decision boundary
Information Gain
measures the change in entropy due to any amount of new information being added
Leakage (Leaking)
mistake that is made by the creator of a machine learning model in which information about the target variable is leaking into the input of the model during the training of the model; information that will not be available in the ongoing data that we would like to predict on.
Which of these terms best describes the process of turning a data set with a bunch of junk in it into a nice clean data set?
munging
Logistic regression models produce a ____________ estimate for a _____________ target variable.
numeric, categorical
read.table()
reads a file in table format and creates a data frame from it (can also read in string with "text=" parameter)
What happen if the training set size changes?
result in different generalization performance from the resultant model
