CS4801 Q2
Tree Induction
incorporates the idea of supervised segmentation in an elegant manner, repeatedly selecting informative attributes
Support Vector Machine (SVM)
instead of thinking about separating with a line, first fit the fattest bar between the classes
Value of a Predictive Model
sometimes the understanding gained from looking at it rather than in the predictions it makes
Deduction
starts with general rules and specific facts, and creates other specific facts from them
Margin
distance between the dashed parallel lines in SVM, you want to maximize this
Predictive Model
a formula for estimating the unknown value of interest: the target
Variance
a natural measure of impurity for numeric values
Information
a quantity that reduces uncertainty about something
Objective Function
calculated for a particular set of weights and a particular set of data
Numeric Variables
can be "discretized" by choosing a split point (or many split points) and then treating the result as a categorical attribute
Centroid
cluster center
Euclidean Distance
compute the overall distance by computing the distances of the individual dimensions—the individual features in our setting
Edit Distance (Levenshtein Metric)
counts the minimum number of edit operations required to convert one string into the other, where an edit operation consists of either inserting, deleting, or replacing a character (one could choose other edit operators)
Hierarchical Clustering
creates a collection of ways to group the points; focuses on the similarities between the individual instances and how similarities link them together
Induction
creation of models from data
Supervised Learning Model
describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable
Linear Discriminant
discriminates between the classes, and the function of the decision boundary is a linear combination—a weighted sum—of the attributes
Linkage Function
distance function between clusters, considering individual instances to be the smallest clusters for hierarchical clustering
Selecting a Subset of Informative Attributes
doing so can substantially reduce the size of an unwieldy dataset, and often will improve the accuracy of the resultant model
Text Classification
each word or token corresponds to a dimension, and the location of a document along each dimension is the number of occurrences of the word in that document
Linear Classifier
essentially a weighted sum of the values for the various attributes
Purity Measure
evaluates how well each attribute splits a set of examples into segments, with respect to a chosen target variable
Overfitting
finding chance occurrences in data that look like interesting patterns, but which do not generalize
Tree Induction
finding informative attributes is the basis for a widely used predictive modeling technique
Fundamental Idea of Data Mining
finding or selecting important, informative variables or "attributes" of the entities described by the data
Support Vector Machine (SVM)
fit the fattest bar between the classes, the linear discriminant will be the center line through the bar
Prediction
generally means to estimate an unknown value
Similarity for Predictive Modeling
given a new example whose target variable we want to predict, we scan through all the training examples and choose several that are the most similar to the new example; then we predict the new example's target value, based on the nearest neighbors' (known) target values
Clustering
groups the points by their similarity
Pure
homogeneous with respect to the target variable
Supervised Segmentation Fundamental Concept
how can we judge whether a variable contains important information about the target variable? How much?
Predictive Modeling as Supervised Segmentation
how can we segment the population into groups that differ from each other with respect to some quantity of interest
Issue with Nearest-Neighbor Methods
how many neighbors should we use? should they have equal weights in the combining function?
Entropy
is a measure of disorder that can be applied to a set, such as one of our individual segments
Complication is Selecting Informative Attributes
is it this better than another split that does not produce any pure subset, but reduces the impurity more broadly?
Support Vector Machine (SVM)
is linear discriminant
Advantage of Hierarchical Clustering
it allows the data analyst to see the groupings—the "landscape" of data similarity—before deciding on the number of clusters to extract
Support Vector Machine (SVM)
linear discriminant
Fitting a (Linear) Model to Data
linear regression, logistic regression, and support vector machines
Similarity-Moderated Voting
majority scoring with weights
Table Model
memorizes the training data and performs no generalization
Cross-validation
more sophisticated holdout training and testing procedure; computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing
Information Gain
most common splitting criterion
K-Means Clustering
most popular centroid-based clustering algorithm
Nearest Neighbors
most-similar instances
Descriptive Model
must be judged in part on its intelligibility, and a less accurate model may be preferred if it is easier to understand
Complication is Selecting Informative Attributes
not all attributes are binary; many attributes have three or more distinct values
Cosine Distance
often used in text classification to measure the similarity of two documents
Cosine Distance
particularly useful when you want to ignore differences in scale across instances—technically, when you want to ignore the magnitude of the vectors
Decision Boundaries
partition the instance space into similar regions
Complication is Selecting Informative Attributes
rarely split a group perfectly
Dendrogram
shows explicitly the hierarchy of the clusters
Model
simplified representation of reality created to serve a purpose
Main Purpose of Creating Homogeneous Regions
so that we can predict the target variable of a new, unseen instance by determining which segment it falls into
Complication is Selecting Informative Attributes
some attributes take on numeric values (continuous or integer)
Manhattan Distance
sum of the (unsquared) pairwise distances
Similarity
the closer two objects are in the space defined by the features
Generalization
the property of a model or modeling process, whereby the model applies to data that were not used to build the model
Overfitting
the tendency of data mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data points
Parameterized Model
the weights of the linear function (wi) are the parameters
Goal of Data Mining
to tune the parameters so that the model fits the data as well as possible; this general approach is called parameter learning or parametric modeling
Jaccard Distance
treats the two objects as sets of characteristics
Finding Informative Attributes
useful to help us deal with increasingly larger databases and data streams
A Key to Supervised Data Mining
we have some target quantity we would like to predict or to otherwise understand better
Fundamental Idea in Data Mining
we need to ask, what should be our goal or objective in choosing the parameters?
Descriptive Modeling
where the primary purpose of the model is not to estimate a value but instead to gain insight into the underlying phenomenon or process