CS4801 Q2

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Tree Induction

incorporates the idea of supervised segmentation in an elegant manner, repeatedly selecting informative attributes

Support Vector Machine (SVM)

instead of thinking about separating with a line, first fit the fattest bar between the classes

Value of a Predictive Model

sometimes the understanding gained from looking at it rather than in the predictions it makes

Deduction

starts with general rules and specific facts, and creates other specific facts from them

Margin

distance between the dashed parallel lines in SVM, you want to maximize this

Predictive Model

a formula for estimating the unknown value of interest: the target

Variance

a natural measure of impurity for numeric values

Information

a quantity that reduces uncertainty about something

Objective Function

calculated for a particular set of weights and a particular set of data

Numeric Variables

can be "discretized" by choosing a split point (or many split points) and then treating the result as a categorical attribute

Centroid

cluster center

Euclidean Distance

compute the overall distance by computing the distances of the individual dimensions—the individual features in our setting

Edit Distance (Levenshtein Metric)

counts the minimum number of edit operations required to convert one string into the other, where an edit operation consists of either inserting, deleting, or replacing a character (one could choose other edit operators)

Hierarchical Clustering

creates a collection of ways to group the points; focuses on the similarities between the individual instances and how similarities link them together

Induction

creation of models from data

Supervised Learning Model

describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable

Linear Discriminant

discriminates between the classes, and the function of the decision boundary is a linear combination—a weighted sum—of the attributes

Linkage Function

distance function between clusters, considering individual instances to be the smallest clusters for hierarchical clustering

Selecting a Subset of Informative Attributes

doing so can substantially reduce the size of an unwieldy dataset, and often will improve the accuracy of the resultant model

Text Classification

each word or token corresponds to a dimension, and the location of a document along each dimension is the number of occurrences of the word in that document

Linear Classifier

essentially a weighted sum of the values for the various attributes

Purity Measure

evaluates how well each attribute splits a set of examples into segments, with respect to a chosen target variable

Overfitting

finding chance occurrences in data that look like interesting patterns, but which do not generalize

Tree Induction

finding informative attributes is the basis for a widely used predictive modeling technique

Fundamental Idea of Data Mining

finding or selecting important, informative variables or "attributes" of the entities described by the data

Support Vector Machine (SVM)

fit the fattest bar between the classes, the linear discriminant will be the center line through the bar

Prediction

generally means to estimate an unknown value

Similarity for Predictive Modeling

given a new example whose target variable we want to predict, we scan through all the training examples and choose several that are the most similar to the new example; then we predict the new example's target value, based on the nearest neighbors' (known) target values

Clustering

groups the points by their similarity

Pure

homogeneous with respect to the target variable

Supervised Segmentation Fundamental Concept

how can we judge whether a variable contains important information about the target variable? How much?

Predictive Modeling as Supervised Segmentation

how can we segment the population into groups that differ from each other with respect to some quantity of interest

Issue with Nearest-Neighbor Methods

how many neighbors should we use? should they have equal weights in the combining function?

Entropy

is a measure of disorder that can be applied to a set, such as one of our individual segments

Complication is Selecting Informative Attributes

is it this better than another split that does not produce any pure subset, but reduces the impurity more broadly?

Support Vector Machine (SVM)

is linear discriminant

Advantage of Hierarchical Clustering

it allows the data analyst to see the groupings—the "landscape" of data similarity—before deciding on the number of clusters to extract

Support Vector Machine (SVM)

linear discriminant

Fitting a (Linear) Model to Data

linear regression, logistic regression, and support vector machines

Similarity-Moderated Voting

majority scoring with weights

Table Model

memorizes the training data and performs no generalization

Cross-validation

more sophisticated holdout training and testing procedure; computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing

Information Gain

most common splitting criterion

K-Means Clustering

most popular centroid-based clustering algorithm

Nearest Neighbors

most-similar instances

Descriptive Model

must be judged in part on its intelligibility, and a less accurate model may be preferred if it is easier to understand

Complication is Selecting Informative Attributes

not all attributes are binary; many attributes have three or more distinct values

Cosine Distance

often used in text classification to measure the similarity of two documents

Cosine Distance

particularly useful when you want to ignore differences in scale across instances—technically, when you want to ignore the magnitude of the vectors

Decision Boundaries

partition the instance space into similar regions

Complication is Selecting Informative Attributes

rarely split a group perfectly

Dendrogram

shows explicitly the hierarchy of the clusters

Model

simplified representation of reality created to serve a purpose

Main Purpose of Creating Homogeneous Regions

so that we can predict the target variable of a new, unseen instance by determining which segment it falls into

Complication is Selecting Informative Attributes

some attributes take on numeric values (continuous or integer)

Manhattan Distance

sum of the (unsquared) pairwise distances

Similarity

the closer two objects are in the space defined by the features

Generalization

the property of a model or modeling process, whereby the model applies to data that were not used to build the model

Overfitting

the tendency of data mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data points

Parameterized Model

the weights of the linear function (wi) are the parameters

Goal of Data Mining

to tune the parameters so that the model fits the data as well as possible; this general approach is called parameter learning or parametric modeling

Jaccard Distance

treats the two objects as sets of characteristics

Finding Informative Attributes

useful to help us deal with increasingly larger databases and data streams

A Key to Supervised Data Mining

we have some target quantity we would like to predict or to otherwise understand better

Fundamental Idea in Data Mining

we need to ask, what should be our goal or objective in choosing the parameters?

Descriptive Modeling

where the primary purpose of the model is not to estimate a value but instead to gain insight into the underlying phenomenon or process


Kaugnay na mga set ng pag-aaral

english exam: literature and "Inn of Lost Time"

View Set

Part 2 of Developmental Psychology

View Set