609 test 2

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

accuracy equation

# of correct decisions made/ total # of decisions made

error equation

# of incorrect decisions made/total # of decisions made

curse of dimensionality

(an issue with NN methods) - have too many/irrelevant attributes may confuse distance calculation - injecting domain knowledge is a solution 2 of dimensionality

Issues with NN methods

- Intelligibility: justification of a specific decision and the intelligibility of an entire model - NN methods lack specific decisions made - dimensionality and domain knowledge - computational efficiency (querying the database for prediction can be expensive)

cluster

- a collection of data objects that are similar to each other - the mean must be defined

profit curve 2 requirements

- a pretty good estimate of class priors (base rate proportion of positive and negative instances) - a good estimate of costs and benefits (expected profit is highly sensitive to these)

topic models

- add complexity to the text mining process bc they attempt to deal with the complexities and nuances of language - from a document, words get mapped to a number of topics that are identified in the document (unsupervised techniques used) - final classifier (at the top) is defined using these intermediate topics

two main approaches to hierarchical clustering

- agglomerative approach (bottom up) -- start w each object as a separate group, then merge group until certain termination condition holds - divisive approach (top down) -- start w all objects in one cluster then successively split up into smaller clusters until a certain termination condition holds

classification terminology

- bad outcome = a "positive" example (sound the alarm) - good outcome = a "negative" example (nothing to look at here)

IDF (inverse document frequency)

- calculates sparseness (t) - when a term is very rare, the IDF is high - when words are more common, IDF decreases = 1 + log(total # of documents / # of documents containing t)

DBSCAN algorithm (simplified)

- check the neighborhood of each point in the database - find satisfying neighborhoods - merge satisfying neighborhoods which are connected

naive bayes classifier

- classifies a new example by estimating the probability that the example belongs to each class and reports the class with highest probability - naive because it models each feature as being generated independently (for each target) - fast, efficient and surprisingly effective

cumulative response curve

- closely related to the roc curve - more intuitive - plot the hit rate (tp rate; y axis) - more clear lift curve is essentially the value of the cumulative response curve at a given x point divided by the diagonal line value at that point

density based clustering

- continues growing the given clusters as long as the density (i.e. the number of objects or data points) in the "neighborhood" exceeds some threshold - discovers clusters with arbitrary shapes due to not using the distance between data objects for clustering

general approach for partition-based clustering

- create an initial partitioning - improve the partitioning by an iterative relocation technique, i.e., move data objects from one group to another

main purposes of a cluster

- discovery of overall distribution patterns and interesting correlations among data attributes - data reduction: a cluster can be treated collectively as one group in many applications

grid based clustering

- divide the object space into a finite number of cells that form a grid structure - main advantage: fast processing time, which is typically independent of the number of data objects

requirements for clustering

- domain knowledge required to determine input parameters and evaluate results - must be able to deal with noisy data (outliers can influence the clustering) - high dimensionality - interpretability and usability

k-nn graph with hierarchical clustering chameleon

- each vertex represents a data object - two vertices are connected if one object is among the k-most similar objects of the other -- the neighborhood radius of an object is determined by the density of the region in which this object resides

how to classify features (words)

- existence (boolean) - count (term frequency)

hierarchical clustering chameleon

- explores dynamic modeling of the neighborhood in hierarchical clustering - counteracts disadvantages of common hierarchical algorithms - two clusters are merged if the interconnectivity and closeness between two clusters are highly related to the internal interconnectivity and closeness of objects within the clusters

partitioning using k-medoids

- find k clusters in n objects by first determining a representative object for each cluster - each remaining object is clustered with the medoid to which it is most similar - the medoids are iteratively replaced by one of the non-medoids as long as the quality of clustering is improved

partition-based clustering

- given a database of n objects, a partitioning method constructs k partitions of the data - each partition represents a cluster - result: the data is classified into k groups. each group must contain at least 1 object. each object must belong to exactly one group

STING advantages

- grid structure facilitates parallel processing and incremental updating - very efficient

scores from ranking classifiers

- higher score indicates higher likelihood of class membership - useful for deciding which prospects are better than others - often used when there is a budget to consider

typical clustering requirements for DM

- many cluster algorithms work well on small low dimensional data sets and numerical attributes - in large data sets, algorithms must be able to deal with scalability (handling millions of data objects) and different types of attributes (binary, nominal, ordinal)

sparseness

- measured by inverse document frequency (IDF) - the most informative terms are neither too rare nor too common - sparse data (many/most features have no value in a particular observation)

discriminative methods

- minimize loss or entropy

measures for distances between two clusters (hierarchical clustering)

- minimum distance - maximum distance - mean distance: distance between the mean of the clusters - average distance: average of all single distances

false positive and false negative errors

- number of mistakes made on negative examples (false pos) can be relatively high - cost of each mistake made on a positive example (false neg) will be relatively high

unbalanced classes

- one class is often rare - classification is typically used to find a relatively small number of unusual ones - class distribution is often unbalanced ("skewed")

impressions in system 1 affect the conclusions of system 2

- presentation impacts the way data is perceived - mood and emotions impact critical thinking

ROC curves

- similar to other model performance visualizations - separates classifier performance from costs, benefits and target class distributions

advantages and disadvantages of naive bayes

- simple yet includes all features - efficient - performs well - robust to violations of independence - naturally incremental - spam filtering is a common example

criteria for selection of clustering algorithm

- type and structure of clustering - type and definition of clusters - features of the data set - number of data objects and variables - representation of results

what is text

- unstructured data - dirty (from an analytics standpoint) - needs context

general approach to hierarchical clustering chameleon

- use a graph partitioning algorithm to cluster the data objects into a large number of relatively small temporary clusters - combine or merge sub-clusters with an agglomerative hierarchical cluster algorithm

clustering process

- works well when clusters are compact clouds and well separated - scalable and efficient in large data sets - often terminates at local optimum (=> random initial partitioning) - mean must be defined (not suitable for categorical attributes) - sensitivity to noise and outlier data (=> mean calculation)

naive bayes equation

...

expected value calculation

= weighted average of the values of different possible outcomes, where the weight given to each value is the probability of its occurrence

Minkowski distance

Generalization

TFIDF

IDF x TF (term frequency) - it is specific to a single document (d) whereas IDF depends on the entire corpus

if we assume independence then P(AB) =

P(A)P(B)

STING

STatisitcal Information Grid - spatial area is divided into rectangular cells - hierarchical structure - statistical information (mean, max, min) regarding the attributes in each cell are precomputed and stored

Manhattan Distance

Sums the differences along the difference dimensions between X and Y Represents the total street distance you would have to travel to get between two points L1 norm

ranking instead of classifying

a different strategy for making decisions is to rank a set of cases by these scores and then take actions on the cases at the top of the ranked list

method for clustering using k-means

arbitrarily choose k objects as the initial cluster centers; repeat - (re)assign each object to the cluster to which the object is most similar based on the mean value of the objects in the cluster - update the cluster means, i.e., calculate the mean value of the objects or each cluster until no change

calinski-harabasz index

cluster homogeneity and separation measure across bootstrap runs

adjusted rand index

cluster similarity across bootstrap runs - lower scores indicate more random cluster generation

corpus

collection of documents

regression

derive target value from the mean or median of the neighbors

differential description

describes only what differentiates this cluster from the others, ignoring the characteristics that may be shared within it - intergroup differences

characteristic description

describes what is typical or characteristic of the cluster, ignoring whether other clusters might share some of the same characteristics - intragroup commonalities

classification trees fall in what category (discriminative or generative)

discriminative

linear equations fall in what category (discriminative or generative)

discriminative

named entity extraction

for known phrases or names - ex: IBM, I.B.M., etc new york mets, mets, amazins, etc - Game of Thrones, show or movie titles that are relevant as a phrase

bayesian methods fall in what category (discriminative or generative)

generative methods

AUC - area under the ROC curve

good measure of overall performance, if single-number metric is needed and target conditions are completely unknown - measures the ranking quality of a model - a fair measure of the quality of probability estimates - useful when nothing is known about the operating conditions - equivalent to wilcoxon (WMW) statistic and Gini coefficient

clustering using k-means

input: number of clusters k and a database containing n objects output: set of k clusters that minimizes the squared-error criterion

tokens or terms

just a word

K-NN

k is a complexity parameter

differences between k-medoids and k-means

k-medoids is more robust and more costly than k-means

Combining function

like voting or averaging - will give a prediction

classification

look for nearest neighbors and derive target class for new example

generative methods

model how data was generated

K = 1

more complex and more likely to overfit

K = 30

more representative

medoid

most centrally located object in a cluster

Nearest Neighbors

most-similar instances uses weighted voting (majority vote) or similarity moderated voting (each neighbor's contribution is scaled by it's similarity)

is differential or characteristic description better

neither is better, it depends what you're using it for

is syntactic similarity the same as semantic similarity

no

criteria for partition-based clustering

objects in the same group are "close" to each other, whereas objects of different clusters are "far apart"

main disadvantage of hierarchical clustering

once a merge or split is done, it can never be undone. i.e., erroneous decisions cannot be corrected

baseline method

one of the first methods to be applied to any new problem

Euclidean Distance

overall distance by computing the distances of the individual dimensions - the individual features in our setting - most common geometric distance measure - numeral equation L2 Norm

if we do not assume independence then p(AB) =

p(A)p(B | A)

joint probability

p(AB) - dice example - "we can calculate the probability of the "joint" event by multiplying the probabilities of the individual events

If A and B are independent then ...

p(B|A) = p(B)

n-grams

phrases made up of multiple words (up to N)

ROC graph

receiver operating characteristics - most interested in left side - perfect model is at 0,1

profit curves

record the percentage of the list predicted as positive and the corresponding estimated profit

bayes rule

says that we can compute the probability of our hypothesis H given some evidence E by instead looking at the probability of the evidence given the hypothesis, as well as the unconditional probabilities of the hypothesis and the evidence

naive bayes is often used with ..

sparse data

system 2

the deliberate, controlled, conscious, and slower way of thinking

system 1

the intuitive, automatic, unconscious, and fast way of thinking

feature selection (solution 1)

the judicious determination of features that should be included in the data mining model

P(E|C=c)

the likelihood of evidence given class C=c

P(E)

the likelihood of the evidence

lift curve

the performance improvement at a given threshold - typically expressed as a multiple (i.e. "2x left") - very sensitive to changes in priors

P(C=c|E)

the posterior - can be used as a score - requires knowing p(E|c) which is difficult to measure

P(c)

the prior

term frequency

the word count (frequency) in the document instead of just a zero or one

bag of words

this approach is to treat every document as a collection of individual words - this ignores grammar, word order, sentence structure and usually punctuation

jaccard distance

treats the two objects as sets of characteristics - appropriate when the possession of a common characteristic between two items is important but the common absence of a characteristic is not

conditional independence

two events are independent if knowing one does not give you information on the probability of the other using conditional probabilities

is clustering supervised or unsupervised

unsupervised

cosine distance

used in text classification to measure the similarity of two documents - useful when you want to ignore differences in scale across instances (technically when you want to ignore the magnitude of vectors)

confusion matrix

used to decompose and count the different types of correct and incorrect decisions made by a classifier - separates out the decisions made by the classifier, making explicit how one class is being confused for another - shows us all possible combinations of predictions and outcomes - main diagonal contains the count of correct decisions

evidence "lifts"

used to examine large numbers of possible pieces of evidence for or against a conclusion

cluster analysis

when finding groups of objects (consumers, businesses, whiskey, etc) where the objects within groups are similar but the objects in different groups are not so similar - exploratory phase

word association

words that appear near other words

predictive classes

yes and no


Ensembles d'études connexes

Quizzes 1-15 intro to Exceptional Children

View Set

Critical Care HESI Practice Exam

View Set

Anatomy Lecture Exam 3: Muscular System

View Set

Finance 311 Final Exam Review (FML)

View Set

practice exam missed questions question 4

View Set

DM- Chapter 11 Quiz Clinical Chemistry

View Set