609 test 2
accuracy equation
# of correct decisions made/ total # of decisions made
error equation
# of incorrect decisions made/total # of decisions made
curse of dimensionality
(an issue with NN methods) - have too many/irrelevant attributes may confuse distance calculation - injecting domain knowledge is a solution 2 of dimensionality
Issues with NN methods
- Intelligibility: justification of a specific decision and the intelligibility of an entire model - NN methods lack specific decisions made - dimensionality and domain knowledge - computational efficiency (querying the database for prediction can be expensive)
cluster
- a collection of data objects that are similar to each other - the mean must be defined
profit curve 2 requirements
- a pretty good estimate of class priors (base rate proportion of positive and negative instances) - a good estimate of costs and benefits (expected profit is highly sensitive to these)
topic models
- add complexity to the text mining process bc they attempt to deal with the complexities and nuances of language - from a document, words get mapped to a number of topics that are identified in the document (unsupervised techniques used) - final classifier (at the top) is defined using these intermediate topics
two main approaches to hierarchical clustering
- agglomerative approach (bottom up) -- start w each object as a separate group, then merge group until certain termination condition holds - divisive approach (top down) -- start w all objects in one cluster then successively split up into smaller clusters until a certain termination condition holds
classification terminology
- bad outcome = a "positive" example (sound the alarm) - good outcome = a "negative" example (nothing to look at here)
IDF (inverse document frequency)
- calculates sparseness (t) - when a term is very rare, the IDF is high - when words are more common, IDF decreases = 1 + log(total # of documents / # of documents containing t)
DBSCAN algorithm (simplified)
- check the neighborhood of each point in the database - find satisfying neighborhoods - merge satisfying neighborhoods which are connected
naive bayes classifier
- classifies a new example by estimating the probability that the example belongs to each class and reports the class with highest probability - naive because it models each feature as being generated independently (for each target) - fast, efficient and surprisingly effective
cumulative response curve
- closely related to the roc curve - more intuitive - plot the hit rate (tp rate; y axis) - more clear lift curve is essentially the value of the cumulative response curve at a given x point divided by the diagonal line value at that point
density based clustering
- continues growing the given clusters as long as the density (i.e. the number of objects or data points) in the "neighborhood" exceeds some threshold - discovers clusters with arbitrary shapes due to not using the distance between data objects for clustering
general approach for partition-based clustering
- create an initial partitioning - improve the partitioning by an iterative relocation technique, i.e., move data objects from one group to another
main purposes of a cluster
- discovery of overall distribution patterns and interesting correlations among data attributes - data reduction: a cluster can be treated collectively as one group in many applications
grid based clustering
- divide the object space into a finite number of cells that form a grid structure - main advantage: fast processing time, which is typically independent of the number of data objects
requirements for clustering
- domain knowledge required to determine input parameters and evaluate results - must be able to deal with noisy data (outliers can influence the clustering) - high dimensionality - interpretability and usability
k-nn graph with hierarchical clustering chameleon
- each vertex represents a data object - two vertices are connected if one object is among the k-most similar objects of the other -- the neighborhood radius of an object is determined by the density of the region in which this object resides
how to classify features (words)
- existence (boolean) - count (term frequency)
hierarchical clustering chameleon
- explores dynamic modeling of the neighborhood in hierarchical clustering - counteracts disadvantages of common hierarchical algorithms - two clusters are merged if the interconnectivity and closeness between two clusters are highly related to the internal interconnectivity and closeness of objects within the clusters
partitioning using k-medoids
- find k clusters in n objects by first determining a representative object for each cluster - each remaining object is clustered with the medoid to which it is most similar - the medoids are iteratively replaced by one of the non-medoids as long as the quality of clustering is improved
partition-based clustering
- given a database of n objects, a partitioning method constructs k partitions of the data - each partition represents a cluster - result: the data is classified into k groups. each group must contain at least 1 object. each object must belong to exactly one group
STING advantages
- grid structure facilitates parallel processing and incremental updating - very efficient
scores from ranking classifiers
- higher score indicates higher likelihood of class membership - useful for deciding which prospects are better than others - often used when there is a budget to consider
typical clustering requirements for DM
- many cluster algorithms work well on small low dimensional data sets and numerical attributes - in large data sets, algorithms must be able to deal with scalability (handling millions of data objects) and different types of attributes (binary, nominal, ordinal)
sparseness
- measured by inverse document frequency (IDF) - the most informative terms are neither too rare nor too common - sparse data (many/most features have no value in a particular observation)
discriminative methods
- minimize loss or entropy
measures for distances between two clusters (hierarchical clustering)
- minimum distance - maximum distance - mean distance: distance between the mean of the clusters - average distance: average of all single distances
false positive and false negative errors
- number of mistakes made on negative examples (false pos) can be relatively high - cost of each mistake made on a positive example (false neg) will be relatively high
unbalanced classes
- one class is often rare - classification is typically used to find a relatively small number of unusual ones - class distribution is often unbalanced ("skewed")
impressions in system 1 affect the conclusions of system 2
- presentation impacts the way data is perceived - mood and emotions impact critical thinking
ROC curves
- similar to other model performance visualizations - separates classifier performance from costs, benefits and target class distributions
advantages and disadvantages of naive bayes
- simple yet includes all features - efficient - performs well - robust to violations of independence - naturally incremental - spam filtering is a common example
criteria for selection of clustering algorithm
- type and structure of clustering - type and definition of clusters - features of the data set - number of data objects and variables - representation of results
what is text
- unstructured data - dirty (from an analytics standpoint) - needs context
general approach to hierarchical clustering chameleon
- use a graph partitioning algorithm to cluster the data objects into a large number of relatively small temporary clusters - combine or merge sub-clusters with an agglomerative hierarchical cluster algorithm
clustering process
- works well when clusters are compact clouds and well separated - scalable and efficient in large data sets - often terminates at local optimum (=> random initial partitioning) - mean must be defined (not suitable for categorical attributes) - sensitivity to noise and outlier data (=> mean calculation)
naive bayes equation
...
expected value calculation
= weighted average of the values of different possible outcomes, where the weight given to each value is the probability of its occurrence
Minkowski distance
Generalization
TFIDF
IDF x TF (term frequency) - it is specific to a single document (d) whereas IDF depends on the entire corpus
if we assume independence then P(AB) =
P(A)P(B)
STING
STatisitcal Information Grid - spatial area is divided into rectangular cells - hierarchical structure - statistical information (mean, max, min) regarding the attributes in each cell are precomputed and stored
Manhattan Distance
Sums the differences along the difference dimensions between X and Y Represents the total street distance you would have to travel to get between two points L1 norm
ranking instead of classifying
a different strategy for making decisions is to rank a set of cases by these scores and then take actions on the cases at the top of the ranked list
method for clustering using k-means
arbitrarily choose k objects as the initial cluster centers; repeat - (re)assign each object to the cluster to which the object is most similar based on the mean value of the objects in the cluster - update the cluster means, i.e., calculate the mean value of the objects or each cluster until no change
calinski-harabasz index
cluster homogeneity and separation measure across bootstrap runs
adjusted rand index
cluster similarity across bootstrap runs - lower scores indicate more random cluster generation
corpus
collection of documents
regression
derive target value from the mean or median of the neighbors
differential description
describes only what differentiates this cluster from the others, ignoring the characteristics that may be shared within it - intergroup differences
characteristic description
describes what is typical or characteristic of the cluster, ignoring whether other clusters might share some of the same characteristics - intragroup commonalities
classification trees fall in what category (discriminative or generative)
discriminative
linear equations fall in what category (discriminative or generative)
discriminative
named entity extraction
for known phrases or names - ex: IBM, I.B.M., etc new york mets, mets, amazins, etc - Game of Thrones, show or movie titles that are relevant as a phrase
bayesian methods fall in what category (discriminative or generative)
generative methods
AUC - area under the ROC curve
good measure of overall performance, if single-number metric is needed and target conditions are completely unknown - measures the ranking quality of a model - a fair measure of the quality of probability estimates - useful when nothing is known about the operating conditions - equivalent to wilcoxon (WMW) statistic and Gini coefficient
clustering using k-means
input: number of clusters k and a database containing n objects output: set of k clusters that minimizes the squared-error criterion
tokens or terms
just a word
K-NN
k is a complexity parameter
differences between k-medoids and k-means
k-medoids is more robust and more costly than k-means
Combining function
like voting or averaging - will give a prediction
classification
look for nearest neighbors and derive target class for new example
generative methods
model how data was generated
K = 1
more complex and more likely to overfit
K = 30
more representative
medoid
most centrally located object in a cluster
Nearest Neighbors
most-similar instances uses weighted voting (majority vote) or similarity moderated voting (each neighbor's contribution is scaled by it's similarity)
is differential or characteristic description better
neither is better, it depends what you're using it for
is syntactic similarity the same as semantic similarity
no
criteria for partition-based clustering
objects in the same group are "close" to each other, whereas objects of different clusters are "far apart"
main disadvantage of hierarchical clustering
once a merge or split is done, it can never be undone. i.e., erroneous decisions cannot be corrected
baseline method
one of the first methods to be applied to any new problem
Euclidean Distance
overall distance by computing the distances of the individual dimensions - the individual features in our setting - most common geometric distance measure - numeral equation L2 Norm
if we do not assume independence then p(AB) =
p(A)p(B | A)
joint probability
p(AB) - dice example - "we can calculate the probability of the "joint" event by multiplying the probabilities of the individual events
If A and B are independent then ...
p(B|A) = p(B)
n-grams
phrases made up of multiple words (up to N)
ROC graph
receiver operating characteristics - most interested in left side - perfect model is at 0,1
profit curves
record the percentage of the list predicted as positive and the corresponding estimated profit
bayes rule
says that we can compute the probability of our hypothesis H given some evidence E by instead looking at the probability of the evidence given the hypothesis, as well as the unconditional probabilities of the hypothesis and the evidence
naive bayes is often used with ..
sparse data
system 2
the deliberate, controlled, conscious, and slower way of thinking
system 1
the intuitive, automatic, unconscious, and fast way of thinking
feature selection (solution 1)
the judicious determination of features that should be included in the data mining model
P(E|C=c)
the likelihood of evidence given class C=c
P(E)
the likelihood of the evidence
lift curve
the performance improvement at a given threshold - typically expressed as a multiple (i.e. "2x left") - very sensitive to changes in priors
P(C=c|E)
the posterior - can be used as a score - requires knowing p(E|c) which is difficult to measure
P(c)
the prior
term frequency
the word count (frequency) in the document instead of just a zero or one
bag of words
this approach is to treat every document as a collection of individual words - this ignores grammar, word order, sentence structure and usually punctuation
jaccard distance
treats the two objects as sets of characteristics - appropriate when the possession of a common characteristic between two items is important but the common absence of a characteristic is not
conditional independence
two events are independent if knowing one does not give you information on the probability of the other using conditional probabilities
is clustering supervised or unsupervised
unsupervised
cosine distance
used in text classification to measure the similarity of two documents - useful when you want to ignore differences in scale across instances (technically when you want to ignore the magnitude of vectors)
confusion matrix
used to decompose and count the different types of correct and incorrect decisions made by a classifier - separates out the decisions made by the classifier, making explicit how one class is being confused for another - shows us all possible combinations of predictions and outcomes - main diagonal contains the count of correct decisions
evidence "lifts"
used to examine large numbers of possible pieces of evidence for or against a conclusion
cluster analysis
when finding groups of objects (consumers, businesses, whiskey, etc) where the objects within groups are similar but the objects in different groups are not so similar - exploratory phase
word association
words that appear near other words
predictive classes
yes and no