CS 470 - Midterms 1 & 2
Distance Between Means/Centroids Clustering
A more practical and efficient version of AVERAGE Clustering
AUC
Area Under Curve
Properties of *Ordinal* Attribute
Order & Distinctness (< & >)
DBSCAN: Noise Point
!CorePoint & !BoarderPoint
DBSCAN: Boarder Point
!CorePoint but near a CorePoint
Conf(X -> Y)
# Transactions Containing X & Y / # Transactions Containing X
Supp( X U Y)
# Transactions Containing X & Y / Total # Transactions
Proximity Measure for Nominal Attributes SMC(i,j)
# attributes where i & j match / # attributes
Confusion Matrix: Error
(FP+FN)/All
Confusion Matrix: Accuracy
(TP+TN)/All
K-means limitations
- Need to specify number of seeds (k) - Not good at globular shapes or outliers (ex: the last boba ball)
Data mining tasks
1 Clustering (ex: subdivide a market into subsets of customers) 2 Association Rules (ex: "Hurricane hit Miami → "Florida flooding") 3 Predictive Modeling 4 Anomaly Detection
FP-Tree Construction
1 Scan database (DB) once to find frequent 1-itemsets 2 Sort frequency items in descending order: Header Table 3 Scan DB again, construct FP-tree by pre-fix method from original transaction table
Why data mining?
1) explosive growth of data 2) Drowning in data, starving for knowledge 3) Hardware 4) Software
Confusion Matrix: Fmeasure
2 x precision x recall / (precision + recall)
Why preprocess data?
Accuracy Completeness Consistency Timeliness Believability Interpretability
More Efficient: Agglomerative or Divisive
Agglomerative (AGNES)
Frequent Itemsets
An itemset is frequent if its support is no less than a minimum support threshold
Downward Closure
Any subset of a frequent itemset must be frequent
Best Attribute for Splitting (based on impurity change)
Before split - weighted sums of after split (Weight = |Dj| / |D| Dj = j-th partition after split *Gini improves but error stays the same
AdaBoost
Boosting (w/ weights) with individual round error limit of 50% If greater, scrap that round
Bagging Algorithm
Build Multiple Classifiers w/ Bootstrapping w/o changing weights
Boosting
Build Multiple Classifiers where Weights change to focus on misclassified records - Boosts can compliment eachother
MIN Clustering
Can handle globular shapes But sensitive to noise & outliers
How to handle 0s in Bayes
Coin Flip Laplace Estimate
Agglomerative Hierarchical Clustering - AGNES
Combine groups from bottom to top Calculate pairs necessary: (n-1)(n)/2
What viewpoints motivate data mining?
Commercial viewpoint, Scientific viewpoint, & Society viewpoint
AVERAGE Clustering
Compromise between MIN & MAX
Rule Generation (X → Y)
Conf (X → Y) = Supp(X U Y) / Supp(X) If # items in frequent itemset is k then there are 2^k - 2 candidate association rules.
Strongly Convertible
Convertible both when items are ordered in ascending & descending order
Discrete Attribute
Countably Infinite or Finite (Such as Binary)
Major Tasks in Data Preproccessing
Data Cleaning (an iterative process) Data Integration Data Reduction Data Transformation & Discretization
Aggregation Purpose
Data Reduction Change of Scale More "stable" data
Knowledge Discovery from Data (KDD) Process
Databases → Data Integration & Cleaning → Data Warehouse → Task-relevant Data → Data Mining → Pattern Evaluation → Knowledge
Data Reduction Strategies
Dimensionality Reduction (ex: remove unimportant attributes) Numerosity Reduction (ex: clustering or regression) Data Compression
Properties of *Interval* Attribute
Distance is Meaningful & Order & Distinctness (+ & -)
Properties of *Nominal* Attribute
Distincness (= & ≠)
Quantile Plot
Each value xi is paired with fi indicating that 100 fi% of data are ≤ xi
Evolution of sciences
Empirical (<1600s) → Theoretical (1600-1950s) → Computational (1950-1990s) → Data (1990-now)
Succinct
Explicitly and precisely determinable if itemset contains some specific items
Data Mining Defenition
Extraction of interesting (non-trivial, implicit, previously unknown, and potentially useful) patterns or knowledge from huge amounts of data
k-means clustering
Find groups of points that are close to each other but far from points in other groups - Each cluster is defined entirely and only by its center (Straws into Milkshake)
Clustering
Finding groups of objects such that the objects in a group will be similar to one another and different from the objects in other groups
Density Based Clustering: DBSCAN Algorithm
Good with noise and clusters of different shapes & sizes Bad when densities vary significantly
Quantile-Quantile (q-q) Plot
Graphs the quantiles of one univariate distribution against the corresponding quantiles of another
DBSCAN: Core Point
Has many neighbors
Proximity: Similarity
How alike two objects are Range [0,1] Higher value = more alike
Proximity: Dissimilarity
How different two objects are Minimum is often 0, unclear maximum Lower = more alike
Monotonic
If C is satisfied, no need to check C again in further mining.
Anti-monotonic
If constraint C is violated, further mining can be terminated
Types of Data Cleaning
Incomplete Noisy Inconsistent Intentional
Gain Ratio
Info Gain / Split Info
Correlation Invariance
Invariant to *scaling & translation*
Discretization: Time Complexity
K-1 intervals => O(k²) new items Improve efficiency using *max support* by knowledge that...if an interval is frequent, then all intervals that contain that interval must also be frequent
Types of Constraints in Pattern Mining
Knowledge Data Dimension Pattern Interestingness
Apriori Algorithm
Let k=1 Generate frequent itemsets of length 1 Repeat until no new freq itemsets are identified: - Generate k+1 candidate itemsets from length k frequent itemsets - *Prune* candidate itemsets containing subsets of length k that are *infrequent* - Count the support of each candidate by scanning the transaction table - Eliminate candidates that are infrequent, leaving only those that are frequent
Estimate Median with Grouped Data (Needed for Final but !Midterm)
Locate group containing median Percentile = [(Frequency of all data) / 2 - (frequency of all groups below group containing median) ] / frequency of group containing median Median = minimum of group w/ median + (range of group w/ median * percentile)
Attribute Creation
Mapping to New Space Attribute Construction Attribute Extractoon
Apriori: Candidate Generation Fk-1 x Fk-1 Method
Merge two frequent (k-1)-itemsets if their first (k-2) items (prefix) are identical
Major issues in Data Mining
Methodology, user interaction, efficiency & scalability, diversity of data types, & society + bias
DBSCAN: MinPts
MinPts = k nearest neighbors Lower MinPts → Less likely to be a noise point
Partitioning Clusters Equation
Minimize the sum of distances² within k clusters Week8_Mon @17
Properties of *Ratio* Attribute
Natural Zero & Distance is Meaningful & Order & Distinctness (* & /)
Decision Tree Structure
Nodes = Categories Branches = Options (split in exactly 2) Leafs = specific examples from training set *continuous (such as income) requires choosing a splitting point (such as 80k)
Types of *Attributes* of Data
Nominal (ex: ID numbers, eye color) Ordinal (ex: height in {tall, medium, short}) Interval (ex: calendar dates) Ratio (ex: temperature)
Max Itemset (Use header table)
None of its immediate supersets are frequent
Closed Itemset (Use header table)
None of its immediate supersets has the same support as the itemset
PCA
Normalize input data Compute k orthogonal vectors Each data is a linear combination of the k vectors Sort vectors in decreasing 'significance' or strength Since sorted, size of data can be reduced by eliminating *weakest* components (Works for numerical data only)
Convertible
Not Monotonic, Anti-monotonic, or Succinct, but can become one of them if items in transaction are properly ordered
Hierarchical Clustering: Time & Space
O(N³) Time O(N²) Space
Bayes Classifier
P(Xd | yes) • P(yes) > or < P(Xd | no) • P(no)
Data Visualization Categories
Pictures Symbols Colors Words Dimensions
Used to estimate probabilities for continuous attributes (without discretization)
Probability Density Estimation
Decision Tree Pros/Cons
Pros: - Inexpensive to construct - Fast - Can handle independent redundant attributes Cons: - Prefers more discriminating - Only output 0 or 1
DBSCAN: EPS
Radius of Neighborhood
Bootstap
Random Sample with replacement Data types that aren't sampled into training set form the test set ~2/3 in the bootstrap
Continuous Attribute
Real Numbers, Floating Point Variables
High Impurity
Roughly evenly split data (less useful)
MAX Clustering
Shortest *between* clusters, longest *within* cluster Better with noise & outliers Tends to split large clusters
Low Impurity
Skew-data (more useful)
Count Matrix
Split Positions are based on 80k example from first slide Move left to right within the row designated by the impact of the shift looking at the top
Divisive Hierarchical Clustering - DIANA
Split groups from top to bottom Calculate pairs necessary: ((2^n)-2)/2
Stratified Sampling
Split the data in the several partitions; then draw random samples from each partition
FP-Tree Mining
Suffix Pruning
Multi-level Association Rules
Supp(parent) ≥ Supp(child1) + Supp(child2) If only child, Supp(child) = Supp(parent) & Conf(child) = Conf(parent)
Confusion Matrix: Specificity
TN/TN+FP
Confusion Matrix: Recall
TP / TP+FN
Confusion Matrix: Precision (= Sensitivity)
TP / TP+FP *Precision ∝ 1/Recall
Curse of Dimensionality
The # of possible values increases exponentially against the # of dimensions The number of possible values of N nominal attributes with M categories is M^N.
Relative Support (FP)
The fraction of transactions that contain X in the database
Classification
The process of grouping things based on their similarities. [Can be done like a decision tree] Two phases: (1) Training & (2) Test
Discretization: Interval Width Drawback
Too wide = lose interesting patterns Too narrow = break apart patterns that should be grouped
Dendogram
Visual graph of hierarchical clustering, (ex: food-chain)
Correlation vs Cosine vs Euclidean Distance
Week 2 - Monday - Slide 33
Smooth Binning
Week 3 - Monday - Slide 11
Statistics-based Method
Week 5 - Wednesday - Slide 16
X² (Chi-Square) Test
X² = Σ [(Observed - Expected)² / Expected] Larger X² = more likely the attributes are related Also if supp(Basketball&Cereal) is lower than its expected value, the relation between "play basketball" and "eat cereal" is negative.
Ordinal Variable Transform into range [0,1]
Zi = ri - 1 / Mi - 1
Measure Cluster Quality: Instrinsic a(o) & b(o) & s(o)
a(o) & b(o) represent distances and will always be positive we prefer larger values of s(o)
Cosine Similarity
cos (x,y) = (x • y) / ||x|| ||y|| • = vector dot product ||x|| = length of vector x (√a² + b² + ...)
Minkowski Distance d(x,y)
d(x,y) = h√ |x₁ - y₁|^h + |x₂ - y₂|^h + ... +
Minkowski Properties
d(x,y) > 0 → Positive Definiteness d(x,y) = d(y,x) → Symmetry d(x,y) ≤ d(x,u) + d(u,y) → Triangle Inequality A distance that satisfies these properties is a metric
Equal-Width Binning
divide the range into N intervals of equal size if A and B are the lowest and highest values of an attribute, then the width of the intervals is (B-A) / N outliers may dominate the presentation and skewed data is not handled well
Equal-Depth Binning
divides the range into N intervals, each containing roughly the same number of samples
Symmetric vs. Skewed Data
mean - mode = 3 x (mean - median) Positive skew (right-foot): {mode, median, mean} Negative skew (left-foot): {mean, median, mode}
Lift(X → Y) & Lift(X → ~Y)
p(X U Y) / p(X)p(Y) cmp p(X U ~Y) / p(X)p(~Y)
Normalization Types: Min-Max
v' = (v - min / max - min)(max' - min') + min'
Normalization Types: Z-Score
v' = v - μ / σ
Normalization Types: Decimal Scaling
v' = v / 10^j where j is the smallest integer such that max(|v'|) < 1
K-itemset
{Ketchup, Orange Fanta} is a 2-itemset