CS 470 - Midterms 1 & 2

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Distance Between Means/Centroids Clustering

A more practical and efficient version of AVERAGE Clustering

AUC

Area Under Curve

Properties of *Ordinal* Attribute

Order & Distinctness (< & >)

DBSCAN: Noise Point

!CorePoint & !BoarderPoint

DBSCAN: Boarder Point

!CorePoint but near a CorePoint

Conf(X -> Y)

# Transactions Containing X & Y / # Transactions Containing X

Supp( X U Y)

# Transactions Containing X & Y / Total # Transactions

Proximity Measure for Nominal Attributes SMC(i,j)

# attributes where i & j match / # attributes

Confusion Matrix: Error

(FP+FN)/All

Confusion Matrix: Accuracy

(TP+TN)/All

K-means limitations

- Need to specify number of seeds (k) - Not good at globular shapes or outliers (ex: the last boba ball)

Data mining tasks

1 Clustering (ex: subdivide a market into subsets of customers) 2 Association Rules (ex: "Hurricane hit Miami → "Florida flooding") 3 Predictive Modeling 4 Anomaly Detection

FP-Tree Construction

1 Scan database (DB) once to find frequent 1-itemsets 2 Sort frequency items in descending order: Header Table 3 Scan DB again, construct FP-tree by pre-fix method from original transaction table

Why data mining?

1) explosive growth of data 2) Drowning in data, starving for knowledge 3) Hardware 4) Software

Confusion Matrix: Fmeasure

2 x precision x recall / (precision + recall)

Why preprocess data?

Accuracy Completeness Consistency Timeliness Believability Interpretability

More Efficient: Agglomerative or Divisive

Agglomerative (AGNES)

Frequent Itemsets

An itemset is frequent if its support is no less than a minimum support threshold

Downward Closure

Any subset of a frequent itemset must be frequent

Best Attribute for Splitting (based on impurity change)

Before split - weighted sums of after split (Weight = |Dj| / |D| Dj = j-th partition after split *Gini improves but error stays the same

AdaBoost

Boosting (w/ weights) with individual round error limit of 50% If greater, scrap that round

Bagging Algorithm

Build Multiple Classifiers w/ Bootstrapping w/o changing weights

Boosting

Build Multiple Classifiers where Weights change to focus on misclassified records - Boosts can compliment eachother

MIN Clustering

Can handle globular shapes But sensitive to noise & outliers

How to handle 0s in Bayes

Coin Flip Laplace Estimate

Agglomerative Hierarchical Clustering - AGNES

Combine groups from bottom to top Calculate pairs necessary: (n-1)(n)/2

What viewpoints motivate data mining?

Commercial viewpoint, Scientific viewpoint, & Society viewpoint

AVERAGE Clustering

Compromise between MIN & MAX

Rule Generation (X → Y)

Conf (X → Y) = Supp(X U Y) / Supp(X) If # items in frequent itemset is k then there are 2^k - 2 candidate association rules.

Strongly Convertible

Convertible both when items are ordered in ascending & descending order

Discrete Attribute

Countably Infinite or Finite (Such as Binary)

Major Tasks in Data Preproccessing

Data Cleaning (an iterative process) Data Integration Data Reduction Data Transformation & Discretization

Aggregation Purpose

Data Reduction Change of Scale More "stable" data

Knowledge Discovery from Data (KDD) Process

Databases → Data Integration & Cleaning → Data Warehouse → Task-relevant Data → Data Mining → Pattern Evaluation → Knowledge

Data Reduction Strategies

Dimensionality Reduction (ex: remove unimportant attributes) Numerosity Reduction (ex: clustering or regression) Data Compression

Properties of *Interval* Attribute

Distance is Meaningful & Order & Distinctness (+ & -)

Properties of *Nominal* Attribute

Distincness (= & ≠)

Quantile Plot

Each value xi is paired with fi indicating that 100 fi% of data are ≤ xi

Evolution of sciences

Empirical (<1600s) → Theoretical (1600-1950s) → Computational (1950-1990s) → Data (1990-now)

Succinct

Explicitly and precisely determinable if itemset contains some specific items

Data Mining Defenition

Extraction of interesting (non-trivial, implicit, previously unknown, and potentially useful) patterns or knowledge from huge amounts of data

k-means clustering

Find groups of points that are close to each other but far from points in other groups - Each cluster is defined entirely and only by its center (Straws into Milkshake)

Clustering

Finding groups of objects such that the objects in a group will be similar to one another and different from the objects in other groups

Density Based Clustering: DBSCAN Algorithm

Good with noise and clusters of different shapes & sizes Bad when densities vary significantly

Quantile-Quantile (q-q) Plot

Graphs the quantiles of one univariate distribution against the corresponding quantiles of another

DBSCAN: Core Point

Has many neighbors

Proximity: Similarity

How alike two objects are Range [0,1] Higher value = more alike

Proximity: Dissimilarity

How different two objects are Minimum is often 0, unclear maximum Lower = more alike

Monotonic

If C is satisfied, no need to check C again in further mining.

Anti-monotonic

If constraint C is violated, further mining can be terminated

Types of Data Cleaning

Incomplete Noisy Inconsistent Intentional

Gain Ratio

Info Gain / Split Info

Correlation Invariance

Invariant to *scaling & translation*

Discretization: Time Complexity

K-1 intervals => O(k²) new items Improve efficiency using *max support* by knowledge that...if an interval is frequent, then all intervals that contain that interval must also be frequent

Types of Constraints in Pattern Mining

Knowledge Data Dimension Pattern Interestingness

Apriori Algorithm

Let k=1 Generate frequent itemsets of length 1 Repeat until no new freq itemsets are identified: - Generate k+1 candidate itemsets from length k frequent itemsets - *Prune* candidate itemsets containing subsets of length k that are *infrequent* - Count the support of each candidate by scanning the transaction table - Eliminate candidates that are infrequent, leaving only those that are frequent

Estimate Median with Grouped Data (Needed for Final but !Midterm)

Locate group containing median Percentile = [(Frequency of all data) / 2 - (frequency of all groups below group containing median) ] / frequency of group containing median Median = minimum of group w/ median + (range of group w/ median * percentile)

Attribute Creation

Mapping to New Space Attribute Construction Attribute Extractoon

Apriori: Candidate Generation Fk-1 x Fk-1 Method

Merge two frequent (k-1)-itemsets if their first (k-2) items (prefix) are identical

Major issues in Data Mining

Methodology, user interaction, efficiency & scalability, diversity of data types, & society + bias

DBSCAN: MinPts

MinPts = k nearest neighbors Lower MinPts → Less likely to be a noise point

Partitioning Clusters Equation

Minimize the sum of distances² within k clusters Week8_Mon @17

Properties of *Ratio* Attribute

Natural Zero & Distance is Meaningful & Order & Distinctness (* & /)

Decision Tree Structure

Nodes = Categories Branches = Options (split in exactly 2) Leafs = specific examples from training set *continuous (such as income) requires choosing a splitting point (such as 80k)

Types of *Attributes* of Data

Nominal (ex: ID numbers, eye color) Ordinal (ex: height in {tall, medium, short}) Interval (ex: calendar dates) Ratio (ex: temperature)

Max Itemset (Use header table)

None of its immediate supersets are frequent

Closed Itemset (Use header table)

None of its immediate supersets has the same support as the itemset

PCA

Normalize input data Compute k orthogonal vectors Each data is a linear combination of the k vectors Sort vectors in decreasing 'significance' or strength Since sorted, size of data can be reduced by eliminating *weakest* components (Works for numerical data only)

Convertible

Not Monotonic, Anti-monotonic, or Succinct, but can become one of them if items in transaction are properly ordered

Hierarchical Clustering: Time & Space

O(N³) Time O(N²) Space

Bayes Classifier

P(Xd | yes) • P(yes) > or < P(Xd | no) • P(no)

Data Visualization Categories

Pictures Symbols Colors Words Dimensions

Used to estimate probabilities for continuous attributes (without discretization)

Probability Density Estimation

Decision Tree Pros/Cons

Pros: - Inexpensive to construct - Fast - Can handle independent redundant attributes Cons: - Prefers more discriminating - Only output 0 or 1

DBSCAN: EPS

Radius of Neighborhood

Bootstap

Random Sample with replacement Data types that aren't sampled into training set form the test set ~2/3 in the bootstrap

Continuous Attribute

Real Numbers, Floating Point Variables

High Impurity

Roughly evenly split data (less useful)

MAX Clustering

Shortest *between* clusters, longest *within* cluster Better with noise & outliers Tends to split large clusters

Low Impurity

Skew-data (more useful)

Count Matrix

Split Positions are based on 80k example from first slide Move left to right within the row designated by the impact of the shift looking at the top

Divisive Hierarchical Clustering - DIANA

Split groups from top to bottom Calculate pairs necessary: ((2^n)-2)/2

Stratified Sampling

Split the data in the several partitions; then draw random samples from each partition

FP-Tree Mining

Suffix Pruning

Multi-level Association Rules

Supp(parent) ≥ Supp(child1) + Supp(child2) If only child, Supp(child) = Supp(parent) & Conf(child) = Conf(parent)

Confusion Matrix: Specificity

TN/TN+FP

Confusion Matrix: Recall

TP / TP+FN

Confusion Matrix: Precision (= Sensitivity)

TP / TP+FP *Precision ∝ 1/Recall

Curse of Dimensionality

The # of possible values increases exponentially against the # of dimensions The number of possible values of N nominal attributes with M categories is M^N.

Relative Support (FP)

The fraction of transactions that contain X in the database

Classification

The process of grouping things based on their similarities. [Can be done like a decision tree] Two phases: (1) Training & (2) Test

Discretization: Interval Width Drawback

Too wide = lose interesting patterns Too narrow = break apart patterns that should be grouped

Dendogram

Visual graph of hierarchical clustering, (ex: food-chain)

Correlation vs Cosine vs Euclidean Distance

Week 2 - Monday - Slide 33

Smooth Binning

Week 3 - Monday - Slide 11

Statistics-based Method

Week 5 - Wednesday - Slide 16

X² (Chi-Square) Test

X² = Σ [(Observed - Expected)² / Expected] Larger X² = more likely the attributes are related Also if supp(Basketball&Cereal) is lower than its expected value, the relation between "play basketball" and "eat cereal" is negative.

Ordinal Variable Transform into range [0,1]

Zi = ri - 1 / Mi - 1

Measure Cluster Quality: Instrinsic a(o) & b(o) & s(o)

a(o) & b(o) represent distances and will always be positive we prefer larger values of s(o)

Cosine Similarity

cos (x,y) = (x • y) / ||x|| ||y|| • = vector dot product ||x|| = length of vector x (√a² + b² + ...)

Minkowski Distance d(x,y)

d(x,y) = h√ |x₁ - y₁|^h + |x₂ - y₂|^h + ... +

Minkowski Properties

d(x,y) > 0 → Positive Definiteness d(x,y) = d(y,x) → Symmetry d(x,y) ≤ d(x,u) + d(u,y) → Triangle Inequality A distance that satisfies these properties is a metric

Equal-Width Binning

divide the range into N intervals of equal size if A and B are the lowest and highest values of an attribute, then the width of the intervals is (B-A) / N outliers may dominate the presentation and skewed data is not handled well

Equal-Depth Binning

divides the range into N intervals, each containing roughly the same number of samples

Symmetric vs. Skewed Data

mean - mode = 3 x (mean - median) Positive skew (right-foot): {mode, median, mean} Negative skew (left-foot): {mean, median, mode}

Lift(X → Y) & Lift(X → ~Y)

p(X U Y) / p(X)p(Y) cmp p(X U ~Y) / p(X)p(~Y)

Normalization Types: Min-Max

v' = (v - min / max - min)(max' - min') + min'

Normalization Types: Z-Score

v' = v - μ / σ

Normalization Types: Decimal Scaling

v' = v / 10^j where j is the smallest integer such that max(|v'|) < 1

K-itemset

{Ketchup, Orange Fanta} is a 2-itemset


Kaugnay na mga set ng pag-aaral

The Mongols/Ruled by Genghis Khan (A.D.1206- A.D. 1227)

View Set

Communication Process and Styles

View Set