Data Mining Midterm - Copy
5 vs of Data View
- Volume: large scale datasets - Variety: types of data (text, imagery, time series, etc.) - Velocity: data changes over time; how fast data comes in - Veracity: quality; how accurately data represents the actual scenario - Value: what you do with the data must be useful
FP growth vs Apriori
- At low support thresholds, FP growth is much faster because there are many candidate itemsets - difference between two runtimes is small at high support thresholds because there are very few candidates
Data Warehouse
- a decision support database that is maintained separately from an organizations operational database - a data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collectino of data in support of management's decision making process
Vertical Data Format
- Horizontal data format: organizes data by transactions - ex:// T1: {A, D, E, F} - Vertical Data format: organizes data by items/itemsets - ex:// t(AD) = {T1, T6, ...} - Vertical data makes counting support very simple - ex:// t(x) = {T1, T2, T3}, t(y) = {T1, T3, T4} then t(xy) = t(x) ∩ t(y) = {T1, T3} - ex:// search engines: - horizontal: organize by docs - vertical: organizing by keyword - "inverted indexing"
Generating Association Rules
- 3 key metrics when talking about strength of associations: support, confidence, correlation - for each frequent itemset l, generate all nonempty subsets of l - for every nonempty subset s of l, output the rule s => (l-s) if support and confidence exceed threshold ex:// X = {1, 2, 5} Non empty subsets: {1}, {2}, {5}, {1,2}, {1,5}, {2,5} Rules: {1} => {2, 5} {2} => {1, 5} {5} => {1,2} {1,2} => {5} {1,5} => {2} {2,5} => {1}
Measures of Data Quality
- Accuracy: how accurate sensors are, human errors in surveys, etc. - Completeness: dataset containing all info you need? - Consistency: not conflicting data - Timeliness: data recency - Believability: how much do you trust data (study designed well?) - Interpretability: how values were produced, what they mean - Accessibility
Apriori Algorithm
- Apriori property: subset of a frequent itemset is also frequent - Apriori Pruning: if X is infrequent, then any supersets of X are pruned - Procedure: 1. scan DB to get frequent 1-itemsets 2. generate candidate (k+1) itemsets from frequent k itemsets 3. Test (k+1) itemsets against DB 4. stop when no frequent or candidate itemsets can be generated - Self joining of k - itemsets to generate (k+1) - itemsets - two k-itemsets are joined if their first (k-1) items are the same - Pruning: remove if any subset is not frequent
Data Discretization
- Binning - top down, splitting approach - bin and then smooth by means, medians, etc. - unsupervised (does not use class information) - Histogram - split, unsupervised - assign interval labels - Clustering - split/merge, unsupervised - entropy based discretization: - split, supervised - divide data such that entropy of subsets is minimized - values are discretized based on split points - interval merging by chi-squared analysis - merge, supervised - merge adjacent intervals with most similar class distributions - remove/reduce correlations - intuitive partitioning - 3,4,5 rule: if msd in range of interval is 3,6,7,9 then three intervals, if 2,4,8 then 4 intervals, if 1,5,10 then 5 intervals
Classification vs Prediction
- Classification - determines categorical class labels - Prediction - models continuous values functions - accuracy based on how close prediction is
Naïve Bayes
- Classification: maximal P(Ci | X) - Naïve assumption: class conditional independence (no dependence between attributes) P(X|Ci) = P(x1 | Ci) * ... * P(xp |Ci) - If Ak is categorical, then P(xk | Ci) can be determined from the training data - If Ak is continuous, assume Gaussian distribution - Advantages: - easy to compute - goo results in many cases - incremental - Disadvantages: - assumption of class conditional independence, dependencies exist in practice
Clustering Analysis
- Cluster: a collection of data objects - object within clusters should be similar - objects in different clusters should be different - unsupervised learning
Data cube cell and ancestor examples
- Consider a data cube with dimensions month, city, and customer_group, and fact sales a = (Jan, * , *, 2800) (1-D cell) b = (Jan, *, Business, 1500) (2-D cell) c = (Jan, Chicago, Business, 45) (3-D cell) - a and b are ancestors of c - c is a descendant of both a and b - b is a parent of c - c is a child of b
AdaBoost
- D: (x1, y1), (x2, y2), ..., (xd, yd) - initial weight of each tuple: 1/d round i (i=1, ..., k): Di: sample d tuples with replacement from D Pr(choose tuple j) is based on tuple j's weight Learn Mi from Di, computer its error rate as sum over Di (weight_j * err(xj)) reduce weights of correctly classified tuples w_j = w_j * err(M_i)/(1-err(M_i)) normalize tuple weights so sum is 1 classification: weighted votes of k classifiers weight(Mi) = log((1-err(Mi)/err(M_i))
Major Tasks in Data Preprocessing
- Data cleaning: fill in missing values, smoothing noisy data, identify or remove outliers, resolve inconsistencies - data integration: integrating multiple data sources - data reduction: dimensionality, numerosity, compression
Data and Data dissimilarity structures
- Data matrix - object by attribute - two modes - Dissimilarity Matrix - object by object - one mode
Data Mining Concepts
- Data mining (knowledge discovery from data): extraction of interesting patterns or knowledge from huge amounts of data - interesting: non-trivial, implicit, previously unknown, potentially useful - huge amounts of data: scalability and efficiency must be addressed
Data Mining Views
- Data view: kinds of data to be mined - knowledge view: kinds of knowledge to be discovered - method view: data mining techniques utilized - application view: kinds of applications adapted
Graphical displays
- Data visualization methods allow for qualitative overview, insight, and exploration - Quantile plot: observed value vs quantile - Quantile-Quantile plot: Y vs X where quantiles are denoted on data points - Histogram: - frequency histogram - density histogram - Scatter plot
Information Gain
- Dataset D, m classes, C1, C2, ..., Cm p_i = |C_i,D| / |D| - expected information (entropy) needed to classify D: Info(D) = -sum(p_i log(p_i)) - information needed to classify D using A: attribute A: a1, a2, ..., av Info_A(D) = sum over v (|Dj|/|D| * Info(D_j)) - information gain: Gain(A) = Info(D) - Info_A(D) - Gain refers to how much entropy was reduced - Split on attribute with the greatest information gain
DENCLUE
- Density based clustering - center defined or arbitrary shaped clusters - uses statistical density functions - major features: - solid mathematical foundation - good for data sets with large amounts of noise - compact description of arbitrarily shaped clusters in high dimensional data sets - needs a large number of parameters - Influence function: impact of a data point within its neighborhood - overall density: sum of the influence function of all data points - density attractor: local maximal of overall density function - clusters determined mathematically by identifying density attractors
Data Warehouse Conceptual Modeling
- Dimensions: Attributes - Facts: Subject/Goal objects - Star Schema: a fact table and a set of dimension tables - Snowflake Schema: a fact table and a hierarchy of dimension tables - Fact Constellation: multiple fact tables share dimension tables
Frequent Patterns and Frequent Pattern Analysis Basic Concepts
- Frequent patterns - set of items - subsequences - substructures - Basic Concepts - Frequent itemset: X = {x1, x2, ..., xk} - Association rule: X=>Y [sup, conf, correlation] - support: probability that a transaction contains X union Y (P(X and Y occur together)) - confidence: conditional probability that a transaction containing X also contains Y (P(Y|X)) - support measures frequency that rule applies, confidence indicates the strength of the rule - minimum support, minimum confidence
Data Cube Materialization
- Full, partial, or no materialization - full materialization: computation of all data cuboids in the lattice defining the data cube - pros: already computed and available - cons: a lot of memory, computationally expensive, may not be necessary - a data cube with n dimensions contains 2^n cuboids
Partitioning Methods
- Given a dataset D and value k, find a partition of k clusters that optimize the chosen partition criterion - often terminate at local optimal (not global) - heuristic methods - k means - k mediods
Types of Outliers
- Global Outliers: outliers that deviate significantly from the rest of the data set (point anomalies) - Contextual Outliers: deviates significantly with respect to a specific context of the object - ex:// it is 85F today, is it an outlier? Depends on time of year and location - Collective Outliers: objects as a whole deviate significantly from the entire data set - ex:// one flight delay in the day may not be an outlier, but 50 delays would be a collective outlier
Attribute Selection Measures
- Information Gain: biased towards tests with many outcomes. It prefers to select attributes with a large number of values. - Gain Ratio: biased towards unbalanced splits. tries to select relatively uniform subsets but not too many of them. - Gini Index: designed for multi valued, equal sized, and pure partitions, not good when number of classes is large because often results in a skewed tree. Generally generate balanced and pure splits
Neural Networks
- Input layer, hidden layers, output layer - input values: - continuous: normalize to [0,1] - discrete: 1 unit per class if more than two classes - Weaknesses: - long training time - params determined empirically - poor interpretability (semantic meaning of input is lost) - Strengths: - high tolerance to noisy data - can classify untrained patterns - well-suited for continuous valued inputs and outputs - success on a wide array of real world data - inherently parallel - rule extraction
Pattern Interestingness
- Interesting patterns are - valid on new/test data with some certainty - novel (new knowledge) - potentially useful - ultimately understandable by humans - objective measures (support, confidence, etc) - subjective measures: vary by domain - completeness, exclusiveness
Classification Evaluation and Model Selection
- K-fold cross validation - Bootstrapping with replacement (repeat k times) - To determine is the difference in error rates between M1 and M2 are different, perform a t-test
Avoid 0 probability
- Laplacian correction (or laplace estimator) - add 1 to each test case p(term | Class) = (# instances of term in class + 1) / (# total words in class + |V|) p(class) = (# samples from class + 1) / (# total samples + |C|)
Closed and Max Patterns
- Mining closed and max patterns can help solve the combinatorial number of possible patterns - Closed Pattern X: if no super pattern Y ⊃ X with the same support then X is a closed pattern - Max Pattern X: if no super pattern Y ⊃ X that is frequent - closed patterns are lossless compressions of frequent patterns (reducing the number of patterns and rules) ex:// if a 100 itemset is frequent, then any combination of its subsets is frequent ex:// {<a1, ..., a100>, <a1, ..., a50>}, min_sup = 50% frequent patterns: any item or combination of items closed patterns: {a1,...,a50}, sup=100% and {a1,...,a100}, sup=50% max pattern: {a1,...,a100}, sup=50%
Object similarity/dissimilarity
- Minkowski distance (Lp norm): d(i,j) = (|xi1 - xj1|^p + ... + |xin - xjn|^p) ^ (1/p) - Euclidean distance (L2 norm): d(i,j) = sqrt((xi1 - xj1)^p + ... + (xin - xjn)^2) - Manhattan distance (L1 norm): d(i,j) = |xi1 - xj1| + ... + |xin - xjn| - weighted distance - Nominal attributes - method 1: d(i,j) = (p-m)/p where p = total number of attributes, m = number of matching attributes - method 2: view each state as a binary variable then do numeric distance - Binary variables contingency table: obj_i(1),obj_j(1) q, obj_i(1),obj_j(0) r obj_i(0),obj_j(1) s, obj_i(0),obj_j(0) t symmetric binary: d(i,j) = (r+s)/(q+r+s+t) asymmetric binary: d(i,j) = (r+s)/(q+r+s) - Ordinal Variables - map from value to rank: rif in {1, ..., Mf} - map to range [0, 1]: zif = (rif - 1)/(Mf-1) - ex:// (bronze, silver, gold) in (1,2,3) => (0, 0.5, 1) - Cosine Similarity: - measures angle between vectors? - often used for text documents - cos(x, y) = similarity(x,y) = (x dot y) / (||x|| * ||y||)
Challenges in Outlier Detection
- Modeling normal and outlier objects effectively - Application specific outlier detection makes it hard to develop universal approaches - Handling noise. Noisy data can blur the lines between normal objects and outliers - Understandability (justification of detection)
OLTP vs OLAP
- OLTP (online transaction processing) - major task of traditional relational DBMS - day to day operations - OLAP (online analytical processing) - major task of data warehouse system - data analysis and decision making
Partial Materialization and Iceberg Cube
- Partial materialization: compute only a subset of the data cubes cuboids - Iceberg cube: partially materialized cube. Compute only subsets of cube that exceed the minimum support threshold. - Only a small portion may be "above" the water in a sparse cube - avoids explosive growth of cube - computes only subsets where the pattern is frequent or significant enough to consider
Major Clustering Methods
- Partitioning Methods - construct k partitions, iterative relocation - ex:// k-means, k-mediods, CLARANS - Hierarchical Methods - hierarchical decomposition, split/merge - ex:// BIRCH, ROCK, Chameleon - Density Based Methods - connectivity and density functions - ex:// DBSCAN, OPTICS, DENCLUE - Grid-based methods - quantize into cells, multi-granularity grid - ex:// STING, CLIQUE, WaveCluster - Model-based methods - hypothesized cluster model, best-fit - ex:// EM, COBWEB, SOM - Clustering high-dimensional Data - subspace clustering: CLIQUE, PROCLUS - frequent pattern based clustering: pCluster - Constrain Based Clustering - user-specified or application oriented constraints - ex:// COD (obstacles), user constrained clustering, semi-supervised clustering
Classification accuracy measures
- Recognition of Class Ci: (# correct predictions of Ci) / (total # Ci objects) - accuracy: (TP + TN) / (P + N) - Error rate: (FP + FN) / (P+N) - Sensitivity/TPR/Recall: TP/P - Specificity/TNR: TN/N - Precision: TP/(TP + FP) - F1 score: (2*precision*recall) / (precision+recall), conveys balance between precision and recall
Gini Index (CART)
- Reduction in impurity: delta(Gini(A)) = Gini(D) - Gini_A(D) - select attribute with the largest impurity reduction - performs a binary split for each attribute
Typical OLAP Operations
- Roll up (drill up): summarization - Drill-down: reverse roll up - Slice and Dice: project and select (sub cube) - Pivot (rotate) - Drill across: more than one fact table - Drill Through: to the back end relational tables
Bagging
- Training: given a dataset D of d tuples - training set Di: d tuples sampled with replacement - a classifier Mi is learned for Di - Classification: majority vote - Prediction: average of multiple predictions
Mining Association Rules
- Two step process - find all frequent itemsets (w/ min support) - generate strong association rules from the frequent itemsets (min_sup, min_conf) - a long pattern contains a combinatorial number of subpatterns ex:// with 100 items, you could generate (2^100) - 1 itemsets = 100 choose 1 + 100 choose 2 + ...
Fuzzy Clusters
- a fuzzy cluster is a fuzzy set of objects - a fuzzy set S is a subset of X that allows each object in X to have some membership degree between 0 and 1 - also known as soft clustering because it allows an object to belong to more than one cluster
Confusion Matrix
- actual class on rows, predicted class on columns TP FN FP TN
Constraints in data mining
- allow for interactive process - enable more efficient mining and filtering for patterns of interest by either pruning the pattern search space or pruning the data search space - knowledge type constraint (ex:// patterns, itemsets, sequences, classification, etc.) - data constraint (starting by selecting a subset of data to use) - dimension/level constraint - interestingness constraint (ex:// min_sup, min_conf) - rule (or pattern) constraint - metarules (rule templates) - # attributes, attribute values, etc.
Clustering Based Outlier Detection
- an object is an outlier if - does not belong to any cluster - far from closest cluster - belongs to a small or sparse cluster - assumes normal data objects belong to large dense clusters and that outliers belong to small sparse clusters (or belong to no clusters) - can use training sets to find patterns in normal data - strengths - no labels required - support many data types - clusters are summaries of data - only need to compare object to cluster (fast) - weaknesses - effectiveness depends on clustering method/cluster quality - clustering is expensive - to reduce cost: fixed width clustering
Correlation Analysis
- analysis of the degree to which changes in one variable are associated with changes in another - correlation coefficient (numerical data) > 0 => positively correlated = 0 => independent < 0 => negatively correlated - chi squared test (categorical data) (check table for chi squared value at threshold)
Probabilistic Model Based Clustering
- assumes that data are generated by a mixture of underlying probability distributions - mixture model: observed data instances drawn independently from multiple clusters - attempts to optimize the fit between data and some mathematical model - find a set C of k probabilistic clusters such that P(Dataset | Clusters) is maximized - goal is to find hidden categories
Statistical Description of Data
- basics: n, min, max - central tendency: mean, median, mode, midrage(=min+max/2) - dispersion: quartiles, IQR, variance, std dev
Handling noisy data
- binning - first sort and partition data into bins - then smooth by: bin means, medians, or boundaries - regression: fit data into regression functions - clustering: detect and remove outliers
Bottleneck in frequent itemset mining
- bottleneck: candidate generation and test (counting support values) - fp growth avoids candidate generation
Rule properties
- can be useful when dealing with certain itemset patterns. Can be used in pruning process - Antimonotonicity: if an itemset S violates a constraint, so does any of its supersets (allows for pruning at each iteration of apriori style algorithms) - Monotonicity: if an itemset S satisfies the constraint, so does any of its supersets (can be used for guaranteed satisfaction) - convertible constraints: constraint that can be converted to antimonotonic or monotonic by properly ordering items support(s) >= v: antimonotonic: yes monotonic: no support(s) <= v: antimonotonic: no monotonic: yes all_conf(s) >= v: antimonotonic: yes monotonic: no all_conf(s) <= v: antimonotonic: no monotonic: yes
Concept hierarchy generation
- categorical data - partial/total ordering of attributes ex:// street < city < state < country - not an example where fewer distinct value mean higher level: weekday, month, quarter, year
Frequent Pattern Mining Challenges
- challenges: - multiple scans of whole data set - huge number of candidates - tedious support counting for candidates - Improving Apriori: - reduce data scans - reduce number of candidates - facilitate support counting - Techniques: - Partitioning of data set - sampling for frequent patterns - transaction reduction - reduce # candidates with hashing - dynamic itemset counting - hash trees for candidate support counting - vertical data format - fp growth
Data objects and attributes
- data set: a set of data objects - data object: an entity with certain attributes - attributes/dimensions/features/variables
Data View - variety
- database oriented (dbms, data warehouse, transactional db) - sequence, stream, temporal, time-series data. Notes: sequence data only care about order not time. stream data is continuous - text, multimedia, web data - graph, social networks data
DBSCAN
- density based clustering - epsilon neighborhood of p: the epsilon radius around point p - core object p: at least MinPts points in the epsilon neighborhood of p - epsilon and MinPts are the two main params to determine threshold for a dense neighborhood - For a core object q and an object p, p is directly density reachable from q if p is in the epsilon neighborhood of q - p is density reachable from q if there is a chain of directly density reachable objects from q to p - p1 and p2 are density connected if there is an object q such that p1 and p2 are density reachable from q - we can use the closure of density connectedness to find connected dense regions as clusters
Info Gain for continuous values attribute A
- determine best split point for A - sort values of A in ascending order - consider the midpoint of adjacent values - pick the midpoint with minimum Info_A(D) - split: - D1: A <= split point - D2: A > split point
Probabilistic Clustering
- each object is assigned a probability of belonging to a cluster - each category represented by a probability density function over the data space - mixture model: observed data instances drawn independently from multiple clusters
FP growth
- finds frequent itemsets without candidate generation - guaranteed to find all patterns - grow very long patterns from short ones using local frequent itemsets - ex:// abc is frequent - get all transactions with abc: DB | abc - if d is a local frequent itemset on DB|abc, then abcd is a frequent itemset
Statistical Based Outlier Detection
- make the assumption of data normality. Assumes that normal data objects are generated by a statistical model - effectiveness depends on whether the assumptions made for the statistical model are true
EM: Expectation Maximization
- framework for approaching maximum likelihood estimates of parameters in statistical models - EM algorithms can be used to compute fuzzy clustering and probabilistic model clustering - iterates until clustering cannot be improved or stopping condition - each iteration consists of two steps: - expectation step: assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clustering - maximization step: finds new clustering or parameters that minimize the SSE in fuzzy clustering or the expected likelihood in probabilistic based clustering - heuristic - extension of k means
Multi-way array aggregation
- full cube computation - array based "bottom up" algorithm - simultaneous aggregation on multiple dimensions - not for high dimensions - partition array into chunks, where a chunk is a subcube small enough to fit into memory - cannot do apriori pruning because it is a bottom up computation
Decision Tree Induction
- greedy algorithm - top down, divide and conquer - attribute selection - attribute split - stopping conditions: - all samples belong to same class - no remaining attributes: majority voting - no samples left
CLIQUE
- grid based clustering method for finding density based clusters in subspaces - discretizes the data space through a grid - grow from single dimensions to high dimensions - density drives merging of dimensions - cluster: a set of dense units - uses Apriori Prinicple: A k dimensional cell c (k > 1) can have at least l points only if every (k-1) dimensional projections of c, which are cells on (k-1) dimensional subpsace, have at least l points
Hierarchical Clustering
- groups data objects into a tree of clusters - spherical clusters - agglomerative: bottom up merging - devisive: top-down splitting - dendogram: represents the process of hierarchical clustering, cut dendogram at a certain level - dont need to specify k (# clusters) - cannot undo previous split/merge decisions - BIRCH: CF-tree, microclusters (phase 1: microclustering (group objects that are similar to each other), pahse 2: macroclustering (iterative partitioning method)) - CHAMELEON: dynamic modeling - arbitrary cluster shapes - data set -> sparse graph -> partition graph (microclusters) -> merge paritions
Reducing number of candidates by hashing
- hash itemsets into buckets - if a hash bucket count is below support threshold, the itemsets in that hash bucket are not frequent itemsets - ex:// when scanning each transaction for frequent 1 itemsets, generate all 2-itemsets and hash them into buckets as candidate 2 itemsets
Data Dispersion
- how much numeric data tend to spread - range: max-min - quartiles: Q1(25th percentile), Q3(75th percentile) - 5 number summary: min, Q1, Q2 (median), Q3, max - outlier: value lower/higher than 1.5*IQR from Q3/Q1 - variance: sig^2 = 1/n (sum((xi-x_bar)^2)) - std dev: sqrt(sig^2) - boxplot: box = Q1, median, Q3 - normal dist: values within 1, 2, and 3 std deviations from mean make up 68, 95, and 99.7% of the data respectively
Dynamic itemset counting
- if A and D are frequent, start count for AD - if BC, BD, and CD are frequent, start count for BCD - round robin type scanning - allows for potentially reducing number of scans
Transaction reduction
- if a transaction T does not contain any frequent k-itemsets, then for any h > k, no need to check T when searching for frequent h-itemsets - problem: removing transactions may force us to do random access on disk. solution: create a copy of reduced data set on disk or in main memory
Data Reduction
- if you have too much unrelated data, the pattern may be hidden - mining large amounts of data/high-d data takes a long time - goal: dataset smaller in volume that produces (almost) the same mining results - strategies: - dimensionality reduction - attribute subset selection (forward, backward) wavelet transform, PCA - numerosity reduction (reducing num objects) - parametric and nonparametric models - regression, loglinear models - data cube aggregation (use smallest representation suitable for the task) - histogram (store bucket avgs or sums), clustering (store cluster representations: centroid and diameter), sampling (random, cluster, stratified) - compression (lossy or lossless)
Handling missing data
- ignore tuple - fill in missing value manually - fill it in automatically with - a global constant - the attribute mean - attribute mean of the same class - the most probable value (e.g., regression, bayesian inference, decision tree)
Gain Ratio
- information gain is biased towards attributes with a large number of values - solution: select attribute with the maximum gain ratio for the split GainRatio = Gain/SplitInfo - many subsets result in a larger value for split info
Back Propagation
- initialize weights and biases to small random normal numbers - used for efficient weight updates - forward prop the inputs to get activites - back prop the error - terminating conditions can help prevent overfitting
Knowledge view
- kinds of knowledge to be discovered - examples: - concept/class description - data characterization (summarization) - data discrimination (contrast) - frequent patters, itemsets, sequences, structures - association analysis - classification and prediction - cluster analysis - outlier analysis - rend and evolution analysis - market analysis/management - fraud detection and rare events
Correlation Measures for Association Rules
- lift(A, B) = P(A ∪ B)/(P(A)P(B)) - lift = 1 => A and B independent - lift < 1 => A and B negatively correlated/dependent - lift > 1 => A and B positively correlated/dependent - Chi - Squared - Null Transaction: double negative case - Null variant correlation measures: list and chi-squared (they are influenced by total number fo transactions) - Null invariant: all_conf, max_conf, Kulc, cosine - Imbalance Ratio: IR(A, B) = (|sup(A) - sup(B)|) / (sup(A) + sup(B) - sup (A ∪ B)) - higher the IR, more imbalanced data is - if heavily imbalanced, correlation measures may not be stable - Recommended: use Kulc and state IR
SVMs
- linear: - separating hyperplane: Wx + b = 0 - find maximum margin hyperplane - nonlinear: - transform data into higher dimension - search for an optimal linear separating hyperplane in the new space - can transform data into high dimensions with nonlinear mappings - finds hyperplane using support vectors and margins
Density based clustering
- major features: - clusters of arbitrary shape - handles noise (may filter or outliers) - single scan - density parameter for termination - typical methods: - DBSCAN, DENCLUE
Central Tendency
- mean: values must be numerical - weighted mean: sum(wi*xi)/sum(wi) - trimmed mean: chopping extreme values - median - estimated median = L1 + ((n/2 - sum(freq_l))/freq_median)width (where data are in intervals, L1 is the lower boundary of the median interval, sum(freq_l) is the sume of the frequencies of all of the intervals that are lower than the median interval, and freq_median is the frequency of the median interval) - mode - unimodal, bimodal, etc. - midrange: avg of min and max
Data Cube
- multidimensional data model that is the basis for data warehouses and OLAP - allow data to be modeled and viewed in multiple dimensions - facts: numerical measures - cube: a lattice of cuboids - Base cuboid: cuboid with no aggregations - apex cuboid: cuboid aggregated along all dimensions
Attribute types
- nominal: categorical (unordered) - ordinal: categorical with order - binary - symmetric binary: two values occur with similar probability - asymmetric binary - numeric: quantitative - interval scaled: ex:// temp in F, year - ratio scaled: has a true zero point. ex:// dollars, age - discrete: finite or countably infinite (ex:// ints) - continuous: infinite (ex:// real numbers)
Normalization
- normalization allows us to compare distributions of values for attributes with different meanings - Min-max: v' = ((v-min_a)/(max_a - min_a) * (new_max_a - new_min_a)) + (new_min_a) - z-score normalization: v' = (v - mean_a)/(stddev_a) - z score gives number of deviations by which the value of an observation is above the mean - normalization by decimal scaling: v' = v/(10^j) where j is the smallest integer such that max(|v'|)<1 - used for looking at order of magnitude differences
k mediods clustering
- partition - find representative objects to define cluster - less sensitive to outliers and noise than k means - cluster represented by mediod (object closest to centroid) - much more costly than k means for large k or n - PAM (Partitioning around mediods): iteratively replacea mediod with a non mediod if it reduced the total distance. effective for small data sets, does not scale. O(k(n-k)) for each iteration - CLARA: apply PAM on multiple sampled sets. O(s*k^2 * k(n-k)) for sample size s - CLARANS: use randomized samples to seach for neighboring solutions
k means clustering
- partitioning method - cluster represented by mean (centroid) - partition objects into k nonempty clusters - compute the mean (centroid) of each cluster, update centroid - assign each object to the closest centroid - repeat until no more assignment changes - relatively efficient: O(tnk) where t is number of iterations, n is number of objects, k is number of clusters - not suitable for discovering clusters of nonconvex shape - sensitive to noise and outliers since these objects skew the centroid locations
Requirements to consider when selecting clustering algorithm
- scalability - attribute types - clusters with arbitrary shape - minimal domain knowledge of params - noisy data - incremental, insensitive to input order - high dimensionality - constrain based clustering - interpretability and usability
FP growth method
- scan, find frequent 1 itemsets - sort frequent items in descending frequency - scan, construct FP-tree (a prefix tree). all info is retained - get conditional pattern bases by traversing links of each frequent item and getting prefix paths - construct x conditional fp tree and get all frequent patterns related to x - repeat until fp tree is empty for single path
Sampling for frequent patterns
- select a sample data set that fits in memory - mine frequent patterns within sample (may use lower min_sup) - scan whole data set for actual support (only checking closed patterns) - does not guarantee to find all frequent patterns - take new sample to find missed frequent patterns - beneficial when efficiency is very important
Semi supervised learning
- self training: - train on some labeled data - classify unlabeled data. High confidence classification data added to training set - cotraining: - two classifiers help each other - two classifiers have independent data (labeled) so that the classifiers will not be correlated
Various Association Rules
- single level, single dimensional, boolean val (is item contained or not) - multilevel association rules - ex:// redundancy filtering - milk => wheat bread [8%, 70%] - 2% milk => wheat bread [2%, 72%] - multidimensional association rules - quantitative association rules
Data transformation
- smoothing: removing noise from data - aggregation: summarization (summarization of multiple entries into one) - generalization: concept hierarchy climbing (one to one mapping of attrs) - normalization: scale to fall within a range - attribute/feature construction
Multidimensional association
- specifying multidimensional associations allow us to constrain mined patters. This allows us to find patterns more efficiently and focus on patterns of greater interest to us - single dimensional rules (intra-dimensional) - ex:// buys(X, "milk") => buys(X, "bread") - multi dimensional rules: >= 2 predicates - inter-dimensional rules (no repeated predicates) - ex:// age(X, 19-25) ^ occupation(X, student) => buys(X, coke) - hybrid-dimensional (repeated predicates) ex:// age(X, 19-25) ^ buys(X, popcorn) => buys(X, coke)
Mining quantitative associates
- static discretization: predefines concepts/ranges (e.g., ag:10-15, 16-20, ...) - dynamic discretization: data distributions (e.g., equal width/freq bucketing) - clustering: distance based association - deviation: from normal data - ex:// sex=female => wage:mean=$7 (overall mean=$9)
Bayesian Classification
- statistical classifier that predicts class membership probabilities - performance comparable to decision tree and some NNs - incremental: when new samples are added to data set, calculation updates can be made without redoing the entire calculation - Bayes Theorem: P(Ci | X) = (P(X | Ci)*P(Ci)) / P(X) - P(X) and P(Ci) are prior probabilities - P(Ci|X), P(X|Ci) are posterior probabilities
STING
- statistical information grid - grid based clustering - spatial area => rectangular cells - using the count and cell size information, dense clusters can be identified approximately using STING - multi-resolution - pros - O(g) where g is the number of grid cells at bottom level - query independent, incremental update, easy to parallelize - cons - finer granularity vs coarser granularity (choosing right granularity) - only horizontal or vertical cluster boundaries
Important characteristics of data
- structured/unstructured/semi-structured (ex:// relational DB is structured, text data is unstructured (no fixed size)) - dimensionality - sparsity (only pattern presence counts) - resolution: patterns depend on scale, granularity/resolution matters - distribution: centrality and dispersion
Data warehouse properties
- subject oriented - organized around one or more subjects (e.g., customers, products, sales) - focus on model and analysis of data for decision making - concise view of a particular subject by excluding un-ueful data - integrated - integrate multiple, heterogeneous data source - time variant - longer time span than operational DBs - every key structure in a data warehouse contains time info, explicitly or implicitly - non volatile - physically separate store from operational environments - no operational update of data - two operations: initial loading (append), and access/scan (read)
Bayesian Belief Networks
- subset of variables conditionally independent - causal model: a directed, acyclic graph - conditional probability tables - P(x1, x2, ..., xp) = product(P(xi | Parents(Yi)))
Supervised vs. Unsupervised Learning
- supervised learning (classification) - training data accompanied by class labels - new data is classified based on training set - unsupervised learning - class labels of training data is unknown - aims to establish the existence of classes or clusters in the data
Overfitting and Tree pruning
- too many branches reflect anomalies due to noise or outliers. poor accuracy for unseen data - pre-pruning: halt tree construction early - post-pruning: remove branches from a "fully grown" tree
Bottom-Up computation (BUC)
- top down computation of sparse and iceberg cubes - pruning via apriori - computation starts from apex (0-D) cuboid - keep drilling down until frequency/support is below threshold - SEE DIAGRAM
Classification Based Methods
- train a classifier model that can distinguish normal data from outliers - can incorporate human domain knowledge into the detection process by learning from the labeled samples - once model is constructed, classification is generally fast - quality depends on data - weaknesses - skewedness of data - cannot detect unseen anomaly (new data must be similar to training set) - one class model - classifier built to describe only normal data objects - difficulties are that there can be more than one type of normal data and data imbalances - semi-supervised - do clustering to find large cluster C and small cluster C1 - C: objects with normal label - C1: objects with outlier label - one class model for C - objects not in C's model, identify as outlier
Partitioning
- two data scans - guaranteed to find all frequent patterns - partition size = fit in main memory - divide D into n partitions, find the frequent itemsets local to each partition (1 scan), combine all local frequent itemsets to form candidate itemsets, find global frequent itemsets among candidates (1 scan)
Proximity Based Outlier Detection
- unsupervised - Assumes that an object is an outlier if the nearest neighbors of the object are far away in feature space - distance based - density based - density around outliers - density of neighbors
Ensemble Methods
- use a combination of multiple models to increase accuracy - bagging: equal weight votes - often better accuracy than single classifier - more robust to noisy data - boosting: weighted votes - generally better than bagging - may overfit to misclassified data (data with larger weights)
Grid Based Clustering
- uses multi-resolution grid data structure - quantizes object space to cells in the grid - fast processing time: depends on number of cells not number of objects - sensitive to resolution choice of grid - start with the space, not the objects - ex:// STING, CLIQUE
Boosting
- weights are assigned to each training tuple - a series of k classifiers is iteratively learned - after classifier Mi is learned, adjust weights so M_i+1 pays more attention to tuples that were missclassified by Mi - M* (final classifier) combines the votes of all k classifiers, weighted by individual accuracy
Counting candidate support with hash trees
- why is counting candidate support a problem? number of candidates, total and per transaction - method: - store candidate itemsets in a hash tree - leaf node contains the candidate itemsets and counts - subset function: finds all candidate contained in transaction - when you have a list of candidate, traditionally, need to do through every one for each transaction (O(n)). With a three, we can reduce this to about O(logn) - checking for candidates in transaction: - ex:// transaction with n items, checking support of 3-itemset candidates - generate (n choose 3) itemsets - traverse tree with each of the (n choose 3) itemsets to see if candidate exists in the itemset
Often when looking for patterns, we may be looking for the same values, bucketing/grouping of values may allow us to find patterns of the same values as opposed to if all of the values were unique
---
Classification Steps
Step 1: learning - model construction - training set - class labels Step 2: Classification - test set - accuracy
ROC Curve
True Positive Rate vs False positive rate at different thresholds area under curve
OLTP vs OLAP chart
users: OLTP: clerk, IT professional OLAP: knowledge worker function: OLTP: day to day operations OLAP: decision support db design: OLTP: application oriented OLAP: subject oriented data: OLTP: current, up to data, detailed, isolated OLAP: historical, summarized, multidimensional, integrated, consolidated usage: OLTP: repetitive OLAP: ad-hoc access: OLTP: read/write, index/hash on primary key OLAP: lots of scans unit of work: OLTP: short, simple txn OLAP: complex query # records accessed: OLTP: tens OLAP: millions # users OLTP: thousands OLAP: hundreds db size: OLTP: 100MB-GB OLAP: 100GB-TB metric: OLTP: transaction throughput OLAP: query throughput, response