CSC 533 Data Mining Final Exam
Three-tier architecture
A client/server configuration that includes three layers: a client layer and two server layers. Although the nature of the server layers differs, a common configuration contains an application server and a database server.
Classification
A form of data analysis that extracts models describing data classes.
interesting
A pattern is interesting if it is valid on test data with some degree of certainty, novel, potentially useful (e.g., can be acted on or validates a hunch about which the user was curious), and easily understood by humans. Interesting patterns represent knowledge. Measures of pattern interestingness, either objective or subjective, can be used to guide the discovery process
data warehouse
A subject-oriented, integrated, time-variant, nonvolatile collection of data used in support management decision making.
issues in data mining research
Areas include mining methodology, user interaction, efficiency and scalability, and dealing with diverse data types. Data mining research has strongly impacted society and will continue to do so in the future
data
Data mining can be conducted on any kind of data as long as the data are meaningful for a target application, such as database data, data warehouse data, transactional data, and advanced data types. Advanced data types include time-related or sequence data, data streams, spatial and spatiotemporal data, text and multimedia data, graph and networked data, and Web data
online analytical processing.
Data warehouse systems provide multidimensional data analysis capabilities,
Classifier
Predicts categorical labels (classes)
Apriori Algorithm
a seminal algorithm for mining frequent itemsets for boolean association rules. explores the level wise mining apriori property that all nonempty subsets of a frequent itemset must also be frequent.
measures that assess a classifiers predictive ability (MODEL EVALUATION)
accuracy, sensitivity, recall, specificity, precision, F scole. reliance on accuracy can be decieving when the main class of interest is in the minority
multidimensional data mining
also known as exploratory multidimensional data mining, online analytical mining, or OLAM
Boxplots
are a popular way of visualizing a distribution. A boxplot incorporates the five-number summary as follows: Typically, the ends of the box are at the quartiles so that the box length is the interquartile range. The median is marked by a line within the box. Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations.
dimensions
are the entities or perspectives with respect to which an organization wants to keep records and are hierarchical in nature
Global outliers
are the simplest form of outlier and the easiest to detect
Data mining functionalities
are used to specify the kinds of patterns or knowledge to be found in data mining tasks. The functionalities include characterization and discrimination; the mining of frequent patterns, associations, and correlations; classification and regression; cluster analysis; and outlier detection. As new types of data, new applications, and new analysis demands continue to emerge, there is no doubt we will see more and more novel data mining tasks in the future.
Clustering evaluation
assesses the feasibility of clustering analysis on a data set and the quality of the results generated by a clustering method. The tasks include assessing clustering tendency, determining the number of clusters, and measuring clustering quality
frequent itemset mining
association and correlation rules can be derived. Apriori like algorithms, frequent pattern growth based algorithms such as FP growth and algorithms that use the vertical data format.
Proximity-based outlier detection methods
assume that an object is an outlier if the proximity of the object to its nearest neighbors significantly deviates from the proximity of most of the other objects to their neighbors in the same data set. Distance-based outlier detection methods consult the neighborhood of an object, defined by a given radius. An object is an outlier if its neighborhood does not have enough other points. In density-based outlier detection methods, an object is an outlier if its density is relatively much lower than that of its neighbors
Clustering-based outlier detection methods
assume that the normal data objects belong to large and dense clusters, whereas outliers belong to small or sparse clusters, or do not belong to any clusters.
tree pruning
attempt to improve accuracy by removing tree branches reflecting noise in the data
naive bayesian classification
based on bayes theorem of posterior probability. assumes class conditional independence that the effect of an attribute value on a given class is independent of the values of the other attributes.
applications
business intelligence, Web search, bioinformatics, health informatics, finance, digital libraries, and digital governments.
Outlier detection methods for high-dimensional data
can be divided into three main approaches. These include extending conventional outlier detection, finding outliers in subspaces, and modeling high-dimensional outlier
Concept characterization
can be implemented using data cube (OLAP-based) approaches and the attribute-oriented induction approach. These are attributeor dimension-based generalization approaches. The attribute-oriented induction approach consists of the following techniques: data focusing, data generalization by attribute removal or attribute generalization, count and aggregate value accumulation, attribute generalization control, and generalization data visualization.
Online Analytical Processing (OLAP)
can be performed in data warehouses/marts using the multidimensional data model. Operations include roll-up, and drill-(down, across, through), slice and dice, and pivot (rotate). Manipulation of information to create business intelligence in support of strategic decision making
Concept comparison
can be performed using the attribute-oriented induction or data cube approaches in a manner similar to concept characterization. Generalized tuples from the target and contrasting classes can be quantitatively compared and contrasted.
confusion matrix
can be used to evaluate a classifiers quality. shows true positives, true negatives, false positives, false negatives.
ensemble methods
can be used to increase overall accuracy by learning and combining a series of individual base classifier models. bagging, boosting, and random forests are popular ensemble methods
density-based method
clusters objects based on the notion of density. It grows clusters either according to the density of neighborhood objects (e.g., in DBSCAN) or according to a density function (e.g., in DENCLUE). OPTICS is a density-based method that generates an augmented ordering of the data's clustering structure.
Data integration
combines data from multiple sources to form a coherent data store. The resolution of semantic heterogeneity, metadata, correlation analysis, tuple duplication detection, and data conflict detection contribute to smooth data integration
Bitmapped join indexing
combines the bitmap and join index methods, can be used to further speed up OLAP query processing.
data cube
consists of a lattice of cuboids each corresponding to a different degree of summarization of the given multidimensional data
null invariant pattern evaluation measures
cosine, Kulczynski, max_confidence, all_confidence
hierarchical method
creates a hierarchical decomposition of the given set of data objects. The method can be classified as being either agglomerative (bottom-up) or divisive (top-down), based on how the hierarchical decomposition is formed. To compensate for the rigidity of merge or split, the quality of hierarchical agglomeration can be improved by analyzing object linkages at each hierarchical partitioning (e.g., in Chameleon), or by first performing microclustering (that is, grouping objects into "microclusters") and then operating on the microclusters with other clustering techniques such as iterative relocation (as in BIRCH).
metadata
data defining the warehouse objects. metadata repo provides details regarding the warehouse structure, data history, algorithms used for summarization, mappings from the source data to the warehouse form, system performance, and business terms and issues
back-end tools and utilities
data extraction, data cleaning, data transformation, loading, refreshing, warehouse management
multidimensional view; major dimensions
data, knowledge, technologies, and applications
Bitmap indexing
each attribute has its own bitmap index table. Bitmap indexing reduces join, aggregation, and comparison operations to bit arithmetic.
Multidimensional data mining
exploratory multidimensional data mining) integrates core data mining techniques with OLAP-based multidimensional analysis. It searches for interesting patterns among multiple combinations of dimensions (attributes) at varying levels of abstraction, thereby exploring multidimensional data space
Contextual outlier detection and collective outlier detection
explore structures in the data. In contextual outlier detection, the structures are defined as contexts using contextual attributes. In collective outlier detection, the structures are implicit and are explored as part of the mining process. To detect such outliers, one approach transforms the problem into one of conventional outlier detection. Another approach models the structures directly.
Challenges in Outlier Detection
finding appropriate data models, the dependence of outlier detection systems on the application involved, finding ways to distinguish outliers from noise, and providing justification for identifying outliers as such.
association rule mining
finding frequent itemsets satisfying a minimum support threshold or percentage of the task relevant tuples from which strong association rules in the form of A>B are generated. these rules also satisfy a minimum confidence threshold (a prespecified probability of satisfyign B under the condition that A is satisfied). Associations can be further analyzed to uncover correlation rules. which convey statistical correlations between itemsets A and B
partitioning method
first creates an initial set of k partitions, where parameter k is the number of partitions to construct. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another. Typical partitioning methods include k-means, k-medoids, and CLARANS.
grid-based method
first quantizes the object space into a finite number of cells that form a grid structure, and then performs clustering on the grid structure. STING is a typical example of a grid-based method based on statistical information stored in grid cells. CLIQUE is a grid-based and subspace clustering algorithm.
k-medoids method
groups n objects into k clusters by minimizing the absolute error. the initial representative objects (called seeds) are chosen arbitrarily
training and test set partitioning methods
holdout, random sampling, cross validation, boostrapping
null invariant
if its value is free from influence of null transactions (transaction that do not contain any of the itemsets being examined)
collective outlier
if the objects as a whole deviate significantly from the entire data set, even though the individual data objects may not be outliers. Collective outlier detection requires background information to model the relationships among objects to find outlier groups.
Types of outliers
include global outliers, contextual outliers, and collective outliers. An object may be more than one type of outlier.
Cluster analysis has extensive applications
including business intelligence, image pattern recognition, Web search, biology, and security. Cluster analysis can be used as a standalone data mining tool to gain insight into the data distribution, or as a preprocessing step for other data mining algorithms operating on the detected clusters
frequent pattern mining
interesting associations and correlations between itemsets in transactional and relational databases. We begin in Section 6.1.1 by presenting an example of market basket analysis, the earliest form of frequent pattern mining for association rules
cluster analysis clustering
is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.unsupervised learning
outlier
is a data object that deviates significantly from the rest of the objects, as if it were generated by a different mechanism.
Data generalization
is a process that abstracts a large set of task-relevant data in a database from a relatively low conceptual level to higher conceptual levels. Data generalization approaches include data cube-based data aggregation and attribute-oriented induction
data warehouse
is a repository for long-term storage of data from multiple sources, organized so as to facilitate management decision making. The data are stored under a unified schema and are typically summarized. Data warehouse systems provide multidimensional data analysis capabilities, collectively referred to as online analytical processing.
Data quality
is defined in terms of accuracy, completeness, consistency, timeliness, believability, and interpretabilty. These qualities are assessed based on the intended use of the data.
Concept description
is the most basic form of descriptive data mining. It describes a given set of task-relevant data in a concise and summarative manner, presenting interesting general properties of the data. Concept (or class) description consists of characterization and comparison (or discrimination). The former summarizes and describes a data collection, called the target class, whereas the latter summarizes and distinguishes one data collection, called the target class, from other data collection(s), collectively called the contrasting class(es).
Data mining
is the process of discovering interesting patterns from massive amounts of data. As a knowledge discovery process, it typically involves data cleaning, data integration, data selection, data transformation, pattern discovery, pattern evaluation, and knowledge presentation.
pattern evaluation measures
lift, X^2, all_confidence, max_confidence, Kulczynski, cosine. Kulczynski and imbalance ratio are suggested to present pattern relationships among itemsets.
measures of central tendency
mean, median, mode
frequent pattern growth
method of mining frequent itremsets without candidate generation. it constructs a highly compact data structure (FP tree) to compress the original transaction database. rather than employing the generate and test strategy of apriori like methods, it focuses on frequent pattern (fragmet) growth which avoids costly candidate generation resulting in greater efficiency.
mining grequent itemsets using the vertical data format (ECLAT)
method that transforms a given data set of transactions in the horiziontal data format of TID itemset into the vertical data format of item-TID-set. mines teh transformed data by TID set intersections based on the apriori property and additional optimization techniques such as diffset
Data compression
methods apply transformations to obtain a reduced or "compressed" representation of the original data. The data reduction is lossless if the original data can be reconstructed from the compressed data without any loss of information; otherwise, it is lossy
Numerosity reduction
methods use parametric or nonparatmetric models to obtain smaller representations of the original data. Parametric models store only the model parameters instead of the actual data. Examples include regression and log-linear models. Nonparamteric methods include histograms, clustering, sampling, and data cube aggregation
Numeric Prediction
models continuous valued functions
Data reduction
n techniques obtain a reduced representation of the data while minimizing the loss of information content. These include methods of dimensionality reduction, numerosity reduction, and data compression.
types of distributions
normal, uniform, binomial, left skew, right skew
class imbalance problem
occurs when the main class of interest is represented by only a few tuples. strategies to address this problem include oversampling, undersampling, threshold moving, and ensemble techniques.
five-number summary
of a distribution consists of the median (Q2), the quartiles Q1 and Q3, and the smallest and largest individual observations, written in the order of Minimum, Q1, Median, Q3, Maximum.
Classification-based outlier detection methods
often use a one-class model. That is, a classifier is built to describe only the normal class. Any samples that do not belong to the normal class are regarded as outliers
concept hierarchies
organize the values of attributes or dimensions into gradual abstraction levels
ROC curves
plot the true positive rate, or sensitivity, versus the false positive rate or 1 - specificity of one or more classifiers
contextual outlier
r deviates significantly with respect to a specific context of the object (e.g., a Toronto temperature value of 28◦C is an outlier if it occurs in the context of winter).
Dimensionality reduction
reduces the number of random variables or attributes under consideration. Methods include wavelet transforms, principal components analysis, attribute subset selection, and attribute creation
Full Materialization
refers to the computation of all the cuboids in the lattice defining a data cube. It typically requires an excessive amount of storage space, particularly as the number of dimensions and size of associated concept hierarchies grow. This problem is known as the curse of dimensionality. Alternatively, partial materialization is the selective computation of a subset of the cuboids or subcubes in the lattice. For example, an iceberg cube is a data cube that stores only those cube cells that have an aggregate value (e.g., count) above some minimum support threshold.
similarity measures; evaluations of clusters
regarding partitioning criteria, separation of clusters, similarity measures used, and clustering space. partitioning methods, hierarchical methods, density-based methods, and grid-based methods
Join indexing
registers the joinable rows of two or more relations from a relational database, reducing the overall cost of OLAP join operations
OLAP server types
relational OLAP, multidimensional OLAP, or a hybrid OLAP implementation. A ROLAP server uses an extended relational DBMS that maps OLAP operations on multidimensional data to standard relational operations. A MOLAP server maps multidimensional data views directly to array structures. A HOLAP server combines ROLAP and MOLAP. For example, it may use ROLAP for historic data while maintaining frequently accessed data in a separate MOLAP store
Data cleaning
routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. Data cleaning is usually performed as an iterative two-step process consisting of discrepancy detection and data transformation
Data transformation
routines convert the data into appropriate forms for mining. For example, in normalization, attribute data are scaled so as to fall within a small range such as 0.0 to 1.0. Other examples are data discretization and concept hierarchy generation.
classification methods
rule based, bayesian, decision tree, ensemble methods
Statistical outlier detection methods
s (or model-based methods) assume that the normal data objects follow a statistical model, where data not following the model are considered outliers. Such methods may be parametric (they assume that the data are generated by a parametric distribution) or nonparametric (they learn a model for the data, rather than assuming one a priori). Parametric methods for multivariate data may employ the Mahalanobis distance, the χ 2 -statistic, or a mixture of multiple parametric models. Histograms and kernel density estimation are examples of nonparametric methods.
clustering requirements
scalability, the ability to deal with different types of data and attributes, the discovery of clusters in arbitrary shape, minimal requirements for domain knowledge to determine input parameters, the ability to deal with noisy data, incremental clustering and insensitivity to input order, the capability of clustering high-dimensionality data, constraint-based clustering, as well as interpretability and usability.
rainforest
scalable tree induction
choosing attributes in training set; attribute selection measure
splitting rules - information gain, gain ratio, gini index
technologies; interdisciplinary nature of data mining research and development
statistics, machine learning, database and data warehouse systems, and information retrieval
outlier detection method categories
supervised, semi-supervised, or unsupervised.statistical methods, proximity-based methods, and clustering-based methods.
Decision Tree Induction
top down recursive tree induction algorithm, which uses an attribute selection measure to select the attribute tested for each nonleaf node in the tree. ID3, C4.5, CART are examples of algorithms using different attribute selection measures
Data discretization
transforms numeric data by mapping values to interval or concept labels. Such methods can be used to automatically generate concept hierarchies for the data, which allows for mining at multiple levels of granularity. Discretization techniques include binning, histogram analysis, cluster analysis, decision tree analysis, and correlation analysis. For nominal data, concept hierarchies may be generated based on schema definitions as well as the number of distinct values per attribute.
multidimensional data model
typically used for the design of corporate data warehouses and departmental data marts. such a model can adopt a star schema, snowflake schema, or fact constellation schema. the core is the data cube which consists of a large set of facts or measures and a number of dimensions
data warehouses
used for information processing (querying and reporting), analytical processing (which allows users to navigate through summarized and detailed data by OLAP operations) and data mining which supports knowledge discovery.
significance test
used to asses whether the difference in accuracy between two classifiers is due to chance
rule based classifier
uses ifthen rules for classification. rules can be extracted from a decision tree. rules may also be generated directly from training data using sequential covering algorithms