CSC 533 Data Mining Final Exam

Ace your homework & exams now with Quizwiz!

Three-tier architecture

A client/server configuration that includes three layers: a client layer and two server layers. Although the nature of the server layers differs, a common configuration contains an application server and a database server.

Classification

A form of data analysis that extracts models describing data classes.

interesting

A pattern is interesting if it is valid on test data with some degree of certainty, novel, potentially useful (e.g., can be acted on or validates a hunch about which the user was curious), and easily understood by humans. Interesting patterns represent knowledge. Measures of pattern interestingness, either objective or subjective, can be used to guide the discovery process

data warehouse

A subject-oriented, integrated, time-variant, nonvolatile collection of data used in support management decision making.

issues in data mining research

Areas include mining methodology, user interaction, efficiency and scalability, and dealing with diverse data types. Data mining research has strongly impacted society and will continue to do so in the future

data

Data mining can be conducted on any kind of data as long as the data are meaningful for a target application, such as database data, data warehouse data, transactional data, and advanced data types. Advanced data types include time-related or sequence data, data streams, spatial and spatiotemporal data, text and multimedia data, graph and networked data, and Web data

online analytical processing.

Data warehouse systems provide multidimensional data analysis capabilities,

Classifier

Predicts categorical labels (classes)

Apriori Algorithm

a seminal algorithm for mining frequent itemsets for boolean association rules. explores the level wise mining apriori property that all nonempty subsets of a frequent itemset must also be frequent.

measures that assess a classifiers predictive ability (MODEL EVALUATION)

accuracy, sensitivity, recall, specificity, precision, F scole. reliance on accuracy can be decieving when the main class of interest is in the minority

multidimensional data mining

also known as exploratory multidimensional data mining, online analytical mining, or OLAM

Boxplots

are a popular way of visualizing a distribution. A boxplot incorporates the five-number summary as follows: Typically, the ends of the box are at the quartiles so that the box length is the interquartile range. The median is marked by a line within the box. Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations.

dimensions

are the entities or perspectives with respect to which an organization wants to keep records and are hierarchical in nature

Global outliers

are the simplest form of outlier and the easiest to detect

Data mining functionalities

are used to specify the kinds of patterns or knowledge to be found in data mining tasks. The functionalities include characterization and discrimination; the mining of frequent patterns, associations, and correlations; classification and regression; cluster analysis; and outlier detection. As new types of data, new applications, and new analysis demands continue to emerge, there is no doubt we will see more and more novel data mining tasks in the future.

Clustering evaluation

assesses the feasibility of clustering analysis on a data set and the quality of the results generated by a clustering method. The tasks include assessing clustering tendency, determining the number of clusters, and measuring clustering quality

frequent itemset mining

association and correlation rules can be derived. Apriori like algorithms, frequent pattern growth based algorithms such as FP growth and algorithms that use the vertical data format.

Proximity-based outlier detection methods

assume that an object is an outlier if the proximity of the object to its nearest neighbors significantly deviates from the proximity of most of the other objects to their neighbors in the same data set. Distance-based outlier detection methods consult the neighborhood of an object, defined by a given radius. An object is an outlier if its neighborhood does not have enough other points. In density-based outlier detection methods, an object is an outlier if its density is relatively much lower than that of its neighbors

Clustering-based outlier detection methods

assume that the normal data objects belong to large and dense clusters, whereas outliers belong to small or sparse clusters, or do not belong to any clusters.

tree pruning

attempt to improve accuracy by removing tree branches reflecting noise in the data

naive bayesian classification

based on bayes theorem of posterior probability. assumes class conditional independence that the effect of an attribute value on a given class is independent of the values of the other attributes.

applications

business intelligence, Web search, bioinformatics, health informatics, finance, digital libraries, and digital governments.

Outlier detection methods for high-dimensional data

can be divided into three main approaches. These include extending conventional outlier detection, finding outliers in subspaces, and modeling high-dimensional outlier

Concept characterization

can be implemented using data cube (OLAP-based) approaches and the attribute-oriented induction approach. These are attributeor dimension-based generalization approaches. The attribute-oriented induction approach consists of the following techniques: data focusing, data generalization by attribute removal or attribute generalization, count and aggregate value accumulation, attribute generalization control, and generalization data visualization.

Online Analytical Processing (OLAP)

can be performed in data warehouses/marts using the multidimensional data model. Operations include roll-up, and drill-(down, across, through), slice and dice, and pivot (rotate). Manipulation of information to create business intelligence in support of strategic decision making

Concept comparison

can be performed using the attribute-oriented induction or data cube approaches in a manner similar to concept characterization. Generalized tuples from the target and contrasting classes can be quantitatively compared and contrasted.

confusion matrix

can be used to evaluate a classifiers quality. shows true positives, true negatives, false positives, false negatives.

ensemble methods

can be used to increase overall accuracy by learning and combining a series of individual base classifier models. bagging, boosting, and random forests are popular ensemble methods

density-based method

clusters objects based on the notion of density. It grows clusters either according to the density of neighborhood objects (e.g., in DBSCAN) or according to a density function (e.g., in DENCLUE). OPTICS is a density-based method that generates an augmented ordering of the data's clustering structure.

Data integration

combines data from multiple sources to form a coherent data store. The resolution of semantic heterogeneity, metadata, correlation analysis, tuple duplication detection, and data conflict detection contribute to smooth data integration

Bitmapped join indexing

combines the bitmap and join index methods, can be used to further speed up OLAP query processing.

data cube

consists of a lattice of cuboids each corresponding to a different degree of summarization of the given multidimensional data

null invariant pattern evaluation measures

cosine, Kulczynski, max_confidence, all_confidence

hierarchical method

creates a hierarchical decomposition of the given set of data objects. The method can be classified as being either agglomerative (bottom-up) or divisive (top-down), based on how the hierarchical decomposition is formed. To compensate for the rigidity of merge or split, the quality of hierarchical agglomeration can be improved by analyzing object linkages at each hierarchical partitioning (e.g., in Chameleon), or by first performing microclustering (that is, grouping objects into "microclusters") and then operating on the microclusters with other clustering techniques such as iterative relocation (as in BIRCH).

metadata

data defining the warehouse objects. metadata repo provides details regarding the warehouse structure, data history, algorithms used for summarization, mappings from the source data to the warehouse form, system performance, and business terms and issues

back-end tools and utilities

data extraction, data cleaning, data transformation, loading, refreshing, warehouse management

multidimensional view; major dimensions

data, knowledge, technologies, and applications

Bitmap indexing

each attribute has its own bitmap index table. Bitmap indexing reduces join, aggregation, and comparison operations to bit arithmetic.

Multidimensional data mining

exploratory multidimensional data mining) integrates core data mining techniques with OLAP-based multidimensional analysis. It searches for interesting patterns among multiple combinations of dimensions (attributes) at varying levels of abstraction, thereby exploring multidimensional data space

Contextual outlier detection and collective outlier detection

explore structures in the data. In contextual outlier detection, the structures are defined as contexts using contextual attributes. In collective outlier detection, the structures are implicit and are explored as part of the mining process. To detect such outliers, one approach transforms the problem into one of conventional outlier detection. Another approach models the structures directly.

Challenges in Outlier Detection

finding appropriate data models, the dependence of outlier detection systems on the application involved, finding ways to distinguish outliers from noise, and providing justification for identifying outliers as such.

association rule mining

finding frequent itemsets satisfying a minimum support threshold or percentage of the task relevant tuples from which strong association rules in the form of A>B are generated. these rules also satisfy a minimum confidence threshold (a prespecified probability of satisfyign B under the condition that A is satisfied). Associations can be further analyzed to uncover correlation rules. which convey statistical correlations between itemsets A and B

partitioning method

first creates an initial set of k partitions, where parameter k is the number of partitions to construct. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another. Typical partitioning methods include k-means, k-medoids, and CLARANS.

grid-based method

first quantizes the object space into a finite number of cells that form a grid structure, and then performs clustering on the grid structure. STING is a typical example of a grid-based method based on statistical information stored in grid cells. CLIQUE is a grid-based and subspace clustering algorithm.

k-medoids method

groups n objects into k clusters by minimizing the absolute error. the initial representative objects (called seeds) are chosen arbitrarily

training and test set partitioning methods

holdout, random sampling, cross validation, boostrapping

null invariant

if its value is free from influence of null transactions (transaction that do not contain any of the itemsets being examined)

collective outlier

if the objects as a whole deviate significantly from the entire data set, even though the individual data objects may not be outliers. Collective outlier detection requires background information to model the relationships among objects to find outlier groups.

Types of outliers

include global outliers, contextual outliers, and collective outliers. An object may be more than one type of outlier.

Cluster analysis has extensive applications

including business intelligence, image pattern recognition, Web search, biology, and security. Cluster analysis can be used as a standalone data mining tool to gain insight into the data distribution, or as a preprocessing step for other data mining algorithms operating on the detected clusters

frequent pattern mining

interesting associations and correlations between itemsets in transactional and relational databases. We begin in Section 6.1.1 by presenting an example of market basket analysis, the earliest form of frequent pattern mining for association rules

cluster analysis clustering

is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.unsupervised learning

outlier

is a data object that deviates significantly from the rest of the objects, as if it were generated by a different mechanism.

Data generalization

is a process that abstracts a large set of task-relevant data in a database from a relatively low conceptual level to higher conceptual levels. Data generalization approaches include data cube-based data aggregation and attribute-oriented induction

data warehouse

is a repository for long-term storage of data from multiple sources, organized so as to facilitate management decision making. The data are stored under a unified schema and are typically summarized. Data warehouse systems provide multidimensional data analysis capabilities, collectively referred to as online analytical processing.

Data quality

is defined in terms of accuracy, completeness, consistency, timeliness, believability, and interpretabilty. These qualities are assessed based on the intended use of the data.

Concept description

is the most basic form of descriptive data mining. It describes a given set of task-relevant data in a concise and summarative manner, presenting interesting general properties of the data. Concept (or class) description consists of characterization and comparison (or discrimination). The former summarizes and describes a data collection, called the target class, whereas the latter summarizes and distinguishes one data collection, called the target class, from other data collection(s), collectively called the contrasting class(es).

Data mining

is the process of discovering interesting patterns from massive amounts of data. As a knowledge discovery process, it typically involves data cleaning, data integration, data selection, data transformation, pattern discovery, pattern evaluation, and knowledge presentation.

pattern evaluation measures

lift, X^2, all_confidence, max_confidence, Kulczynski, cosine. Kulczynski and imbalance ratio are suggested to present pattern relationships among itemsets.

measures of central tendency

mean, median, mode

frequent pattern growth

method of mining frequent itremsets without candidate generation. it constructs a highly compact data structure (FP tree) to compress the original transaction database. rather than employing the generate and test strategy of apriori like methods, it focuses on frequent pattern (fragmet) growth which avoids costly candidate generation resulting in greater efficiency.

mining grequent itemsets using the vertical data format (ECLAT)

method that transforms a given data set of transactions in the horiziontal data format of TID itemset into the vertical data format of item-TID-set. mines teh transformed data by TID set intersections based on the apriori property and additional optimization techniques such as diffset

Data compression

methods apply transformations to obtain a reduced or "compressed" representation of the original data. The data reduction is lossless if the original data can be reconstructed from the compressed data without any loss of information; otherwise, it is lossy

Numerosity reduction

methods use parametric or nonparatmetric models to obtain smaller representations of the original data. Parametric models store only the model parameters instead of the actual data. Examples include regression and log-linear models. Nonparamteric methods include histograms, clustering, sampling, and data cube aggregation

Numeric Prediction

models continuous valued functions

Data reduction

n techniques obtain a reduced representation of the data while minimizing the loss of information content. These include methods of dimensionality reduction, numerosity reduction, and data compression.

types of distributions

normal, uniform, binomial, left skew, right skew

class imbalance problem

occurs when the main class of interest is represented by only a few tuples. strategies to address this problem include oversampling, undersampling, threshold moving, and ensemble techniques.

five-number summary

of a distribution consists of the median (Q2), the quartiles Q1 and Q3, and the smallest and largest individual observations, written in the order of Minimum, Q1, Median, Q3, Maximum.

Classification-based outlier detection methods

often use a one-class model. That is, a classifier is built to describe only the normal class. Any samples that do not belong to the normal class are regarded as outliers

concept hierarchies

organize the values of attributes or dimensions into gradual abstraction levels

ROC curves

plot the true positive rate, or sensitivity, versus the false positive rate or 1 - specificity of one or more classifiers

contextual outlier

r deviates significantly with respect to a specific context of the object (e.g., a Toronto temperature value of 28◦C is an outlier if it occurs in the context of winter).

Dimensionality reduction

reduces the number of random variables or attributes under consideration. Methods include wavelet transforms, principal components analysis, attribute subset selection, and attribute creation

Full Materialization

refers to the computation of all the cuboids in the lattice defining a data cube. It typically requires an excessive amount of storage space, particularly as the number of dimensions and size of associated concept hierarchies grow. This problem is known as the curse of dimensionality. Alternatively, partial materialization is the selective computation of a subset of the cuboids or subcubes in the lattice. For example, an iceberg cube is a data cube that stores only those cube cells that have an aggregate value (e.g., count) above some minimum support threshold.

similarity measures; evaluations of clusters

regarding partitioning criteria, separation of clusters, similarity measures used, and clustering space. partitioning methods, hierarchical methods, density-based methods, and grid-based methods

Join indexing

registers the joinable rows of two or more relations from a relational database, reducing the overall cost of OLAP join operations

OLAP server types

relational OLAP, multidimensional OLAP, or a hybrid OLAP implementation. A ROLAP server uses an extended relational DBMS that maps OLAP operations on multidimensional data to standard relational operations. A MOLAP server maps multidimensional data views directly to array structures. A HOLAP server combines ROLAP and MOLAP. For example, it may use ROLAP for historic data while maintaining frequently accessed data in a separate MOLAP store

Data cleaning

routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. Data cleaning is usually performed as an iterative two-step process consisting of discrepancy detection and data transformation

Data transformation

routines convert the data into appropriate forms for mining. For example, in normalization, attribute data are scaled so as to fall within a small range such as 0.0 to 1.0. Other examples are data discretization and concept hierarchy generation.

classification methods

rule based, bayesian, decision tree, ensemble methods

Statistical outlier detection methods

s (or model-based methods) assume that the normal data objects follow a statistical model, where data not following the model are considered outliers. Such methods may be parametric (they assume that the data are generated by a parametric distribution) or nonparametric (they learn a model for the data, rather than assuming one a priori). Parametric methods for multivariate data may employ the Mahalanobis distance, the χ 2 -statistic, or a mixture of multiple parametric models. Histograms and kernel density estimation are examples of nonparametric methods.

clustering requirements

scalability, the ability to deal with different types of data and attributes, the discovery of clusters in arbitrary shape, minimal requirements for domain knowledge to determine input parameters, the ability to deal with noisy data, incremental clustering and insensitivity to input order, the capability of clustering high-dimensionality data, constraint-based clustering, as well as interpretability and usability.

rainforest

scalable tree induction

choosing attributes in training set; attribute selection measure

splitting rules - information gain, gain ratio, gini index

technologies; interdisciplinary nature of data mining research and development

statistics, machine learning, database and data warehouse systems, and information retrieval

outlier detection method categories

supervised, semi-supervised, or unsupervised.statistical methods, proximity-based methods, and clustering-based methods.

Decision Tree Induction

top down recursive tree induction algorithm, which uses an attribute selection measure to select the attribute tested for each nonleaf node in the tree. ID3, C4.5, CART are examples of algorithms using different attribute selection measures

Data discretization

transforms numeric data by mapping values to interval or concept labels. Such methods can be used to automatically generate concept hierarchies for the data, which allows for mining at multiple levels of granularity. Discretization techniques include binning, histogram analysis, cluster analysis, decision tree analysis, and correlation analysis. For nominal data, concept hierarchies may be generated based on schema definitions as well as the number of distinct values per attribute.

multidimensional data model

typically used for the design of corporate data warehouses and departmental data marts. such a model can adopt a star schema, snowflake schema, or fact constellation schema. the core is the data cube which consists of a large set of facts or measures and a number of dimensions

data warehouses

used for information processing (querying and reporting), analytical processing (which allows users to navigate through summarized and detailed data by OLAP operations) and data mining which supports knowledge discovery.

significance test

used to asses whether the difference in accuracy between two classifiers is due to chance

rule based classifier

uses ifthen rules for classification. rules can be extracted from a decision tree. rules may also be generated directly from training data using sequential covering algorithms


Related study sets

Test 1 Pediatric Nursing NCLEX Practice | Quiz #2: 50 Questions

View Set

ECON Lowdown Insurance: Managing Risk and Balancing Responsibility with Affordability

View Set

physics ch 1 and 2 hw and physics primer things to know

View Set

Functional Organization of Nervous Tissue

View Set

Module 2.01 ETHICAL DECISION MAKING PROCESS

View Set

Chapter Exam Review - Attempt #5 - Incorrect Answers

View Set