Data Mining

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What is attribute-oriented induction?

A query-oriented, generalization-based, online data analysis technique.

Frequent pattern based methods

based on the analysis of frequent patterns

prior probability

basically the end goal, the class we want to check or classify for

Cosine Similarity Measure

(NewX * OldX) + (NewY * OldY) / (Sqrt(NewX^2 + NewY^2) * Sqrt(OldX^2 + OldY^2))

What is a star schema?

A data model composed of a main, central table that holds the most important data. Now, to form the "star", the central table branches out into arms, or dimension tables, which expand on the attributes present in the central table.

Boosting

Bagging with weights

Discretization techniques

Binning: top-down split, unsupervised Histogram Analysis: top-down split, unsupervised Clustering Analysis: unsupervised, top-down split OR bottom-up merge Decision-tree analysis: supervised, top-down split Correlation analysis: unsupervised, bottom-up merge

Bayesian methods

Compute a distribution of possible clusterings

What are the steps involved in data mining when viewed as a process of knowledge discovery?

Data Cleaning Data Integration Data Selection Data Transformation Data Mining Pattern Evaluation Knowledge Presentation

Intentional data

Disguised missing data

Frequency

Number of times something happens / total number of events

Greedy search doesn't allow for backtracking

True

K-means doesn't guarantee a global optimum and often terminates at a local optimum

True

Minimum Description Length

best decision tree is the one that requires the fewest bits

Entity Identification Problem

different sources don't always label the same data in the same way.

Consider a data cube measure obtained by applying the sum() function. The measure is

distributive

Pivot

a visualization operation that rotates the data axes in view to provide an alternative data presentation

In attribute-oriented induction, data relevant to the task at hand is collected and then generalization is performed by either attribute generalization or __

attribute removal

Spiral

involves the rapid generation of increasingly functional systems, with short intervals between successive releases. This is considered a good choice for data warehouse development, especially for data marts, because the turnaround time is short, modifications can be done quickly, and new designs and technologies can be adapted in a timely manner.

Dissimilarity/similarity metric1

expressed in terms of a distance function which are different from interval-scaled, boolean, categorical, ordinal ration and vector variables. Weights should be associated with different variables based on applications and data semantics

Neural network pros

high tolerance to noisy data classify untrained patterns well-suited for continuous valued input and output successful on real world data algorithms are inherently parallel recently developed techniques to extract rules

Complete link

largest distance between an element in one cluster and an element in the other

Bayesian classifiers advantages

least error rate theoretical justification for other classifiers that dont use bayes theorem

Gini index

measures impurity, the minimum index will be selected as the splitting point or subset

Multivariate splits

partitioning the tuples on a combo of attribute

Waterfall

performs a structured and systematic analysis at each step before proceeding to the next

Roll-up

performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction

Predictive mining

performs induction on the current data in order to make predictions

k-medoids algorithm

pick one representational object for each cluster and assign the remaining objects to the cluster that they are most similar with

Attribute vector

the way each tuple is depicted. Consists of n-dimensions and contains the information that pertains to the attribute

Closed frequent itemset

meets the property of a closed itemset, but also passing minimum support threshold

Data integration

merges data from multiple sources into a coherent data store, such as a data warehouse.

Split point

midpoint of 2 adjacent known values

five number summary

minimum, Q1, median, Q3, maximum

Classification step

model is used to predict class labels for given data

Drill-down

navigates from less detailed data to more detailed data. Can be realized by either stepping down a concept hierarchy for a dimension or introducing additional dimensions

Why data visualization

Gain insight into an information space by mapping data onto graphical primitives Provide a qualitative overview of large data sets Search for patterns, trends, structure, irregularities, relationships among data Help find interesting regions and suitable parameters for further quantitative analysis Provide visual proof of computer representations derived

What is the second step of association rule mining

Generate strong association rules from the frequent itemsets: Creating rules that satisfy both the minimum support and minimum confidence.

Algorithmic methods

agglomerative, divisive and multiphase methods, they consider data objects as deterministic and compute clusters according to the deterministic distances between objects

Data migration tools

allow transformations to be specified

ETL (Extraction/Transformation/Loading) tools

allow users to specify transformations through a graphical user interface

Gain ratio

applies a normalization to information gain using a split information value. For each outcome it considers the # of tuples having that outcome with respect to the total number of tuples in D. The result must be greater than average gain

Analytical processing

supports basic OLAP operations, including slice-and-dice, drill-down, roll-up, and pivoting. It generally operates on historic data in both summarized and detailed forms. The major strength is the multidimensional data analysis of data warehouse data.

posterior probability

the probability that a hypothesis holds the given evidence of a known data tuple that belongs to a class

Cluster analysis

the process of partitioning a set of data objects into clusters. Unsupervised, learns by observation

Dimensionality reduction

the process of reducing the number of random variables or attributes under consideration. Include wavelet transforms, principal components analysis, and attribute subset selection

Radius of a cluster

the square root of the average distance from any point of the cluster to the centroid

rule based classifier

uses a set of if-then rules for classification

What are some of the challenges to consider and the techniques employed in data integration?

Entity Identification Problem Redundancy Tuple Duplication Data Value Conflict Detection and Resolution Correlation is a technique for data integration

Time-variant

Every key structure in the data warehouse contains, either implicitly or explicitly, a time element.

Not all numerical data sets have a median. (T/F)

False

What is the first step of association rule mining

Find all frequent itemsets: In other words, finding each itemset that will occur at least as frequently as a predetermined minimun support count(minimum number of transactions that contain the itemsets).

Quartile

Get the median, cut the data by the median, get the median of each piece, those will be the first and 3rd quartile

Intuitively, the roll-up OLAP operation corresponds to concept ___ in a concept hierarchy

ascension

associative classification

association rules are generated from frequent patterns and used for classification

Naive Bayesian Classifier

assume that the effect of an attribute value on a given class is independent of the values of the other attributes

Pessimistic prunning

uses the training set to determine error rates

Numeric prediction

where a model predicts a continuous valued function where order DOES matter

Does an outlier need to be discarded always?

In most cases of data mining, outliers are discarded. However, there are special circumstances, such as fraud detection, where outliers can be useful.

data warehouse applications

Information Processing Analytical Processing Data mining

Maximal Frequent Itemset

Meets closed frequent itemset critieria, but also has no superset.

Relationship between association rules and minimum support threshold

When minimum support is low, there exist potentially an exponential number of frequent itemsets

Hierarchical clustering method:

Works by grouping data objects into a hierarchy or tree of clusters. Doesn't require the # of clusters, but needs a termination condition

Separation of clusters

either mutually exclusive (only belong to one cluster) or data can belong to more than 1 cluster

Clustering as a preprocessing tool

for regressiong, PCA, attribute subset selection, image processing, vector quantization, finding k-nearest neighbors, outlier detection

Data Value Conflict Detection and Resolution

for the same real-world entity, attribute values from different sources may differ. This may be due to differences in representation, scaling, or encoding

partitioning methods

given a set of n objects, a partitioning method constructs k partitions of the data, where each represents a cluster. Mostly distance based, they construct partitions and evaluate them by some criterion (Kmeans, Kmedoids, LLARANS)

Robustness

handling noise and missing values

Discrete Attribute

has a finite or countably infinite set of variables

Branching factor

specifies the maximum # of children by nonleaf node

Bayesian classifiers

statistical classifiers that can predict class membership probabilities. They can predict if a tuple belongs to a specific class

What is a closed itemset

When there exists no set that contains a greater number of examples of the given set's contents; for example if a set A contains { a, b } it cannot be said to be a closed itemset if there also exists a set B { a, b, c} because B has as much support, the same number of a's and b's, but also contains a { c }, which makes it a superset to A.

Concept hierarchy generation for nominal data

Where attributes such as street can be generalized to higher-level concepts, like city or country. Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level.

Attribute construction (or feature construction)

Where new attributes are constructed and added from the given set of attributes to help the mining process.

Aggregation

Where summary or aggregation operations are applied to the data. This step is typically used in constructing a data cube for data analysis at multiple abstraction levels.

Tuple Duplication

there are two or more identical tuples for a given unique data entry case

K-modes method

variant of the k-means but using modes instead

PAM cons

doesn't scale well for large data sets. Improvements (CLARA, CLARANS)

Binning

smooths out the values around the noise by placing data into bins based on mean, median, boundaries, etc.

Probabilistic methods

use probabilistic models to capture clusters and measure the quality of cluster by the fitness of models

Data scrubbing

use simple domain knowledge to detect errors and make corrections

Information gain

used by ID3 as an attribute selection measure. Gain(A) = Info (D) - InfoA(D). Highest gain is chosen as splitting attribute

Cost complexity prunnin

used on CART, considers the cost complexity as a function of the # of leafs and the error rate. Compares impruned vs pruned, if < then prunes

Sequential Covering Algorithm

used to extract if then rules from the data

Attribute generalization

If there is a large set of distinct values for an attribute in the initial working relation, and there exists a set of generalization operators on the attribute, then a generalization operator should be selected and applied to the attribute

Holistic measures

If there is no constant bound on the storage size needed to describe a subaggregate. That is, there does not exist an algebraic function with M arguments (where M is a constant) that characterizes the computation. Common examples of holistic functions include median(), mode(), and rank(). If applying holistic aggregate functions

Handling Missing Data

Ignore the tuple Fill in the missing value manually Fill in it automatically with a global constant, the attribute mean or the attribute mean for all samples belonging to the same class, the most probable value: inference based such as Bayesian formula or decision tree

Requirements for clustering

Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Requirements for domain knowledge to determine input parameters Ability to deal with noisy data Incremental clustering and insensitivity to input order Capability of clustering high-dimensionality data Constraint based clustering Interpretability and usability

K-means algorithm

uses the centroid of a cluster to represent that cluster. 1.Choose (or you can be given) the initial centroids 2. Assign objects to cluster based on their Euclidean distances 3. Find the mean of the objects in the cluster 4. Repeat step 2, but with means as centroids now 5. Stop until no change, or stopping case is reached Complexity O(nkt)

Partitioning Around Medoids (PAM)

variation of kmedoids. Randomly chooses k object as an initial medoid, then assigns each remaining object to the nearest medoid. Randomly selects a nonmedoid object, computes the total cost of swapping and swaps if it is better.

The ___ OLAP operation performs a selection on one dimension of the given cube

slice

Smoothing

Works to remove noise from the data. Techniques include binning, regression, and clustering.

Grid-based methods

quantize the object space into a finite # of cells that forms a grid structure. All the clustering operations are performed here and are therefore fast since they depend on the # of cell, not of clusters/objects.

Bootstrap method

randomly selects tuples for the training set. Uses sampling with replacement, meaning a tuple can be selected twice.

Nominal attribute

refer to symbols or names of things. Categorical. It can also be represented using a number, however, they are not meant to be used quantitatively. Has no median, but has a mode

Data cleaning

remove noise and correct inconsistencies in the data.

Postprunning

removes substrees from a fully grown tree

In the ___ schema some dimension tables are normalized generating additional tables

snowflake

threshold parameter

specifies the max diameter and subclusters stored @ the leaf nodes

In data warehouse development, with the ___ process changes to requirements can be resolved faster

spiral

Data discrimination

comparison of the target class with one or a set of comparative classes

Objective measures of pattern interestingness

confidence and support

multilayer feed-forward neural network

consists of an input, 1 or more hidden and output layers

Density-based methods

their general idea is to continue growing a cluster as long as the density in the neighborhood exceeds some threshold (DBSCAN, OPTICS, DenClue)

Similarity measure of clusters

they can be distance based or connectivity based

Interquartile range

third quartile - first quartile

Data compression

transformations are applied so as to obtain a reduced or "compressed" representation of the original data.

BOAT (Bootstrapped Optimistic Algorithm for Tree Construction

uses bootstrapping to created smaller subsets of the training data each of which fit in memory. Each set constructs a tree, then they are used to construct a general one. Only requires 2 scans and can be used for incremental updates

KNN

when given an unknown tuple, it searches the pattern space for the training tuples that are closest to the unknown tuple using euclidean distance

Algebraic measures

If it can be computed by an algebraic function with M arguments, obtained by applying a distributive aggregate function. For example, avg() (average) can be computed by sum()/count(), where both sum() and count() are distributive aggregate functions. By applying an algebraic aggregate function

Concept Description vs. Cube Based OLAP

Similarity: Data generalization Presentation of data summarization at multiple levels of abstraction Interactive drilling, pivoting, slicing and dicing Differences: OLAP has systematic preprocessing, query independent, and can drill down to a rather low level AOI has automated desired level allocation, and may perform dimension relevance analysis/ranking when there are many relevant dimensions AOI works on the data which are not in relational formsffr

Data mining and society challenges

Social impacts of data mining Privacy-preserving data mining Invisible data mining

snowflake schema

a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a shape similar to a snowflake

BIRCH scan 2

applies a selected clustering algorithm to the leaf nodes on the tree that removes the sparse clusters as outliers and groups the dense clusters into larger ones

Support vector machines

classification method for linear and nonlinear data. Uses nonlinear mapping to transform the training data to a higher dimension, then it searches for the linear optimal separating hyperplane using support vectors and margins

Scalability

clustering all the data instead of only samples

DIANA (Divisive Analysis)

inverse order of AGNES, each node forms a cluster of its own and they are split by the maximal distance between neighboring objects.

Concept description

is the most basic form of descriptive data mining. It describes a given set of task-relevant data in a concise and summative manner, presenting interesting general properties of the data. It consists of characterization and comparison (or discrimination).

Neural network distadvantages

long training time # of parameter determined empirically poor interpretability

Training set

made up of database tuples and their associated class labels. Used to teach the classifier in the learning step

Test set

made up of test tuples and their class labels. It is used to test the classifier, it doesn't take part in the training process.

RainForest

maintains an AVC set at each node describing the training tuples at the node

A major distinguishing feature of an online analytical processing system is that

manages large amounts of historic data

SMV applications

numeric prediction classification handwritten digit recognition object recognition speaker identification time-series prediction tests

Link-based clustering methods

objects are often linked together in carious ways (SimRank, LinkClus)

CLARANS (Clustering Large Applications based upon Randomized Search

randomly selects k objects in the data set as current medoids, then selects a current medoid and an object and swap only if it improves the absolute error criterion.

Partitioning criteria

single hierarchy where no item in a cluster is under another vs hierarchical partitioning (preferred)

Divisive approach (top-down)

starts with all objects in the same cluster. In each iteration a cluster is split into smaller clusters, until eventually each object is in one cluster, or a termination condition holds.

Splitting subset

subset of the known values of a splitting attribute

Relevance Analysis

Find attributes which best distinguish different classes

Partial materialization

is the selective computation of a subset of the cuboids or subcubes in the lattice

Data mining turns data into organized ______

knowledge

Decision tree induction

learning of decision trees from class-labeled training tuples

Backpropagation

learns by iteratively processing a data set of training tuple, comparing the networks prediction for each tuple with the known target value. The weights are modifies to minimize the mean squared error for every tuple

Entropy of D

The average amount of information needed to identify the class label for tuple D

Normalization

Where the attribute data are scaled so as to fall within a smaller range, such as −1.0 to 1.0, or 0.0 to 1.0.

Discretization

Where the raw values of a numeric attribute are replaced by interval labels or conceptual labels. The labels can be recursively organized into higher-level concepts, resulting in a concept hierarchy for the numeric attribute. More than one concept hierarchy can be defined for the same attribute to accommodate the needs of various users. (Concept hierarchy climbing)

Numeric Attributes

Quantitative; that is, it is a measurable quantity, represented in integer or real values. Can be interval-scaled or ratio-scaled.

Learning step

classification algorithm builds the classifier by analyzing a training set

High intra-class similarity

cohesive within clusters

Metadata

describe or define warehouse elements

Clustering feature

a 3d vector summarizing information about clusters of objects

Clustering space

clusters within the entire given space (one dimension) or clusters with different subspaces (multi-dimensional)

Ensemble methods

combine models to increase accuracy, to come up with an improved composite classification model

CHAMELEON

explores dynamic modeling in hierarchical clustering. Cluster similarity is based on how well connected objects are within a cluster and the proximity of them

Data source view

exposes the information being captured, stored, and managed by operational systems

Iterative relocation technique

improves partitioning by moving objects from one group to another

Overfit

incorporate particular anormalities of the training data that are not present in the general set overal

Training tuples

individual tuples making up the training set, are randomly sampled from the database under analysis

Transactional Database

captures a transaction which typically includes an ID and a list of items that make it up

outlier analysis

Removes outliers from noise

Nonvolatile

a data warehouse does not require transaction processing, recovery, and concurrency control mechanisms. It usually requires only two operations in data accessing: initial loading of data and access of data.

Enterprise warehouse

a data warehouse model that collects all of the information about subjects spanning the entire organization

Data mart

a data warehouse model that contains a subset of corporate-wide data that is of value to a specific group of users. The scope is confined to specific selected subjects.

In an online analytical processing system, the typical unit of work is

a read-only operation

Top-down view

allows the selection of the relevant information necessary for the data warehouse. This information matches current and future business needs.

Data auditing

analyzing data to discover rules and relationship to detect violators

Data warehouse view

consists of fact tables and dimension tables

Attribute removal

If there is a large set of distinct values for an attribute of the initial working relation, but either there is no generalization operator on the attribute, or its higher-level concepts are expressed in terms of other attributes, then the attribute should be removed from the working relation.

Apriori pruning principle

If there is any itemset that is infrequent, its superset should not be generated/tested!

Relational OLAP (ROLAP)

Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware. Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services. Greater scalability

Classification

when a classifier and constructor are used to predict categorical labels represented by discrete values with no relevant order. Form of data analysis that extracts models describing important data classes

Nearest neighbor clustering algorithm

when an algorithm uses minimum distance to measure the distance between clusters

Constraint-based clustering

User may give inputs on constraints, use domain knowledge to determine input parameters.

Eager learners

when given a set of training tuples will construct a generalization model before receiving new tuples

Lazy learners

when given a set of training tuples, it waits until a test tuple to classify

farthest neighbor clustering algorithm

when the algorithm uses max distance to measure cluster distance

Minimal spanning tree algorithm

when the spanning tree of a graph connects all vertices and has the least sum of edge weights

What is a snowflake schema?

A model with a central fact table and a set of constituent dimension tables which are further normalized into sub-dimension tables.

Bayes theorem

Classifies by calculating the posterior probabilities

Please discuss data generalization and some of the concepts associated to it.

"Summarizes data by replacing relatively low-level values with higher-level concepts, or by reducing the number of dimensions to summarize data in concept space involving fewer dimensions."

Data Mining Applications

- Business Intelligence -Web search engines -Web page analysis -Basket data analysis to targeted marketing -Biological and medical data analysis

Attribute Selection Measures Biases

-Information gain and Gini Index is biased toward multivalued attributes -Gain Ratio unbalances splits -Gini Index has difficulties with large classes and favors tests that result in equal size partitions and purity in both

Dice

Defines a subcube by performing a selection on two or more dimensions

Why is data integration necessary?

It is used to combine multiple sources of the same type of data. The more sources the better in case of bias and the more data the better in general.

Rules can be pruned

True

Interesting patterns

-easily understood by humans -valid on new/old data with certainty -potentially useful -novel -validates the hypothesis we sought to confirm -represents knowledge

K-means cons

-not suitable for discovering clusters within nonconvex shapes or clusters of very different size -sensitive to noise and outliers -amount of clusters need to be specified -works only for numbers

Making k-means better

-use a good sized set of samples in clustering -employ a filtering approach that uses spatial hierarchical data index to save costs when computing means -group nearby objects into microclusters and perform kmeans

How many cuboids are there in a 6-dimensional data cube if there were no hierarchies associated to any dimension?

64

Relational database

A set of tables that consist attributes which store of tuples of entities and keys

Single-linkage algorithm

if the clustering is terminated when the distance between clusters is greater than the user defined threshold

What are association rules?

if-then statements that help to show the probability of relationships between data items within large data sets in various types of databases.

BIRCH scan 1

scans the database to build an in memory CF tree that is a multilevel compression of the data and tries to preserve the structure

Business query view

sees the perspectives of data in the warehouse from the view of end user

SMV features

training can be slow but accuracy is high

Tree pruning

Removing noise branches

Rules are strong/frequent if the support calculated is greater than or equal to the minimum support given (T/F)

True

Redundancy

data can be derived from an existing attribute.

What is an itemset

A collection of one or more items

Low inter-class similarity

Distinctive between clusters

Random subsampling

repeating the holdout method k times. Accuracy will be the average of the accuracies of each iteration

Incremental decision tree induction

restructure the decision tree when new training tuples are processed

Virtual Warehouse

A set of views over operational databases. Only some of the possible summary views may be materialized

Data transformation strategies

1. Smoothing 2. Attribute construction (or feature construction) 3. Aggregation 4. Normalization 5. Discretization 6. Concept hierarchy generation for nominal data

What is data mining?

The process of sorting through large data sets to identify patterns and establish relationships to solve problems through data analysis

Confidence

Times A and B happen / times A happens

What is a data cube?

A common multi-dimensional model that is a step above a basic 2-d data chart

What is dice?

A slice on more than one dimensions of a data cube

Data reduction

can reduce the data size by aggregating, eliminating redundant features, or clustering.

Data classification

A two-step process 1)learning step, 2)classification step

Apriori property

The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent

User-guided or constraint based methods

clustering by considering user specified or application specified constraints (COD, constrained clustering)

Descriptive mining

characterizes properties of the data in a target data set

Holdout method

data is partitioned into 3 sets, 2 for training, 1 for testing

CLARA (Clustering Large Applications

takes a sample of the data, then uses PAM algorithm

Slice

Performs a selection on one dimension of the given cube, resulting in a subcube.

In an online transaction processing system, the typical unit of work is

a read-only operation

quality of clustering

a separate quality function that measures the goodness of a cluster

Apriori Steps

1. Analyze every element and calculate the number of occurrences 2. If occurrences > or = minimum support, keep 3. Join the items kept with every single other item kept, in a Cartesian product manner 4. Repeat step 2 5. If there are any elements kept, repeat step 3, but add an extra element 6. Repeat step 2 7. Keep repeating step 5 and 6 until no more elements are kept 8. Choose the last elements kept, where occurrences > minimum support, those are the frequent elements.

Discuss the steps associated to the design of a data warehouse.

1. Choose a business process to model. If the business process is organizational and involves multiple complex object collections, a data warehouse model should be followed. However, if the process is departmental and focuses on the analysis of one kind of business process, a data mart model should be chosen. 2. Choose the business process grain, which is the fundamental, atomic level of data to be represented in the fact table for this process. 3. Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item, customer, supplier, warehouse, transaction type, and status. 4. Choose the measures that will populate each fact table record. Typical measures are numeric additive quantities like dollars sold and units sold.

Correlation

A calculation used to determine how dependent or independent attributes are with each other. Its analysis is used to keep redundancy in check.

What is a fact constellation?

A composite of the previous schemas. Here, there can be more than one central table; and these tables can share dimensional tables. It could be thought of as a collection of stars.

Integrated

A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational databases, flat files, and online transaction records

What is a data cube measure? Any examples?

A function that can evaluate to any point in the data cube's space. An example would be calculating the sum or average of the data.

Data transformations

A function that maps the entire set of values of a given attribute to a new set of replacement values, each old value can be identified with one of the new values. This can improve the accuracy and efficiency of mining algorithms involving distance measurements.

What do we understand by "multidimensional data model"?

A model for, usually, themed databases. They are used to categorize data into specializations such as dates, locations, and counts. The multi-dimensional model comes into its own as these broad specializations can be further broken down, say as dates could change from years to months or months to days.

Binary Attributes

A nominal attribute with only two categories or states: 0 or 1, where 0 typically means that the attribute is absent, and 1 means that it is present.

Describe the Spiral Method

A sequence of waterfalls and considered a "risk oriented iterative enhancement" . The spiral method is usually the development of choice as it is an iterative process that is used while developing warehouses.

Data warehouse

A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management's decision making process.

What is a data warehouse?

A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management's decision-making process

What is slice?

A subset of a multidimensional array corresponding to a single value set for one or more of the dimensions not in the subset

Data Characterization

A summary of the general characteristics or features of a target class of data. The data corresponding to the user-specified class is typically collected by a query. For example, to study the characteristics of software products with sales that increased by 10% in the previous year, the data related to such products can be collected by executing an SQL query on the sales database.

Discuss one of the factors comprising data quality and provide examples.

Accuracy Completeness Consistency Timeliness Believability Interpretability

Explain one challenge of mining a huge amount of data in comparison with mining a small amount of data.

Algorithms that deal with data need to scale nicely so that even vast amounts of data can be handled efficiently, and take short amounts of time

Ordinal Attributes

An attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known.

What is an outlier?

An object which does not fit in with the general behavior of the model.

What are some of the differences between operational database systems and data warehouses?

An operational database query allows to read and modify operations, while an OLAP query needs only read only access of stored data. An operational database maintains current data. On the other hand, a data warehouse maintains historical data and provides us generalized and consolidated data in multidimensional view. Along with generalized and consolidated view of data, a data warehouses also provides us Online Analytical Processing (OLAP) tools.

What do we understand by "frequent patterns"? How are they used in data mining? Please provide examples.

Are patterns that are frequent in a data set. There are three categories for these patterns: itemsets, subsequences and substructures. They are useful for the discovery of associations and correlations between items in a data set. This can help businesses to make smart marketing decisions. One example of this is the market basket analysis that determines what items are frequently purchased together by customers (for instance milk and bread, computers and antivirus...).

Data mining diversity of data types challenges

Handling complex types of data Mining dynamic, networked, and global data repositories

neural network

a set of connected input/output units in which each connection has a weight associated with it

How would you catalog a boxplot, as a measure of dispersion or as a data visualization aid? Why?

As a data visualization aid. The boxplot shows how the boundaries relate to each other visually, where the minimum, maximum values lie, and the Interquartile ranges with a line signifying the median. It does not give you a specific measure, but allows you to somewhat visualize the data set. For example, if you have a boxplot for the grades in a class, if the box is closer to the minimum boundary then you can see that most scores were low.

Frequent itemset applications

Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. Association, correlation, and causality analysis, Sequential, structural patterns, Pattern analysis in spatiotemporal, multimedia, time-series, and stream data, Classification: discriminative, frequent pattern analysis, Cluster analysis: frequent pattern-based clustering, Data warehousing: iceberg cube and cube-gradient, Semantic data compression: fascicles, Broad applications

Cluster analysis applications

Business intelligence: organize a large # of customers Project management: partition projects into categories Image recognition: handwritten character recognition systems Web search: to organize search results for accessibility Biology: taxonomy Information retrieval: document clustering Land use: identification of areas of similar land Marketing: help marketers discover different groups of customers and develop plans for them City-planning: identifying groups of houses based on characteristics Earthquake studies: clustering epicenters along continent faults Climate: find patterns of atmosphere and ocean Economic Science: market research Stand alone tool: to get insight into data distribution Pre-processing step: to make other algorithms work better

DBMS

Consists of a database and a software to manage and access that data

AVC set

Contains an attribute, value, class label information to make the tree take less memory

Why is data quality important

Data can become difficult to analyze, hard to use, unreliable, outdated. In other words, having a database with bad quality can defeat the whole purpose of having a database.

How can the data be preprocessed in order to help improve its quality?

Data cleaning Data integration Data reduction Data transformations

Knowledge Discovery Process

Data cleaning, then data integration (all inside the database) Moving the data to the data warehouse where it undergoes data selection for task relevant data Then data mining is performed, which leads to a lot of representations of the data Patterns are evaluated Knowledge is presented

Quantile plot

a simple and effective way to have a first look at a univariate data distribution

Why is data mining important?

Data mining turns a large collection of data into knowledge. We life on the information and technology age and we have tons of information, but we want more knowledge

Online Analytical Mining importance

High quality of data in data warehouses Available information processing structure surrounding data warehouses OLAP-based exploratory data analysis On-line selection of data mining functions

Please discuss the meaning of noise in data sets and the methods that can be used to remove the noise (smooth out the data).

The random errors found in measured variables, they are basically outliers. Binning, regression and outlier analysis.

Quality of a clustering method

Depends on: the similarity measure used by the method the method implementation the ability of the method to discover some or all of the hidden patterns

Histogram

Differs from a bar chart in that it is the area of the bar that denotes the value

Characteristics of structured data

Dimensionality Sparsity Resolution Distribution

Discriminant rules

Discrimination descriptions expressed in the form of rules

Data mining efficiency and scalability challenges

Efficiency and scalability of data mining algorithms Parallel, distributed, stream, and incremental mining methods

Clustering

Gathering; forming in a group. Detecting and removing outliers

quantile-quantile plot, or q-q plot

Graphs the quantiles of one univariate distribution against the corresponding quantiles of another

Multiphase clustering

Integrate hierarchical clustering with other clustering techniques

Data mining user interaction challenges

Interactive mining Incorporation of background knowledge Ad hoc data mining and data mining query languages Presentation and visualization of data mining results

Describe the waterfall method.

Is similar to going down a flight of steps. In order to reach the bottom, every step must be completed. It is similar to the waterfall methodology that is used during development. The model is a linear sequence of activities and requirements that are structured in a way that the tasks are relying on the previous objective. There are many steps like system design, detailed design, test, performance and maintenance etc.

Describe Data Mining

It is all about discovering new and hidden patterns, performing predictions and displaying what was mined using visual tools

Is data cube technology sufficient to accomplish all kinds of concept description tasks for large data sets?

It is not properly sufficient to accomplish all kinds of concept description tasks for large data sets for two mean reasons. First off, concept description should handle complex data types. OLAP, with its restriction of possible dimension and measure types (non-numeric only), represents a simplified model for data analysis. Secondly, it's too complicated for most users.

Alternate names for data mining

Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc

What is a metadata repository and what are some of the elements it should contain?

Meta data repository is the data defining warehouse objects. Metadata contains a description of the structure of the data warehouse such as schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents, operational metadata data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails), the algorithms used for summarization, the mapping from operational environment to the data warehouse, data related to system performance such as warehouse schema, view and derived data definitions, and it also contains business data such as business terms and definitions, ownership of data, charging policies.

Normalization by decimal scaling

Normalizes by moving the decimal point of values of attribute A. The number of decimal points moved depends on the maximum absolute value of A Vi = vi/10^j

What is the rationale of constructing a separate data warehouse, when online analytical processing could be performed directly on operational databases?

Operational databases store changing and current data, while warehouses store historical data, which is what is needed in the decision making process.

Mining class comparison process

Partition the set of relevant data into the target and the contrasting class(es) Generalize both classes to the same high-level concepts Compare tuples with the same high-level descriptions Present for every tuple its description and two measures: support (distribution within single class), and comparison (distribution between classes) Highlight the tuples with strong discriminant features

Data post-processing techniques

Pattern evaluation Pattern selection Pattern interpretation Pattern visualization

Min-max normalization

Performs a linear transformation on the original data. It preserves the relationships among the original data values. It will encounter an "out-of-bounds" error if a future input case for normalization falls outside of the original data range for A. Vi = (vi - minA/maxA - minA)(new maxA − new minA) + new minA.

Data visualization techniques

Pixel-oriented techniques Geometric projection techniques Icon-based visualization techniques Hierarchical visualization techniques Visualizing complex data and relations

AGNES (Agglomerative Nesting)

Places each object into a cluster of its own, then the clusters are merged step-by-step according to some criterion. It uses a single-linkage approach to merge. It repeats until all objects are merged.

Numerosity reduction

Replace the original data volume by alternative, smaller forms of data representation, These techniques may be parametric or nonparametric

Galaxy Schema

Sophisticated applications may require multiple fact tables to share dimension tables. This kind of schema can be viewed as a collection of stars.

Multidimensional OLAP (MOLAP)

Sparse array-based multidimensional storage engine. Fast indexing to pre-computed summarized data

Data mining technologies

Statistics Machine Learning Pattern Recognition Database Systems Visualization Data Warehouse Algorithms Information Retrieval Applications High-performance computing

Describe Analytical Processing

Supports basic OLAP operations but its major strength Is analyzing the data warehouse in multidimensional.

Describe information processing

Supports query and reporting using charts and graphs, to name a few. It can be useful to find information however, only information directly from the databases or aggregate functions. Unlike data mining, it cannot reflect the more complex patterns buried in the database. Also it is used to construct low cost web accessing tools that are integrated into web tools, a step behind in analytical processing.

Distributive measures

Suppose the data are partitioned into n sets. We apply the function to each partition, resulting in n aggregate values. If the result derived by applying the function to the n aggregate values is the same as that derived by applying the function to the entire data set (without partitioning), the function can be computed in a distributed manner. By applying a distributive aggregate function

In an online transaction processing system the typical unit of work is

a simple transaction

What do we understand by data normalization?

The process by which data is transformed to fall within a smaller range such as [−1,1] or [0.0, 1.0]. This attempts to give all attributes of the data set an equal weight.

What is the importance of dissimilarity measures

The importance of this is that in some instances, having two objects with low dissimilarity could mean something negative. For example, cheating.

What are the differences between the measures of central tendency and the measures of dispersion?

The measures of central tendency are the mean, median, mode and midrange. They are used to measure the location of the middle or the center of the data distribution, basically where the most values fall. Whereas, the dispersion measures are the range, quartiles, interquartile range, the five-number summary, boxplots, the variance and standard deviation of the data. They are mainly used to find an idea of the dispersion of the data, how is the data spread out, and to identify outliers.

Star schema

The most common modeling paradigm, in which the data warehouse contains a large central table containing the bulk of the data, with no redundancy, and a set of smaller attendant tables, one for each dimension. The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table

Preprunning

a tree is pruned by halting the construction early, when it falls bellow a threshold

Curse of dimensionality

The storage requirements are even more excessive when many of the dimensions have associated concept hierarchies, each with multiple levels

What is the importance of similarity measures

They are important because they help us see patterns in data. They also give us knowledge about our data. They are used in clustering algorithms. Similar data points are put into the same clusters, and dissimilar points are placed into different clusters.

Support

Times A and B happen / total number of events

The mean is in general affected by outliers (T/F)

True

The mode is the only measure of central tendency that can be used for nominal attributes. (T/F)

True. An example of this would be hair color, with different categories such as black, brown, blond, and red. Which one is the most common one?

Data discrepancy detection methods

Use metadata; Check field overloading; Check uniqueness rule, consecutive rule and null rule; Use commercial tools like Data scrubbing and Data auditing

Decision tree

a tree that holds the test on an attribute at the node, and a class label at the leaf

Cluster

a collection of data objects that are similar to one another and dissimilar to objects in other groups

CF tree

a height-balanced tree that stores the clustering features for a hierarchical cluster

Model-based methods

a model is hypothesized for each of the clusters and tries to find the best fit of that model to each other (EM, SOM, COBWEB)

BIRCH

begins by partitioning objects hierarchically using tree structures where the nodes can be viewed as microclusters depending on the resolution scale. It then applies other clustering algorithms to perform macroclustering on the microclusters

Hierarchical method

can be agglomerative or divisive, creates a hierarchical decomposition of the set of data using some criterion (DIANA, AGNES, BIRCH, CHAMELEON). Once a step is done, it cannot be undone

k-fold cross validation

data is partitioned into k mutually exclusive subsets or folds. Training and testing is performed k times. In each iteration, i is used as testing

Among the data warehouse applications, __ applications supports knowledge discovery

data mining

Dendogram

data structure used to represent the process of hierarchical clustering. Shows how objects are grouped together or partitioned.

Subject-oriented

data warehouses typically provide a simple and concise view of particular subject issues by excluding data that are not useful in the decision support process.

Class-label attribute

defines what an attribute is part of and is discrete valued and unordered

Selecting which cuboids to materialize

depending on size, sharing, access frequency, etc

Regression

derives a linear equation to get a best fit line of the noise

Attribute Selection Measures

determine how the tuples at a given node are to be split. Provides a ranking for each attribute describing the given training tuples

Splitting criterion

determines which attribute to test at node N by looking for the best way to partition the tuples D into individual classes. Also, tells us the branches of outcomes of the test. Indicates the splitting attribute, split point or a splitting subset

Decision tree advantages

don't require domain knowledge or parameter setting appropriate for exploratory knowledge discovery handle multidimensional data intuitive representation learning & classification steps are fast and accurate

Bagging

each classifier is trained using sampling with replacement and a classifier is learned from each training set. Each return a prediction which is counted and the prediction with most votes gets chosen

Single-linkage

each cluster is represented by all the objects in the cluster and the similarity between 2 clusters is measured by the closest pair of data points belonging to different clusters

Agglomerative approach (bottom-up)

each object forms a separate group, it merges the objects or groups close to one another until all groups are merged into 1

Pure partition

if all tuples in the partition belong to the same class

Information processing

supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts, or graphs. A current trend is to construct low-cost web-based accessing tools that are then integrated with web browsers.

Advantages of boosting

tends to have greater accuracy but risks overfitting the model, can be extended for numeric prediction

Unsupervised learning

the classifier doesn't know what the class labels are, the # of classes may also be unknown

Supervised learning

the classifier is trained by being told to which classes the tuples belong

Acurracy of the classifier

the percentage of test set tuples that are correctly classified by the classifier

Diameter of a cluster

the square root of the average mean squared distance between all pairs of points in the cluster

Z-score normalization

the values for an attribute, A, are normalized based on the mean (i.e., average) and standard deviation of A Vi = vi - A / o'A

Continuous Attributes

typically represented as floating-point variables.

Subjective interestingness measures

unexpected and actionable patterns

What are the data mining functionalities

Characterization and discrimination Mining of frequent patterns, associations, and correlations Classification and regression Clustering analysis Outlier analysis

Data reduction strategies.

Dimensionality reduction Numerosity reduction Data compression

Discuss one of the distance measures that are commonly used for computing the dissimilarity of objects described by numeric attributes.

Euclidean distance d(i, j) =sqrt((xi1 − xj1)^2 + (xi2 − xj2)^2 +··· ) Manhattan Distance |x1 - x2| + |y1 - y2| Minkowski distance d(i, j) = sqrt(h, |xi1 − xj1|^h + |xi2 − xj2|^h + ...) Supremum distance d(i, j) = max(f, p) |xif − xjf |

In many real-life databases, objects are described by a mixture of attribute types. How can we compute the dissimilarity between objects of mixed attribute types?

In order to determine the dissimilarity between objects of mixed attributes there are two main approaches. One of them indicates to separate each attribute type and do a data mining analysis for each of them. This method is acceptable if the results are consistent. Applying this method to real life projects is not viable as analyzing the attribute types separately will most likely generate different results. The second approach is more acceptable. It processes all attributes types together and do only one analysis by combining the attributes into a dissimilarity matrix

What is the importance of data reduction?

It can increase storage efficiency and reduce costs. It allows analytics to take less take and yield similar (if not identical) results

What do we understand by similarity measure?

It quantifies the similarity between two objects. Usually, large values are for similar objects and zero or negative values are for dissimilar objects.

What do we understand by dissimilarity measure and what is its importance?

Measuring the difference between to objects, the greater the difference between two objects the higher the value.

Data normalization methods

Min-max normalization Z-score normalization Normalization by decimal scaling

Data mining methodology challenges

Mining various and new kinds of knowledge Mining knowledge in multidimensional space Integrating new methods from multiple disciplines Boosting the power of discovery in a networked environment Handling uncertainty, noise, or incompleteness of data Pattern evaluation and pattern- or constraint-guided mining

Full Materialization

Precompute all of the cuboids. The resulting lattice of computed cuboids is referred to as the full cube. This choice typically requires huge amounts of memory space in order to store all of the precomputed cuboids.

What do we understand by data quality and what is its importance?

When an object satisfies the requirements of the intended use. It has many factors like: including accuracy, completeness, consistency, timeliness, believability, and interpretability. It also depends on the intended use of the data, for some users the data may be inconsistent, but for others, it can just be hard to interpret.


Ensembles d'études connexes

principles of management midterm/ finals

View Set

Coronary Artery Disease/ Coronary Heart Disease- EDEN

View Set

Practical Applications of the Different Regions of the EM Spectrum - Microwaves and Infrared Waves

View Set

UWorld Urinary/Renal & Integumentary- Child Health

View Set