Data Mining
What is attribute-oriented induction?
A query-oriented, generalization-based, online data analysis technique.
Frequent pattern based methods
based on the analysis of frequent patterns
prior probability
basically the end goal, the class we want to check or classify for
Cosine Similarity Measure
(NewX * OldX) + (NewY * OldY) / (Sqrt(NewX^2 + NewY^2) * Sqrt(OldX^2 + OldY^2))
What is a star schema?
A data model composed of a main, central table that holds the most important data. Now, to form the "star", the central table branches out into arms, or dimension tables, which expand on the attributes present in the central table.
Boosting
Bagging with weights
Discretization techniques
Binning: top-down split, unsupervised Histogram Analysis: top-down split, unsupervised Clustering Analysis: unsupervised, top-down split OR bottom-up merge Decision-tree analysis: supervised, top-down split Correlation analysis: unsupervised, bottom-up merge
Bayesian methods
Compute a distribution of possible clusterings
What are the steps involved in data mining when viewed as a process of knowledge discovery?
Data Cleaning Data Integration Data Selection Data Transformation Data Mining Pattern Evaluation Knowledge Presentation
Intentional data
Disguised missing data
Frequency
Number of times something happens / total number of events
Greedy search doesn't allow for backtracking
True
K-means doesn't guarantee a global optimum and often terminates at a local optimum
True
Minimum Description Length
best decision tree is the one that requires the fewest bits
Entity Identification Problem
different sources don't always label the same data in the same way.
Consider a data cube measure obtained by applying the sum() function. The measure is
distributive
Pivot
a visualization operation that rotates the data axes in view to provide an alternative data presentation
In attribute-oriented induction, data relevant to the task at hand is collected and then generalization is performed by either attribute generalization or __
attribute removal
Spiral
involves the rapid generation of increasingly functional systems, with short intervals between successive releases. This is considered a good choice for data warehouse development, especially for data marts, because the turnaround time is short, modifications can be done quickly, and new designs and technologies can be adapted in a timely manner.
Dissimilarity/similarity metric1
expressed in terms of a distance function which are different from interval-scaled, boolean, categorical, ordinal ration and vector variables. Weights should be associated with different variables based on applications and data semantics
Neural network pros
high tolerance to noisy data classify untrained patterns well-suited for continuous valued input and output successful on real world data algorithms are inherently parallel recently developed techniques to extract rules
Complete link
largest distance between an element in one cluster and an element in the other
Bayesian classifiers advantages
least error rate theoretical justification for other classifiers that dont use bayes theorem
Gini index
measures impurity, the minimum index will be selected as the splitting point or subset
Multivariate splits
partitioning the tuples on a combo of attribute
Waterfall
performs a structured and systematic analysis at each step before proceeding to the next
Roll-up
performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction
Predictive mining
performs induction on the current data in order to make predictions
k-medoids algorithm
pick one representational object for each cluster and assign the remaining objects to the cluster that they are most similar with
Attribute vector
the way each tuple is depicted. Consists of n-dimensions and contains the information that pertains to the attribute
Closed frequent itemset
meets the property of a closed itemset, but also passing minimum support threshold
Data integration
merges data from multiple sources into a coherent data store, such as a data warehouse.
Split point
midpoint of 2 adjacent known values
five number summary
minimum, Q1, median, Q3, maximum
Classification step
model is used to predict class labels for given data
Drill-down
navigates from less detailed data to more detailed data. Can be realized by either stepping down a concept hierarchy for a dimension or introducing additional dimensions
Why data visualization
Gain insight into an information space by mapping data onto graphical primitives Provide a qualitative overview of large data sets Search for patterns, trends, structure, irregularities, relationships among data Help find interesting regions and suitable parameters for further quantitative analysis Provide visual proof of computer representations derived
What is the second step of association rule mining
Generate strong association rules from the frequent itemsets: Creating rules that satisfy both the minimum support and minimum confidence.
Algorithmic methods
agglomerative, divisive and multiphase methods, they consider data objects as deterministic and compute clusters according to the deterministic distances between objects
Data migration tools
allow transformations to be specified
ETL (Extraction/Transformation/Loading) tools
allow users to specify transformations through a graphical user interface
Gain ratio
applies a normalization to information gain using a split information value. For each outcome it considers the # of tuples having that outcome with respect to the total number of tuples in D. The result must be greater than average gain
Analytical processing
supports basic OLAP operations, including slice-and-dice, drill-down, roll-up, and pivoting. It generally operates on historic data in both summarized and detailed forms. The major strength is the multidimensional data analysis of data warehouse data.
posterior probability
the probability that a hypothesis holds the given evidence of a known data tuple that belongs to a class
Cluster analysis
the process of partitioning a set of data objects into clusters. Unsupervised, learns by observation
Dimensionality reduction
the process of reducing the number of random variables or attributes under consideration. Include wavelet transforms, principal components analysis, and attribute subset selection
Radius of a cluster
the square root of the average distance from any point of the cluster to the centroid
rule based classifier
uses a set of if-then rules for classification
What are some of the challenges to consider and the techniques employed in data integration?
Entity Identification Problem Redundancy Tuple Duplication Data Value Conflict Detection and Resolution Correlation is a technique for data integration
Time-variant
Every key structure in the data warehouse contains, either implicitly or explicitly, a time element.
Not all numerical data sets have a median. (T/F)
False
What is the first step of association rule mining
Find all frequent itemsets: In other words, finding each itemset that will occur at least as frequently as a predetermined minimun support count(minimum number of transactions that contain the itemsets).
Quartile
Get the median, cut the data by the median, get the median of each piece, those will be the first and 3rd quartile
Intuitively, the roll-up OLAP operation corresponds to concept ___ in a concept hierarchy
ascension
associative classification
association rules are generated from frequent patterns and used for classification
Naive Bayesian Classifier
assume that the effect of an attribute value on a given class is independent of the values of the other attributes
Pessimistic prunning
uses the training set to determine error rates
Numeric prediction
where a model predicts a continuous valued function where order DOES matter
Does an outlier need to be discarded always?
In most cases of data mining, outliers are discarded. However, there are special circumstances, such as fraud detection, where outliers can be useful.
data warehouse applications
Information Processing Analytical Processing Data mining
Maximal Frequent Itemset
Meets closed frequent itemset critieria, but also has no superset.
Relationship between association rules and minimum support threshold
When minimum support is low, there exist potentially an exponential number of frequent itemsets
Hierarchical clustering method:
Works by grouping data objects into a hierarchy or tree of clusters. Doesn't require the # of clusters, but needs a termination condition
Separation of clusters
either mutually exclusive (only belong to one cluster) or data can belong to more than 1 cluster
Clustering as a preprocessing tool
for regressiong, PCA, attribute subset selection, image processing, vector quantization, finding k-nearest neighbors, outlier detection
Data Value Conflict Detection and Resolution
for the same real-world entity, attribute values from different sources may differ. This may be due to differences in representation, scaling, or encoding
partitioning methods
given a set of n objects, a partitioning method constructs k partitions of the data, where each represents a cluster. Mostly distance based, they construct partitions and evaluate them by some criterion (Kmeans, Kmedoids, LLARANS)
Robustness
handling noise and missing values
Discrete Attribute
has a finite or countably infinite set of variables
Branching factor
specifies the maximum # of children by nonleaf node
Bayesian classifiers
statistical classifiers that can predict class membership probabilities. They can predict if a tuple belongs to a specific class
What is a closed itemset
When there exists no set that contains a greater number of examples of the given set's contents; for example if a set A contains { a, b } it cannot be said to be a closed itemset if there also exists a set B { a, b, c} because B has as much support, the same number of a's and b's, but also contains a { c }, which makes it a superset to A.
Concept hierarchy generation for nominal data
Where attributes such as street can be generalized to higher-level concepts, like city or country. Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level.
Attribute construction (or feature construction)
Where new attributes are constructed and added from the given set of attributes to help the mining process.
Aggregation
Where summary or aggregation operations are applied to the data. This step is typically used in constructing a data cube for data analysis at multiple abstraction levels.
Tuple Duplication
there are two or more identical tuples for a given unique data entry case
K-modes method
variant of the k-means but using modes instead
PAM cons
doesn't scale well for large data sets. Improvements (CLARA, CLARANS)
Binning
smooths out the values around the noise by placing data into bins based on mean, median, boundaries, etc.
Probabilistic methods
use probabilistic models to capture clusters and measure the quality of cluster by the fitness of models
Data scrubbing
use simple domain knowledge to detect errors and make corrections
Information gain
used by ID3 as an attribute selection measure. Gain(A) = Info (D) - InfoA(D). Highest gain is chosen as splitting attribute
Cost complexity prunnin
used on CART, considers the cost complexity as a function of the # of leafs and the error rate. Compares impruned vs pruned, if < then prunes
Sequential Covering Algorithm
used to extract if then rules from the data
Attribute generalization
If there is a large set of distinct values for an attribute in the initial working relation, and there exists a set of generalization operators on the attribute, then a generalization operator should be selected and applied to the attribute
Holistic measures
If there is no constant bound on the storage size needed to describe a subaggregate. That is, there does not exist an algebraic function with M arguments (where M is a constant) that characterizes the computation. Common examples of holistic functions include median(), mode(), and rank(). If applying holistic aggregate functions
Handling Missing Data
Ignore the tuple Fill in the missing value manually Fill in it automatically with a global constant, the attribute mean or the attribute mean for all samples belonging to the same class, the most probable value: inference based such as Bayesian formula or decision tree
Requirements for clustering
Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Requirements for domain knowledge to determine input parameters Ability to deal with noisy data Incremental clustering and insensitivity to input order Capability of clustering high-dimensionality data Constraint based clustering Interpretability and usability
K-means algorithm
uses the centroid of a cluster to represent that cluster. 1.Choose (or you can be given) the initial centroids 2. Assign objects to cluster based on their Euclidean distances 3. Find the mean of the objects in the cluster 4. Repeat step 2, but with means as centroids now 5. Stop until no change, or stopping case is reached Complexity O(nkt)
Partitioning Around Medoids (PAM)
variation of kmedoids. Randomly chooses k object as an initial medoid, then assigns each remaining object to the nearest medoid. Randomly selects a nonmedoid object, computes the total cost of swapping and swaps if it is better.
The ___ OLAP operation performs a selection on one dimension of the given cube
slice
Smoothing
Works to remove noise from the data. Techniques include binning, regression, and clustering.
Grid-based methods
quantize the object space into a finite # of cells that forms a grid structure. All the clustering operations are performed here and are therefore fast since they depend on the # of cell, not of clusters/objects.
Bootstrap method
randomly selects tuples for the training set. Uses sampling with replacement, meaning a tuple can be selected twice.
Nominal attribute
refer to symbols or names of things. Categorical. It can also be represented using a number, however, they are not meant to be used quantitatively. Has no median, but has a mode
Data cleaning
remove noise and correct inconsistencies in the data.
Postprunning
removes substrees from a fully grown tree
In the ___ schema some dimension tables are normalized generating additional tables
snowflake
threshold parameter
specifies the max diameter and subclusters stored @ the leaf nodes
In data warehouse development, with the ___ process changes to requirements can be resolved faster
spiral
Data discrimination
comparison of the target class with one or a set of comparative classes
Objective measures of pattern interestingness
confidence and support
multilayer feed-forward neural network
consists of an input, 1 or more hidden and output layers
Density-based methods
their general idea is to continue growing a cluster as long as the density in the neighborhood exceeds some threshold (DBSCAN, OPTICS, DenClue)
Similarity measure of clusters
they can be distance based or connectivity based
Interquartile range
third quartile - first quartile
Data compression
transformations are applied so as to obtain a reduced or "compressed" representation of the original data.
BOAT (Bootstrapped Optimistic Algorithm for Tree Construction
uses bootstrapping to created smaller subsets of the training data each of which fit in memory. Each set constructs a tree, then they are used to construct a general one. Only requires 2 scans and can be used for incremental updates
KNN
when given an unknown tuple, it searches the pattern space for the training tuples that are closest to the unknown tuple using euclidean distance
Algebraic measures
If it can be computed by an algebraic function with M arguments, obtained by applying a distributive aggregate function. For example, avg() (average) can be computed by sum()/count(), where both sum() and count() are distributive aggregate functions. By applying an algebraic aggregate function
Concept Description vs. Cube Based OLAP
Similarity: Data generalization Presentation of data summarization at multiple levels of abstraction Interactive drilling, pivoting, slicing and dicing Differences: OLAP has systematic preprocessing, query independent, and can drill down to a rather low level AOI has automated desired level allocation, and may perform dimension relevance analysis/ranking when there are many relevant dimensions AOI works on the data which are not in relational formsffr
Data mining and society challenges
Social impacts of data mining Privacy-preserving data mining Invisible data mining
snowflake schema
a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a shape similar to a snowflake
BIRCH scan 2
applies a selected clustering algorithm to the leaf nodes on the tree that removes the sparse clusters as outliers and groups the dense clusters into larger ones
Support vector machines
classification method for linear and nonlinear data. Uses nonlinear mapping to transform the training data to a higher dimension, then it searches for the linear optimal separating hyperplane using support vectors and margins
Scalability
clustering all the data instead of only samples
DIANA (Divisive Analysis)
inverse order of AGNES, each node forms a cluster of its own and they are split by the maximal distance between neighboring objects.
Concept description
is the most basic form of descriptive data mining. It describes a given set of task-relevant data in a concise and summative manner, presenting interesting general properties of the data. It consists of characterization and comparison (or discrimination).
Neural network distadvantages
long training time # of parameter determined empirically poor interpretability
Training set
made up of database tuples and their associated class labels. Used to teach the classifier in the learning step
Test set
made up of test tuples and their class labels. It is used to test the classifier, it doesn't take part in the training process.
RainForest
maintains an AVC set at each node describing the training tuples at the node
A major distinguishing feature of an online analytical processing system is that
manages large amounts of historic data
SMV applications
numeric prediction classification handwritten digit recognition object recognition speaker identification time-series prediction tests
Link-based clustering methods
objects are often linked together in carious ways (SimRank, LinkClus)
CLARANS (Clustering Large Applications based upon Randomized Search
randomly selects k objects in the data set as current medoids, then selects a current medoid and an object and swap only if it improves the absolute error criterion.
Partitioning criteria
single hierarchy where no item in a cluster is under another vs hierarchical partitioning (preferred)
Divisive approach (top-down)
starts with all objects in the same cluster. In each iteration a cluster is split into smaller clusters, until eventually each object is in one cluster, or a termination condition holds.
Splitting subset
subset of the known values of a splitting attribute
Relevance Analysis
Find attributes which best distinguish different classes
Partial materialization
is the selective computation of a subset of the cuboids or subcubes in the lattice
Data mining turns data into organized ______
knowledge
Decision tree induction
learning of decision trees from class-labeled training tuples
Backpropagation
learns by iteratively processing a data set of training tuple, comparing the networks prediction for each tuple with the known target value. The weights are modifies to minimize the mean squared error for every tuple
Entropy of D
The average amount of information needed to identify the class label for tuple D
Normalization
Where the attribute data are scaled so as to fall within a smaller range, such as −1.0 to 1.0, or 0.0 to 1.0.
Discretization
Where the raw values of a numeric attribute are replaced by interval labels or conceptual labels. The labels can be recursively organized into higher-level concepts, resulting in a concept hierarchy for the numeric attribute. More than one concept hierarchy can be defined for the same attribute to accommodate the needs of various users. (Concept hierarchy climbing)
Numeric Attributes
Quantitative; that is, it is a measurable quantity, represented in integer or real values. Can be interval-scaled or ratio-scaled.
Learning step
classification algorithm builds the classifier by analyzing a training set
High intra-class similarity
cohesive within clusters
Metadata
describe or define warehouse elements
Clustering feature
a 3d vector summarizing information about clusters of objects
Clustering space
clusters within the entire given space (one dimension) or clusters with different subspaces (multi-dimensional)
Ensemble methods
combine models to increase accuracy, to come up with an improved composite classification model
CHAMELEON
explores dynamic modeling in hierarchical clustering. Cluster similarity is based on how well connected objects are within a cluster and the proximity of them
Data source view
exposes the information being captured, stored, and managed by operational systems
Iterative relocation technique
improves partitioning by moving objects from one group to another
Overfit
incorporate particular anormalities of the training data that are not present in the general set overal
Training tuples
individual tuples making up the training set, are randomly sampled from the database under analysis
Transactional Database
captures a transaction which typically includes an ID and a list of items that make it up
outlier analysis
Removes outliers from noise
Nonvolatile
a data warehouse does not require transaction processing, recovery, and concurrency control mechanisms. It usually requires only two operations in data accessing: initial loading of data and access of data.
Enterprise warehouse
a data warehouse model that collects all of the information about subjects spanning the entire organization
Data mart
a data warehouse model that contains a subset of corporate-wide data that is of value to a specific group of users. The scope is confined to specific selected subjects.
In an online analytical processing system, the typical unit of work is
a read-only operation
Top-down view
allows the selection of the relevant information necessary for the data warehouse. This information matches current and future business needs.
Data auditing
analyzing data to discover rules and relationship to detect violators
Data warehouse view
consists of fact tables and dimension tables
Attribute removal
If there is a large set of distinct values for an attribute of the initial working relation, but either there is no generalization operator on the attribute, or its higher-level concepts are expressed in terms of other attributes, then the attribute should be removed from the working relation.
Apriori pruning principle
If there is any itemset that is infrequent, its superset should not be generated/tested!
Relational OLAP (ROLAP)
Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware. Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services. Greater scalability
Classification
when a classifier and constructor are used to predict categorical labels represented by discrete values with no relevant order. Form of data analysis that extracts models describing important data classes
Nearest neighbor clustering algorithm
when an algorithm uses minimum distance to measure the distance between clusters
Constraint-based clustering
User may give inputs on constraints, use domain knowledge to determine input parameters.
Eager learners
when given a set of training tuples will construct a generalization model before receiving new tuples
Lazy learners
when given a set of training tuples, it waits until a test tuple to classify
farthest neighbor clustering algorithm
when the algorithm uses max distance to measure cluster distance
Minimal spanning tree algorithm
when the spanning tree of a graph connects all vertices and has the least sum of edge weights
What is a snowflake schema?
A model with a central fact table and a set of constituent dimension tables which are further normalized into sub-dimension tables.
Bayes theorem
Classifies by calculating the posterior probabilities
Please discuss data generalization and some of the concepts associated to it.
"Summarizes data by replacing relatively low-level values with higher-level concepts, or by reducing the number of dimensions to summarize data in concept space involving fewer dimensions."
Data Mining Applications
- Business Intelligence -Web search engines -Web page analysis -Basket data analysis to targeted marketing -Biological and medical data analysis
Attribute Selection Measures Biases
-Information gain and Gini Index is biased toward multivalued attributes -Gain Ratio unbalances splits -Gini Index has difficulties with large classes and favors tests that result in equal size partitions and purity in both
Dice
Defines a subcube by performing a selection on two or more dimensions
Why is data integration necessary?
It is used to combine multiple sources of the same type of data. The more sources the better in case of bias and the more data the better in general.
Rules can be pruned
True
Interesting patterns
-easily understood by humans -valid on new/old data with certainty -potentially useful -novel -validates the hypothesis we sought to confirm -represents knowledge
K-means cons
-not suitable for discovering clusters within nonconvex shapes or clusters of very different size -sensitive to noise and outliers -amount of clusters need to be specified -works only for numbers
Making k-means better
-use a good sized set of samples in clustering -employ a filtering approach that uses spatial hierarchical data index to save costs when computing means -group nearby objects into microclusters and perform kmeans
How many cuboids are there in a 6-dimensional data cube if there were no hierarchies associated to any dimension?
64
Relational database
A set of tables that consist attributes which store of tuples of entities and keys
Single-linkage algorithm
if the clustering is terminated when the distance between clusters is greater than the user defined threshold
What are association rules?
if-then statements that help to show the probability of relationships between data items within large data sets in various types of databases.
BIRCH scan 1
scans the database to build an in memory CF tree that is a multilevel compression of the data and tries to preserve the structure
Business query view
sees the perspectives of data in the warehouse from the view of end user
SMV features
training can be slow but accuracy is high
Tree pruning
Removing noise branches
Rules are strong/frequent if the support calculated is greater than or equal to the minimum support given (T/F)
True
Redundancy
data can be derived from an existing attribute.
What is an itemset
A collection of one or more items
Low inter-class similarity
Distinctive between clusters
Random subsampling
repeating the holdout method k times. Accuracy will be the average of the accuracies of each iteration
Incremental decision tree induction
restructure the decision tree when new training tuples are processed
Virtual Warehouse
A set of views over operational databases. Only some of the possible summary views may be materialized
Data transformation strategies
1. Smoothing 2. Attribute construction (or feature construction) 3. Aggregation 4. Normalization 5. Discretization 6. Concept hierarchy generation for nominal data
What is data mining?
The process of sorting through large data sets to identify patterns and establish relationships to solve problems through data analysis
Confidence
Times A and B happen / times A happens
What is a data cube?
A common multi-dimensional model that is a step above a basic 2-d data chart
What is dice?
A slice on more than one dimensions of a data cube
Data reduction
can reduce the data size by aggregating, eliminating redundant features, or clustering.
Data classification
A two-step process 1)learning step, 2)classification step
Apriori property
The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent
User-guided or constraint based methods
clustering by considering user specified or application specified constraints (COD, constrained clustering)
Descriptive mining
characterizes properties of the data in a target data set
Holdout method
data is partitioned into 3 sets, 2 for training, 1 for testing
CLARA (Clustering Large Applications
takes a sample of the data, then uses PAM algorithm
Slice
Performs a selection on one dimension of the given cube, resulting in a subcube.
In an online transaction processing system, the typical unit of work is
a read-only operation
quality of clustering
a separate quality function that measures the goodness of a cluster
Apriori Steps
1. Analyze every element and calculate the number of occurrences 2. If occurrences > or = minimum support, keep 3. Join the items kept with every single other item kept, in a Cartesian product manner 4. Repeat step 2 5. If there are any elements kept, repeat step 3, but add an extra element 6. Repeat step 2 7. Keep repeating step 5 and 6 until no more elements are kept 8. Choose the last elements kept, where occurrences > minimum support, those are the frequent elements.
Discuss the steps associated to the design of a data warehouse.
1. Choose a business process to model. If the business process is organizational and involves multiple complex object collections, a data warehouse model should be followed. However, if the process is departmental and focuses on the analysis of one kind of business process, a data mart model should be chosen. 2. Choose the business process grain, which is the fundamental, atomic level of data to be represented in the fact table for this process. 3. Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item, customer, supplier, warehouse, transaction type, and status. 4. Choose the measures that will populate each fact table record. Typical measures are numeric additive quantities like dollars sold and units sold.
Correlation
A calculation used to determine how dependent or independent attributes are with each other. Its analysis is used to keep redundancy in check.
What is a fact constellation?
A composite of the previous schemas. Here, there can be more than one central table; and these tables can share dimensional tables. It could be thought of as a collection of stars.
Integrated
A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational databases, flat files, and online transaction records
What is a data cube measure? Any examples?
A function that can evaluate to any point in the data cube's space. An example would be calculating the sum or average of the data.
Data transformations
A function that maps the entire set of values of a given attribute to a new set of replacement values, each old value can be identified with one of the new values. This can improve the accuracy and efficiency of mining algorithms involving distance measurements.
What do we understand by "multidimensional data model"?
A model for, usually, themed databases. They are used to categorize data into specializations such as dates, locations, and counts. The multi-dimensional model comes into its own as these broad specializations can be further broken down, say as dates could change from years to months or months to days.
Binary Attributes
A nominal attribute with only two categories or states: 0 or 1, where 0 typically means that the attribute is absent, and 1 means that it is present.
Describe the Spiral Method
A sequence of waterfalls and considered a "risk oriented iterative enhancement" . The spiral method is usually the development of choice as it is an iterative process that is used while developing warehouses.
Data warehouse
A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management's decision making process.
What is a data warehouse?
A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management's decision-making process
What is slice?
A subset of a multidimensional array corresponding to a single value set for one or more of the dimensions not in the subset
Data Characterization
A summary of the general characteristics or features of a target class of data. The data corresponding to the user-specified class is typically collected by a query. For example, to study the characteristics of software products with sales that increased by 10% in the previous year, the data related to such products can be collected by executing an SQL query on the sales database.
Discuss one of the factors comprising data quality and provide examples.
Accuracy Completeness Consistency Timeliness Believability Interpretability
Explain one challenge of mining a huge amount of data in comparison with mining a small amount of data.
Algorithms that deal with data need to scale nicely so that even vast amounts of data can be handled efficiently, and take short amounts of time
Ordinal Attributes
An attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known.
What is an outlier?
An object which does not fit in with the general behavior of the model.
What are some of the differences between operational database systems and data warehouses?
An operational database query allows to read and modify operations, while an OLAP query needs only read only access of stored data. An operational database maintains current data. On the other hand, a data warehouse maintains historical data and provides us generalized and consolidated data in multidimensional view. Along with generalized and consolidated view of data, a data warehouses also provides us Online Analytical Processing (OLAP) tools.
What do we understand by "frequent patterns"? How are they used in data mining? Please provide examples.
Are patterns that are frequent in a data set. There are three categories for these patterns: itemsets, subsequences and substructures. They are useful for the discovery of associations and correlations between items in a data set. This can help businesses to make smart marketing decisions. One example of this is the market basket analysis that determines what items are frequently purchased together by customers (for instance milk and bread, computers and antivirus...).
Data mining diversity of data types challenges
Handling complex types of data Mining dynamic, networked, and global data repositories
neural network
a set of connected input/output units in which each connection has a weight associated with it
How would you catalog a boxplot, as a measure of dispersion or as a data visualization aid? Why?
As a data visualization aid. The boxplot shows how the boundaries relate to each other visually, where the minimum, maximum values lie, and the Interquartile ranges with a line signifying the median. It does not give you a specific measure, but allows you to somewhat visualize the data set. For example, if you have a boxplot for the grades in a class, if the box is closer to the minimum boundary then you can see that most scores were low.
Frequent itemset applications
Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. Association, correlation, and causality analysis, Sequential, structural patterns, Pattern analysis in spatiotemporal, multimedia, time-series, and stream data, Classification: discriminative, frequent pattern analysis, Cluster analysis: frequent pattern-based clustering, Data warehousing: iceberg cube and cube-gradient, Semantic data compression: fascicles, Broad applications
Cluster analysis applications
Business intelligence: organize a large # of customers Project management: partition projects into categories Image recognition: handwritten character recognition systems Web search: to organize search results for accessibility Biology: taxonomy Information retrieval: document clustering Land use: identification of areas of similar land Marketing: help marketers discover different groups of customers and develop plans for them City-planning: identifying groups of houses based on characteristics Earthquake studies: clustering epicenters along continent faults Climate: find patterns of atmosphere and ocean Economic Science: market research Stand alone tool: to get insight into data distribution Pre-processing step: to make other algorithms work better
DBMS
Consists of a database and a software to manage and access that data
AVC set
Contains an attribute, value, class label information to make the tree take less memory
Why is data quality important
Data can become difficult to analyze, hard to use, unreliable, outdated. In other words, having a database with bad quality can defeat the whole purpose of having a database.
How can the data be preprocessed in order to help improve its quality?
Data cleaning Data integration Data reduction Data transformations
Knowledge Discovery Process
Data cleaning, then data integration (all inside the database) Moving the data to the data warehouse where it undergoes data selection for task relevant data Then data mining is performed, which leads to a lot of representations of the data Patterns are evaluated Knowledge is presented
Quantile plot
a simple and effective way to have a first look at a univariate data distribution
Why is data mining important?
Data mining turns a large collection of data into knowledge. We life on the information and technology age and we have tons of information, but we want more knowledge
Online Analytical Mining importance
High quality of data in data warehouses Available information processing structure surrounding data warehouses OLAP-based exploratory data analysis On-line selection of data mining functions
Please discuss the meaning of noise in data sets and the methods that can be used to remove the noise (smooth out the data).
The random errors found in measured variables, they are basically outliers. Binning, regression and outlier analysis.
Quality of a clustering method
Depends on: the similarity measure used by the method the method implementation the ability of the method to discover some or all of the hidden patterns
Histogram
Differs from a bar chart in that it is the area of the bar that denotes the value
Characteristics of structured data
Dimensionality Sparsity Resolution Distribution
Discriminant rules
Discrimination descriptions expressed in the form of rules
Data mining efficiency and scalability challenges
Efficiency and scalability of data mining algorithms Parallel, distributed, stream, and incremental mining methods
Clustering
Gathering; forming in a group. Detecting and removing outliers
quantile-quantile plot, or q-q plot
Graphs the quantiles of one univariate distribution against the corresponding quantiles of another
Multiphase clustering
Integrate hierarchical clustering with other clustering techniques
Data mining user interaction challenges
Interactive mining Incorporation of background knowledge Ad hoc data mining and data mining query languages Presentation and visualization of data mining results
Describe the waterfall method.
Is similar to going down a flight of steps. In order to reach the bottom, every step must be completed. It is similar to the waterfall methodology that is used during development. The model is a linear sequence of activities and requirements that are structured in a way that the tasks are relying on the previous objective. There are many steps like system design, detailed design, test, performance and maintenance etc.
Describe Data Mining
It is all about discovering new and hidden patterns, performing predictions and displaying what was mined using visual tools
Is data cube technology sufficient to accomplish all kinds of concept description tasks for large data sets?
It is not properly sufficient to accomplish all kinds of concept description tasks for large data sets for two mean reasons. First off, concept description should handle complex data types. OLAP, with its restriction of possible dimension and measure types (non-numeric only), represents a simplified model for data analysis. Secondly, it's too complicated for most users.
Alternate names for data mining
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc
What is a metadata repository and what are some of the elements it should contain?
Meta data repository is the data defining warehouse objects. Metadata contains a description of the structure of the data warehouse such as schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents, operational metadata data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails), the algorithms used for summarization, the mapping from operational environment to the data warehouse, data related to system performance such as warehouse schema, view and derived data definitions, and it also contains business data such as business terms and definitions, ownership of data, charging policies.
Normalization by decimal scaling
Normalizes by moving the decimal point of values of attribute A. The number of decimal points moved depends on the maximum absolute value of A Vi = vi/10^j
What is the rationale of constructing a separate data warehouse, when online analytical processing could be performed directly on operational databases?
Operational databases store changing and current data, while warehouses store historical data, which is what is needed in the decision making process.
Mining class comparison process
Partition the set of relevant data into the target and the contrasting class(es) Generalize both classes to the same high-level concepts Compare tuples with the same high-level descriptions Present for every tuple its description and two measures: support (distribution within single class), and comparison (distribution between classes) Highlight the tuples with strong discriminant features
Data post-processing techniques
Pattern evaluation Pattern selection Pattern interpretation Pattern visualization
Min-max normalization
Performs a linear transformation on the original data. It preserves the relationships among the original data values. It will encounter an "out-of-bounds" error if a future input case for normalization falls outside of the original data range for A. Vi = (vi - minA/maxA - minA)(new maxA − new minA) + new minA.
Data visualization techniques
Pixel-oriented techniques Geometric projection techniques Icon-based visualization techniques Hierarchical visualization techniques Visualizing complex data and relations
AGNES (Agglomerative Nesting)
Places each object into a cluster of its own, then the clusters are merged step-by-step according to some criterion. It uses a single-linkage approach to merge. It repeats until all objects are merged.
Numerosity reduction
Replace the original data volume by alternative, smaller forms of data representation, These techniques may be parametric or nonparametric
Galaxy Schema
Sophisticated applications may require multiple fact tables to share dimension tables. This kind of schema can be viewed as a collection of stars.
Multidimensional OLAP (MOLAP)
Sparse array-based multidimensional storage engine. Fast indexing to pre-computed summarized data
Data mining technologies
Statistics Machine Learning Pattern Recognition Database Systems Visualization Data Warehouse Algorithms Information Retrieval Applications High-performance computing
Describe Analytical Processing
Supports basic OLAP operations but its major strength Is analyzing the data warehouse in multidimensional.
Describe information processing
Supports query and reporting using charts and graphs, to name a few. It can be useful to find information however, only information directly from the databases or aggregate functions. Unlike data mining, it cannot reflect the more complex patterns buried in the database. Also it is used to construct low cost web accessing tools that are integrated into web tools, a step behind in analytical processing.
Distributive measures
Suppose the data are partitioned into n sets. We apply the function to each partition, resulting in n aggregate values. If the result derived by applying the function to the n aggregate values is the same as that derived by applying the function to the entire data set (without partitioning), the function can be computed in a distributed manner. By applying a distributive aggregate function
In an online transaction processing system the typical unit of work is
a simple transaction
What do we understand by data normalization?
The process by which data is transformed to fall within a smaller range such as [−1,1] or [0.0, 1.0]. This attempts to give all attributes of the data set an equal weight.
What is the importance of dissimilarity measures
The importance of this is that in some instances, having two objects with low dissimilarity could mean something negative. For example, cheating.
What are the differences between the measures of central tendency and the measures of dispersion?
The measures of central tendency are the mean, median, mode and midrange. They are used to measure the location of the middle or the center of the data distribution, basically where the most values fall. Whereas, the dispersion measures are the range, quartiles, interquartile range, the five-number summary, boxplots, the variance and standard deviation of the data. They are mainly used to find an idea of the dispersion of the data, how is the data spread out, and to identify outliers.
Star schema
The most common modeling paradigm, in which the data warehouse contains a large central table containing the bulk of the data, with no redundancy, and a set of smaller attendant tables, one for each dimension. The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table
Preprunning
a tree is pruned by halting the construction early, when it falls bellow a threshold
Curse of dimensionality
The storage requirements are even more excessive when many of the dimensions have associated concept hierarchies, each with multiple levels
What is the importance of similarity measures
They are important because they help us see patterns in data. They also give us knowledge about our data. They are used in clustering algorithms. Similar data points are put into the same clusters, and dissimilar points are placed into different clusters.
Support
Times A and B happen / total number of events
The mean is in general affected by outliers (T/F)
True
The mode is the only measure of central tendency that can be used for nominal attributes. (T/F)
True. An example of this would be hair color, with different categories such as black, brown, blond, and red. Which one is the most common one?
Data discrepancy detection methods
Use metadata; Check field overloading; Check uniqueness rule, consecutive rule and null rule; Use commercial tools like Data scrubbing and Data auditing
Decision tree
a tree that holds the test on an attribute at the node, and a class label at the leaf
Cluster
a collection of data objects that are similar to one another and dissimilar to objects in other groups
CF tree
a height-balanced tree that stores the clustering features for a hierarchical cluster
Model-based methods
a model is hypothesized for each of the clusters and tries to find the best fit of that model to each other (EM, SOM, COBWEB)
BIRCH
begins by partitioning objects hierarchically using tree structures where the nodes can be viewed as microclusters depending on the resolution scale. It then applies other clustering algorithms to perform macroclustering on the microclusters
Hierarchical method
can be agglomerative or divisive, creates a hierarchical decomposition of the set of data using some criterion (DIANA, AGNES, BIRCH, CHAMELEON). Once a step is done, it cannot be undone
k-fold cross validation
data is partitioned into k mutually exclusive subsets or folds. Training and testing is performed k times. In each iteration, i is used as testing
Among the data warehouse applications, __ applications supports knowledge discovery
data mining
Dendogram
data structure used to represent the process of hierarchical clustering. Shows how objects are grouped together or partitioned.
Subject-oriented
data warehouses typically provide a simple and concise view of particular subject issues by excluding data that are not useful in the decision support process.
Class-label attribute
defines what an attribute is part of and is discrete valued and unordered
Selecting which cuboids to materialize
depending on size, sharing, access frequency, etc
Regression
derives a linear equation to get a best fit line of the noise
Attribute Selection Measures
determine how the tuples at a given node are to be split. Provides a ranking for each attribute describing the given training tuples
Splitting criterion
determines which attribute to test at node N by looking for the best way to partition the tuples D into individual classes. Also, tells us the branches of outcomes of the test. Indicates the splitting attribute, split point or a splitting subset
Decision tree advantages
don't require domain knowledge or parameter setting appropriate for exploratory knowledge discovery handle multidimensional data intuitive representation learning & classification steps are fast and accurate
Bagging
each classifier is trained using sampling with replacement and a classifier is learned from each training set. Each return a prediction which is counted and the prediction with most votes gets chosen
Single-linkage
each cluster is represented by all the objects in the cluster and the similarity between 2 clusters is measured by the closest pair of data points belonging to different clusters
Agglomerative approach (bottom-up)
each object forms a separate group, it merges the objects or groups close to one another until all groups are merged into 1
Pure partition
if all tuples in the partition belong to the same class
Information processing
supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts, or graphs. A current trend is to construct low-cost web-based accessing tools that are then integrated with web browsers.
Advantages of boosting
tends to have greater accuracy but risks overfitting the model, can be extended for numeric prediction
Unsupervised learning
the classifier doesn't know what the class labels are, the # of classes may also be unknown
Supervised learning
the classifier is trained by being told to which classes the tuples belong
Acurracy of the classifier
the percentage of test set tuples that are correctly classified by the classifier
Diameter of a cluster
the square root of the average mean squared distance between all pairs of points in the cluster
Z-score normalization
the values for an attribute, A, are normalized based on the mean (i.e., average) and standard deviation of A Vi = vi - A / o'A
Continuous Attributes
typically represented as floating-point variables.
Subjective interestingness measures
unexpected and actionable patterns
What are the data mining functionalities
Characterization and discrimination Mining of frequent patterns, associations, and correlations Classification and regression Clustering analysis Outlier analysis
Data reduction strategies.
Dimensionality reduction Numerosity reduction Data compression
Discuss one of the distance measures that are commonly used for computing the dissimilarity of objects described by numeric attributes.
Euclidean distance d(i, j) =sqrt((xi1 − xj1)^2 + (xi2 − xj2)^2 +··· ) Manhattan Distance |x1 - x2| + |y1 - y2| Minkowski distance d(i, j) = sqrt(h, |xi1 − xj1|^h + |xi2 − xj2|^h + ...) Supremum distance d(i, j) = max(f, p) |xif − xjf |
In many real-life databases, objects are described by a mixture of attribute types. How can we compute the dissimilarity between objects of mixed attribute types?
In order to determine the dissimilarity between objects of mixed attributes there are two main approaches. One of them indicates to separate each attribute type and do a data mining analysis for each of them. This method is acceptable if the results are consistent. Applying this method to real life projects is not viable as analyzing the attribute types separately will most likely generate different results. The second approach is more acceptable. It processes all attributes types together and do only one analysis by combining the attributes into a dissimilarity matrix
What is the importance of data reduction?
It can increase storage efficiency and reduce costs. It allows analytics to take less take and yield similar (if not identical) results
What do we understand by similarity measure?
It quantifies the similarity between two objects. Usually, large values are for similar objects and zero or negative values are for dissimilar objects.
What do we understand by dissimilarity measure and what is its importance?
Measuring the difference between to objects, the greater the difference between two objects the higher the value.
Data normalization methods
Min-max normalization Z-score normalization Normalization by decimal scaling
Data mining methodology challenges
Mining various and new kinds of knowledge Mining knowledge in multidimensional space Integrating new methods from multiple disciplines Boosting the power of discovery in a networked environment Handling uncertainty, noise, or incompleteness of data Pattern evaluation and pattern- or constraint-guided mining
Full Materialization
Precompute all of the cuboids. The resulting lattice of computed cuboids is referred to as the full cube. This choice typically requires huge amounts of memory space in order to store all of the precomputed cuboids.
What do we understand by data quality and what is its importance?
When an object satisfies the requirements of the intended use. It has many factors like: including accuracy, completeness, consistency, timeliness, believability, and interpretability. It also depends on the intended use of the data, for some users the data may be inconsistent, but for others, it can just be hard to interpret.