Data Mining
Outlier Analysis
1. A data object that does not comply with the general behavior of the data 2. Noise or exception? ― One person's garbage could be another person's treasure 3. Methods: by product of clustering or regression analysis 4. Useful in fraud detection, rare events analysis
Knowledge to be mined(or:Data mining functions)
1. Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. 2. Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. 3. Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.
knowledge discovery from data
1. Data cleaning- to remove noise 2. Data integration - where multiple data sources may be combined 3. data selection - where data relevant to the analysis task are retrieved from the database 4. Data transformation - where data are transformed and consolidate into forms appropriate for mining by performing summary or aggregation operations 5. data mining - an essential process where intelligent methods are applied to extract data patterns. 6. pattern evolution - to identify the truly interesting pattern representing knowledge based on interestingness measure 7. knowledge presentation - where visualization and knowledge representation technique used to present mined knowledge to users.
Advanced data sets and advanced applications
1. Data streams and sensor data 2. Time-series data, temporal data, sequence data(incl.bio-sequences) 3. Structure data, graphs, social networks and multi-linked data 4. Object-relational databases 5. Heterogeneous databases and legacy databases 6. Spatial data and spatiotemporal data 7. Multimedia database 8. Text databases 9. The World-Wide Web
Association and Correlation Analysis
1. Frequent patterns (or frequent itemsets) 2. Association, correlation vs. causality
Structure and Network Analysis
1. Graph mining 2. Information network analysis 3. Web mining
Data cleaning steps
1. Ignore the tuples 2. Fill in the missing value manually 3. Use a global constant to fill in the missing value 4. Use a measure of central tendency for the attribute to fill the missing value 5.Use the attribute mean or median for all samples belonging to the same class as the given tuple 6. Use the most probable value to full in the missing value
Generalization
1. Information integration and data warehouse construction. Data cleaning, transformation, integration, and multidimensional data model 2. Data cube technology, Scalable methods for computing (i.e., materializing) multidimensional aggregates, OLAP (online analytical processing) 3. Multidimensional concept description: Characterization and discrimination. Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region
data transformation steps
1. Smoothing 2. Attribute Construction 3.Aggregration 4.Normalization 5.Discretization 6. Concept hierarchy generation for nominal data.
Sequence, trend and evolution analysis
1. Trend, time-series, and deviation analysis: e.g., regression and value prediction. 2. Sequential pattern mining 3. Periodicity analysis 4. Motifs and biological sequence analysis - Approximate and consecutive motifs 5. Similarity-based analysis 6. Mining data streams - Ordered, time-varying, potentially infinite, data streams
Cluster Analysis
1. Unsupervised learning (i.e., Class label is unknown) 2. Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns 3. Principle: Maximizing intra-class similarity & minimizing interclass similarity 4. Many methods and applications
Classification
1. label prediction - Construct models (functions) based on some training examples. Describe and distinguish classes or concepts for future prediction. Predict some unknown class labels 2. Typical methods - Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern- based classification, logistic regression 3. Typical applications - Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages.
Symmetric
A Binary attribute is ____ if both of its states are equally valuable and carry the same weight
Attribute Vector/ Feature Vector
A set of attributes used to describe a given object
Techniques utilized
Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc.
Data to be mined
Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks
Graph mining
Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments)
Numeric attributes
Interval-scaled attributes and ratio scaled attribute
Nominal Attributes
Nominal, Categorical, Enumerations.
Univariate
One attribute or variable
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
Information network analysis
Social networks: actors (objects, nodes) and relationships (edges) Multiple heterogeneous networks Links carry a lot of semantic information: Link mining
Discrepancy detection / data transformation
The two step process of ___ and _____ iterates, this process however is error prone and time consuming, some transformations may introduced more discrepancies.
Web mining
Web is a big information network: from PageRank to Google Analysis of Web information networks
asymmetric
a binary attribute is ____ if the outcomes of the states are not equally important, such as positive and negative outcomes.
Attribute
a data field, representing a characteristic or feature of a data object.
parametric methods
a model is used to estimate the data, so that typically only the data parameters need to be stored, instead of the actual data.
Three elements defining data quality
accuracy, completeness, and consistency.
data migration tools
allows simple transformation to be specified such as to replace the string "gender" by "sex"
ETL (extraction, transformation, and loading) tools
allows users to specify transforms through a graphical user interface GUI. These tool typically support only a restricted set of transforms so that, often we may also choose to write custom scripts for this step of the data cleaning process.
field overloading
another error source that typically results when developers squeeze new attributes definitions into unused portions of already defined attribute
redundancy
another important issue in data integration, an attribute may be redundant if it can be derived from another attribute or set of attribute. Some can detected by correlation analysis, given two attributes such analysis can measure how strongly one attribute implies the other, based on the available data.
log linear models
approximate discrete multi dimensional probability distributions.
Interval-scaled attributes
are measured on a scale of equal size units. The values have order and can positive0 or negative, like temperature.
similarity / dissimilarity
are used in data mining applications such as clustering, outlier analysis, and nearest neighbor classification. Such measures of proximity can be computed for each attribute type studied in type studied in this chapter or for combinations of such attributes.
discrepancies
can be caused by several factors including poorly designed data entry forms that have many optional fields, human error in data entry, deliberate errors, and data decay.
Categorical
can be some kind of category, code, state and so nominal attributes
attribute construction
can help improve accuracy and understanding of structure in high dimensional data
numerosity reduction
data are replaced by alternative, smaller representations using parametric models (regression) or nonparametric models (histograms, clusters, data aggregation)
dimensionality reduction
data encoding schemes are applied so as to obtain a reduced/compressed representation of the original data
data quality
data have quality if they satisfy the requirement of the intended's. Including accuracy, completeness, consistency, timeless, believability and interpretability.
data pre-processing
data integration normalization feature selection dimension reduction
data reduction strategies include
dimensionality reduction, numerosity reduction and at a compression
first step in data cleaning
discrepancy detection
smoothing by bin means
each value in a bin replaced by the mean value of the bin, ex the man of the values 4,8 and 15 in bin 1 is 9.
data auditing tools
find discrepancies by analyzing the data to discover rules and relationships, and detecting data that violate each conditions.
correlation coefficient
for numeric attributes we can evaluate the correlation between attributes A and B by computing the...
qqplot
graphs the quantile of one univariate distribution against the corresponding quantities of another
discrete attribute
has a finite or countably infinite set of values, which mayor may not be represented as integers. It can have numeric values, such as 0 or 1
geometric projection technique
helps user find interesting projections of multidimensional data sets
Continuous
if an attribute is not discrete, such as a floating point variable.
data integration
integrating multiple databases, data cubes or files
chernoff faces
introduced in 1973 by statistical Herman chernoff
linear regression
involves finding the best line to fit two attributes or variables so that one attribute can be used to predict the others.
discrete wavelet transform
is a linear signal processing technique that when applied to a data vector X, transforms it to a numerically different vector X of wavelet coefficient.
Binary attribute
is a normal attribute with only 2 categories or state. 0 or 1, 0 means absent, 1 means present or true and false.
Noise
is a random error or variance in a measured variable
tag cloud
is a visualization of statistics of user generated tags.
ordinal attributes
is an attribute with possible values that have a meaning full order or ranking among them, but the magnitude between successive values is not known.
multiple linear regression
is an extension of linear regression where more than two attribute are involved and the data are fit to a multidimensional surface
ratio-scaled attribute
is numeric attribute with an inherent zero point like kelvin, it has a true zero point.
numeric attribute
is quantitative, that is, it is a measurable quantity, represented in integer or real values. They can interval-scaled or ratio scaled.
alternative names
knowledge discovery in database, knowledge extraction data/pattern analysis, data archeology, data dredging, information, business intelligence.
outliers
may be detected by clustering.
nested discrete poncies
may only be detected after others have been fixed.
data transformation
most errors however will require____, that is once we find discrepancies, we typically need to define and apply transformation to correct them.
data transformation
normalization data discretization and concept hierarchy generation
data reduction
obtains a reduced representation of the data set that is much smaller in volume, yet produces the same or almost the same result. Including Discrete wavelet transform and numerosity reductio.
health care & medical data mining
often adopted such a view in statistics and machine learning
data mining
often requires data integration the merging of data from multiple data stores.
hierarchical visualization techniques
partition all dimensions into subsets
data mining
pattern discovery association & correlation classification clustering outlier analysis
post processing
pattern evaluation pattern selection pattern interpretation pattern visualization
Types of Data Visualization graphs
pixel-oriented techniques, geometric projection techniques, icon based techniques, and hierarchical and graph-based techniques
attribute subset selection
reduces the data set set size by removing irrelevant or redundant attributes. The goal is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes 1. stepwise forward selection 2.stepwise backward elimination 3.combination of forward selection and backward elimation 4. decision tree induction
Nominal
relating to names, they can be symbols or name of things.
smoothing by bin medians
replaced by the bin median
data cleaning
routines work to "clean" the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.
unique rule
says that each value of the given attributes must be different from all other values for that attribute
consecutive rule
says that there can be no missing value between the lowest and highest values for the attribute and that all must be unique
entity identification problem
schema integration and object matching can be tricky, how can equivalent real world entities from multiple data sources match up.
principal components analysis
searches for K n-dimensional orthogonal vectors that can be used to represent the data, where k is greater or equal to n.
binning methods
smooth a sorted data value by consulting its "neighborhood" that is, the values around it. The sorted values are distributed into a number of "buckets" or bins.
null rule
specifies the use of blanks, questions marks, special characters, or other strings that may indicate the null conditions and how such values should be handled.
meta data
such knowledge /data about data
stick figure visualization
technique maps multidimensional data to five piece stick figures, where each figures has 4 limbs and a body
data reduction
techniques can be applied to obtain a reduced representation of the dataset that is much smaller in volume, yet closely maintains the integrity of the original data.
orthonormal
the columns are unit vectors and are mutually orthogonal, so that the matrix inverse is just its transpose.
contingency table
the data tuples described by and b can be shown as ___ with the c values of A making up the columns and the r values of B making up the rows.
expected values
the mean values of A and B respectively are known as
smoothing by bin boundaries
the min and max values in again bin are identified as bin boundaries. Each bin value is then replaced by the closest boundary value.
data compression
transformation are applied so as to obtain a reduced or compressed representation of the original data. If the original data can be reconstructed from the compressed data without any information loss, the data reduction is lossless, if not it is lossy.
bivariate distribution
two attributes
data scrubbing tools
use simple domain knowledge to detect errors and make corrections in the data
icon based visualization technique
use small icon to represent multidimensional data values
Dimension
used in data warehousing
Feature
used in machine learning
Enumerations
values with no meaning
business intelligence view
warehouse, data cube, reporting but not much mining