Data Mining

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Outlier Analysis

1. A data object that does not comply with the general behavior of the data 2. Noise or exception? ― One person's garbage could be another person's treasure 3. Methods: by product of clustering or regression analysis 4. Useful in fraud detection, rare events analysis

Knowledge to be mined(or:Data mining functions)

1. Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. 2. Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. 3. Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.

knowledge discovery from data

1. Data cleaning- to remove noise 2. Data integration - where multiple data sources may be combined 3. data selection - where data relevant to the analysis task are retrieved from the database 4. Data transformation - where data are transformed and consolidate into forms appropriate for mining by performing summary or aggregation operations 5. data mining - an essential process where intelligent methods are applied to extract data patterns. 6. pattern evolution - to identify the truly interesting pattern representing knowledge based on interestingness measure 7. knowledge presentation - where visualization and knowledge representation technique used to present mined knowledge to users.

Advanced data sets and advanced applications

1. Data streams and sensor data 2. Time-series data, temporal data, sequence data(incl.bio-sequences) 3. Structure data, graphs, social networks and multi-linked data 4. Object-relational databases 5. Heterogeneous databases and legacy databases 6. Spatial data and spatiotemporal data 7. Multimedia database 8. Text databases 9. The World-Wide Web

Association and Correlation Analysis

1. Frequent patterns (or frequent itemsets) 2. Association, correlation vs. causality

Structure and Network Analysis

1. Graph mining 2. Information network analysis 3. Web mining

Data cleaning steps

1. Ignore the tuples 2. Fill in the missing value manually 3. Use a global constant to fill in the missing value 4. Use a measure of central tendency for the attribute to fill the missing value 5.Use the attribute mean or median for all samples belonging to the same class as the given tuple 6. Use the most probable value to full in the missing value

Generalization

1. Information integration and data warehouse construction. Data cleaning, transformation, integration, and multidimensional data model 2. Data cube technology, Scalable methods for computing (i.e., materializing) multidimensional aggregates, OLAP (online analytical processing) 3. Multidimensional concept description: Characterization and discrimination. Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region

data transformation steps

1. Smoothing 2. Attribute Construction 3.Aggregration 4.Normalization 5.Discretization 6. Concept hierarchy generation for nominal data.

Sequence, trend and evolution analysis

1. Trend, time-series, and deviation analysis: e.g., regression and value prediction. 2. Sequential pattern mining 3. Periodicity analysis 4. Motifs and biological sequence analysis - Approximate and consecutive motifs 5. Similarity-based analysis 6. Mining data streams - Ordered, time-varying, potentially infinite, data streams

Cluster Analysis

1. Unsupervised learning (i.e., Class label is unknown) 2. Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns 3. Principle: Maximizing intra-class similarity & minimizing interclass similarity 4. Many methods and applications

Classification

1. label prediction - Construct models (functions) based on some training examples. Describe and distinguish classes or concepts for future prediction. Predict some unknown class labels 2. Typical methods - Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern- based classification, logistic regression 3. Typical applications - Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages.

Symmetric

A Binary attribute is ____ if both of its states are equally valuable and carry the same weight

Attribute Vector/ Feature Vector

A set of attributes used to describe a given object

Techniques utilized

Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc.

Data to be mined

Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks

Graph mining

Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments)

Numeric attributes

Interval-scaled attributes and ratio scaled attribute

Nominal Attributes

Nominal, Categorical, Enumerations.

Univariate

One attribute or variable

Database-oriented data sets and applications

Relational database, data warehouse, transactional database

Applications adapted

Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

Information network analysis

Social networks: actors (objects, nodes) and relationships (edges) Multiple heterogeneous networks Links carry a lot of semantic information: Link mining

Discrepancy detection / data transformation

The two step process of ___ and _____ iterates, this process however is error prone and time consuming, some transformations may introduced more discrepancies.

Web mining

Web is a big information network: from PageRank to Google Analysis of Web information networks

asymmetric

a binary attribute is ____ if the outcomes of the states are not equally important, such as positive and negative outcomes.

Attribute

a data field, representing a characteristic or feature of a data object.

parametric methods

a model is used to estimate the data, so that typically only the data parameters need to be stored, instead of the actual data.

Three elements defining data quality

accuracy, completeness, and consistency.

data migration tools

allows simple transformation to be specified such as to replace the string "gender" by "sex"

ETL (extraction, transformation, and loading) tools

allows users to specify transforms through a graphical user interface GUI. These tool typically support only a restricted set of transforms so that, often we may also choose to write custom scripts for this step of the data cleaning process.

field overloading

another error source that typically results when developers squeeze new attributes definitions into unused portions of already defined attribute

redundancy

another important issue in data integration, an attribute may be redundant if it can be derived from another attribute or set of attribute. Some can detected by correlation analysis, given two attributes such analysis can measure how strongly one attribute implies the other, based on the available data.

log linear models

approximate discrete multi dimensional probability distributions.

Interval-scaled attributes

are measured on a scale of equal size units. The values have order and can positive0 or negative, like temperature.

similarity / dissimilarity

are used in data mining applications such as clustering, outlier analysis, and nearest neighbor classification. Such measures of proximity can be computed for each attribute type studied in type studied in this chapter or for combinations of such attributes.

discrepancies

can be caused by several factors including poorly designed data entry forms that have many optional fields, human error in data entry, deliberate errors, and data decay.

Categorical

can be some kind of category, code, state and so nominal attributes

attribute construction

can help improve accuracy and understanding of structure in high dimensional data

numerosity reduction

data are replaced by alternative, smaller representations using parametric models (regression) or nonparametric models (histograms, clusters, data aggregation)

dimensionality reduction

data encoding schemes are applied so as to obtain a reduced/compressed representation of the original data

data quality

data have quality if they satisfy the requirement of the intended's. Including accuracy, completeness, consistency, timeless, believability and interpretability.

data pre-processing

data integration normalization feature selection dimension reduction

data reduction strategies include

dimensionality reduction, numerosity reduction and at a compression

first step in data cleaning

discrepancy detection

smoothing by bin means

each value in a bin replaced by the mean value of the bin, ex the man of the values 4,8 and 15 in bin 1 is 9.

data auditing tools

find discrepancies by analyzing the data to discover rules and relationships, and detecting data that violate each conditions.

correlation coefficient

for numeric attributes we can evaluate the correlation between attributes A and B by computing the...

qqplot

graphs the quantile of one univariate distribution against the corresponding quantities of another

discrete attribute

has a finite or countably infinite set of values, which mayor may not be represented as integers. It can have numeric values, such as 0 or 1

geometric projection technique

helps user find interesting projections of multidimensional data sets

Continuous

if an attribute is not discrete, such as a floating point variable.

data integration

integrating multiple databases, data cubes or files

chernoff faces

introduced in 1973 by statistical Herman chernoff

linear regression

involves finding the best line to fit two attributes or variables so that one attribute can be used to predict the others.

discrete wavelet transform

is a linear signal processing technique that when applied to a data vector X, transforms it to a numerically different vector X of wavelet coefficient.

Binary attribute

is a normal attribute with only 2 categories or state. 0 or 1, 0 means absent, 1 means present or true and false.

Noise

is a random error or variance in a measured variable

tag cloud

is a visualization of statistics of user generated tags.

ordinal attributes

is an attribute with possible values that have a meaning full order or ranking among them, but the magnitude between successive values is not known.

multiple linear regression

is an extension of linear regression where more than two attribute are involved and the data are fit to a multidimensional surface

ratio-scaled attribute

is numeric attribute with an inherent zero point like kelvin, it has a true zero point.

numeric attribute

is quantitative, that is, it is a measurable quantity, represented in integer or real values. They can interval-scaled or ratio scaled.

alternative names

knowledge discovery in database, knowledge extraction data/pattern analysis, data archeology, data dredging, information, business intelligence.

outliers

may be detected by clustering.

nested discrete poncies

may only be detected after others have been fixed.

data transformation

most errors however will require____, that is once we find discrepancies, we typically need to define and apply transformation to correct them.

data transformation

normalization data discretization and concept hierarchy generation

data reduction

obtains a reduced representation of the data set that is much smaller in volume, yet produces the same or almost the same result. Including Discrete wavelet transform and numerosity reductio.

health care & medical data mining

often adopted such a view in statistics and machine learning

data mining

often requires data integration the merging of data from multiple data stores.

hierarchical visualization techniques

partition all dimensions into subsets

data mining

pattern discovery association & correlation classification clustering outlier analysis

post processing

pattern evaluation pattern selection pattern interpretation pattern visualization

Types of Data Visualization graphs

pixel-oriented techniques, geometric projection techniques, icon based techniques, and hierarchical and graph-based techniques

attribute subset selection

reduces the data set set size by removing irrelevant or redundant attributes. The goal is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes 1. stepwise forward selection 2.stepwise backward elimination 3.combination of forward selection and backward elimation 4. decision tree induction

Nominal

relating to names, they can be symbols or name of things.

smoothing by bin medians

replaced by the bin median

data cleaning

routines work to "clean" the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.

unique rule

says that each value of the given attributes must be different from all other values for that attribute

consecutive rule

says that there can be no missing value between the lowest and highest values for the attribute and that all must be unique

entity identification problem

schema integration and object matching can be tricky, how can equivalent real world entities from multiple data sources match up.

principal components analysis

searches for K n-dimensional orthogonal vectors that can be used to represent the data, where k is greater or equal to n.

binning methods

smooth a sorted data value by consulting its "neighborhood" that is, the values around it. The sorted values are distributed into a number of "buckets" or bins.

null rule

specifies the use of blanks, questions marks, special characters, or other strings that may indicate the null conditions and how such values should be handled.

meta data

such knowledge /data about data

stick figure visualization

technique maps multidimensional data to five piece stick figures, where each figures has 4 limbs and a body

data reduction

techniques can be applied to obtain a reduced representation of the dataset that is much smaller in volume, yet closely maintains the integrity of the original data.

orthonormal

the columns are unit vectors and are mutually orthogonal, so that the matrix inverse is just its transpose.

contingency table

the data tuples described by and b can be shown as ___ with the c values of A making up the columns and the r values of B making up the rows.

expected values

the mean values of A and B respectively are known as

smoothing by bin boundaries

the min and max values in again bin are identified as bin boundaries. Each bin value is then replaced by the closest boundary value.

data compression

transformation are applied so as to obtain a reduced or compressed representation of the original data. If the original data can be reconstructed from the compressed data without any information loss, the data reduction is lossless, if not it is lossy.

bivariate distribution

two attributes

data scrubbing tools

use simple domain knowledge to detect errors and make corrections in the data

icon based visualization technique

use small icon to represent multidimensional data values

Dimension

used in data warehousing

Feature

used in machine learning

Enumerations

values with no meaning

business intelligence view

warehouse, data cube, reporting but not much mining


Set pelajaran terkait

Chapter 21: Abdomen, Ch. 25 Anus, Rectum and Prostate, quiz 9

View Set

Official Driving Segment One End of Day Review

View Set

Advanced Accounting - Chapter 10

View Set

NUR 108 Ch 39 Oxygenation and Perfusion

View Set

Chapter 10 "Postural Control Mechanisms" and Chapter 11 "Postural Control in Wellness & Performance"

View Set

ISYS438 Project Risk Management Chapter 11 Terms

View Set

Ethical Hacking - C701 TotalTester Part 1/2

View Set

Unit # 1 : DNA + Protein Synthesis

View Set