Data Mining

¡Supera tus tareas y exámenes ahora con Quizwiz!

Outlier Analysis

1. A data object that does not comply with the general behavior of the data 2. Noise or exception? ― One person's garbage could be another person's treasure 3. Methods: by product of clustering or regression analysis 4. Useful in fraud detection, rare events analysis

Knowledge to be mined(or:Data mining functions)

1. Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. 2. Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. 3. Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.

knowledge discovery from data

1. Data cleaning- to remove noise 2. Data integration - where multiple data sources may be combined 3. data selection - where data relevant to the analysis task are retrieved from the database 4. Data transformation - where data are transformed and consolidate into forms appropriate for mining by performing summary or aggregation operations 5. data mining - an essential process where intelligent methods are applied to extract data patterns. 6. pattern evolution - to identify the truly interesting pattern representing knowledge based on interestingness measure 7. knowledge presentation - where visualization and knowledge representation technique used to present mined knowledge to users.

Advanced data sets and advanced applications

1. Data streams and sensor data 2. Time-series data, temporal data, sequence data(incl.bio-sequences) 3. Structure data, graphs, social networks and multi-linked data 4. Object-relational databases 5. Heterogeneous databases and legacy databases 6. Spatial data and spatiotemporal data 7. Multimedia database 8. Text databases 9. The World-Wide Web

Association and Correlation Analysis

1. Frequent patterns (or frequent itemsets) 2. Association, correlation vs. causality

Structure and Network Analysis

1. Graph mining 2. Information network analysis 3. Web mining

Data cleaning steps

1. Ignore the tuples 2. Fill in the missing value manually 3. Use a global constant to fill in the missing value 4. Use a measure of central tendency for the attribute to fill the missing value 5.Use the attribute mean or median for all samples belonging to the same class as the given tuple 6. Use the most probable value to full in the missing value

Generalization

1. Information integration and data warehouse construction. Data cleaning, transformation, integration, and multidimensional data model 2. Data cube technology, Scalable methods for computing (i.e., materializing) multidimensional aggregates, OLAP (online analytical processing) 3. Multidimensional concept description: Characterization and discrimination. Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region

data transformation steps

1. Smoothing 2. Attribute Construction 3.Aggregration 4.Normalization 5.Discretization 6. Concept hierarchy generation for nominal data.

Sequence, trend and evolution analysis

1. Trend, time-series, and deviation analysis: e.g., regression and value prediction. 2. Sequential pattern mining 3. Periodicity analysis 4. Motifs and biological sequence analysis - Approximate and consecutive motifs 5. Similarity-based analysis 6. Mining data streams - Ordered, time-varying, potentially infinite, data streams

Cluster Analysis

1. Unsupervised learning (i.e., Class label is unknown) 2. Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns 3. Principle: Maximizing intra-class similarity & minimizing interclass similarity 4. Many methods and applications

Classification

1. label prediction - Construct models (functions) based on some training examples. Describe and distinguish classes or concepts for future prediction. Predict some unknown class labels 2. Typical methods - Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern- based classification, logistic regression 3. Typical applications - Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages.

Symmetric

A Binary attribute is ____ if both of its states are equally valuable and carry the same weight

Attribute Vector/ Feature Vector

A set of attributes used to describe a given object

Techniques utilized

Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc.

Data to be mined

Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks

Graph mining

Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments)

Numeric attributes

Interval-scaled attributes and ratio scaled attribute

Nominal Attributes

Nominal, Categorical, Enumerations.

Univariate

One attribute or variable

Database-oriented data sets and applications

Relational database, data warehouse, transactional database

Applications adapted

Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

Information network analysis

Social networks: actors (objects, nodes) and relationships (edges) Multiple heterogeneous networks Links carry a lot of semantic information: Link mining

Discrepancy detection / data transformation

The two step process of ___ and _____ iterates, this process however is error prone and time consuming, some transformations may introduced more discrepancies.

Web mining

Web is a big information network: from PageRank to Google Analysis of Web information networks

asymmetric

a binary attribute is ____ if the outcomes of the states are not equally important, such as positive and negative outcomes.

Attribute

a data field, representing a characteristic or feature of a data object.

parametric methods

a model is used to estimate the data, so that typically only the data parameters need to be stored, instead of the actual data.

Three elements defining data quality

accuracy, completeness, and consistency.

data migration tools

allows simple transformation to be specified such as to replace the string "gender" by "sex"

ETL (extraction, transformation, and loading) tools

allows users to specify transforms through a graphical user interface GUI. These tool typically support only a restricted set of transforms so that, often we may also choose to write custom scripts for this step of the data cleaning process.

field overloading

another error source that typically results when developers squeeze new attributes definitions into unused portions of already defined attribute

redundancy

another important issue in data integration, an attribute may be redundant if it can be derived from another attribute or set of attribute. Some can detected by correlation analysis, given two attributes such analysis can measure how strongly one attribute implies the other, based on the available data.

log linear models

approximate discrete multi dimensional probability distributions.

Interval-scaled attributes

are measured on a scale of equal size units. The values have order and can positive0 or negative, like temperature.

similarity / dissimilarity

are used in data mining applications such as clustering, outlier analysis, and nearest neighbor classification. Such measures of proximity can be computed for each attribute type studied in type studied in this chapter or for combinations of such attributes.

discrepancies

can be caused by several factors including poorly designed data entry forms that have many optional fields, human error in data entry, deliberate errors, and data decay.

Categorical

can be some kind of category, code, state and so nominal attributes

attribute construction

can help improve accuracy and understanding of structure in high dimensional data

numerosity reduction

data are replaced by alternative, smaller representations using parametric models (regression) or nonparametric models (histograms, clusters, data aggregation)

dimensionality reduction

data encoding schemes are applied so as to obtain a reduced/compressed representation of the original data

data quality

data have quality if they satisfy the requirement of the intended's. Including accuracy, completeness, consistency, timeless, believability and interpretability.

data pre-processing

data integration normalization feature selection dimension reduction

data reduction strategies include

dimensionality reduction, numerosity reduction and at a compression

first step in data cleaning

discrepancy detection

smoothing by bin means

each value in a bin replaced by the mean value of the bin, ex the man of the values 4,8 and 15 in bin 1 is 9.

data auditing tools

find discrepancies by analyzing the data to discover rules and relationships, and detecting data that violate each conditions.

correlation coefficient

for numeric attributes we can evaluate the correlation between attributes A and B by computing the...

qqplot

graphs the quantile of one univariate distribution against the corresponding quantities of another

discrete attribute

has a finite or countably infinite set of values, which mayor may not be represented as integers. It can have numeric values, such as 0 or 1

geometric projection technique

helps user find interesting projections of multidimensional data sets

Continuous

if an attribute is not discrete, such as a floating point variable.

data integration

integrating multiple databases, data cubes or files

chernoff faces

introduced in 1973 by statistical Herman chernoff

linear regression

involves finding the best line to fit two attributes or variables so that one attribute can be used to predict the others.

discrete wavelet transform

is a linear signal processing technique that when applied to a data vector X, transforms it to a numerically different vector X of wavelet coefficient.

Binary attribute

is a normal attribute with only 2 categories or state. 0 or 1, 0 means absent, 1 means present or true and false.

Noise

is a random error or variance in a measured variable

Data Mining

Conjuntos de estudio relacionados

BUS 421 Ch. 10 AI

Chapter 21: Abdomen, Ch. 25 Anus, Rectum and Prostate, quiz 9

M2 Equity Method

Official Driving Segment One End of Day Review

English Pop Quiz

Advanced Accounting - Chapter 10

Unit 9 Developmental Psychology

ACCT 2302 CHapter 6 HW

NUR 108 Ch 39 Oxygenation and Perfusion

micro chapter 10

tb ap bio 2-3 (first test)

Chapter 10 "Postural Control Mechanisms" and Chapter 11 "Postural Control in Wellness & Performance"

MPRE

ISYS438 Project Risk Management Chapter 11 Terms

Psy- Addiction

Ethical Hacking - C701 TotalTester Part 1/2

Accounting 201B Chapter 3

chapter 6

Chapter 6 MKTG

Unit # 1 : DNA + Protein Synthesis