1: Introduction to Data Mining

Ace your homework & exams now with Quizwiz!

Data Mining Tasks

- Classification - Clustering - Sequential Pattern Discovery - Regression - Deviation Detection

Business/Analytics applications

- Customer relationship management applications - Discovery of segments or groups within a customer data set - Identifying the characteristics of the most successful employees - Market basket analysis - Catalog marketing industry

Input Data can be described by the following

- Data sets - Attributes and measurements - Attribute type

Different characteristics of data sets

- Dimensionality - Sparsity - Resolution

Other classification of attributes

- Discrete attribute: has a finite or countably infinite set of values, e.g. zip codes, ID numbers, dates, colours, standard sizes, etc - binary attribute: is a special case of discrete attribute - Continuous attribute: is one whose values are real numbers, e.g. temperature, weight, height, speed, etc - Asymmetric attribute: is one whose existence is regarded as important

Classification: Definition

- Given a collection of records (training set): Each record contains a set of attributes, one of the attributes is the class - Find a model for class attribute as a function of the values of other attributes - Goal: preciously unseen records should be assigned a class as accurately as possible A test set is used to determine the accuracy of the model. Usually, the given set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Different types of Attribute

- Nominal: The values of nominal attribute are just different names - Ordinal: The values of an ordinal attribute allow to order objects - Interval: The difference between interval attribute are meaningful - Ration: Both differences and ratios are meaning for ratio attributes

Types of Data Sets

- Record data - Ordered data - Graph-based data

Medicine/ Science/ Engineering applications

- Understand the mapping relationship between the inter-individual variation in human DNA sequences - Condition monitoring of high voltage electrical equipment - Dissolved gas analysis on power transformers - Analysis of factors leading students to choose to engage in behaviours to reduce their learning - Discovering patterns associating drug prescriptions to medical diagnoses

Following steps of Knowledge discovery

1. Data cleaning: to remove noise and inconsistent data 2. Data integration: to combine multiple data sources 3. Data selection: to retrieve data from databases 4. Data transformation: to get data into forms appropriate for data mining 5. Data mining: to extract data patterns 6. Pattern evaluation to identify interesting patterns 7. Knowledge presentation: to present mined knowledge to users

Data quality: Missing values, Inconsistent values and Duplicate data

Missing values: means that one or more attribute are not available in a data object. Values can be missing because information was not collected , some attributes are not applicable, its presence depends on presence of other values etc Incosistent values: are values that violate given consistency constraints Duplicate data: are data objects that are duplicates or almost duplicates of each other

Time series data

Special type of sequential data in which each record is a time series, i.e., a series of measurements taken over time

Dimensionality

The dimensionality of a data set is the number of attributes that the objects in the data set posses

Resolution

The resolution of a data set means an average "distance" between the measurement of the attributes of the data objects

Sparsity

The sparsity of a data set means frequency of attribute appearances in the descriptions of the objects

Cluster Analysis

finds the groups of closely related observations such that observations that belong to the same cluster are more similar to each other that belong to other cluster.

Data integration: Elimination of redundancies

means finding the attributes whose values can be derived from other attributes, e.g. through correlation analysis

Data integration

means merging data from multiple data sources in to a coherent data store

Examples of application classes

Business/Analytics applications Medicine applications Science and Engineering applications

Sparse data matrix

A data matrix with missing or unavailable elements

Transaction data

A set of records, where each record involves a set of items

Example of Predictive modeling

A species of flower based on the characteristics of the flower Based on the categories of petal width and length the following rules can be derived - petal width low and petal length low implies a species Setosa - petal width medium and petal length medium imples a species Versicolour

Data preprocessing: Aggregation, Sampling, Dimensionality reduction, Feature subset selection

Aggregation combines two or more objects into a single object Sampling selects subset of the data objects to be analyzed Dimensionality reduction reduces the total number of attributes describing an object creating new that are combinations of the old ones Feature subset selection reduce the total number of attributes by eliminating nonimportant attributes

Data matrix

All records have fixed set of numeric attributes, data objects can be considered as "points" in a multidimensional space where each dimension represents a distinct attribute describing the object

Attributes and measurements

An attribute is a property or characteristics of an object that may vary either from one object to another or from one time to another

Anomaly detection

Anomaly detection is a task of identifying observations whose characteristics are significantly different from the rest of the data.

Association analysis

Association analysis is used to discover patterns that describe strongly associated features in the data

Attribute type

Attribute type is determined by the properties of its values that correspond to underlying properties of the attribute

Example of anaomaly detection

Credit card fraud detection Based on the analysis of legitimate credit card transactions a profile of legitimate transaction is built and when a new transaction arrives it is compared against the profile; if its characteristics are very different from the earlier created profile the transaction is flagged as suspected.

What is data mining?

Data mining is the process of automatically discovering useful information in large data repositories

Data sets

Data set is a collection of data objects (records, points, vectors, graphs, observations, etc)

Sequence data

Data set that is a sequence of individual entities, such as a sequence of words or letters

Data integration: Detection and resolution of data value conflicts

Detection and resolution of data value conflicts means identification and elimination of all cases when for the same real world entity the values of the same attributes from different sources may differ

Data with objects that are graphs

If objects have internal structure then the objects contain sub-objects that have relationships among them

Sequential data (temporal data)

Extension of record data where each record has a time moment associated with it

Data preprocessing: Feature creation, Discretization and binarization, Variable transformation

Feature creation means creating a new set of attributes from the original one Discretization is a transformation of continuous attribute into a categorical attribute, binarization is a transformation of both continuous and discrete attributes into binary attributes Variable transformation refers to a transformation that is applied to all the values of a variable(attribute)

Example of Association analysis

Finding the items that are frequently bought together by the customers Based on the analysis of customer baskets the following rules can be derived - bread -> butter that suggests customer who buy bread also tend to buy butter

Clustering Definition

Given a set of data points, each having a set of attributes and similarity measure among them, find clusters such that - Data points in one cluster are more similar to one another - Data points in separate clusters are less similar to one another Similarity Measures: - Euclidean Distance if attributes are continuous - Other Problem-specific Measures.

Classification: Application 2 (Fraud Detection)

Goal: Predict fraudulent cases in credit card transactions Approach: - Use credit card transactions and the information on its account-holder attributes. When does a customer buy, what does he buy, how often he pays on time, etc. - Label past transactions as fraud or fair transactions. This forms the class attribute. - Learn a model for the class of the transactions. - Use this model to detect fraud by observing credit card transactions on an account.

Classification: Application1 (Direct Marketing)

Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new smart-phone product. Approach: - Use the data for a similar product introduced before - We know which customers decided to buy and which decided otherwise. This {buy, don't buy} decision forms the class attribute. - Collect various demographic, lifestyle and company interaction related information about all such customers. Type of business, where they stay, how much they earn, etc. - Use this information as input attributes to learn a classifier model.

Clustering: Application 2(Document Clustering)

Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.

Classification: Application 3 (Sky Survey Cataloging)

Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory) - 3000 images with 23,040 x 23040 pixels per image Approach: - Segment the image. - Measure image attributes(features)- 40 of them per object. - Model the class based on these features - Success Story: could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find!

Clustering: Application 1 (Market Segmentation)

Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reach with a distinct marketing mix. Approach: - Collect different attributes of customers based on their geographical and lifestyle related information. - Find clusters of similar customers. - Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.

Data with objects that are graphs and have relationships amongst objects

Graph of Graphs

Example of Cluster Analysis

Grouping the newspaper articles based on their respective topics

What is Knowledge discovery?

Knowledge discovery is a process of converting raw data into useful information

Data Quality: Measurement and data collection errors

Measurement error happens when a value recorded differs from the true value Data collection error refers to omitting data objects or attributes, or inappropriately including a data object.

Record data

No explicit relationship among records, or data fields, every record has the same set of attributes

Data Quality: Noise and artifacts

Noise is a random component of a measurement error, it distorts a value or it adds spurious objects Artifact is a deterministic distortion of data

Descriptive tasks

Objective of these tasks is to derive patterns such as correlations, trends, clusters, trajectories, and anomalies that summarize the underlying relationships in data

Predictive Tasks

Objective of these tasks is to predict the value of a particular attribute based on the value of other attributes

Data Quality: Precision, bias, accuracy and Outliers

Precision: means the closesness of repeated measurements to one another. Bias: means a systematic variation of measurements from the quantity being measured. Accuracy: means the closeness of measurements of the true value of the quantity being measured Outliers: are either data objects that have characteristics different from the most of other data objects in the data set or values of attributes that are unusual with respect to the typical values for that attribute

Data mining Tasks

Predictive Tasks Descriptive Tasks

Predictive modeling

Predictive modeling refers to the task of building a model for the target variable as a function of explanatory variables

Examples of Data mining tasks

Predictive tasks Association tasks Cluster analysis Anomaly detection

Spatial data

Records data that have spatial attributes such as positions or areas and other types of attributes

Data with relationships among objects

Relationships among the objects convey important information, the data is represented as a graph

Schema integration and object matching

Schema integration means matching real world entities in to a common schema Object matching means matching identical real world objects that have a bit different desciptions


Related study sets

Marketing Chapter 5 - Consumer Behaviour

View Set

Business Policy and Strategic Management Ch 4-6

View Set

Patho Ch 38 Disorders of Special Sensory Function

View Set

ARCH 249 Exam 3 (Architecture of Ancient India and SE Asia)

View Set

Overview of the digestive system

View Set

Cell Structure and Function - Neural Tissue

View Set

+INFECTIOUS DISEASE UWORLD ROUND 1+

View Set