Data Mining- Exam I

¡Supera tus tareas y exámenes ahora con Quizwiz!

What is the coverage and accuracy for this example. (flip over)

(Status = Single) -> No Coverage = 40% Accuracy = 50%

Proximity measure for binary attributes

- A contingency table for binary data - Distance measure for symmetric binary variables: - Distance measure for asymmetric binary variables: - Jaccard coefficient (similarity measure for asymmetric binary variables): - Note: Jaccard coefficient is the same as "coherence"

Data quality

- Accuracy: correct or wrong, accurate or not - Completeness: not recorded, unavailable, ... - Consistency: some modified but some not, dangling, ... - Timeliness: timely update? - Believability: how trustable the data are correct? - Interpretability: how easily the data can be understood?

Knowledge to be mined (Data mining functions)

- Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. - Descriptive vs. predictive data mining - Multiple/integrated functions and mining at multiple levels

Major tasks in preprocessing

- Data Cleaning - Data Integration - Data Reduction - Data transformation and data discretization

Measuring data dissimilarity

- Data matrix - Dissimilarity matrix - Cosine Similarity

Reasons to preprocess the data

- Database are highly susceptible! - Heterogeneity - How can the data be preprocessed in order to help improve the quality of the data and, consequently, of the mining results? - How can the data be preprocessed so as to improve the efficiency and ease of the mining process?

Reasons for Data Mining

- Explosive growth of data (from automated data collection tools, database systems, the web, computerized society) - Need for analysis of massive data sets

Discrete Attribute

- Has only a finite or countably infinite set of values - E.g., zip codes, profession, or the set of words in a collection of documents - Sometimes, represented as integer variables - Note: Binary attributes are a special case of discrete attributes

Continuous Attribute

- Has real numbers as attribute values - E.g., temperature, height, or weight - Practically, real values can only be measured and represented using a finite number of digits - Continuous attributes are typically represented as floating-point variables

Ratio

- Inherent zero-point - We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚). - e.g., temperature in Kelvin, length, counts, monetary quantities

Measuring Central Tendency

- Mean - Median - Mode

Interval

- Measured on a scale of equal-sized units - Values have order - E.g., temperature in C˚or F˚, calendar dates - No true zero-point

Five-number summary

- Minimum - Q1 - Median - Q3 - Maximum

Major issues in Data Mining

- Mining methodology - User Interaction - Efficiency and Scalability - Diversity of data types - Data mining and society

Statistical description of Data

- Motivation - To better understand the data: central tendency, variation and spread - Data dispersion characteristics - median, max, min, quantiles, outliers, variance, etc. - Numerical dimensions correspond to sorted intervals - Data dispersion: analyzed with multiple granularities of precision - Boxplot or quantile analysis on sorted intervals - Dispersion analysis on computed measures - Folding measures into numerical dimensions - Boxplot or quantile analysis on the transformed

Similirity

- Numerical measure of how alike two data objects are - Value is higher when objects are more alike - Often falls in the range [0,1]

Dissimilarity

- Numerical measure of how different two data objects are - Lower when objects are more alike - Minimum dissimilarity is often 0 - Upper limit varies

Outlier Analysis

- Outlier: A data object that does not comply with the general behavior of the data - Noise or exception? ― One person's garbage could be another person's treasure - Methods: by product of clustering or regression analysis, ... - Useful in fraud detection, rare events analysis

Measuring the Dispersion of Data

- Quartiles, outliers and boxplots - Variance and standard deviation - Boxplot - Five-number summary

Types of Data Sets

- Record - Relational records - Data matrix, e.g., numerical matrix, crosstabs - Document data: text documents: termfrequency vector - Transaction data - Graph and network - World Wide Web - Social or information networks - Molecular Structures - Ordered - Video data: sequence of images - Temporal data: time-series - Sequential Data: transaction sequences - Genetic sequence data - Spatial, image and multimedia: - Spatial data: maps - Image data: - Video data:

Data mining applications

- Web page analysis: from web page classification, clustering to PageRank & HITS algorithms - Collaborative analysis & recommender systems - Basket data analysis to targeted marketing - Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis - Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue) - From major dedicated data mining systems/tools (e.g., SAS, MS SQLServer Analysis Manager, Oracle Data Mining Tools) to invisible data mining

Proximity measure for nominal attributes

- simple matching - use a large number of

How do we create a decision tree for Classification?

1. Create a node. 2. Find the best split. 3. Classify. 4. Determine the stoping conditions.

What are the steps for Direct Method for Sequential Covering?

1. Start from an empty rule. 2. Grow a rule using the Learn-One-Rule function. 3. Remove training records covered by the rule. 4. Repeat steps 2 and 3 until looping criteria is met.

What is the Modified Value Difference Metric? (MVDM)

A part of PEBLS, is used to calculate the distance between 2 nominal values.

What is Bayes theorem?

A probabilistic framework for solving classification problems.

How is the best split determined?

Based on the degree of impurity of the child nodes. Class distribution (0,1) has highest purity, class distribution (0.5, 0.5) has the smallest purity. Intuition: high purity -> small value of impurity measures -> better split

Symmetric binary

Both outcomes equally important. E.g., gender

Name five kinds of graphics/plots that can be used to represent data dispersion characteristics effectively.

Boxplot, Q-Q plot, histogram, quantile plot, scatter plot.

Data Cleaning

Can be applied to remove noise and correct inconsistencies of data.

Data Reduction

Can reduce the size of the data by aggregating, eliminating redundant features, or clustering.

Nominal

Categories, states, or "names of things". E.g. -Hair_color = {auburn, black, blond, brown, grey, red, white} - marital status, occupation, ID numbers, zip codes

What are K-Nearest Neighbor classifiers?

Classify a new example by comparing it to all previously seen examples. The classifications of the k most similar previous cases are used for predicting the classification of the current example.

Classification Rule

Classify records by using a collection of "if...then..." rules • Rule: (Condition) -> y • where • Condition is a conjunction of attribute tests • y is the class label ex: • Examples of classification rules: • (Blood Type=Warm) ^ (Lay Eggs=Yes) -> Birds

What is the best distance measure to find whether two text documents are similar?

Cosine similarity.

How do we evaluate the quality of a classification rule?

Coverage and Accuracy.

Negatively Skewed Data

Data where the mean is the least, median is the middle and the mode is the greatest.

Symmetric Data

Data where the mean, median and mode are all very similar or even the same.

Positively Skewed Data

Data where the mode is the least, the median is the middle and the mean is the greatest.

Techniques utilized

Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc.

Data to be mined

Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks.

What is the difference between direct and indirect method for building classification rules?

Direct method: extract rules directly from data. Indirect method: extract rules from other classification models.

What are some measures if impurity?

Entropy, Gini and classification error.

What is Tree induction?

Essentially, it's how to split records when building a decision tree and when to stop splitting. This, of course, depends on the attribute types (binary, nominal, ordinal and continuous) as well as the # of ways to split. (2-way/binary split and multi-way split)

What is the issue with the Euclidean measure?

High dimensional data; the curse of dimensionality can produce counter-intuitive results. (the solution to this is to normalize the vectors to unit length)

Binning

How to deal with noisy data

What is Hunt's Algorithm?

Hunt's algorithm grows a decision tree in a recursive fashion by partitioning the training records into successively purer subsets. Let Dt be the set of training records that reach a node t General Procedure: - If Dt contains records that belong the same class yt , then t is a leaf node labeled as yt - If Dt is an empty set, then t is a leaf node labeled by the default class, yd - If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.

What happens when a test record is presented to an ordered rule set?

It is assigned to the class label of the highest ranked rule it has triggered. If none of the rules fired, it is assigned to the default class.

What is the best distance measure for comparing similar diseases with a set of medical tests?

Jaccard Coefficient.

When using K-Nearest Neighbor classifications, what are the issues if k is too small vs k is too big?

K is too small: sensitive to noise points. K is too big: neighborhood may include noise/points from other classes.

Data Mining

Knowledge discovery form data. Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amounts of data.

Are K-NN Classifiers lazy or eager learners?

Lazy - it does not build models explicitly, unlike eager learners such as decision tree induction and rule-based systems; classifying unknown records are relatively expensive.

Make Classification Rules from Decision Trees. (flip over card)

Make sure you can do this.

Data Transformations

May be applied where data are scaled to fall within a smaller range (e.g., 0.0 to 1.0)

Data Integration

Merges data from multiple sources into a coherent data store.

What are characteristics of rule-based classifiers?

Mutually exclusive rules; Classifier contains mutually exclusive rules if the rules are independent of each other, every record is covered by at most one rule. Exhaustive rules; Classifier has exhaustive coverage if it accounts for every possible combination of attribute values, each record is covered by at least one rule. As highly expressive as decision trees. Easy to interpret. Easy to generate. Can classify new instances rapidly. Performance comparable to decision trees.

Binary

Nominal attribute with only 2 states (0 and 1).

Asymmetric binary

Outcomes not equally important. E.g., medical test (positive vs. negative). Convention: assign 1 to most important outcome (e.g., HIV positive)

Data object

Represents an entity. Described by attributes. Examples: - sales database: customers, store items, sales - medical database: patients, treatments - university database: students, professors, courses

Applications adapted

Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

What are 2 examples of instance-based classifiers?

Rote-learner and nearest neighbor.

What are the 2 Rule Ordering Schemes?

Rule-based ordering: individual rules are ranked based on their quality. Class-based ordering: rules that belong to the same class appear together.

List 4 Classification techniques besides decision trees.

Rule-based, nearest-neighbor, support vector machines, ensemble methods.

What is an ordered rule set?

Rules are rank ordered according to their priority. (aka a decision list)

What is the best distance measure to find the maximum difference between any attribute of two vectors.

Supremum distance.

How do we determine the class from nearest neighbor list?

Take the majority vote of class labels among the k-nearest neighbors. Weigh the vote according to distance.

What is classification?

Task of assigning objects to one of several predefined categories.

What is coverage?

The fraction of records in D that trigger the rule r.

What is accuracy?

The fraction of records triggered by r whose class labels are equal to y.

Data visualization

Used to gain insight, provide a qualitative overview, search, help find interesting regions and suitable parameters, provide visual proof. - Pixel-oriented visualization techniques - Geometric projection visualization techniques - Icon-based visualization techniques - Hierarchical visualization techniques - Visualizing complex data and relations

Ordinal

Values have a meaningful order (ranking) but magnitude between successive values is not known. E.g., Size = {small, medium, large}, grades, army rankings

Attribute

a.k.a. dimensions, features, variables: A data field, representing a characteristic or feature of a data object. E.g., customer_ID, name, address. Types: - Nominal - Binary - Numeric: quantitative - Interval-scaled - Ratio-scaled

Quantity

integer or real-valued

Proximity

refers to similarity or dissimilarity

What is Parallel Exemplar-Based Learning System? (PEBLS)

• Works with both continuous and nominal features • For nominal features, distance between two nominal values is computed using modified value difference metric (MVDM) • Each record is assigned a weight factor • Number of nearest neighbor, k = 1

Data Mining- Exam I

Conjuntos de estudio relacionados

Astr 111 Midterm

315 LM Chapter 19 pre/post

Practice Questions

Study Guide 6 (quiz 5)

Pharm review

49

Salesforce AI Associate Study Set

SOC 302 Test #2

The Science of the Mind

Chemistry

S4 Unit 3 questions 4

Life and Health Exam Study Guide

Quiz 1 Econ

Chapter 6 - International trade theory

Міжнародний маркетинг 2

Module 5

Chapter 1, 2, and 3 Investments

Individual Tax

Round 3 Exam Questions Part 1

Chapter 13