Data chapter 4

Ace your homework & exams now with Quizwiz!

Data Mining Characteristics/Objectives

- Source of data for DM is often a consolidated data warehouse (not always!). - DM environment is usually a client-server or a Web-based information systems architecture. - Data is the most critical ingredient for DM which may include soft/unstructured data. - The miner is often an end user. - Striking it rich requires creative thinking. - Data mining tools' capabilities and ease of use are essential (Web, Parallel processing, etc.).

Why Data Mining?

-More intense competition at the global scale -Recognition of the value in data sources -Availability of quality data on customers, vendors, transactions, Web, etc. -Consolidation and integration of data repositories into data warehouses -The exponential increase in data processing and storage capabilities; and decrease in cost -Movement toward conversion of information resources into nonphysical form

Association rule learning

A very popular data mining method in business Find interesting relationships between variables There is no output variable Also known as market basket analysis

Types of patterns

Association, Prediction, Cluster (segmentation), Sequential (or time series) relationships.

Data Mining Process

CRISP-DM (cross industry standard process for data mining) SEMMA (sample, explore, modify, model, and assess) KDD (Knowledge Discovery in Databases)

How Data Mining Works

DM extracts patterns from data

Classification techniques

Decision tree analysis Statistical analysis Neural networks Support vector machines Case-based reasoning Bayesian classifiers Genetic algorithms Rough sets

Decision trees

Employs the divide and conquer method. Recursively divides a training set until each division consists of examples from one class.

Clustering results may be used to

Identify natural groupings of customers Identify rules for assigning new cases to classes for targeting/diagnostic purposes Provide characterization, definition, labeling of populations Decrease the size and complexity of problems for other data mining methods Identify outliers in a specific domain (e.g., rare-event detection)

Data Mining Methods: Classification

Most frequently used DM method Part of the machine learning family Employ supervised learning Learn from past data, classify new data The output variable is categorical in nature

Data Mining Tasks

Prediction (classification, regression, time series) Association (market-basket, link analysis, sequence analysis) and Segmentation (clustering, outlier analysis)

Assessment Methods for Classification

Predictive accuracy (hit rate), speed, robustness (ability to make reasonable accurate predictions), scalability, and interpretability (transparency, explainability)

Data mining mistakes

Selecting the wrong problem for data mining Ignoring what your sponsor thinks data mining is and what it really can or cannot do Beginning without the end in mind Not leaving insufficient time for data acquisition selection and preparation Looking only at aggregated results and not individual records or predictions

Analysis methods

Statistical methods Neural networks Fuzzy logic Genetic algorithms

Data mining is a blend of multiple disciplines

Statistics Artificial intelligence Machine learning and pattern recognition Information visualization Database management and data warehousing Management science and information systems

CRISP-DM (Cross-Industry Standard Process for Data Mining)

Steps 1. Businesses understanding 2. Data understanding 3. Data preparation 4. Model building 5. Testing and Evaluation 6. Deployment

K-means clustering algorithm

Steps: 1. Randomly generate K random points as initial cluster centers 2. Assign each point to the nearest cluster Center 3. Recompute the new cluster centers Repeat steps three and four until some conversions criteria is met

Cluster Analysis for Data Mining

Used for automatic identification of natural groupings of things Learns the clusters of things from past data, then assigns new instances Part of the machine-learning family Employ unsupervised learning There is not an output/target variable Also known as segmentation in marketing


Related study sets

Pharm Exam 8 Book Questions Ch. 21, 22 & 23

View Set

CSE (California Supplemental Exam) ARE

View Set

Gender Stereotypes and the Media

View Set

Advanced Managerial Finance - Chapter 12-14 - Capital Budget Decision Criteria

View Set

Unit 8: Observation, Reporting and Documentation

View Set

PSY230 Practice Problems Lesson 8 Hypothesis Testing

View Set