IS 383 - DATA MINING

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

data mining process

a manifestation of best practices systematic way to conduct DM projects different groups have different versions CRISP-DM (cross industry standard process for data mining) SEMMA (Sample, Explore, Modify, Model, and Assess) KDD (Knowledge Discovery in Databases)

Clustering Analysis for Data Mining

aka Segmentation used for identifying of natural groupings part of the machine learning family employ unsupervised learning learns from past data then assigns new instances no output variable

Supervised data mining

analysts develop a model classification and decision tree statistical technique to estimate parameters of a model regression analysis - measures the impact of set variables on another neural networks -predict values and make classification used for making predictions

data mining techniques

are being used by organizations to gain a better understanding of their customers and their operations and to solve complex organizational problems.

data mining patterns

associations - commonly co-grouping predictions - future occurrence based on historical occurrences cluster - (segmentation) natural grouping based on characteristics sequential relationships - discovering time ordered events

data mining methods

classification regression clustering

decision tree

is a hierarchial arrangement of criteria that predicts a classification or value. data mining technique that selects the most useful attributes for classifying entities on some criterion. useful in the decision process.

data mining

is a process that uses statistical, mathematical, and artificial intelligence techniques to extract and identify useful information and subsequent knowledge or patterns from large sets of data.

Clustering

is unsupervised learning no controlling mechanism only input variables are provided find heuristics

Cluster Algorithms

k-means Clustering Algorithm k = predetermined number of clusters Algorithm (Determine the value of k) 1. randomly generate k points 2. assign each point to the nearest cluster 3. recompute the new cluster center intra cluster distances are minized inter clsuter distances are maximized

why use data mining

more intense competition in the market. untapped value is hidden in large data integration of databases which enables a single view of customers, vendors, transaction, etc. consolidation of databases into a single location in the form of a data warehouse. storage capabilities that reduce cost. converting information into non physical form.

classification

most frequently used data mining part of the machine learning family employ supervised learning learn from past data, classify data the output variable is categorical in nature (nominal or ordinal) good/bad credit sunny, windy, rainy 2-step method 1. Model development and training data is fed class labels are also provided 2. Model testing and deployment tested with a new sample of data accuracy is assessed Iterative process Deployed for actual use

Unsupervised data mining

no model analysts apply data-mining and then observe results analysts create a hypotheses after analysis is completed cluster analysis - common technique in this category groups entities together that have similar characterisitcs

Classification vs. Regression

non numeric = classification numeric = regression

Regression

numeric value the predicted problem is called regression

data mining tasks

prediction - is commonly referred to as the act of telling about the future association - discovering interesting relationships among variables in large databases. clustering - partitions a collection of things into segments whose members share similar characteristics

Assessing the Models (Classification Models)

predictive accuracy - hit rate Speed - Model building, predicting Robustness - Scalability - Interpretability - transparency, explainability

Data Mining Myths

provides instant solution is not yet viable for business practices requires a separate database can only be done by those with advanced degrees is only for large forms is another name for statistics

k-Fold Cross Validation aka rotation estimation (Classification)

splits data into k mutually exclusive subsets uses each subset as testing while using the rest of the subsets as training repeat the experimentation for k times Aggregate the test results for true estimation of prediction accuracy training

Simple Split (Classification)

Split data into 2 mutually exclusive sets training 70% training, 30% testing

Highest Level of Data Abstraction

unstructured data - textual, imagery, voice, and web content structured data - is what data mining algorithms use and can be classified as categorical or numeric.

data mining tools

uses a variety of techniques to find patterns and relationships in large volumes of information and infers rules that predict future behaviors and guide decision making

Knowledge Discovery in Databases (KDD) rotation estimation

uses data mining methods to find useful information and patterns in the data it is a comprehensive process that encompasses data mining KDD consists of the following steps: data selection data preprocessing data transformation data mining interpretation/evaluation

Data mining extracts patterns from data

what is a pattern? a mathematical relationship among data items (numeric or symbolic) shows historic trend predicts future outcomes

Six Step CRISP-DM Data mining process

1. Business Understanding - Determining business objectives 2. Data Understanding - collecting the data verify data quality 3. Data preparation - select, clean, construct, integrate, format data 4. Data modeling - select modeling technique test design 5. Evaluation - results, review process, determine next steps 6. Deployment - produce final report review project process highly repetitive and experimental 85% of project time is spent in step 1,2,3,

10 data mining mistakes

1. selecting wrong problem to data mine 2. ignoring what your sponsor think data mining is and what it can or cannot do 3. not enough time to data selection, preparation and acquisition 4. Looking at aggregated results and individual records 5. being sloppy about keeping track of results 6. ignoring suspicions 7. running algorithms repeatedly and blindly 8. believing everything you are told about data 9. believing everything you are told about your own data 10. measuring results differently than your sponsor does

CRISP-DM VS. SEMMA

CRISP-DM is more comprehensive approach included understanding of business and the relevant data SEMMA - Implicit Implicitly assumes the data mining projects goals and objectives with the identified data sources

data mining application

CRM (Customer Relationship Management) maximize return on marketing campaigns improve customer retention (churn analysis) maximize customer value (cross, up selling) identify and treat most valued customers banking & other financial automate the loan application process detecting fraudulent transaction maximize customer value (cross, upselling) optimizing cash reserves with forecasting

Numerical aka continuous data discrete variable - finite

Interval - Measurement in interval scale ex. temperature - usually no absolute 0 value Ratio - Measurement variable in science and engineering posses a non-arbitrary zero value. ex. mass, length, energy, electric charge etc.

Structured Data

Categorical - Finite categories or groups. Discrete data - finite values with no continuum between them ex. gender, age, groups, colors, educational levels Numerical - numerical values of specific variables, continuous data, usually real numbers ex. incomes, travel distances in miles etc.

Data in data mining

Data collection of facts Data may consist of numbers, words, and images Data lowest level of abstraction from knowledge is gained from information Highest level of Abstraction (Unstructured, Semi- Structured, or Structured)

Data Mining Objectives

Data may be presented in a variety of formats. Data is cleansed and consolidated into a data warehouse. Data mining environment is usually a client/server architecture or a Web-based information systems architecture Data Mining can include soft/unstructured data The miner is often an end user Striking it rich involves creative thinking Data mining tools combined with spreadsheets and other software development tools can be analyzed quickly and easily

Data Mining

Data mining is an automated process of discovery and extraction of hidden unexpected patterns of collected data in order to create models for decision making that predict future behavior based on analyses of past activity

Categorical Data

Nominal - Measurements of simple codes assigned to objects as labels. Binomial Ordinal - Measurement codes assigned to objects with rank or order ex. credit score (low, medium, high) educational level (high school, college, graduate school) age young, middle aged, old)

Data Mining

Process - data mining comprises many steps. Nontrivial - some experimentation type research is involved Valid - new patterns should hold true on new data with sufficient degree of certainty Novel - patterns are not known to users with the systems being analyzed. Potentially Useful - discovered pattern should lead to some benefit to user or task Ultimately understandable - patterns should make business sense.

SEMMA - Data Mining Process

Sample - generate a representative sample of the data Explore - Visualization and basic description of the data Modify - Select variables, transform variable representations Model - use variety of statistical and machine learning models Assess - Evaluate the accuracy and usefulness of the models


Kaugnay na mga set ng pag-aaral

Power and Influence Tactics (ch 6)

View Set

Health and Wellness & Leading Causes of Death

View Set

ATI Gastrointestinal learning system 3.0

View Set

Possible Quiz Questions for A Man For All Seasons

View Set

Chapter 1 : Life Skills (Standard Foundations)

View Set