IS 383 - DATA MINING
data mining process
a manifestation of best practices systematic way to conduct DM projects different groups have different versions CRISP-DM (cross industry standard process for data mining) SEMMA (Sample, Explore, Modify, Model, and Assess) KDD (Knowledge Discovery in Databases)
Clustering Analysis for Data Mining
aka Segmentation used for identifying of natural groupings part of the machine learning family employ unsupervised learning learns from past data then assigns new instances no output variable
Supervised data mining
analysts develop a model classification and decision tree statistical technique to estimate parameters of a model regression analysis - measures the impact of set variables on another neural networks -predict values and make classification used for making predictions
data mining techniques
are being used by organizations to gain a better understanding of their customers and their operations and to solve complex organizational problems.
data mining patterns
associations - commonly co-grouping predictions - future occurrence based on historical occurrences cluster - (segmentation) natural grouping based on characteristics sequential relationships - discovering time ordered events
data mining methods
classification regression clustering
decision tree
is a hierarchial arrangement of criteria that predicts a classification or value. data mining technique that selects the most useful attributes for classifying entities on some criterion. useful in the decision process.
data mining
is a process that uses statistical, mathematical, and artificial intelligence techniques to extract and identify useful information and subsequent knowledge or patterns from large sets of data.
Clustering
is unsupervised learning no controlling mechanism only input variables are provided find heuristics
Cluster Algorithms
k-means Clustering Algorithm k = predetermined number of clusters Algorithm (Determine the value of k) 1. randomly generate k points 2. assign each point to the nearest cluster 3. recompute the new cluster center intra cluster distances are minized inter clsuter distances are maximized
why use data mining
more intense competition in the market. untapped value is hidden in large data integration of databases which enables a single view of customers, vendors, transaction, etc. consolidation of databases into a single location in the form of a data warehouse. storage capabilities that reduce cost. converting information into non physical form.
classification
most frequently used data mining part of the machine learning family employ supervised learning learn from past data, classify data the output variable is categorical in nature (nominal or ordinal) good/bad credit sunny, windy, rainy 2-step method 1. Model development and training data is fed class labels are also provided 2. Model testing and deployment tested with a new sample of data accuracy is assessed Iterative process Deployed for actual use
Unsupervised data mining
no model analysts apply data-mining and then observe results analysts create a hypotheses after analysis is completed cluster analysis - common technique in this category groups entities together that have similar characterisitcs
Classification vs. Regression
non numeric = classification numeric = regression
Regression
numeric value the predicted problem is called regression
data mining tasks
prediction - is commonly referred to as the act of telling about the future association - discovering interesting relationships among variables in large databases. clustering - partitions a collection of things into segments whose members share similar characteristics
Assessing the Models (Classification Models)
predictive accuracy - hit rate Speed - Model building, predicting Robustness - Scalability - Interpretability - transparency, explainability
Data Mining Myths
provides instant solution is not yet viable for business practices requires a separate database can only be done by those with advanced degrees is only for large forms is another name for statistics
k-Fold Cross Validation aka rotation estimation (Classification)
splits data into k mutually exclusive subsets uses each subset as testing while using the rest of the subsets as training repeat the experimentation for k times Aggregate the test results for true estimation of prediction accuracy training
Simple Split (Classification)
Split data into 2 mutually exclusive sets training 70% training, 30% testing
Highest Level of Data Abstraction
unstructured data - textual, imagery, voice, and web content structured data - is what data mining algorithms use and can be classified as categorical or numeric.
data mining tools
uses a variety of techniques to find patterns and relationships in large volumes of information and infers rules that predict future behaviors and guide decision making
Knowledge Discovery in Databases (KDD) rotation estimation
uses data mining methods to find useful information and patterns in the data it is a comprehensive process that encompasses data mining KDD consists of the following steps: data selection data preprocessing data transformation data mining interpretation/evaluation
Data mining extracts patterns from data
what is a pattern? a mathematical relationship among data items (numeric or symbolic) shows historic trend predicts future outcomes
Six Step CRISP-DM Data mining process
1. Business Understanding - Determining business objectives 2. Data Understanding - collecting the data verify data quality 3. Data preparation - select, clean, construct, integrate, format data 4. Data modeling - select modeling technique test design 5. Evaluation - results, review process, determine next steps 6. Deployment - produce final report review project process highly repetitive and experimental 85% of project time is spent in step 1,2,3,
10 data mining mistakes
1. selecting wrong problem to data mine 2. ignoring what your sponsor think data mining is and what it can or cannot do 3. not enough time to data selection, preparation and acquisition 4. Looking at aggregated results and individual records 5. being sloppy about keeping track of results 6. ignoring suspicions 7. running algorithms repeatedly and blindly 8. believing everything you are told about data 9. believing everything you are told about your own data 10. measuring results differently than your sponsor does
CRISP-DM VS. SEMMA
CRISP-DM is more comprehensive approach included understanding of business and the relevant data SEMMA - Implicit Implicitly assumes the data mining projects goals and objectives with the identified data sources
data mining application
CRM (Customer Relationship Management) maximize return on marketing campaigns improve customer retention (churn analysis) maximize customer value (cross, up selling) identify and treat most valued customers banking & other financial automate the loan application process detecting fraudulent transaction maximize customer value (cross, upselling) optimizing cash reserves with forecasting
Numerical aka continuous data discrete variable - finite
Interval - Measurement in interval scale ex. temperature - usually no absolute 0 value Ratio - Measurement variable in science and engineering posses a non-arbitrary zero value. ex. mass, length, energy, electric charge etc.
Structured Data
Categorical - Finite categories or groups. Discrete data - finite values with no continuum between them ex. gender, age, groups, colors, educational levels Numerical - numerical values of specific variables, continuous data, usually real numbers ex. incomes, travel distances in miles etc.
Data in data mining
Data collection of facts Data may consist of numbers, words, and images Data lowest level of abstraction from knowledge is gained from information Highest level of Abstraction (Unstructured, Semi- Structured, or Structured)
Data Mining Objectives
Data may be presented in a variety of formats. Data is cleansed and consolidated into a data warehouse. Data mining environment is usually a client/server architecture or a Web-based information systems architecture Data Mining can include soft/unstructured data The miner is often an end user Striking it rich involves creative thinking Data mining tools combined with spreadsheets and other software development tools can be analyzed quickly and easily
Data Mining
Data mining is an automated process of discovery and extraction of hidden unexpected patterns of collected data in order to create models for decision making that predict future behavior based on analyses of past activity
Categorical Data
Nominal - Measurements of simple codes assigned to objects as labels. Binomial Ordinal - Measurement codes assigned to objects with rank or order ex. credit score (low, medium, high) educational level (high school, college, graduate school) age young, middle aged, old)
Data Mining
Process - data mining comprises many steps. Nontrivial - some experimentation type research is involved Valid - new patterns should hold true on new data with sufficient degree of certainty Novel - patterns are not known to users with the systems being analyzed. Potentially Useful - discovered pattern should lead to some benefit to user or task Ultimately understandable - patterns should make business sense.
SEMMA - Data Mining Process
Sample - generate a representative sample of the data Explore - Visualization and basic description of the data Modify - Select variables, transform variable representations Model - use variety of statistical and machine learning models Assess - Evaluate the accuracy and usefulness of the models