Data chapter 4
Data Mining Characteristics/Objectives
- Source of data for DM is often a consolidated data warehouse (not always!). - DM environment is usually a client-server or a Web-based information systems architecture. - Data is the most critical ingredient for DM which may include soft/unstructured data. - The miner is often an end user. - Striking it rich requires creative thinking. - Data mining tools' capabilities and ease of use are essential (Web, Parallel processing, etc.).
Why Data Mining?
-More intense competition at the global scale -Recognition of the value in data sources -Availability of quality data on customers, vendors, transactions, Web, etc. -Consolidation and integration of data repositories into data warehouses -The exponential increase in data processing and storage capabilities; and decrease in cost -Movement toward conversion of information resources into nonphysical form
Association rule learning
A very popular data mining method in business Find interesting relationships between variables There is no output variable Also known as market basket analysis
Types of patterns
Association, Prediction, Cluster (segmentation), Sequential (or time series) relationships.
Data Mining Process
CRISP-DM (cross industry standard process for data mining) SEMMA (sample, explore, modify, model, and assess) KDD (Knowledge Discovery in Databases)
How Data Mining Works
DM extracts patterns from data
Classification techniques
Decision tree analysis Statistical analysis Neural networks Support vector machines Case-based reasoning Bayesian classifiers Genetic algorithms Rough sets
Decision trees
Employs the divide and conquer method. Recursively divides a training set until each division consists of examples from one class.
Clustering results may be used to
Identify natural groupings of customers Identify rules for assigning new cases to classes for targeting/diagnostic purposes Provide characterization, definition, labeling of populations Decrease the size and complexity of problems for other data mining methods Identify outliers in a specific domain (e.g., rare-event detection)
Data Mining Methods: Classification
Most frequently used DM method Part of the machine learning family Employ supervised learning Learn from past data, classify new data The output variable is categorical in nature
Data Mining Tasks
Prediction (classification, regression, time series) Association (market-basket, link analysis, sequence analysis) and Segmentation (clustering, outlier analysis)
Assessment Methods for Classification
Predictive accuracy (hit rate), speed, robustness (ability to make reasonable accurate predictions), scalability, and interpretability (transparency, explainability)
Data mining mistakes
Selecting the wrong problem for data mining Ignoring what your sponsor thinks data mining is and what it really can or cannot do Beginning without the end in mind Not leaving insufficient time for data acquisition selection and preparation Looking only at aggregated results and not individual records or predictions
Analysis methods
Statistical methods Neural networks Fuzzy logic Genetic algorithms
Data mining is a blend of multiple disciplines
Statistics Artificial intelligence Machine learning and pattern recognition Information visualization Database management and data warehousing Management science and information systems
CRISP-DM (Cross-Industry Standard Process for Data Mining)
Steps 1. Businesses understanding 2. Data understanding 3. Data preparation 4. Model building 5. Testing and Evaluation 6. Deployment
K-means clustering algorithm
Steps: 1. Randomly generate K random points as initial cluster centers 2. Assign each point to the nearest cluster Center 3. Recompute the new cluster centers Repeat steps three and four until some conversions criteria is met
Cluster Analysis for Data Mining
Used for automatic identification of natural groupings of things Learns the clusters of things from past data, then assigns new instances Part of the machine-learning family Employ unsupervised learning There is not an output/target variable Also known as segmentation in marketing