Ch. 4 Predictive analytics I: Data mining process, methods and algorithms ISDS 415
3 major characteristics and objectives of data mining?
1. data buried deep in very large databases. data is cleansed and consolidated into a data warehouse. may be presented in a variety of forms 2. the data mining environment is usually a client/ server architecture or a web-based IS architecture 3. sophisticated new tools 4. the miner is often the end user 5. striking it rich often involves finding an unexpected results and req end user to think creatively 6. data mining tools are readily combined w/ spreadsheets and other software development tools 7. b/c large amounts of data and massive search efforts is is sometimes necessary to use parallel processing for data mining.
7 classification techniques
1. decision tree analysis 2. statistical analysis 3. neuronetworks 4. case-base reasoning 5. Bayesian classifiers 6. genetic algorithms 7. rough sets
4 additional classification assessment methodologies
1. leave-one out 2. bootstrapping 3. jackknifing 4. area under the curve
two common used derivatives of association rule?
1. link analysis - the linkage among many objects of interest is discovered automatically 2. sequence analysis - relationships are examined in therms of order of occurrence to identify association over time.
Determining the optimal number of clusters
1. look at percentage variance explained as a function of the number of clusters; that is choose a number of clusters so that adding another cluster would not give much better modeling of the data 2. set the number of clusters to (n/2)^(1/2), where n is the number of data points 3. use the akaike info criteria which is a measure of the goodness of fit (based on the concept of entropy) 4. use bayesian info criteria which is a model selection criterion (based on max likelihood estimation)
Gini index
used to determine the purity of a specific class as a result of a decision to branch along a particular attribute or variable. The best spit is the one that increases purity
what are the two methods cluster analysis rely on?
1. Divisive- all items start in one cluster and are broken apart 2. agglomerative- all items start in individual cluster and the clusters are joined together
what are the 4 major types of pattern?
1. associations 2. predictions 3. cluster 4. sequential relations
What are the 6 steps in the CRISP - DM data mining process?
1. business understanding 2. data understanding 3. data preparation 4 model building 5. test and eval 6. deployment
what are the step in a decision tree?
1. create a root node and assign all of the training data to it. 2. select the best splitting attribute 3. add a branch to the root node for each value of the split. split the data into mutually exclusive subsets along the lines of the specific split and move to the branches 4. repeat steps 2 and 3 for each and every leaf node until the stopping criteria is reached.
what fields are data mining used in
1. customer relationship management (CRM)- customer profiling, improve cust rel, max sales 2. Banking- automation, fraud, max value, op cash return 3. retailing and logistics- predict sales, ident rel between products, forecast, discover patterns 4 manufacturing and productions- predict failures, system optimization, improve product quality 5. Brokerage and securities trading- predict bond prices, forecast stock fluctuations, asses events over markets, prevent fraud 6. insurance- forecast, optimal rates, predict customers, prevent fraud 7. computer hardware and software- predict failures, ident unwanted content, prevent security breaches, ident unsecured products 8. Government and defense- forecast cost, predict moves, predict resource consumption, optimization 9. Travel industry- predict sales, forecast demand, ident profitable customers, retain valuable employees 10. Healthcare- ident people w/o health ins.
What are some reasons data mining has gained recognition?
1. more intense competition 2. general rec. of the untapped value hidden in large data sources 3. consolidation and integration of database records 4. the exponential increase in data processing and storage tech 5. sig red in the cost of hardware and software for data storage and processing 6. movement towards the demassification - con of info resources into nonphysical form of business practices.
Model Building
various modeling techniques are selected and applied to an already prepared data set to address the specific business need. also the assessment and comparative analysis model building.
Classification
or called supervised induction, most common of all data mini tasks. objective: is to analyze the historical data stored in a database and automatically generate a model that can predict future behavior .
Clustering
partitions a collection of things into segments whose members share similar characteristics. Class labels are unknown objective: is to create groups that the members have max similarity and members across groups have min similarities.
simple split
partitions the data into two mutually exclusive subsets called a training set and test set. disadvantage: makes assumptions that the data in the two subsets are of the same kind
what is the difference between prediction and forecasting
prediction is based on experience and opinion based forecast is data and model based
what is the main difference between data mining and statistics
statistics starts with a well defined proposition and hypothesis where data mining starts with loosely defined discovery statement. stats- collect sample data data min- uses existing data stats- the right size of data needed data- looks for sets as big as possible
k means clustering algorithm steps (pre determined number of clusters)
step 1: randomly generate k random pints as initial cluster center step 2: assign each point to the nearest cluster center step 3: recompute the new cluster center repeat steps 2 and 3 until some convergence criterion is met.
data preparation
take the data identified in the previous step and prepare it for analysis by data mining methods. This consumes 80% of the time in CRISP-SM
Prediction
tell the nature of future occurrences based on the past forecasting
data mining
term used to describe discovering or mining knowledge from large amounts of data.
Cluster analysis
is an essential data mining method for classifying items, events, or concepts into common groupings called clusters. is an exploratory data analysis tool for solving classification problems.
Information gain
is the splitting mechanism used in ID3 which is the most widely known decision tree algorithm
what is the objective of cluster analysis?
is to sort cases into groups or clusters so that the degree of association is strong among members of the same cluster and weak among members of different clusters.
what is the critique about statistics based classification techniques?
they make unrealistic assumptions about the data such as independence and normality which limit their use in classification type data mining projects
K fold cross validation
to minimize the bias associated with random sampling. complete data set is randomly spit into k mutually exclusive subsets of approximately equal size.
SEMMA what does it stand for?
it is another data mining standardized process and methodology Sample,- generate sample data explore, - vis and basic description of data modify, - select variables, trans var representation model - use statistical and machine learning models and assess - evaluate the accuracy and usefulness of models
Decision tree
it recursively divides a training set until each division consist entirely or primarily of examples from one class.
what is the most common clustering techniques?
k means self organizing maps
business understanding
know what the study is for
what does classification do?
learns patterns from past data to place new instances into their respective groups or class.
entropy
measures the extent of uncertainty or randomness in a data set.
testing and eval
models assessed and eval for accuracy and generality. see if it mets the business objectives and to what extent critical and challenging task
Data privacy
must be careful not to share info
what are common classification tools?
neural networks and decision trees logistical regression and discriminant analysis emerging tools such as rough stets, support vector machines and genetic algorithms.
are the ideas behind data mining new? when were they formed?
no 1980's
Associations
or association rule leaning a technique for discovering interesting relationships among variables in large databases. (market basket analysis)
what are the factors when assessing a model?
1. predictive accuracy - the ability to correctly predict the class label of new or previously unseen data. 2. speed- the computational cost involved in generating and using the model where faster is deemed better 3. robustness- ability to make reasonably accurate predictions given noisy data or data w/ missing and erroneous values 4. scalability- ability to construct a prediction model efficiently given a rather large amount of data 5. interpretability- the level of understanding and insight by the model
terms that describe data mining
1. process- many iterative steps 2. nontrivial- experiment search or inference is involved 3. valid- discovered patterns should hold true on new data w/ a certain degree of certainty 4. novel- patterns are not preciously known to the user w/in the context of they system being analyzed 5. potentially useful- discovered patterns should lead to some benefit 6. ultimately understandable- the patterns should make make business sense
what are some mistakes made when data mining?
1. selecting wrong problem for data mining 2. ignoring what your sponsors thinks data mining is and what it really can and cannot do 3. beginning w/o the end in mind. 4. define the project around a foundation that your data can't support 5. leaving insufficient time for data preparation 6. looking at only at aggregated results and not at individual records. 7. being sloppy about keeping track of procedures 8. using data from the future to predict the future 9. ignoring suspicious findings and quickly moving on 10. starting w/ high- complex project 11. running data mining algorithms repeatedly and blindly 12. ignoring the subject matter experts 13. believing everything you are told about the data 14. assuming that the keepers of the data will be full onboard. 15. measuring your results differently from the way your sponsor measures them 16. if you build it they will come
cluster analysis methods
1. stat methods: k-means or k-modes 2. neural networks 3. fuzzy logic 4. genetic algorithms
are all associations rules interesting and useful? what are the 3 common metrics we use?
1. support- how often these products appear together in the same transaction 2. confidence 3. lift
what fields are data mining used in cont.
8. Government and defense- forecast cost, predict moves, predict resource consumption, optimization 9. Travel industry- predict sales, forecast demand, ident profitable customers, retain valuable employees 10. Healthcare- ident people w/o health ins., ident cost benefits, forecast demand of services, customer and employee attrition 11. medicine- ident patterns for survivability, predict success rates, ident functions of different genes 12. entertainment industry- analyze view data, predict financial success, forecast demand, develop optimal pricing 13. homeland security and law enforcement - ident pattens of terrorist, crime patterns, predict, 14. sports- improve performance
what data mining software tool is the leading software at what percent?
R with 49%, Rapid miner is second w/ 33%
data mining software tools
SPSS SAS Weka RapidMiner KNIME SQL microsoft enterprise consortium
association rule mining
also known as affinity analysis or market- basket analysis aims to find interesting relationships between variables i large databases.
what is the most commonly used algorithm in association rule mining?
apriori algorithm use to discover assocation rules.
When are they best used and what are some disadvantages of neural networks?
best used when the number of variables involved is rather large and relationships among them are complex and imprecise. disadvantage: difficult to provide a good rationale for hte predictions and need a lot of training, cannot be trained on very large databases.
Neural networks resemble?and are?
biological neural networks in the human brain. involve development of mathematical structures that have the capability to learn from past experiences presented in the form of well-structured data sets.
what fields is cluster analysis common?
biology,medicine, genetics, social network analysis, anthro, archaeology, astronomy, character recognition, and management info systems development.
Visualization/ time series forecasts
can be used in conjunction with other data mining techniques to gain a clearer understanding of underlying relationships.
what are decision tree best used for
categorical and interval data
What is the most frequently used data mining method?
classification
what is the difference between clustering and classification?
classification-learns the function between the characteristics of things and their membership through a supervised learning process where both types of variables are presented to teh algorithm clustering- the memberships of the objects is learned through an unsupervised learning process where only the input variables are present in the algorithm.
decision tree
classify data into a finite number of classes based on the values of the input variables. Hierarchy of if-then statements faster than neural networks
what is the primary source for accuracy estimation?
confusion matrix- ture pos, false pos, false neg, true neg
deployment
depending on the requirements the deployment phase can be as simple as generating reports or as complex as implementing a repeatable data mining process across the enterprise. include maintenance activities
how do cluster analysis calculate closeness between pairs of items?
distance measure: popular methods are 1. euclidian distance - ordinary distance between two points 2. manhattan distance- rectilinear distance
what fields on the commercial side most commonly use data mining?
finance retail healthcare
visual analytics
idea is to combine analytics and visualization in a single environment for easier and faster knowledge creation;
data understanding
ident relevant data -use a variety of stats and graphical tech to pick most relevant data -Quantitative or qualitative - numeric , nominal , ordinal