ch 4 data mining
types of patterns
-association -prediction -cluster -sequential
data processes steps (4)
1. data consolidation 2. data cleaning 3. data transformation 4. data reduction
estimation methodologies
1. simple split 2. k fold 3. leave one out 4. bootstrapping 5. jackknifing 6.area under ROC curve
In data mining, classification models help in prediction. Select one: True False
TRUE
What is data
a collection of facts usually obtained as the result of experiences, observations, or experiments
what is a pattern
a mathematical relationship among data items
What is the difference between classification and regression
classification: is labeled as a class regression: numeric values
normalize data discretize, aggregate data construct new attributes are examples of what process
data transformation
What steps account for 85% of total project time
steps 1,2,3
What is data mining
the nontrivial process of identifying valid, novel, potentially useful and understandable patterns in data stored in structured databases
clustering -outlier analysis
unsupervised (k means)
data mining characteristics
- sources of data is often a consolidated DW -DM environment is isually a client-server or a web based IS -data is the most critical ingredient for DM -the miner is often an end user -creative thinking is needed -bc large amounts of data, parallel processing might be necessary
data mining applications
-customer relationship mgmt -banking and other financial - retailing and logistics - manufacturing and maintenance - brokerage and securities trading - insurance - computer hardware and software - science and engineering
assessment methods for classification
-predictive accuracy (hit rate) -speed (modeling building, predicting) -robustness -scalability -interpretability (understanding and insight by the model)
data mining process (6)
1. business understanding 2. data understanding 3. data preparation 4. model building 5. testing and evaluation 6. deployment
2 types of data mining
1. hypothesis driven 2. discovery driven
Reasons why data mining is gaining attention
1. more intense competition 2. recognition of value in data sources 3. availability of quality data on customers, vendors... 4. integration of data into data warehouses 5. exponential increase in data processing 6. reduction in cost for hardware, software for data storage 7. movement toward the demassification (conversion of info resources into nonphysical form)
What are the most common standard processes for data mining?
CRISP (cross industry standard process) SEMMA (sample, explore, modify, model, assess) KDD (knowledge discovery databases)
(data divided chart)
DATA categorical : nominal & ordinal Numerical: interval & ratio
Data mining requires specialized data analysts to ask ad hoc questions and obtain answers quickly from the system. Select one: True False
FALSE
In the Cabela's case study, the SAS/Teradata solution enabled the direct marketer to better identify likely customers and market to them based mostly on external data sources. Select one: True False
FALSE
In the Memphis Police Department case study, predictive analytics helped to identify the best schedule for officers in order to pay the least overtime. Select one: True False
FALSE
Ratio data is a type of categorical data. Select one: True False
FALSE
Statistics and data mining both look for data sets that are as large as possible. Select one: True False
FALSE
The entire focus of the predictive analytics system in the Infinity P&C case was on detecting and handling fraudulent claims for the company's benefit. Select one: True False
FALSE
Data mining can be very useful in detecting patterns such as credit card fraud, but is of little help in improving sales. Select one: True False
TRUE
If using a mining analogy, "knowledge mining" would be a more appropriate term than "data mining." Select one: True False
TRUE
Interval data is a type of numerical data. Select one: True False
TRUE
The cost of data storage has plummeted recently, making data mining feasible for more firms. Select one: True False
TRUE
Using data mining on data about imports and exports can help to detect tax avoidance and money laundering. Select one: True False
TRUE
data may consist of numbers, letters, words, images, voice reording
TRUE
What is the main reason parallel processing is sometimes used for data mining? Select one: a. because any strategic application requires parallel processing b. because the most of the algorithms used for data mining require it c. because of the massive data amounts and search efforts involved d. because the hardware exists in most organizations and it is available to use
c. because of the massive data amounts and search efforts involved
In the Cabela's case study, what types of models helped the company understand the value of customers, using a five-point scale? Select one: a. simulation and geographical models b. reporting and association models c. clustering and association models d. simulation and regression models
c. clustering and association models
What is the difference between classification and clustering?
classification: supervised clustering: unsupervised
in classification problems, the primary sources for accuracy estimation is the
confusion matrix
what is ordinal data
contain codes assigned to objects that also represent the rank in order - 1 low 2 medium 3 high
What is nominal data
contains measurements of simple codes assigned to objects as labels - 1 single 2 married 3 divorced
Understanding customers better has helped Amazon and others become more successful. The understanding comes primarily from Select one: a. asking the customers what they want. b. developing a philosophy that is data analytics-centric. c. collecting data about customers and transactions. d. analyzing the vast data amounts routinely collected.
d. analyzing the vast data amounts routinely collected.
All of the following statements about data mining are true EXCEPT Select one: a. the valid aspect means that the discovered patterns should hold true on new data. b. the potentially useful aspect means that results should lead to some business benefit. c. the novel aspect means that previously unknown patterns are discovered. d. the process aspect means that data mining should be a one-step process to results.
d. the process aspect means that data mining should be a one-step process to results.
input missing values reduce noise in data eliminate inconsistencies are examples of what process
data cleaning
collect data select data integrate data are examples of what process
data consolidation
Knowledge extraction, pattern analysis, data archaeology, information harvesting, pattern searching, and data dredging are all alternative names for_________________
data mining
how does data mining work?
data mining extracts patterns from data
Data preparation, the third step in the CRISP-DM data mining process, is more commonly known as
data processing
reduce number of variables reduce number of cases balance skew data are examples of what process
data reduction
Data are often buried deep within very large ___________________ , which sometimes contain data from several years.
databases
What is the most commonly used similarity measure in cluster analysis
distance measure
What are the most popular application areas for data mining?
healthcare and medicine
What is the most commonly used clustering algorithms?
k means & self organizing maps
other names for data mining
knowledgeable extraction pattern analysis knowledge discovery information harvesting pattern searching data dredging
what is ratio data
measurement variables commonly found in sciences and engineering -mass, length, time, energy
What is interval data
numeric values of specific variables -temperature
cluster algorithms are used when the data records do not have
predefined class identifiers
In the Memphis Police Department case study, shortly after all precincts embraced Blue CRUSH, ________________________________ became one of the most potent weapons in the Memphis police department's crime-fighting arsenal.
predictive analysis
Simple split
split the data into 2 manually exclusive sets training = 70% testing = 30%
Prediction -classification -regression
supervised
association rule mining is used to discover
two or more items that go together
association -link analysis -sequence analysis
unsupervised (bar code scanners)
