Data Mining Test 1
Rank the levels of measurement:
1. Nominal 2. Ordinal 3. Interval
In a Decision tree, what happens with each split?
1. There will be an increase in purity 2. The change of purity level is defined as information gain 3. Choose the split that achieves the largest information
What are the two phases of constructing a decision tree
1. Tree Construction 2. Tree Pruning
What is a data warehouse?
A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data, in a standardized format.
Describe a decision tree
A series of nested tests containing nodes and leafs. Each node represents a test on one attribute. Each leaf is an end node with prediction.
The size of a company is _________ to describe it. Dimension Variable Feature
All of the above (Dimension, Variable, Feature)
Which of the following is (are) major step(s) in the data mining process? Interpretation/Evaluation Data Processing Data Modeling All of the above
All of the above (Interpretation/Evaluation, Data Processing, Data Modeling)
What do column/bar charts measure?
Useful in comparing value across different categories.
Attribution, Dimension, Feature, and Variable all mean the same thing?
Yes
What do box plots measure?
alternative type of chart for showing the distribution of a variable. Useful for comparing a variable against different categories.
What is the definition of information gain?
expected reduction in entropy
What do histograms measure?
most common type of chart for showing the distribution of a numerical value. Great for showing the shape of the distribution.
What is doing the "right" thing vs doing things "right" in Data Mining?
• Don't solve the problems that do not help business • Rigor takes backseat to usefulness
Which of the following are advantages of the decision tree?
• Easy to understand and interpret - tree structure specifies entire decision structure • Easy to implement • Running time is low even with large data sets • Very popular method
Give examples for the "Identify" step in the data mining cycle
• Planning for new product introduction • Planning direct marketing campaign • Understanding customer attrition/churn • Evaluation results of market test
What are the most popular tree stopping criteria?
• Pruning • Restriction on minimum node size (the minimum number of customers in nodes): not good, but makes managerial sense • Set a threshold stopping value on the value of splitting criterion
What are weaknesses of the decision tree?
• Volatile: small changes in underlying data result in very different models • Cannot capture interactions between variables • Can result in large error
How to measure purity in a decision tree?
1. Gini Index 2. Entropy
Explain what pruning a decision tree is
- Identify and remove branches - To avoid overfitting
What are examples of possible data sources?
1. "House" Data - Info kept within the company 2. Purchased Data - Data purchased outside 3. Test Data - Data generated with experiments conducted by the company. 4. Surveys - Useful for measuring consumer attitudes and competitive activity 5. Text Data - Content of customer emails or social network activities.
What are single numerical variable statistics?
1. Central Tendency (Mean, Median, Min, Max, percentiles, quartiles) 2. Dispersion/Variability (Variance, Std. Deviation) 3. Distribution shape (skewness - lack of symmetry or Kurtosis - peakedness or flatness of a distribution)
What are the main categories of potential predictors?
1. Customer Characteristics - Examples: Demographics, psychographics 2. Previous behavior - Examples: Previous purchases, response to previous marketing efforts 3. Previous Marketing - Previous marketing efforts targeted at the customer. 4. Big Data - Other Relevant Information
What is a predictor variable?
Stuff you can use to predict the target
What are the key steps of data mining?
1. Data Selection/Sampling 2. Data Preprocessing (Cleaning, Reduction & transformation) 3. Choosing Techniques 4. Data Mining 5. Pattern Evaluation & Knowledge Presentation
What is the process of predictive modeling?
1. Defining the problem 2. Preparing the data 3. Estimating the model 4. Evaluating the model 5. Making decisions
What are descriptive functions of data mining techniques?
Summarization Statistics Tables Graphs Visualization Association Rules Clustering
Name the data mining technique that answers the following question: What items are commonly bought together?
Association - Descriptive
Why bother data mining?
Because of data explosion and because cheap and powerful computers & the Internet make collecting and crunching gigantic datasets possible. And it creates value.
1. A bank uses data to predict which customers are more likely to default on their loans. Which of the following methods is NOT an appropriate data mining tool for the objective? A. Neural Network B. Decision Tree C. Exploration D. Regression
C. Exploration - You can use Neural Network, Decision Tree and Regression
What is another word for nominal and ordinal measures?
Categorical
What are prescriptive function of data mining techniques?
Classification Regression
Name the data mining technique that answers the following question: How likely is a customer to respond to a marketing campaign?
Classification/Regression - Predictive
Name the data mining technique that answers the following question: What will the sales look like for the new product?
Classification/Regression - Predictive
What is the best use for the decision tree?
Classifying an unknown sample (prediction)
Name the data mining technique that answers the following question: What cohesive groups of customers do we have?
Clustering - Descriptive
What are multi numerical variable statistics?
Covariance (relationship between two variables) and correlation (measure of linear association between two variables)
Where does data mining apply to?
Customers Product/Services Competition Reports Nonbusiness Applications (politics, security & terrorism)
What are three popular models?
Decision Tree Regression Neural Network
What is the extraction of potentially useful (yet previously unknown) patterns or knowledge from large volumes of data.
Definition of Data Mining
What are other names for target variable?
Dependent Variables Response Variables
What category of data mining answers the questions: Who are my best customers? What items are purchased together? On what and how much do they usually spend?
Descriptive
What data mining techniques characterizes properties of the data?
Descriptive
What two categories do data mining tasks fall in?
Descriptive and Predictive
What are other words for Attribution?
Dimension Feature Variable
Our sample customers have different intentions to buy our products. Out of 900, 450 customers are willing to buy and the others do not. What is the entropy of this customer sample?
Entropy is 1
T or F: The measurement level of a variable can only be interval.
False
We evaluate the performance of a predictive model based on its performance on the training data.
False
What are the entropy ranges?
From 0 (most pure) to Log (equal representation of cases)
Which type of graph would you use for the distribution of an interval variable?
Histogram
What are the steps of the data mining cycle?
Identify - business opportunities where analyzing data can provide value Transform Data - Into actionable information using data mining techniques Act - on the information Measure the results - of the efforts to complete the learning cycle (ITAM - Identify, Transform, Act, Measure)
What are other names for predictor variables?
Independent Variables Explanatory Variables
1. Which of the statements about measurement level is true? A. Interval variables can be measured as nominal B. Nominal variables can be measured as interval. C. Categorical variables can be measured as interval. D. None of the above
Interval variables can be measured as nominal. (remember the order. I, O, N: Interval, Ordinal, Nominal)
What can an interval-scaled attribute be measured by?
Interval, Ordinal, or Nominal measures
Which type of graph would you use for trend over time?
Line Chart
What is an ordinal measure?
Meaningful order or ranking but the distance between rankings has no meaning Examples: Age
What is interval measure?
Measured on a scale with meaningful difference We can perform arithmetic operations on them
What do line charts measure?
Measuring variables that change, usually over time. Best for showing trend over a time period. univariate analysis.
What is a nominal measure?
Symbols or names of things Valid operations are = and not=
Does Correlation always equal Causality?
No
What is a binary measure?
Nominal attribute with only two categories. Examples: Yes/No, Positive/Negative, True/False, Male/Female.
What can an nominal-scaled attribute be measured by?
Nominal measures only! NOT ordinal and interval measures!
What is another word for interval measures?
Numeric or Continuous
What can an ordinal-scaled attribute be measured by?
Ordinal and Nominal measures NOT interval measures
How do you construct a decision tree?
Partition training instances into purer and purer subgroups. (Group A is "purer" than group B if more members in A are similar than members in B.) Trees constructed by recursively partitioning instances.
Which type of graph would you use to measure the composition of student backgrounds?
Pie Chart
What category of data mining answers the following questions?: Will this customer default on their loan? Will this customer cancel their gym membership?
Predictive
What is the use of statistical models to predict?
Predictive Modeling
What data mining techniques makes inferences from data for prediction?
Prescriptive
What is a "pattern" in data mining?
Relationships, regularities, and structures hidden in data
Which type of graph would you use to measure the relationship between two variables?
Scatterplot
What do scatterplots/bubble graphs measure?
Shows relationship between two variables. Multivariate analyses.
What is a target variable?
The thing you want to predict
Name the data mining technique that answers the following question: What are stock price movements?
Time-Series Analysis
Pruning a tree helps to address the overfitting problem
True
T or F: An attribute can be measured at a certain level or any other levels.
True
T or F: Attributes may vary from one object to another (cross-sectional) or from one time to another (longitudinal).
True
T or F? Measuring at a lower level loses information and limits possible analyses.
True
True or False: A scatter plot is suitable for exploring the relationship between two interval variables.
True
True or False: Data mining is an iterative process that data and models shall be updated based on feedbacks, or new situations, etc.?
True
True or False: It is a predictive task to classify credit card purchases into fraudulent and legitimate ones.
True